## Backtesting

When evaluating a trading strategy, it is routine to discount the Sharpe ratio from a historical backtest. The reason is simple: there is inevitable data mining by both the researcher and by other researchers in the past. Our paper provides a statistical framework that systematically accounts for these multiple tests. We propose a method to determine the appropriate haircut for any given reported Sharpe ratio. We also provide a profit hurdle that any strategy needs to achieve in order to be deemed "significant".

## Introduction

A common practice in evaluating backtests of trading strategies is to discount the reported Sharpe ratios by 50%. There are good economic and statistical reasons for reducing the Sharpe ratios. The discount is a result of data mining. This mining may manifest itself by academic researchers searching for asset pricing factors to explain the behavior of equity returns or by researchers at firms that specialize in quantitative equity strategies trying to develop protable systematic strategies.

The 50% haircut is only a rule of thumb. The goal of our paper is to develop an analytical way to determine the magnitude of the haircut.

Our framework relies on the statistical concept of multiple testing. Suppose you have some new data, Y, and you propose that variable X explains Y. Your statistical analysis finds a significant relation between Y and X with a t-ratio of 2.0 which has a probability value of 0.05. We refer to this as a single test. Now consider the same researcher trying to explain Y with variables X1;X2,.....,X100. In this case, you cannot use the same criteria for significance. You expect by chance that some of these variables will produce t-ratios of 2.0 or higher. What is an appropriate cut-o for statistical significance?

In Harvey and Liu (HL, 2015), we present three approaches to multiple testing. We answer the question in the above example. The t-ratio is generally higher as the number of tests (or X variables) increases. Consider a summary of our method. Any given strategy produces a Sharpe ratio. We transform the Sharpe ratio into a t -ratio. Suppose that t-ratio is 3.0. While a t-ratio of 3.0 is highly significant in a single test, it may not be if we take multiple tests into account. We proceed to calculate a p-value that appropriately reflects multiple testing. To do this, we need to make an assumption on the number of previous tests.

For example, Harvey, Liu and Zhu (HLZ, 2015) document that at least 316 factors have been tested in the quest to explain the cross-sectional patterns in equity returns. Suppose the adjusted p-value is 0.05. We then calculate an adjusted t-ratio which, in this case, is 2.0. With this new t -ratio, we determine a new Sharpe ratio. The percentage difference between the original Sharpe ratio and the new Sharpe ratio is the "haircut".

The haircut Sharpe ratio that obtains as a result of multiple testing has the following interpretation. It is the Sharpe ratio that would have resulted from a single test, that is, a single measured correlation of Y and X.

We argue that it is a serious mistake to use the rule of thumb 50% haircut. Our results show that the multiple testing haircut is nonlinear. The highest Sharpe ratios are only moderately penalized while the marginal Sharpe ratios are heavily penalized.

This makes economic sense. The marginal Sharpe ratio strategies should be thrown out. The strategies with very high Sharpe ratios are probably true discoveries. In these cases, a 50% haircut is too punitive.

Our method does have a number of caveats { some of which apply to any use of the Sharpe ratio. First, high observed Sharpe ratios could be the results of non-normal returns, for instance an option-like strategy with high ex ante negative skew.

In this case, Sharpe ratios need to be viewed in the context of the skew. Dealing with these non-normalities is the subject of future research. Second, Sharpe ratios do not necessarily control for risk. That is, the volatility of the strategy may not reflect the true risk. Importantly, our method also applies to Information ratios which use residuals from factor models. Third, it is necessary in the multiple testing framework to take a stand on what qualifies as the appropriate significance level, e.g., is it 0.10 or 0.05? Fourth, a choice needs to made on the multiple testing method. We present results for three methods as well as the average of the methods. Finally, some judgment is needed specifying the number of tests.

Given choices (3)-(5), it is important to determine the robustness of the haircuts to changes in these inputs. We provide a program at: http://faculty.fuqua.duke.edu/charvey/backtesting that allows the user to vary the key parameters to investigate the impact on the haircuts. We also provide a program that determines the minimal level of protability for a trading strategy to be considered "significant".

## Important information

Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc (‘Man’). These opinions are subject to change without notice, are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which the Company and/or its affiliates provides investment advisory or any other financial services. Any organisations, financial instrument or products described in this material are mentioned for reference purposes only which should not be considered a recommendation for their purchase or sale. Neither the Company nor the authors shall be liable to any person for any action taken on the basis of the information provided. Some statements contained in this material concerning goals, strategies, outlook or other non-historical matters may be forward-looking statements and are based on current indicators and expectations. These forward-looking statements speak only as of the date on which they are made, and the Company undertakes no obligation to update or revise any forward-looking statements. These forward-looking statements are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements. The Company and/or its affiliates may or may not have a position in any financial instrument mentioned and may or may not be actively trading in any such securities. This material is proprietary information of the Company and its affiliates and may not be reproduced or otherwise disseminated in whole or in part without prior written consent from the Company. The Company believes the content to be accurate. However accuracy is not warranted or guaranteed. The Company does not assume any liability in the case of incorrectly reported or incomplete information. Unless stated otherwise all information is provided by the Company. Past performance is not indicative of future results.

P/16/0683/GL/DI/W