ARTICLE | 4 MIN | OXFORD MAN INSTITUTE AND ACADEMIC PARTNERSHIPS

Backtesting

June 28, 2015

When evaluating a trading strategy, it is routine to discount the Sharpe ratio from a historical backtest.

Introduction

A common practice in evaluating backtests of trading strategies is to discount the reported Sharpe ratios by 50%. There are good economic and statistical reasons for reducing the Sharpe ratios. The discount is a result of data mining. This mining may manifest itself by academic researchers searching for asset pricing factors to explain the behavior of equity returns or by researchers at firms that specialize in quantitative equity strategies trying to develop protable systematic strategies.

The 50% haircut is only a rule of thumb. The goal of our paper is to develop an analytical way to determine the magnitude of the haircut.

Our framework relies on the statistical concept of multiple testing. Suppose you have some new data, Y, and you propose that variable X explains Y. Your statistical analysis finds a significant relation between Y and X with a t-ratio of 2.0 which has a probability value of 0.05. We refer to this as a single test. Now consider the same researcher trying to explain Y with variables X1;X2,.....,X100. In this case, you cannot use the same criteria for significance. You expect by chance that some of these variables will produce t-ratios of 2.0 or higher. What is an appropriate cut-o for statistical significance?

In Harvey and Liu (HL, 2015), we present three approaches to multiple testing. We answer the question in the above example. The t-ratio is generally higher as the number of tests (or X variables) increases. Consider a summary of our method. Any given strategy produces a Sharpe ratio. We transform the Sharpe ratio into a t -ratio. Suppose that t-ratio is 3.0. While a t-ratio of 3.0 is highly significant in a single test, it may not be if we take multiple tests into account. We proceed to calculate a p-value that appropriately reflects multiple testing. To do this, we need to make an assumption on the number of previous tests.

For example, Harvey, Liu and Zhu (HLZ, 2015) document that at least 316 factors have been tested in the quest to explain the cross-sectional patterns in equity returns. Suppose the adjusted p-value is 0.05. We then calculate an adjusted t-ratio which, in this case, is 2.0. With this new t -ratio, we determine a new Sharpe ratio. The percentage difference between the original Sharpe ratio and the new Sharpe ratio is the "haircut".

The haircut Sharpe ratio that obtains as a result of multiple testing has the following interpretation. It is the Sharpe ratio that would have resulted from a single test, that is, a single measured correlation of Y and X.

We argue that it is a serious mistake to use the rule of thumb 50% haircut. Our results show that the multiple testing haircut is nonlinear. The highest Sharpe ratios are only moderately penalized while the marginal Sharpe ratios are heavily penalized.

This makes economic sense. The marginal Sharpe ratio strategies should be thrown out. The strategies with very high Sharpe ratios are probably true discoveries. In these cases, a 50% haircut is too punitive.

Our method does have a number of caveats { some of which apply to any use of the Sharpe ratio. First, high observed Sharpe ratios could be the results of non-normal returns, for instance an option-like strategy with high ex ante negative skew.

In this case, Sharpe ratios need to be viewed in the context of the skew. Dealing with these non-normalities is the subject of future research. Second, Sharpe ratios do not necessarily control for risk. That is, the volatility of the strategy may not reflect the true risk. Importantly, our method also applies to Information ratios which use residuals from factor models. Third, it is necessary in the multiple testing framework to take a stand on what qualifies as the appropriate significance level, e.g., is it 0.10 or 0.05? Fourth, a choice needs to made on the multiple testing method. We present results for three methods as well as the average of the methods. Finally, some judgment is needed specifying the number of tests.

Given choices (3)-(5), it is important to determine the robustness of the haircuts to changes in these inputs. We provide a program at: http://faculty.fuqua.duke.edu/charvey/backtesting that allows the user to vary the key parameters to investigate the impact on the haircuts. We also provide a program that determines the minimal level of protability for a trading strategy to be considered "significant".

For further clarification on the terms which appear here, please visit our Glossary page.

Backtesting

Introduction

Author(s)

Related insights

Market Timing: More than a Mirage

Regimes, Systematic Models and the Power of Prediction

Bridging the Gap: A Framework for Real-World Decarbonization