Systematically Biased: Replication in Finance

Pockets of Replicability (Post #6)

Systematically Biased — Tue, 02 Jun 2026 20:40:09 GMT

In asset pricing, a stochastic discount factor (SDF) is a special random variable with the property that the price of any asset i can be obtained as the expected value of the asset’s payoff multiplied by the SDF:

The existence of a strictly positive SDF is equivalent to the absence of arbitrage opportunities.

Let R_i denote the asset’s gross return and R_f the gross risk-free return. We can rearrange the pricing equation to show that risk premia are determined by covariation with the SDF:

Asset pricing theory is concerned with specifying a particular form for the SDF, implying that expected returns are determined by covariations with some set of variables or factors. A central question in asset pricing, therefore, is to determine which factors enter the SDF. Since these are the factors that help explain cross-sectional differences in expected returns, this has not only economic but also practical implications, as investors can use these factors to construct and hedge portfolios.

However, economic theory does not provide much guidance on the precise structure of the SDF, which means that we must rely on empirical studies to test which factors actually matter. This is a non-trivial task that involves many steps and can be affected by data quality/availability and several design choices.

The (Equity) Factor Zoo

In the CAPM, the SDF is an affine function of a single systematic risk factor: the market return. Therefore, under the CAPM, the expected return on any asset is entirely determined by its covariance with the market (i.e. its market beta). Extensive research on the stock market has shown that the CAPM is unable to explain the returns on many different types of portfolios. This list of “anomalies” eventually morphed into what we today call the “factor zoo”: a large collection that includes potential risk factors or, more generally, variables that may represent mispricings, trading frictions, or that may simply be the result of data mining. Post #3 of this series discussed some Bayesian approaches to select the “right” factors from the factor zoo, and Post #5 discussed the issues in the replication of the anomalies in the factor zoo.

Equity markets have been extensively studied, leading to the hundreds of factors that we collectively call the “factor zoo”. In contrast, fewer studies have looked at corporate bonds. The main reasons for this are related to data availability and complexity. High-quality bond data is much more difficult to obtain than equity data. A single company can have hundreds of bonds with very different characteristics (maturity, seniority, optionality, coupon structure, etc). In addition, bond trading can be much less liquid and generally happens in over-the-counter markets.

Factors for Bond Returns

In this post, I’m going to discuss the paper “Common risk factors in the cross-section of corporate bond returns”, by Jennie Bai, Turan G. Bali, and Quan Wen. This paper was published in the Journal of Financial Economics in 2019, but subsequently retracted by the authors in 2023. The reason for the retraction was straightforward. Another group of authors (Alexander Dickerson, Philippe Mueller, and Cesare Robotti) tried to replicate the results of Bai et al. and discovered an issue of temporal misalignment. Bai et al. confirmed that the issue was indeed present, and that their paper’s results did not reproduce once the issue was corrected.

The Bai et al. paper had a strong and economically intuitive result. The authors had identified three factors that explained differences in returns of corporate bonds issued by similar companies: downside risk, credit risk, and liquidity risk. Downside risk, in particular, is an intuitively appealing explanation of bond returns, as bond investors do not participate in the upside in the same way as equity investors. Their preferred model appeared to explain returns of bond portfolios quite well.

Figure 2 from Bai et al. (2019)

The strong results of the paper and the fact that the factor data were made available by Bai et al. led to their four-factor model quickly becoming a benchmark in other corporate bond papers.

The Replication

Dickerson, Mueller, and Robotti (2023) revisited the results of Bai et al. (2019). They found two main issues. The first, and most important, was a temporal misalignment. For most of the sample, the downside risk and credit risk factor returns reported for month t were actually the returns for month t + 1. In other words, the factors inadvertently incorporated information from the future. The liquidity-risk factor was also misaligned during the final two years of the sample, although in the opposite direction: its returns lagged by one month.

The second issue concerned the construction of the bond-market factor. Bai et al. truncated extreme returns in both tails of the distribution. This reduced the measured risk premium of the market factor and made the additional downside risk, credit risk, and liquidity risk factors appear stronger in multivariate tests.

Using multiple bond databases and correcting the misalignment issue, Dickerson et al. found that previously proposed corporate-bond factors generally did not add meaningful explanatory power beyond the value-weighted bond-market factor. In other words, the bond CAPM was difficult to outperform. The only marginal exception was traded liquidity.

Figure 1 from Dickerson, Mueller, and Robotti (2023)

This conclusion is striking because the original model appeared to work extremely well. Bai et al. had reported that their four-factor model explained much of the variation in the returns of corporate-bond portfolios, with predicted and realized returns clustering closely around the 45-degree line.

In a more recent paper, Dickerson, Robotti, and Rossetti (2026) broaden the exercise to a corporate-bond “factor zoo” of 108 signals. They argue that measurement error and look-ahead bias affect a wider segment of the literature, and that most reported bond factors do not retain statistically significant bond CAPM alphas after correction. The Bai et al. episode, therefore, is a reminder that corporate bond data are complex, and that empirical bond pricing studies can be sensitive to data construction, temporal alignment, liquidity measurement, and other methodological choices.

The Bayesians Join the Fray

In a paper recently published online in the JFE, Dickerson, Julliard, and Mueller (2026) use a Bayesian approach (similar to the ones discussed in Post #3, but adapted to handle multiple asset classes) to jointly price the cross-section of stock and bond returns. Their results show that equity and nontradable factors are sufficient to price corporate bonds once their Treasury term-structure risk is accounted for. Tradable bond factors become largely redundant for pricing the remaining credit component. However, bond factors, together with nontradable factors, remain necessary to price the Treasury component, which stock factors do not appear to capture.

Final Thoughts

The retraction of the Bai et al. paper, in my view, is an example of the system working as it should. The problem was identified because the factor data were publicly available. An independent group of authors found the issue and flagged it. As a result, our understanding of the subject improved.

An interesting point raised by both Dickerson, Mueller, and Robotti (2023) and Dickerson, Julliard, and Mueller (2026) is the large amount of model uncertainty present in empirical asset pricing studies. The latter paper states:

Overall, we find that the true latent SDF is dense in the space of observable nontradable and tradable bond and stock factors. Importantly, this implies that all low dimensional observable factor models proposed to date are affected by severe misspecification and rejected by the data.

In other words, substantial model uncertainty favors aggregation through Bayesian model averaging rather than reliance on a single sparse representation. This is similar to the conclusion reached by papers that apply Bayesian methods to the equity factor zoo. In Dickerson, Julliard, and Mueller (2026), the model space is gigantic: more than 18 quadrillion possible models. Even without the replication issue, a four-factor model such as the one proposed by Bai et al. should therefore be understood as one possible low-dimensional approximation of an unobservable SDF, rather than as its definitive representation.

References

Bai, Jennie, Turan G. Bali, and Quan Wen. "RETRACTED: Common risk factors in the cross-section of corporate bond returns." (2019): 619-642.

Dickerson, Alexander, Christian Julliard, and Philippe Mueller. “The co-pricing factor zoo.” Journal of Financial Economics 182 (2026): 104295.

Dickerson, Alexander, Philippe Mueller, and Cesare Robotti. “Priced risk in corporate bonds.” Journal of Financial Economics 150, no. 2 (2023): 103707.

Dickerson, Alexander, Cesare Robotti, and Giulio Rossetti. “The Corporate Bond Factor Replication Crisis.” arXiv preprint arXiv:2604.07880 (2026).

Pockets of Replicability (Post#5)

Systematically Biased — Wed, 15 Apr 2026 10:37:54 GMT

In the first post of this series, I mentioned two papers that look into whether asset pricing anomalies survive replication. In this post, we look at the overall discussion on the replicability of asset pricing anomalies.

Hou, Xue and Zhang: Most Anomalies Fail to Replicate

The first one is an influential paper in the recent replicability debate in empirical asset pricing: Hou, Xue and Zhang’s (2020) Replicating Anomalies. Their conclusion is stark: once you use stricter procedures, such as value-weighted returns, using NYSE breakpoints, and controls that reduce the influence of microcaps, most published anomalies no longer look very convincing. In their sample of 452 anomalies, 65% fail the standard single-test hurdle, and 82% fail once they impose a higher multiple-testing threshold. Their broader message is hard to miss: capital markets may be much more efficient than the anomalies literature had led us to believe. Among the anomalies that they find replicate well are value, momentum, investment and profitability.

Our key finding is that most anomalies fail to replicate, falling short of currently acceptable standards for empirical finance. First, of the 452 anomalies, 65% cannot clear the single test hurdle of |t|≥1.96. The key word is “microcaps.” Microcaps represent only 3.2% of the aggregate market capitalization but 60.7% of the number of stocks. Microcaps have the highest equal-weighted returns and the largest cross-sectional dispersions in returns and in anomaly variables. Many original studies overweight microcaps via equal-weighted returns and often with NYSE-Amex-NASDAQ breakpoints in portfolio sorts. Hundreds of studies perform cross-sectional regressions of returns on anomaly variables, mostly with ordinary least squares, which are highly sensitive to microcap outliers.

Jensen, Kelly, and Pedersen: The Zoo Looks More Alive Than You Think

The second one is a later paper by Jensen, Kelly, and Pedersen (2023), Is There a Replication Crisis in Finance?, which pushes back and arrives at a very different conclusion. They begin by distinguishing two challenges in finance research replicability. The first is internal validity: do results survive replication under slightly different data and methods? The second is external validity: even if they do, are they just the product of multiple testing and p-hacking?

They propose a Bayesian framework to study anomalies in light of these two challenges. Their approach also has one crucial difference relative to Hou, Xue and Zhang, in that they propose to look at the anomalies alphas relative to the CAPM, instead of looking at returns.1 They also argue that the factor zoo should not be treated as a giant list of unrelated t-stats. Many factors are closely related, and a joint Bayesian framework can use that dependence to separate false discoveries from genuinely recurring signals more efficiently than blunt multiple-testing corrections. They propose to group anomalies into “themes” using their Bayesian modeling approach, and classify anomalies algorithmically into 13 themes that have high correlation and economic similarity. On their numbers, replication rates are much higher compared to Hou, Xue, and Zhang. For example, using their sample and CAPM alphas, they find that 82.4% anomalies replicate. They also extend the analysis to a large international sample and argue that the majority of factors work out of sample across 93 countries.

López de Prado and Fabozzi: Are we Even Asking the Right Question?

A recent paper by López de Prado and Fabozzi adds another twist to this debate by arguing that even the more optimistic attempts to estimate false discovery rates from the cross-section of published factor results face a deeper identification problem. Their point is not simply that finance researchers test too many things. It is that the published statistic is often the winner of an unobserved search over many specifications, so the observed cross-section is no longer generated by the “single-trial” experiment assumed by empirical-Bayes, local-FDR, or hierarchical-Bayes approaches. In that sense, they are challenging not just the pessimism of Hou, Xue, and Zhang, but also the optimism of Jensen, Kelly, and Pedersen: once latent search and selection are present, the prevalence of genuine factors cannot be recovered from in-sample reported statistics alone unless one explicitly models that search process or brings in genuinely independent validation data. Under their own maintained search-adjusted model, the implied false discovery rate is much higher than in the optimistic recent literature. Whether one buys all of their assumptions or not, the paper usefully sharpens the real issue: this is not only a fight about t-statistics, microcaps, or Bayesian estimation, but about what statistical experiment we think actually generated the factor zoo in the first place.

Final Thoughts

My own takeaway is that these papers are best read together. Hou, Xue, and Zhang are a useful corrective to an earlier literature that had become too comfortable with marginal t-statistics and too casual about robustness. Jensen, Kelly, and Pedersen are a reminder that once we bring in risk adjustment, dependence across factors, and international evidence, the case against the entire anomalies literature becomes much less clear-cut. López de Prado and Fabozzi, in turn, force the debate onto even deeper ground by asking whether the data we observe can identify the false discovery rate at all without modeling the hidden search process that generated the published result.

So the right conclusion is probably neither “the zoo is dead” nor “everything replicates.” It is that finance has likely discovered a fair amount of real structure in returns, but much less cleanly, and much less definitively, than the original papers often suggest.

The idea being that the CAPM “…is the clearest theoretical benchmark model that is not mechanically linked to other so-called anomalies in the list of replicated factors.”

Pockets of Predictability (Post #4)

Systematically Biased — Mon, 30 Mar 2026 09:13:32 GMT

On this issue of Pockets of Predictability, I discuss the 2022 paper “Anomalies and the Expected Market Return”, by Dong, Li, Rapach, and Zhou (DLRZ), as well as the 2024 paper by Cakici, Fieberg, Metko, and Zaremba (CFMZ), who performed a large scale replication of DLRZ using both U.S. as well as global markets data.

I’ll also discuss the kinds of methodologies used in these papers (and in most papers in the time series return predictability literature) to assess forecasts from a statistical and economic point of view.

At the end of the article, I include python code that downloads data from Amit Goyal’s website and from Open Source Asset Pricing, runs predictive regressions with both sets of predictors, and compares results.

Linking Cross-Sectional Anomalies and Aggregate Stock Market Prediction

Stock return predictability is a topic of interest to both practitioner and academics. There is a large literature on this topic, split along two lines of research:

Cross-sectional predictability: focuses on identifying which variables explain return differences among stocks. This literature created the so-called factor zoo, which I’ve mentioned a few times in previous posts. This is a large collection of asset pricing anomalies, i.e. long-short portfolios whose returns can’t be explained by typical asset pricing models.
Time-series predictability: focuses on identifying which variables can predict aggregate stock market returns and the market risk premium.

The research question in DLRZ is whether there is a link between these two strands. In other words, do long-short anomalies from the cross-sectional literature help predict aggregate stock market returns? If true, this would imply that signals traditionally used to rank stocks in the cross-section also contain information about aggregate market returns.

According to DLRZ’s finding, the answer is yes. They study the predictive ability of 100 long-short anomaly portfolio returns using different techniques such as dimension reduction and shrinkage, and report statistically robust return predictability. In addition, they find that market timing strategies based on out-of-sample forecasts using anomalies generate economic gains relative to a benchmark.

DLRZ’s paper uses a standard predictive regression setup common to many papers in the time series predictability literature. Therefore, it’s useful to briefly review the predictive regression framework used in these studies.

The Predictive Regression Setup

Many papers in this literature use a predictive regression model:

where r_M,t is the excess market return on month t and x_t-1 contains lagged predictors. In words, we are using information available at time t−1 to forecast next month’s market return.

Different types of predictors have been tested in the literature, including macroeconomic variables, technical indicators, sentiment measures etc. A few details that matter in practice:

The predictive regression can be estimated using a rolling or an expanding window scheme. Expanding (or recursive) estimation is more common, but if coefficients are unstable or if there are structural breaks, it can be problematic. While rolling estimation doesn’t solve these kinds of issues, it can attenuate their effects.
Many papers truncate forecasts at zero, which is justified by the idea that the equity risk premium shouldn’t be negative. This generally improves the performance of forecasts, but remains an ad hoc procedure. Some papers try to enforce this directly into the model using Bayesian methods for example.
The predictable component in aggregate market returns is very small. Adding more than a few predictors to x causes performance to degrade quickly unless some type of regularization is used.
The consensus in the literature is that predicting aggregate market returns is very difficult, and most predictors do not work out of sample. A key reference is Goyal and Welch (2008). An updated version with more predictors is Goyal, Welch, and Zafirov (2024). I use data from this paper later to compare with the performance of long-short anomalies.

Statistical Performance

In the time-series return predictability, forecasts are usually evaluated relative to a prevailing mean forecast, which is just the average market excess return using observations available at forecast formation. Another way to think about this forecast is as a regression of excess returns onto a constant. This benchmark assumes that the equity risk premium is unpredictable, so it’s a natural benchmark. Papers usually report the R²_OOS statistic to compare the performance of a given forecast with that of the prevailing mean forecast. Consider a forecast of the excess market return at time t, using information until time t-1:

The forecast error is given by

Let MSFE₁ be its mean squared forecast error:

Likewise, we can define the prevailing mean forecast at time t and calculate its mean squared forecast error, which we’ll call MSFE₀. The R²_OOS statistic is defined as:

This statistic can be interpreted as the proportional reduction in MSFE for the competing forecast relative to the prevailing mean benchmark. To test if a forecast has better predictive ability compared to the prevailing mean benchmark, papers often use the test by Clark and West (2007), which tests the null of equal predictive ability between the a parsimonious model with only a constant (i.e, the prevailing mean benchmark) and a larger model that includes predictors. The null hypothesis of the CW test is equal predictive ability: MSFE₁=MSFE₀, which corresponds to R²_OOS=0.

Economic Performance

Economic performance is generally assessed based on market timing portfolios that solve a mean-variance optimization problem at each point in time. Specifically, the problem is a capital allocation one: based on the available forecast, decide how much to allocate to the overall market and to the risk-free asset. The problem at the end of month t can be written as:

where γ is the investor’s coefficient of risk aversion, usually set to γ=3 in most papers. The solution is

Although the solution is exact, this formula can lead to very large positive or negative weights. Because of that, weights are typically capped to some interval, such as [-1, 1.5] or [0, 1.5] if the researcher wants to restrict portfolios to be long only. Most papers in the literature estimate the variance of returns using some rolling window. DLRZ use a 60-month rolling window for this purpose. Once the optimal weights are calculated, the timing portfolio returns over the out-of-sample period are calculated in the usual way by multiplying weights by returns. Many papers report the annualized difference in realized utility as a measure of the economic performance. The average realized utility (or certainty equivalent return) of the investor’s portfolio based on the optimal weights is calculated as

where the required inputs are the average realized return and the realized variance of the optimal market timing portfolio over the out-of-sample period. Likewise, we can define the corresponding quantity based on the prevailing mean forecast:

Finally, the quantity below expresses the increase in the investor’s utility from using the forecast relative to using the prevailing mean benchmark:

If this quantity is positive, it means that the portfolio constructed using the model forecast performs better economically than the portfolio that relies on the benchmark forecast based on averaging past prior returns.

DLRZ: Data and Methodology

Now let’s move on to the specific results in DLRZ. The authors estimate the predictive regression discussed previously where x_t-1 contains the lagged returns on the 100 long-short anomalies returns constructed using CRSP data. The market excess return is the CRSP value-weighted market return minus the risk-free return and the sample period goes from 1970 to 2017.

In terms of forecasts, DLRZ construct the following ones:

Conventional OLS: run a multiple regression of excess returns on the lagged 100 long-short anomalies returns. This is expected to perform poorly due to overfitting.
ENet: same but using elastic net, which regularizes the coefficients and should alleviate the overfitting concern.
Simple Combination: run univariate regressions and average the forecasts. This is essentially another kind of shrinkage as in Rapach, Strauss, and Zhou (2010).
Combination ENet: similar to the simple combination, but instead of averaging directly over all univariate forecasts, they first run univariate regressions on the training window, leaving 5 years as a holdout sample. Then, they run an elastic net regression of excess returns onto univariate forecasts on the holdout sample, and finally they average univariate regressions forecasts only for the regressors that were selected by the elastic net.
Predictor Average: instead of using the 100 individual anomalies, first take their average and then run a univariate regression of excess returns on this average.
Principal Component: run a unvariate regression of excess returns on the first principal component of the 100 anomalies.
PLS: uses the Partial Least Squares approach to construct a target-relevant factor that has maximum correlation with the market excess return.

The estimation of these models is done using an expanding window scheme. The initial estimation window is the 10-year period from from 1970:01 to 1979:12. The period from 1980:01 to 1984:12 is used as a holdout sample for the Combination ENet method. One peculiarity of DLRZ is that they only use the holdout period for the elastic net combination method. For all the other models, including the standard ENet, they use the combined training and validation windows at each point.

DLRZ: Results

The main results of DLRZ are that the long-short anomalies have strong predictive ability for aggregate market returns. Their main table for statistical performance shows that, while conventional OLS has terrible performance (as expected), several approaches deliver high R²_OOS close to 2% or even 3%.

In addition, several forecasts using the anomalies deliver significant economic gains relative to the prevailing mean benchmark or a buy-and-hold strategy:

These results are puzzling to some degree, because previous research had shown that firm characteristics themselves appear to have very limited ability to predict aggregate returns. Specifically, a 2023 JFQA paper by Engelberg, McLean, Pontiff, and Ringgenberg looked at whether average values of firm characteristics that are widely used to explain cross-sectional returns (or form portfolios), such as price-to-earnings, could be used to predict aggregate market returns, finding little evidence that this was the case.

There is much more in the paper (it’s a JF paper after all!), including an economic rationale based on mispricing that could explain why the anomalies should have predictive ability. But what is important for this series is the replication angle.

CFMZ: The Replication

In their 2024 RF paper, Cakici, Fieberg, Metko, and Zaremba (CFMZ) performed a large scale replication of DLRZ using both U.S. as well as global markets data.

Their conclusion is quite strong:

I first came upon this paper after having come to a similar conclusion myself (at least regarding the U.S. market), based on my own attempts to replicate the results in DLRZ. I was working on a paper on this topic, and decided to include the average of a large cross-section of long-short anomalies as a predictor. I relied on data from Chen and Zimmermann’s Open Source Asset Pricing, and found that their average had little to no explanatory power, and only under certain design choices, such as using an expanding window.

CFMZ analyze this in a much more detailed way using a large dataset with anomalies from the U.S. and 42 other countries.

CFMZ: Replication in Other Markets

The first result of CFMZ, using up to 153 anomalies in other countries and similar machine learning forecasts as those used in DLRZ, is that the average R²_OOS across the 42 countries are negative or close to zero. The market timing portfolios formed using the anomalies also underperform naive benchmarks and buy-and-hold strategy.

CFMZ: Replication in Alternative Anomaly Sets

CFMZ try to replicate DLRZ’s results using 4 datasets:

Open Source Asset Pricing
DLRZ’s dataset, which is provided in their replication package.
Hou, Xue, and Zhang (2020)’s anomalies
Jensen, Kelly and Pedersen (2023)’s repository.

Their conclusion:

Among all the tested samples, the return predictability holds only for one in four: the original sample of Dong et al. (2022). No other anomaly set generates any evidence of a similar pattern.

CFMZ also tested:

whether the selection of anomalies matters: they tested random samples of 100 anomalies from the various sets, finding large variations in R²_OOS but the same overall conclusion, even for the best sets. In particular, they also found that the set of anomalies used by DLRZ contained a higher percentage of anomalies with significant predictive ability, and their predictive ability exceed that of similar variables from other samples. Finally, they show that predictive ability in DLRZ set is unusually concentrated in anomalies related to issuance.
whether anomaly construction can affect the results: they tested different options for methodological choices used to create the anomalies, such as weighting schemes, winsorization rules, and cut-off points. Their conclusion: only a few implementations showed some predictive ability that is comparable to that of DLRZ.

DLRZ go Global

In a separate paper (not yet published), Dong, Li, Li, Rapach, and Zhou reassess the predictive power of anomalies, this time using global anomalies data for 43 non-US countries. Their story is that, while anomalies have limited predicited power at the country level (which seems to agree with CFMZ), they show strong predicitive power when aggregated to the supranational level. While on this paper they also use alternative anomalies data from Jensen, Kelly, and Pedersen (2023), they focus exclusively on non-US countries.

A Python Workbook to Investigate Anomalies for Market Prediction

Pockets of Replicability (Post #3)

Systematically Biased — Mon, 16 Mar 2026 12:02:12 GMT

In their 2018 JF paper, Comparing Asset Pricing Models, Barillas and Shanken (BS) proposed a Bayesian asset pricing test to help sort through the so-called “factor zoo”, the collection of hundreds of asset pricing factors “discovered” over the last 30 years or so. his widely cited paper was part of a broader revival of interest in Bayesian methods in asset pricing, and the multifactor model they identified as having the largest posterior probability was used in several papers as an alternative to common benchmarks like the Fama and French 3- and 5-factor models, or the models proposed by Hou, Xue, and Zhang.

The main idea of the BS approach was to use an improper Jeffreys prior for the beta coefficients and the residual covariance matrix:

while keeping an informative prior for alphas conditional on other parameters:

where k is chosen based on plausible values for the maximum Sharpe ratio. Under these choices, they obtained a closed form way of calculating posterior model and factor probabilities, which allows for model comparisons. Using a set of 13 candidate asset pricing factors, they compute posterior model probabilities recursively over time. The graph below suggests that the model in blue, which includes six factors (market, momentum, size, monthly updated value, investment, and ROE), seems to dominate others.

The Problem

In 2020, Chib, Zeng and Zhao (CZZ) published (also on JF) the aptly called paper “On Comparing Asset Pricing Models”, in which they showed that BS’s calculations do not lead to correct posterior model probabilities. CZZ state (emphasis mine):

In this paper, we revisit the framework of Barillas and Shanken (2018), BS henceforth, and show that the Bayesian marginal likelihood-based model comparison method in that paper is unsound. Hence, the BS “marginal likelihoods” each depend on an arbitrary constant, which voids the ranking of models by the size of the marginal likelihoods and invalidates any conclusions drawn from such a method about the underlying data-generating process (DGP).

The issue arises because the prior for the betas is improper (i.e., not a valid probability distribution that integrates to 1). Because of that, multiplying this improper prior by any positive value produces the same improper prior, which in turn implies that the marginal likelihoods (loosely speaking, the probability of the data given the model) are determined only up to a constant. If all models shared the same prior, this constant would cancel out in model comparisons.1 However, as CZZ show, this is not the case in the BS framework. The key requirement for valid Bayesian model comparison is that the priors across competing models must be induced from a common underlying measure so that any arbitrary constants cancel in Bayes factors. CZZ show that this coherence condition fails in the BS setup, and then go on to propose a prior under which these conditions hold, and thus model comparison is valid.

How much difference does it make? Using simulations, CZZ show that under different DGPs and sample sizes, the BS prior never identifies the correct null model, in the sense that the highest posterior probability model under the BS prior is never the true model. This is disputed to some degree by BS.

The BS Riposte

Not surprisingly, Barillas and Shanken disputed this critique and responded in a follow-up paper. In a reply to CZZ called “Comparing Priors for Comparing Asset Pricing Models”, BS state that the approach described in CZZ had been discussed in communications between BS and CZZ, in response to an earlier version of CZZ’s paper. CZZ note in their paper that the issue had been raised by a reader of an earlier draft. As it turns out, one of authors in BS had been a reviewer of the earlier version of CZZ’s paper.

BS also argue that the measure used by CZZ to compare the results under the two different priors was overly simplistic, as it relied on a binary outcome (assigning higher posterior probability to the correct model). In addition, they argue that CZZ pre-selected statistically significant models, which may induce some bias.

Applied CZZ

In “Winners from Winners: A Tale of Risk Factors”, Chib, Zhao, and Zhou (let’s call them CZZ2) put the CZZ prior to an empirical test using two sets of factors:

A smaller list of 12 “benchmark” factors for which there’s some support in the literature (Fama and French factors, Hou, Xue, and Zhang factors etc).
A larger list of 125 additional factors from the factor zoo (from Hou et al., 2020).

They first run a benchmark scan using the initial set of benchmark factors, which supports a 7-factor model:

Next, they use this 7-factor model to first rule out factors from the larger set (with 125 factors).2 This approach reduces the model space by keeping only 24 “true anomalies” relative to the 7-factor model. Note that the model space is still huge: with a total of 36 factors (12 benchmark plus 24 true anomalies), the total number of models is over 68 million. To further reduce dimensionality, they use the first 12 principal components (PCs) of the 24 anomalies, reducing the model space to “only” about 17 million models. The resulting models with highest posterior probability all include the PCs of the anomalies. In addition, the best model from this extended model scan outperforms the best model from the original scan over the 12 benchmark factors.

Other Bayesian Approaches for the Factor Zoo

I wrote a paper with Soosung Hwang in which we applied a Bayesian variable selection methodology to investigate the selection of asset pricing factors using individual stocks. Using individual stocks has the advantage of bypassing certain issues that can arise from grouping stocks into portfolios. We used a hirerarchical model with indicator variables γ_j that are equal to 1 if a factor is included in the model, and 0 otherwise. Since we did not use improper priors, the issues described previously do not affect model comparison. One of the advantages of the specific hierarchical setup we used is that it is possible to integrate out the models parameters and obtain the posterior distribution of the vector γ directly through MCMC simulation, which then can be used to obtain the posterior factor and model probabilities. Our results suggested that:

of the 88 factors we considered, only a few appeared to be relevant;
of these, only the market and size factors coincide with the factors from widely used factor models, like the Fama-French 5-factor model or the Hou et al q-factor model;
many different factor combinations have similar posterior probability, suggesting that model uncertainty is pervasive and that the search for a single factor model is unlikely to give a definitive answer in asset pricing.

Bryzgalova et al. (2023) proposed a Bayesian approach to compare different linear asset pricing models by focusing directly on estimation of the stochastic discount factor (SDF). Although they use improper priors, they are careful to only use them for nuisance parameters, such that their effect is canceled out in Bayes factors calculations. In addition to identifying high-posterior probability factors that should be included in any SDF, a key empirical result from their paper is that the SDF is dense in the space of observable factors. In other words, due to substantial model uncertainty, aggregation through Bayesian model averaging outperforms sparse representations. This is similar to what we found in our paper, and in line with the findings on Giannone, Lenza, and Primiceri (2021), that dense models outperform sparse ones in various applications in economics and finance.

Final Thoughts

The use of Bayesian methods in asset pricing research has several advantages. The Bayesian approach is flexible, handles high dimensionality well, and naturally provides answers when model uncertainty is pervasive. However, as this short discussion shows, there are some pitfalls in Bayesian model comparison. Improper priors are often harmless for parameter estimation, but they can be problematic for model comparison because Bayes factors depend on the normalization of the prior.

An Oversimplified Example

To illustrate the issue with improper priors, consider a simple example. Suppose we observe data:

where σ² is know, and want to compare

against

Under the alternative model, suppose we use the improper flat prior:

for some arbitrary constant C.

The marginal likelihood under M₀ is a function only of the data. Let’s denote it by m₀(x).

The marginal likelihood under the alternative is defined only up to an arbitrary positive constant, because the prior is improper. When we integrate the likelihood with respect to this prior to compute the marginal likelihood, this constant carries through the calculation.The marginal likelihood under the alternative then becomes

where A(x) depends only on the data. The Bayes factor comparing the two models is

where B(x)=A(x)/m₀(x). Since C can be chosen arbitrarily, the Bayes factor, and therefore the posterior model probabilities, are not uniquely defined.

At the end of this post, I include an oversimplified example of the problem of Bayesian model comparison with improper priors.

For each of the 125 factors, they run a Bayesian comparison of two models, one with the intercept, and one without the intercept. If the Bayes factor favors the model with the intercept, they concluded that that factor is a true anomaly, and include it in the next step.

Pockets of Replicability (Post #2)

Systematically Biased — Wed, 04 Mar 2026 14:05:01 GMT

Pockets of Replicability is a series of posts about research replicability in finance.
This post is about the paper “The Virtue of Complexity” by Kelly, Malamud, and Zhou.
I start this post with a simplified discussion about overfitting and the bias-variance trade-off. If this is familiar territory, you can skip directly to “Double Descent”. If that’s also familiar, you can skip directly to “The Virtue of Complexity”.

The Bias-Variance Trade-off

Any textbook on machine learning discusses the bias-variance trade-off: the tension between model complexity and in-sample vs out-of-sample performance. I like the example below from Bishop’s classic book on machine learning. We’re given a sample of N=10 points from a function y=sin(2πx). In the graph below, the truth is plotted as the green curve. In real life, all we observe are the data (blue circles).

Next, we try to build a model using polynomials of the observed x:

As we increase M, the in-sample fit improves. At M=9, the model interpolates the training data:

When M=9, the curve interpolates the training data, but the fitted model starts to oscillate wildly. If we were to change the data slightly, the fitted curve could look very different. What is happening is that the model is becoming too adjusted to the training data (overfitting). In practice, the regression coefficients become unstable, and when tested on new data, performance deteriorates. This is the bias-variance trade-off: making the model more complex reduces the bias (it can better fit the training data) but increases the variance (sensitivity of the error to sample variations).

The bias-variance trade-off and conventional statistical wisdom suggests that we shouldn’t try to estimate a model with 10 parameters when we have 10 data points. If we increase the sample size, the overfitting problem becomes less severe. The graph below shows the estimated model using 15 (left) and 100 (right) data points.

Shrinkage

What if we can’t increase the sample size? In this case, a solution to the overfitting problem is to use shrinkage or regularization. The idea is to “discipline” the model by preventing coefficients from becoming too large. Ridge regression is one such method that works by placing a penalty term on the error function, controlled by a parameter λ. Instead of minimizing the sum of squared errors, we minimize an objective function that adds a penalty term:1

Setting λ=0 reproduces the OLS overfit (left below). Setting λ too large collapses the model to the mean (right). The middle panel shows a value of λ that offers a compromise that alleviates overfitting, even with only 10 data points.

The Double Descent

The conventional statistical wisdom from examples like the above is that there is a trade-off between a model’s complexity and its generalization ability (the performance of a model on unseen data). In the polynomial example, when M=9, the model has as many coefficients as the number of available data points. Beyond this point, ordinary least squares (OLS) estimation breaks down. What happens when the number of coefficients is larger than the number of observations? Recent research has shown that flexible models like neural networks, in which the number of parameters can easily exceed the number of data points, can still have good generalization despite interpolating the training data.

In the textbook bias-variance trade-off described above, error on new data starts to increase as the model complexity increases (Panel A below). But in some situations, a phenomenon called “double descent” is observed: as model complexity increases past the point where the model can interpolate the training data, test error can start to decrease once again (Panel B).2

Double Descent in Linear Models

Double descent is not confined to complex neural networks.3 Suppose we have n data points and are trying to estimate a linear model with p features. We can consider two complexity regimes:

underparameterized: the number of features is less than the number of data points (p<n).
overparameterized: the number of features exceeds the number of data points (p>n).

In the overparameterized regime, the least squares objective function does not have a unique minimizer: there are infinitely many coefficient vectors that fit the training data equally well. When the design matrix has full row rank and p >= n, the model can interpolate the data, achieving zero training error.

Ridge regression resolves this non-uniqueness by shrinking coefficients toward zero. As before, we must select the shrinkage parameter λ. A special solution in the overparameterized regime is the ridgeless (minimum-norm) estimator, obtained as the limit of ridge regression as λ→0⁺. In that limit, the estimator converges to the unique interpolating solution with the smallest Euclidean norm (equivalently, the Moore–Penrose pseudoinverse solution).4

Is it better to use ridgeless regression or more aggressive regularization (i.e., higher λ)? The answer depends on the structure of the data and on whether the model is well-specified (the true signal lies in the span of the features) or misspecified (the features only approximate the true signal). In some settings, ridge regression yields lower risk; in others, the minimum-norm interpolator generalizes better.

The Virtue of Complexity (Kelly, Malamud, Zhou, 2024 JF)

In a recent (and, it turns out, controversial) paper, Kelly, Malamud, and Zhou (2024, KMZ) study equity return prediction in the overparameterized regime. Specifically, KMZ study the accuracy of forecasts and the returns of market timing portfolios built with these forecasts when the number of model parameters increases much beyond sample sizes. Specifically, KMZ define complexity through the ratio of the number of features to the number of observations, and their theoretical results rely on random matrix theory to study the asymptotics when the model size grows with the number of observations at a fixed rate:

0","id":"WXJVEPCJUA"}" data-component-name="LatexBlockToDOM">

Their theoretical results show that out-of-sample forecast accuracy increases with model complexity once an appropriate degree of shrinkage is applied. Their recommendation is directly opposed to conventional statistical wisdom:

Our central research question therefore is, what level of model complexity […] should the analyst opt for? […] The analyst should always use the largest approximating model that she can compute.

Predicting the S&P500 with Complex Models

In their empirical application, KMZ forecast U.S. equity returns from 1926 to 2020 using 15 predictors that have been widely studied in the equity risk premium literature.5 To transition from low to high complexity regimes, they use a machine learning technique known as Random Fourier Features (RFF). This consists in generating random features from the original set of 15 features using transformations of the form:

and then estimating a model of the form:

The advantage of this approach is that any number of features can be generated from the original ones, allowing them to investigate different complexity regimes. KMZ vary the number of RFF features from P=2 to P=12,000, therefore transitioning from the underparameterized to the overparameterized regime. They then conduct a backtesting exercise to create out-of-sample forecasts and derive (conditionally) optimal market timing portfolio weights. Their results suggest that the expected returns of market timing portfolios increase and their volatilities decreases in the high complexity regime where the number of features vastly exceeds the number of training observations. Perhaps the most surprising empirical pattern they document is that high-complexity models estimated with as little as 12 months of training data still generate positive R² and Sharpe ratios. Most papers in this literature use decades of data and many papers rely on expanding window estimation.

Important Details

KMZ make several empirical design choices that will be important to understand the criticisms about their results:

The lagged return on the market is included in the set of 15 variables. That is, R_t_,
is included among the 15 features.
KMZ volatility-standardize market excess returnsusing a trailing 12-month standard deviation of excess returns computed using returns up to time t. They volatility-standardize the predictors as well, but using an expanding window to estimate volatilities.
The regression model above does not include an intercept.

The Reaction

The results from KMZ have been challenged by a number of authors. Most of the critique has focused on their empirical implementation.6 But there are also some critiques about the theoretical setup in the paper, especially regarding how KMZ model the economy.

Modeling Choices

Buncic (2025) does a deep dive in the empirical implementation in KMZ. There’s a lot of interesting points in the paper, but the main ones are:

Missing intercepts: KMZ estimate models (both the complex models using RFFs and the simple models using the predictors directly) without including an intercept, which induces large bias in KMZ’s out-of-sample forecasts. The choice to not include the intercept in ridge regularization is uncommon (see footnote 1). Buncic shows that including the intercept immediately improves all models. It also explains why KMZ’s forecasts are mostly positive. KMZ argue that the complex model “learns from the data”, but Buncic’s results suggest that this is simply a byproduct of excluding the intercept.7
Aggregation schemes: due to the random nature of RFFs, KMZ repeat estimation 1,000 times for each value of P and aggregate forecasts across runs by taking an average. This is a standard technique in machine learning when individual forecasts have a random component (e.g. as in Bagging or random forests). The result is a single forecast for each value of P at each point in time, calculated as the average across the 1,000 estimates. The time series of forecasts can then be used to define the market timing portfolio positions.8 However, when they analyze the performance of market timing portfolios, KMZ calculate averages over each of the 1,0000 market timing portfolios. For linear measures like expected returns, this doesn’t make a difference, but for nonlinear metrics like the Sharpe ratio, it does. Buncic argues that this approach doesn’t make sense; in practice, it would require the investor to hold 1,000 portfolios.
Placebo using IID data: Buncic runs an experiment in which he deploys KMZ machinery on random data from an i.i.d. Normal distribution. The patterns reported in KMZ, especially their “Virtue of Complexity” or VoC curve, obtain also for this unpredictable data. But since the data are random in this experiment, there is nothing to learn.
Poor performance: perhaps the strangest element of KMZ is the apparently poor performance of complex as well as simple models (linear regression models with or without shrinkage using the 15 predictors). KMZ state (my emphasis):
We find extraordinary agreement between empirical patterns and our theoretical predictions. Over the standard Center for Research in Security Prices (CRSP) sample from 1926 to 2020, out-of-sample market timing Sharpe ratio improvements (relative to market buy-and-hold) reach roughly 0.47 per annum with t-statistics near 3.0.
but their Table 1 (below) actually seems to report absolute (not relative) Sharpe ratios, as noted by Buncic. Since the Sharpe ratio of the buy-and-hold strategy over this period is 0.51, none of the market timing portfolios outperform the buy-and-hold strategy. Another puzzling choice is to report only two very different cases (the ridgeless case in which z=0⁺, and a ridge case with significant shrinkage, z=10³).
Buncic shows that the KMZ implementation of complex models (show on the right below) actually underperforms the buy and hold (the red line) across the board, while ridge regression with less shrinkage than the one chosen in KMZ’s Table 1 outperforms the buy and hold. When the intercept term is added, KMZ’s complex models improve a bit, but are still worse than simple ridge regression.

It’s All Momentum

Nagel (2025) argues that, when the number of RFF features (P) is much larger than the size of the training sample (T),

…the RFF-based forecast becomes a weighted average of the T training sample returns, with weights determined by the similarity between the predictor vectors in the training data and the current predictor vector. In short training windows, similarity primarily reflects temporal proximity, so the forecast reduces to a recency-weighted average of the T return observations in the training data—essentially a momentum strategy. Moreover, because similarity declines with predictor volatility, the result is a volatility-timed momentum strategy.

Nagel’s argument is that, since several of the predictors are persistent (e.g. dividend/price ratio), recent observations tend to be more similar to the current one, which leads to higher weights on recent returns. Nagel’s conclusion is that the performance of the high-complexity KMZ forecasts is simply due to the fact that a volatility-timed momentum strategy performs well historically. He goes on to directly compare the average weight on recent returns in KMZ’s high-complexity ridgeless regression with those from a simple volatility-adjusted momentum strategy:

Finally, Nagel simulates data with a reversal (rather than momentum) patterns by adding a MA(2) component that induces strong negative autocorrelation to actual market returns. The KMZ forecasts continue to behave as a volatility-timed momentum strategy and produce negative out-of-sample returns. According to Nagel, this confirms that the KMZ complex models are not learning from the data.

A similar point about the convergence of KMZ’s approach to a momentum strategy is made by Elmore and Strauss (2026), who claim that the combination of complex models with a high degree of shrinkage leads to forecasts that amount to a rolling window of past returns.

Buncic also noted that any rolling window regression forecast is a weighted average of the returns in the rolling window used.9 He computes time-series averages of these weights for different complex (RFF) and simple (ridge models using the 15 features) estimated with an without an intercept. His results show a similar pattern to Nagel’s when models are estimated without an intercept: the weights decay exponentially over recent returns (both for RFF and simple forecasts). When models are estimated with an intercept, this patterns disappears.

It’s a Noisy World

Cartea et al. (2025) study the behavior of complex models when features are observed with noise. Their results show that, when features are noisy, increasing model complexity can exacerbate the impact of the noise on performance, limiting the benefits of more complex models.

Theoretical Misgivings

Perhaps the harshest critic of the paper so far has been Berk (2023). Their full abstract simply states:

In contrast to what is claimed in Kelly, Malamud and Zhou (2023), the implication of the theoretical analysis in that article is so narrow that it is virtually useless to financial economists.

Berk’s main arguments are:

Assumptions: The statistical theory in KMZ was derived in the statistics literature (see footnote 3), and the application of this theory to financial economics therefore hinges on the assumptions of KMZ about the return process. Berk argues that the return process assumed in KMZ, which has expected excess return equal to 0, is inconsistent with economic equilibrium: the only assets that have excess return equal to 0 are those uncorrelated with the stochastic discount factor. Because the strategy implied by KMZ’s complex model has positive excess return, it would imply excess demand and the market couldn’t clear.
Excess returns: Berk argues that, because the asset used in the empirical tests violates the assumption in KMZ’s theory (the excess return on the CRSP value-weighted index is not zero), the theory in the paper isn’t useful for explaining the empirical findings. Berk also mentions that KMZ’s tests are not truly out-of-sample, since the predictors were identified using the entire sample (although I guess you could use this criticism against almost all papers in this literature).
Bad performance (again): Berk also notes that the performance of KMZ’s complex models is actually not that good. In KMZ Table 1, when the sample size is closer to what one would minimally use in practice (120 months), the Sharpe ratio of the linear model with shrinkage is actually higher than that of the complex model.

The Riposte

Kelly and Malamud (2025, KM) reply to their critics in a long article. KM provide both simulation and additional empirical results to answer their critics, including results using deep neural networks. They also emphasize that the VoC is about reducing specification error: more complex models nest simpler models. And they highlight that the main contribution of KMZ is theoretical, and the critics of KMZ are overly fixated on narrow aspects of their empirical application:

Small sample sizes
Use of RFFs as the basis for complex models
Adjustments made on data (e.g. standardization)
Sharpe ratios of resulting strategies

In their reply to Nagel, KM argue that:

Nagel is reverse-engineering momentum from the RFF forecasts, creating in effect an “approximator” of KMZ’s forecasts.
The MA(2) reversal execise of Nagel is invalid, because the features are kept the same. Since many of the features are persistent, it’s not surprising that the KMZ approach fails to learn anything and produces persistent forecasts.

In their reply to Cartea et al, KM argue that it’s not surprising that performance deteriorates when noise is added to the data. KM also argue that Cartea et al’s results, that the benefit of complexity reverses when predictors are sufficiently noisy, are contradicted by the “real world data” in KMZ.10

In their reply to Buncic, KM state that:

KMZ never claim that a complex model is better than simple models, or that a simple model can’t outperform a complex model.
Buncic’s approach isn’t valid, because it’s not an apples-to-apples comparison. By this it seems that KM mean that researchers shouldn’t compare their complex RFF approach with no intercept with a simple linear model with shrinkage and an intercept, as in Buncic.11
The aggregation used in KMZ is in fact the correct way to “build an empirical counterpart to the expected Sharpe ratio” in KMZ’s theory. However, as noted by Buncic, since the data are kept fixed, this expectation is over the random RFF weights, so the aggregation in KMZ can’t be an estimate of the true expected Sharpe ratio in an out-of-sample sense.
Buncic mistakenly takes the empirical results of KMZ out of context and uses them as a license to search for higher Sharpe ratios. Specifically, KM say that Buncic misinterpreted what the benchmark for economic performance is in KMZ by focusing too much on Sharpe ratios. KM argue that what matters is alpha relative to buy-and-hold, i.e. the market timing portfolios in KMZ should be combined with a buy-and-hold strategy. In an update to his paper, Buncic analyses these combined strategies. His results suggest that lower complexity RFF models do essentially as well as higher complexity ones. In addition, simple linear models (or indeed the historical average of returns) also seem to perform as well as the complex ones. Finally, Buncic shows that linear models with regularization achieve the highest cumulative wealth among all the models, while RFF models achieve a cumulative wealth that is very similarly to the historical average (graph below).

Final Thoughts

KMZ’s results initially seem striking: performance (both statistical and economic) improves as model complexity increases. Yet the underlying statistical ideas—double descent and benign overfitting—have been studied extensively in the statistics and machine learning literature. The research on double descent, however, suggests that benign overfitting is not a given: whether performance improves in the overparameterized regime relative to the underparameterized one can depend on the structure of the data.

Under KMZ’s assumption, performance is guaranteed to increase with model complexity. To the extent that they want this to be informative in financial economics, researchers need to examine both the assumptions underlying KMZ’s theory and their empirical implementation, which is what the follow-up papers I’ve discussed in this post have done. In other words, if the goal is to draw conclusions about real-world return predictability, the proposed complex approach should be compared with the simpler forecasting methods commonly used in the literature.

KM’s reply to the critics creates some tensions. They emphasize that KMZ’s contribution is primarily theoretical, but the paper clearly invites the interpretation that complexity improves forecast performance empirically. KMZ’s theoretical results are derived under a specific DGP with strong assumptions about the predictors, in an asymptotic regime where both the sample size and the number of features grow large. But their proof-of-concept relies on an empirical design that does not correspond directly either to their theoretical setup (as argued by Berk) or to the conventional frameworks used in most studies of return predictability. KM argue that critics place too much weight on the details of the empirical implementation:

These challenges miss the point that KMZ’s empirical analysis is a deliberately simple proof-of-concept for their main contribution, which is theoretical. KMZ do not advocate RFF as their preferred machine learning model for return prediction. Their claim is not that machine learning works best with small samples. Their point is plainly that unusually large asset pricing models are surprisingly successful out-of-sample. KMZ’s experiments are designed to convey this point clearly and concisely.

By emphasizing that their empirical exercise is “only” a proof-of-concept, while also arguing that comparisons with conventional forecasting approaches are not appropriate,12 the reply may leave the impression that much of the criticism of KMZ is simply being dismissed as “missing the point”.

The Proof of the Pudding

Where does all that leave us? Machine learning techniques are increasingly used in finance, and it is not controversial that they are useful. Because financial data are far messier than assumed in most theoretical models (KMZ’s included), empirical adaptations are inevitable. In my view, high-dimensional machine learning methods have clearly improved our understanding of cross-sectional asset pricing (both theoretically as well as empirically). But in my experience, complex nonlinear machine learning models do not seem to add much value relative to properly regularized linear models in the case of aggregate stock market prediction (although I haven’t personally tested KMZ’s RFF approach).

I would argue that KMZ’s approach can and should be put to the test relative to other existing models, as done by Buncic and others. If the conclusion is that simpler models do as well as complex ones (as seems to be the case), this would not invalidate KMZ’s theoretical contribution, but it would suggest that their proof-of-concept may not add much to our empirical understanding of return predictability or the practical relevance of their theory.

In that sense, the debate around KMZ may be less about whether their theory is correct, and more about whether their empirical illustration meaningfully informs the problem of return prediction.

The intercept coefficient β₀ is often excluded from the regularization problem. The reason for this is that the intercept simply captures the mean of the response variable when the values of all features are equal to 0. This point will be important to understand some of the criticism about the paper.

These figures are from Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854. See also Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107-115.

For linear models, see Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2), 949.

This may seem counterintuitive, since λ controls the penalty term. However, in the overparameterized (interpolation) regime, there are infinitely many solutions that fit the data exactly. Ridge regression selects among these by shrinking coefficients toward zero. When (X’X) is singular; ridge yields (X’X+λI)^-1X’y. As λ→0⁺, the ridge solution converges to the minimum-norm interpolator equivalent to the Moore–Penrose solution.

These features are from Welch, I., & Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. The Review of Financial Studies, 21(4), 1455-1508. Data can be downloaded from Amit Goyal’s website.

Replication code for KMZ is available in the Journal of Finance website.

A widely used approach in the return predictability literature consists in truncating forecasts at zero. The justification is that the expected excess return on the market, in principle, shouldn’t be negative. See Campbell, J. Y., & Thompson, S. B. (2008). Predicting excess stock returns out of sample: Can anything beat the historical average?. The Review of Financial Studies, 21(4), 1509-1531.

In KMZ, market timing positions correspond directly to the excess return forecasts due to the volatility standardization.

This comes from the fact that regression forecasts for a given observation are obtained by multiplying the projection or hat matrix by the value of the response.

I would add, however, that based on the papers surveyed in this post, the real-world performance seems to be debatable.

This comment is odd. It’s common practice in the literature to compare the out-of-sample forecasts of different approaches. Of course, appropriate statistical tests must be used in case of nested vs non-nested models.

Including standard machine learning approaches (like shrinkage) that have been used in the literature on return predictability for over a decade.

Pockets of Replicability (Post #1)

Systematically Biased — Thu, 26 Feb 2026 11:07:11 GMT

When I worked as a quant trader/researcher at a hedge fund, whenever one of us came up with a backtested strategy, someone else in the team would replicate it from scratch, without looking at the original code. The main objectives were to make sure that (1) the results were correct and (2) not dependent on specific implementation choices made by the original researcher. I can’t overstate how easily small mistakes and seemingly innocuous modeling choices conspire to ruin your day. Usually, backtests that seem too good to be true are a good indication that something’s not quite right, but less impactful errors can easily go unnoticed until later. Of course, reality doesn’t care about your baciktest; mistakes are revealed through negative P&L (quite rarely, a mistake generates positive results, but Murphy’s law is usually binding).

In academia, much has been said about the lack of replicability of scientific studies. Finance is no exception: Hou, Xue and Zhang (2020) suggest that, after microcap stocks are excluded, 65% of so-called asset pricing anomalies fail to show statistical significance, although some argue that the replication crisis in finance is not real. Top finance journals now require authors to share “replication packages” with code and data, which has made it easier than ever for others to check the results of published papers.

In this series of posts, I go over some recent examples of papers (in areas of research that interest me) whose results have been challenged. The title of this series is tongue-in-cheek, and the objective is not to single out any specific paper or author, but rather to show how the recent focus on replicability in finance research actually works and reveals issues in published studies. In other words, replication challenges are not failures of the scientific process: they are part of it.

Pockets of Predictability (Farmer, Schmidt, Timmermann, 2023 JF)

The finance profession has come up with many variables that supposedly predict aggregate stock returns. Anyone who has ever worked with these predictors knows that:

most predictors have limited to no predictability out-of-sample
the level of predictability is low: we’re talking about R-squared values of less than 2% when forecasting monthly returns
even for predictors that do seem to work, coefficient estimates and performance are unstable over time

The core idea of Pockets of Predictability is elegant. The paper starts with a generalization of the commonly used predictive regression model to allow for time-varying coefficients:

Instead of fully specifying the dynamics of the time-varying coefficient, they propose to estimate it using a non-parametrical kernel method, which doubles as a way to construct “local” estimates by giving more weight to nearby points. To avoid look-ahead bias, they use a one-sided Epanechnikov kernel (that’s a mouthful) that is designed to only look at past data:

The one-sided nature is critical: it ensures real-time implementability and avoids look-ahead bias.

Once they estimate the model, they compare their forecasts to a simple average to calculate a squared error difference (SED):

When SED > 0, the kernel forecast is more accurate than the benchmark forecast. Now comes the second step of their approach: run another nonparametric regression of the form

and use the sign of the forecasted SED to define a “pocket of predictability”. Using this approach, they go on to show that several predictors used in the literature exhibit periods of stronger predictability, interspersed with periods of no predictability, and how this can be leveraged to forecast stock returns.