Systematically Biased: Portfolio Lab

Backtesting Mean-Variance Alternatives

Systematically Biased — Thu, 28 May 2026 17:22:17 GMT

In a previous post, I discussed several alternatives to mean-variance optimization (MVO), which are summarized below:

In this post, I’m going to put them all to the (back)test. To keep things simple but also practical, I’ll use a common universe of ETFs representing different asset classes and regions, and I’ll make some choices about how to group them into buckets to avoid undue concentrations.

Asset Universe and Data

I start with the following universe of ETFs. Although there are arguably some better (i.e. cheaper) ETFs for at least some of the asset classes, I chose these to maximize the length of the backtest period.

The ETFs belong to different asset classes/regions To allow me to control overall exposures, I assign them to an upper level “allocation bucket”. I would note that both my definitions of asset class and allocations buckets involve a degree of subjectivity. For example, I could have put REITs, Commodities, and Gold into one big bucket and called it “Alternatives”, but I preferred to split real estate from commodities and gold.

The ETFs all have different inception dates. To extend the backtest as much as possible, I first download total return series for the ETFs as well as their corresponding indices from Bloomberg. Then, for each ETF, if the ETF data starts after the index data, I backfill the series using the index returns. This gives me synthetic series starting in 2000 for almost all ETFs, the exceptions being VNQI (start date: 02/01/2001) and BNDX (start date: 04/01/2013).

After completing this step, I end up with the following series:

Setting Up the Backtest

I use 3 years of daily data to estimate inputs for all methods, so the backtests start in 2003. On each rebalance date, I require an ETF to have available data over the last 3 years. Portfolios are rebalanced at the end of each month, and held for one month.

To avoid extreme allocations to individual assets/buckets, I constrain the optimizer for most optimizations to avoid solutions that drift too far from a diversified multi-asset allocation. At the allocation bucket level, I use the following constraints:

At the individual ETF level, I use the following constraints:

In my view, these constraints embed a relative flexible asset allocation policy, while also preventing overly concentrated portfolios.

I used simple historical estimates for expected returns and Ledoit-Wolf shrinkage estimator for the covariance matrix.1 All optimizations were done using the skfolio python package. I clean-up weights to remove very small weights, and use some reasonable failsafes in case optimizations fails.

The Models

I consider the following allocation models. With the exception of the risk budget portfolio, all other optimizations are done using the constraints discussed previously. Also, it should be noted that both the Global 60/40 Benchmark and the 1/N portfolio are feasible under the set of constraints.

1. Global 60/40 Benchmark:

A simple strategic allocation: 40% U.S. equities, 20% international equities, and 40% bonds, with a global tilt through EFA, EEM, and BNDX. The allocations within equity (SPY, EFA, EEM) and fixed income (AGG, BNDX) represent loosely the breakdown of the market cap of the markets represented by the ETFs.2

2. 1/N Portfolio

This is an equally weighted portfolio of all available assets on each month. Note that applying 1/N across these ETFs/asset classes is not an innocuous or view-free allocation decision. It implies the following allocations: 33.33% to equity, 22.2% to bonds, 22.22% to REITs, and 22.22% to Commodities/Gold (11.11% each).

3. Minimum Variance Portfolio (MVP)

This portfolio chooses weights to minimize overall portfolio volatility available within the constraints.

4. Mean-Variance Optimization with Target Volatility (MVO σ=10%)

This portfolio chooses the highest-return it can find while targeting a fixed volatility level of 10%.

5. Mean-Semivariance Portfolio (Mean-SV)

This portfolio optimizes expected return subject to a target semivariance equal to the realized semi-variance of the Global 60/40 portfolio. The semi-variance is calculated relative to a target return of zero.

6. Mean-CVaR Portfolio (Mean-CVaR)

This portfolio optimizes with respect to tail risk rather than volatility. CVaR measures the average loss in the worst part of the return distribution, so this approach is designed to be more sensitive to extreme downside events. I use a confidence level β=0.95 and a target CVaR equal to the CVaR of the Global 60/40 portfolio.

7. Maximum Diversification Portfolio (MDP)

This portfolio maximizes the diversification ratio, favoring assets that contribute distinct risk exposures rather than moving closely together.

8. Risk Budget Portfolio (RB)

This portfolio allocates portfolio risk, rather than capital, across assets. Note that the risk budget approach is sensitive to the choice of the assets. For this reason, I decided to implement risk budgets across the allocation buckets:

Equities: 40% (split equally between SPY, EFA, and EEM)
Fixed Income: 40% (split equally between AGG and BNDX)
Real Estate: 10% (split equally between IYR and VNQI)
Commodities/Gold: 10% (split equally between DBC and GLD)

This results in ETF-level risk budgets that vary between 5% to 20%. Note that a risk parity solution at the ETF level would allocate 11.1% (1/9) to each ETF, which would result in a different risk budget at the bucket level.3 It should also be noted that for the RB portfolio, the asset-level and bucket constraints on portfolio weights do not apply.

Results

The backtests cover the period from February 2003 to April 2026. The table below shows summary statistics of all the models.

Before looking at the numbers, it’s important to note a few things:

Since the strategies have different levels of risk, comparison of most statistics is not appropriate. In special, comparing returns and maximum drawdowns would be misleading.
The alternative allocation models I considered have different objectives. Some of the metrics are directly related to certain objectives; others are not. For example, none of the methods directly relate to maximum drawdown. While we can certainly analyze the realized maximum drawdowns, we can’t really make any conclusions about the methods in general based on this.
In the case of the risk-based models (MVP, MDP, and RB) in particular, it’s important in my view to look at statistics related to what these methods try to achieve. For example, within the methods that implement the constraints, MVP achieves the lowest volatility, which suggests the objective of this model is attained.4 Likewise, when looking at MDP, we should probably look at other metrics that directly relate to the diversification ratio. For RB, the primary objective is to attain the desired risk budget, which the method does.

In order to make the strategies easier to compare, the table below shows the same statistics, but with the volatility of all strategies scaled to 10%. The best performing strategies in terms of risk-adjusted ratio metrics (Sharpe and Sortino ratio) are the mean-risk optimizations, which all produce very similar results.

The animation below shows the equity curves for all vol-scaled strategies.

The graphs below show the allocations on each rebalance date. A few points to note:

MVP allocates significantly to bonds, as expected. Over the second half of the sample, the bucket level constraint is binding most of the time.
The mean-risk (MVO, Mean-SV, Mean-CVaR) allocations follow almost identical patterns, with more volatility in allocations over time compared with MDP and RB.
MVP and MDP do not allocate to REITs at all over the second half of the sample. This is explained by the fact that the correlation between REITs and equity is significantly higher over the second period. As a result, there is little diversification to be gained by adding REITs to the portfolio.
The allocations of RB are very bond-heavy (as expected) and generally stable over time.

The table below shows the average monthly turnover of the models, computed from drifted weights and assuming a conservative one-way transaction cost of 25bps.5 Mean-risk approaches have higher turnover compared to other strategies, but the transaction costs remain very manageable.

Final Thoughts

The objective of this post was to show a practical implementation of alternative methods to mean-variance optimization (MVO) for a diversified asset universe covering major asset classes. In this backtest, alternative mean-risk portfolio construction approaches, such as mean-semivariance and mean-CVaR, produced almost identical results to MVO. After scaling the strategies to the same volatility, the mean-risk optimization models also delivered better risk-adjusted performance than simpler allocations, including a global 60/40 benchmark and the 1/N allocation, as well as other risk-based approaches, such as risk budget and maximum diversification.

Once embbeded inside a sensible asset-allocation policy, mean-risk optimizations, including MVO, worked well.

I started this post with no prior expectation for how MVO would perform relative to the other methods. In fact, I have no horse in this race, and I believe we should follow the empirical evidence. Despite being often criticized, MVO performed well in this application, even though I relied on simple historical estimates of expected returns and imposed only relatively simple constraints. Of course, this does not mean that MVO is appropriate in every situation, or that more sophisticated methods cannot deliver better solutions, particularly if investors have specific preferences about risk and return.

One important caveat is that the mean-risk optimizations use historical expected returns. In a nine-asset universe with strong long-run differences across realized asset-class returns, this can make a lot of difference. The result should therefore not be interpreted as evidence that historical mean returns are generally reliable forecasts. Rather, it shows that in this particular universe, sample period, and constrained implementation, mean-risk optimization was able to exploit return differences without producing pathological portfolios.

With 3 years of daily data per asset and only 9 assets, the sample covariance matrix would have worked just as well.

On the periods prior to BNDX availability, the strategy allocates 40% to AGG.

The risk budget in this case would be as below. I don’t like this allocation as it gives over 40% of the total risk budget to Commodities and Gold.

Equities: 33.3%
Fixed Income: 22.2%
Real Estate: 22.2%
Commodities/Gold: 22.2%

The fact that RB produces lower volatility than MVP is explained by the fact that the RB portfolio is not subject to the same constraints. Since fixed income has lower risk, RB allocates more to this bucket to balance the risk contributions.

I estimate transaction costs as follows (example for MVO σ=10%):

Beyond Mean-Variance Optimization

Systematically Biased — Mon, 25 May 2026 17:26:03 GMT

Obs: this post contains some formulas that don’t display properly on mobile. They do display correctly on desktop.

In a previous post, I discussed mean-variance optimization (MVO), and some of the many things that can go wrong when MVO is implemented naively. A big issue with MVO or indeed any portfolio optimization method is estimation error. The animation below shows how unstable the MVO can be when we resample the data.

One of the tricks commonly used in practice to stabilize MVO solutions is the use of constraints, which is related to regularization or shrinkage. Jagannathan and Ma (2003) showed that adding long-only constraints to the optimization problem is equivalent to using a modified covariance matrix, with the effect of shrinking the largest elements of the covariance matrix. If large covariances are due to estimation error, this helps because the shrunk covariance matrix becomes more precise.

There are other ways to achieve this shrinkage, notably by using a more structured estimator for the covariance matrix (e.g. an index model), or by explicitly shrinking the sample covariance estimator towards such a structured target. In cases where enough data are available, or if the optimization problem is already constrained, as in many practical cases, the choice of covariance estimator is much less critical.1

This post is going to look at some alternatives to MVO. The papers mentioned in the post are shown in the references section at the end, and most have presentation decks in the Systematically Biased Library.

But before we dive into the alternatives, let’s take two steps back to consider:

The measure of risk used in MVO.
Some misconceptions about MVO that regularly pop up.

Risk ≠ Variance

Textbook MVO equates risk with variance. One formulation of MVO is in terms of maximizing the mean-variance criterion:

where γ is a risk aversion parameter. This formulation says that investor utility increases with the portfolio expected return and decreases with its variance.2

However, variance as a measure of risk has shortcomings:

Variance penalizes large positive returns as much as large negative returns.
Variance measures deviations from the mean, which may not be the appropriate level of return targeted by the investor.

Regarding the first point, most investors are not worried about large positive returns. What really matters is downside risk: the occurrence of large, negative returns. Therefore, if returns are not symmetrically distributed, using the variance may not be an appropriate way to measure risk.

The second issue is more subtle. Variance is a statistical measure of dispersion, and the first moment (the expected value) is a natural reference point. However, when it comes to investments or measuring risk, there’s no a priori reason why we should measure deviations relative to the expected return. The investor may have a different benchmark level of return (say, the risk-free rate of return, or a fixed level of return like 5%).

In sum, if returns are not symmetrically distributed about the mean, or if the benchmark level of return is not the mean, variance does not in general coincide with downside risk.

One possibility, in this case, is to use another risk measure. A natural candidate is to replace variance with semivariance, which measures deviations only below a benchmark level of return B, and likewise replace covariances with semicovariances. In fact, Markowitz (1959) had already recognized that semivariance is a more plausible measure of risk than the variance.

Semivariance is not the only option. Another commonly used approach is to use a tail risk measure, such as the conditional Value at Risk (CVaR). We discuss both mean-semivariance and mean-CVaR approaches later on.

Before turning to alternatives, it is worth clearing up two misconceptions that repeatedly appear whenever MVO is discussed.

Myths about MVO

It’s important to remember that the estimation error issues discussed above are not specific to MVO. As a general rule, using noisy inputs (either for µ or Σ) can lead to poor results in any optimization problem, especially if it’s unconstrained. But MVO is often the target of these criticisms, even when the results being discussed have been obtained in a naive setting (µ estimated from historical returns; no or unreasonable constraints).

But there are other persistent myths about MVO, and I recommend the article by Benveniste, Kolm, and Ritter (2024) for a discussion of those. I’ll focus here on two of those myths that are probably the most persistent. The first one is that MVO requires assuming normally distributed asset returns. The second one is that it requires assuming investors have a quadratic utility function. I think one of the reasons that these myths persist is that these assumptions indeed imply mean-variance preferences. However, they are not needed for MVO to be a reasonable approach. Let’s take a quick look at each.

Normally Distributed Returns

I don’t think that discussing the unreasonableness of assuming normally distributed asset returns is necessary. But if asset returns were normally distributed, then the entire return distribution would be characterized by its mean and variance. Any expected-utility comparison among normally distributed portfolios would therefore reduce to a comparison involving those two moments. While assuming normally distributed returns gives you mean-variance preferences, it is not needed. What matters is the maximization of expected utility. Benveniste, Kolm, and Ritter (2024) define a class of mean-variance equivalent distributions, which have the property that the expected utility maximization for any standard utility function coincides with the maximum of an equivalent MVO problem. Many distributions, including heavy tailed distributions, elliptical distributions, and even some skewed distributions, are mean-variance equivalent.

Example of a mean-variance equivalent distribution from Benveniste, Kolm, and Ritter (2024)

Quadratic Utility

Quadratic utility indeed implies mean-variance preferences, but it also has some economically unreasonable properties. The most relevant is that quadratic utility implies increasing absolute risk aversion. This means that investors would invest less in risky assets as their wealth increases, i.e. risky assets are inferior goods.

As with the normality assumption, quadratic utility implies mean-variance preferences, but is not necessary for MVO to be a reasonable solution to a portfolio allocation problem. Indeed, the mean-variance criterion stated above is not a utility function, but rather an approximation of the expected utility. Under a second-order approximation to expected utility, expected return enters positively and variance enters negatively, with the coefficient on variance governed by risk aversion.3

The rest of this post is going to discuss some alternatives to MVO.

Alternative #0: Keep it Simple (SAA)

The simplest alternative, which existed long before Markowitz proposed MVO, is not to optimize at all. Investors can simply define an asset allocation policy that makes sense in the long run and rebalance their portfolios periodically to maintain it. Of course, defining these allocations requires some assumptions about the expected return and risk of the assets, as well as the investor’s risk preferences.

This kind of approach is sometimes referred to as strategic asset allocation (SAA). The widely used 60/40 benchmark (60% in equities, 40% in bonds) is a canonical example. Many investment products implement variations of this, including options that change the allocations to reduce the portfolio risk at a target date (e.g. for retirement).

Many different SAAs have been proposed, using different assets and asset classes. I implemented several of them in my AssetAllocation package for R (as well as several tactical asset allocation strategies).

Alternative #1: Keep it Simple (the 1/N rule)

Another rule, which is a special case of Alternative #0 and has been extensively studied, is the “Talmudic” or “1/N” rule of allocating equally across investments. This simple rule has the benefits of not requiring forecasts, mechanically enforcing diversification, and keeping turnover low.4 However, 1/N is not assumption-free. The main active decision is shifted from the weights to the definition of the opportunity set.

In terms of empirical performance, there is some controversy. A widely cited study by DeMiguel, Garlappi, and Uppal (2009) ran an out-of-sample horse-race between the 1/N portfolio and 14 mean-variance models across seven datasets, concluding that none of the models outperformed 1/N. Their conclusion: any potential gains due to optimal diversification are more than offset by estimation error.

These results, however, have been questioned by some studies on two fronts:

Kritzman, Page, and Turkington (2010) argue that the outperformance of the 1/N in DeMiguel, Garlappi, and Uppal (2009) is the result of using short samples. When estimates are constructed using longer samples, or simple but reasonable assumptions, Kritzman et al find that optimized portfolios outperform 1/N out of sample.
Allen, Lizieri, and Satchell (2019) make the point that, if investors have even modest forecasting ability, they can benefit substantially from MVO, which they substantiate analytically, via simulation, and empirically through out-of-sample comparisons.

The 1/N rule allocates equally across investments. It requires no forecasts and mechanically forces diversification. Whether 1/N outperforms portfolios obtained using optimization models is debatable.

An example of an investment product based on the 1/N rule is the RSP ETF, which has about $87 billion in assets.

Alternative #2: Minimum Variance Portfolios

The minimum variance portfolio (MVP) is the only portfolio on the mean-variance efficient frontier that doesn’t require estimation of expected returns. It is simply the solution of the problem below: 5

Because expected returns are harder to estimate and forecast, they are a major source of estimation error in MVO or indeed of any other optimization approach that requires them as inputs. In addition, variances and covariances are more persistent, and therefore more predictable. Therefore, although in principle other portfolios on the efficient frontier may be preferable, the MVP is likely to be estimated with more precision than other efficient portfolios.

The MVP is the only portfolio on the efficient frontier that doesn’t require expected return estimates/forecasts

Finance theory makes the prediction that the market-cap weighted portfolio of securities is the optimal efficient portfolio in equilibrium, and should, in principle, outperform the MVP.

Although neat, this result is based on a list of strong assumptions, all of which are violated in practice. Therefore, even a comprehensive market-cap weighted portfolio of all stocks in the market is bound to be inefficient. This point was made by Haugen and Baker (1991), who compared such a portfolio (the Wilshire 5000) to an MVP constructed from the largest 1000 stock in the US, with some concentration and sector constraints. The MVP achieved similar returns to the market-cap weighted index, but with lower risk. de Silva, Clarke, and Thorley (2006) confirm this result with a long backtest, which shows that the MVP has about three-quarters of the realized risk of a cap-weighted portfolio, but earns higher average returns. Clarke, De Silva, and Thorley (2011) further study the composition of the MVP. They show that a long-only MVP typically invests in a small number of securities, tilted towards low betas.

The surprisingly good performance of the MVP relative to market-cap weighted portfolios is therefore related to the well-known critique of the CAPM (i.e., portfolios sorted on beta have negligible differences in return). More generally, this is related to the low beta and low volatility anomalies.

It should be noted that the MVP avoids expected return forecasts, but it still depends on a risk model (volatilities, correlations) and other choices like constraints and turnover assumptions.

The MVP is the only portfolio on the mean-variance efficient frontier that doesn’t require expected returns as inputs. In many long-run empirical studies, constrained minimum-variance portfolios have delivered lower volatility and competitive, sometimes higher, realized returns than capitalization-weighted indices.

An example of an investment product based on the MVP is the USMV ETF (about $24 billion in assets).

Alternative 3: Mean-Risk Optimization

As discussed above, the variance has some shortcomings as a risk measure. There are several alternative risk measures that can be used to replace the variance. The two most commonly used in practice are the semi-variance and the conditional Value at Risk (CVaR).

Mean-Semivariance Optimization

The semivariance is similar to the variance, but considers only returns below a benchmark level of return B:

Intuitively, mean-semivariance optimization tries to find portfolios with attractive expected returns while penalizing only the observations in which the portfolio falls below the benchmark. This makes the relevant downside observations endogenous: changing the weights changes which scenarios count as downside scenarios.

For this reason, mean-semivariance optimization is not as simple as mean-variance optimization. The issue is that the semivariance of a portfolio cannot be written as a quadratic form, precluding the use of quadratic programming.

This problem can be resolved by noting that, within the regions where assets underperform the benchmark, the semivariance can be written in a quadratic form. This approach, presented by Markowitz et al. (2020), is the typical implementation used in practice. It relies on introducing a matrix of excess returns or deviations relative to the benchmark.

Mean-CVaR Optimization

To talk about conditional Value-at-Risk (CVaR), we first need to define the Value-at-Risk. Loosely speaking, the VaR is a loss that we’re fairly sure won’t be exceeded over some horizon. For example, suppose that the level of confidence is 90%. If the daily VaR of a portfolio with a 90% confidence level is $1 million, we are 90% confident that we won’t lose more than $1 million on any given day. Conversely, we should expect to lose more than $1 million on 10% of days.

VaR has some shortcomings as a way to measure risk. Notably, VaR is not a coherent risk measure. Particularly, VaR does not respect the sub-additivity property of a coherent risk measure, which requires that the risk measure applied to the sum of two portfolios must be at most equal to the sum of the risk measures applied to each portfolio. The consequence is that VaR may discourage diversification.

Another problem with VaR is that it only gives us a level of loss that we should not expect to exceed with some confidence, but it tells us nothing about what level of loss to expect when we do exceed it. The CVaR, or expected shortfall, gives you exactly that.

CVaR (also known as expected shortfall) is a tail risk measure. It tells us how much we expect to lose, given that the loss exceeds the VaR. The animation below shows examples of VaR and CVaR. Note that this is shown using the distribution of returns (i.e., negative values correspond to losses).

To define VaR and CVaR mathematically, we need some notation:

w : vector of portfolio weights
r : vector of asset returns
β ∈ (0,1) : confidence level
L(w, r) = -w’r: portfolio loss (negative of portfolio return)
p(r) : probability density function of returns
Ψ_L(y): cumulative distribution function of losses (the probability of not exceeding a threshold loss y).6
α_β(w): the VaR of portfolio w at confidence level β.

Note that we went from working with returns to working with losses. This means the distribution shown above would be flipped, with large positive values corresponding to large losses. With this notation, the VaR at confidence level β is defined as the smallest loss such that the probability of not exceeding it is at least β:

\\beta\\}","id":"PBFWVMMJEV"}" data-component-name="LatexBlockToDOM">

Now that we have defined VaR, we can define the CVaR mathematically as:

The expression above resembles the expected loss. Indeed, the CVaR_βis a specialization of the expected value in which we’re averaging only over the worst (1-β) fraction of losses.7

Portfolio optimization using the CVaR is complicated, because the CVaR depends on an integral over VaR values. However, Rockafellar and Uryasev (2000) proved two key results that make mean-CVaR optimization practical, effectively transforming it into a linear programming problem. Their paper rests on introducing the following function:

where[x]⁺=max(x,0). Their first theorem shows CVaR_β can be obtained as the minimum of this function in α. Their second theorem states that minimizingCVaR_β over all portfolios w is equivalent to minimizing the function above over all values of (w, α).

In practice, the integral can be approximated as a sum, using values that can be either simulated from p(r), if it’s available, or a sample (more commonly). Suppose that a sample of returns r₁, r₂, …, r_T is available. Then function can be approximated by

The optimization problem can then be written as a linear programming problem using auxiliary variables:

Mean-risk approaches replace variance with other risk measures. Two commonly used approaches are the mean-semivariance and mean-CVaR. Both optimization approaches are more complicated than MVO, but can be resolved by augmenting the optimization problem through auxiliary variables.

Alternative 4: Maximum Diversification

Choueifaty and Coignard (2008) note that one of the main difficulties with MVO is the need to estimate expected returns. They propose a heuristic approach based on maximizing the diversification ratio. Denote portfolio weights by w, the covariance matrix by Σ, and σ_d=diag(Σ)^1/2the vector of volatilities for the assets. The diversification ratio is defined as

The numerator is the weighted average of volatilities, which disregards diversification due to asset comovement. The denominator is the portfolio volatility, which accounts for asset comovement. Therefore, the diversification ratio captures the extent to which asset comovements reduce risk. They propose to compute the most diversified portfolio (MDP) by choosing w that maximizes this ratio. In their empirical applications, the MDP has higher Sharpe ratio than market-cap weighted indices.

MDP belongs to a class of risk-based portfolio construction approaches, which do not require estimation of expected returns. Other approaches in this category are the MVP and risk parity approaches.

A related paper is Clarke, De Silva, and Thorley (2013), who derive analytical expressions for risk-based portfolios (MVP, MDP, and risk parity) under a single-index model.

The maximum diversification approach selects portfolio weights that maximize the ratio between the weighted average volatility of the individual assets and the volatility of the resulting portfolio.

Alternative 5: Risk Parity

Risk parity is perhaps the most popular risk-based portfolio construction method. It is widely used by institutional investors because it provides a disciplined way to diversify risk. It is particularly popular with CTAs and trend followers to equalize how much risk each asset contributes to the overall portfolio.

A canonical example to explain risk parity is to look at a 60/40 portfolio of stocks and bonds. Using 10 years of data ending in April 2026, we get the following realized performance:

The 60/40 portfolio has a volatility of 11.3%. However, approximately 93.5% of this total risk comes from the equity allocation. The bond allocation, while representing 40% of the total allocation, accounts for less than 10% of the risk. Risk parity focuses on finding the portfolio allocations that would result in equal risk contributions. In this example, the allocations that equalize risk contributions are 23.6% in SPY and 76.4% in BND. The resulting portfolio has a volatility of 6.39%.

The example above uses volatility as the risk measure, but risk parity is more general. Denote by R(w) a risk measure for a portfolio w. The marginal risk contribution of asset i is defined as

and the risk contribution of asset i is defined as

The risk measure must satisfy the Euler allocation principle, which states that risk can be decomposed as follows:

In words: the total risk is the sum of the risk contributions, defined as each allocation multiplied by the derivative of the risk measure relative to the allocation. The portfolio risk can be obtained as the sum of risk contributions.

In the case of the volatility, we have

and the vector of marginal risk contributions is

Notice that the denominator is the same for all assets, such that the marginal risk contribution of asset i is:

where (Σw)_i denotes the i-th element of Σw. We can verify that this satisfies the allocation property:

Suppose that we have a risk budget b=(b₁, …, b_n) that defines how much risk each asset should contribute to the total risk. The risk parity case corresponds to equal risk contributions, i.e. b₁= ⋯ = b_n. The risk budget portfolio is the solution of the following system:

0\\\\\\sum_i{w_i}&=1 \\end{array}\\right.","id":"TOXMTKGZFB"}" data-component-name="LatexBlockToDOM">

Writing the Lagrangian for this problem and solving the first-order conditions, it can be shown that it is equivalent to solving the following problem:

0, i=1,\\dots, n\\end{array}\\right.","id":"EZLZBHFDVN"}" data-component-name="LatexBlockToDOM">

One important element to consider when using the risk budget approach, especially with equal risk budgets, is that the choice of the asset universe can have a significant impact on the resulting portfolio. For instance, suppose we added another equity ETF to the universe in the toy example above. Then a risk parity solution would end up allocating 2/3 of the risk budget to equities and 1/3 to bonds. A possibility in these cases is to have equal risk budgets within asset classes, and divide asset-level risk budgets equally across instruments within each asset class.

A related approach is hierarchical risk parity. Rather than solving directly for equal risk contributions across all assets, hierarchical risk parity first uses the correlation structure to cluster similar assets, and then allocates risk through the resulting hierarchy. This can make the allocation less sensitive to small changes in the covariance matrix and can avoid some of the arbitrary effects of treating all assets in the universe as exchangeable.

There’s an enormous literature on risk parity and its performance. Some interesting earlier papers are Chaves, Hsu, and Thorley (2011), Asness, Frazzini, and Pedersen (2012), and Clarke, De Silva, and Thorley (2013). Hierarchical risk parity is introduced in Lopez de Prado (2016).

Note that risk parity is usually solved as a long-only problem with w_i > 0. In general, there’s no unique solution if weights are allowed to be negative. However, if we know which assets we want to short, a solution can be found by modifying the problem above. This approach is applied by Rubesam (2022) in the context of three systematic trading strategies (trend following, pairs trading, and factor investing).

Risk parity is a risk-based approach that constructs a portfolio so that assets or asset classes contribute equally, or according to predefined budgets, to total portfolio risk. Instead of allocating capital equally, it allocates risk equally.

Summary

Mean-variance optimization is often criticized due to the sensitivity of optimal solutions to changes in the inputs, estimation error, and due to the shortcomings of variance as a risk measure. At the same time, there are some persistent misconceptions about MVO, such as MVO requiring an assumption of normality or a quadratic utility function, which deserve to be put to rest.

This post reviews some alternatives to MVO, which range from doing away with forecasts altogether (1/N), getting rid only of expected return forecasts (MVP, MDP, risk parity), and using different risk measures (mean-semivariance, mean-CVaR). The table below summarizes these alternatives and their tradeoffs.

In an upcoming post, I’ll discuss a practical implementation of these alternatives with ETFs

I should not also that this is not an exhaustive list of alternatives to MVO. Another family of approaches, robust optimization, keeps the optimization framework but explicitly accounts for uncertainty in the inputs. Views-based approaches, such as Black-Litterman and Entropy Pooling, also remain close to the optimization tradition, but they change the way investor views, priors, or scenarios enter the problem. I will discuss these approaches in future posts.

References

Allen, D., Lizieri, C., & Satchell, S. (2019). In defense of portfolio optimization: What if we can forecast?. Financial Analysts Journal, 75(3), 20-38.

Asness, C. S., Frazzini, A., & Pedersen, L. H. (2012). Leverage aversion and risk parity. Financial Analysts Journal, 68(1), 47-59.

Benveniste, J., Kolm, P. N., & Ritter, G. (2024). Untangling universality and dispelling myths in mean-variance optimization. The Journal of Portfolio Management, 50(8), 90-116.

Chaves, D., Hsu, J., Li, F., & Shakernia, O. (2011). Risk parity portfolio vs. other asset allocation heuristic portfolios. Journal of Investing, 20(1), 108.

Choueifaty, Y., & Coignard, Y. (2008). Toward Maximum Diversification. The Journal of Portfolio Management, 35(1), 40-51.

Clarke, R., De Silva, H., & Thorley, S. (2011). Minimum-variance portfolio composition. Journal of Portfolio Management, 37(2), 31.

Clarke, R., De Silva, H., & Thorley, S. (2013). Risk parity, maximum diversification, and minimum variance: An analytic perspective. The Journal of Portfolio Management, 39(3), 39-53.

DeMiguel, V., Garlappi, L., & Uppal, R. (2009). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy?. The review of Financial studies, 22(5), 1915-1953.

De Silva, R., Clarke, H., & Thorley, S. (2006). Minimum-variance portfolios in the US equity market. Journal of Portfolio Management, 33(1), 1-14.

Dom, M. S., Howard, C., Jansen, M., & Lohre, H. (2025). Beyond GMV: the relevance of covariance matrix estimation for risk-based portfolio construction. Quantitative Finance, 25(3), 403-419.

Haugen, R. A., & Baker, N. L. (1991). The efficient market inefficiency of capitalization-weighted stock portfolios. Journal of Portfolio Management, 17(3), 35.

Jagannathan, R., & Ma, T. (2003). Risk reduction in large portfolios: A role for portfolio weight constraints. Journal of Finance, 58, 1651-1684.

Kritzman, M., Page, S., & Turkington, D. (2010). In defense of optimization: the fallacy of 1/N. Financial Analysts Journal, 66(2), 31-39.

Ledoit, O., & Wolf, M. (2003). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10(5), 603-621.

Lopez de Prado, M. (2016). Building diversified portfolios that outperform out-of-sample. Journal of Portfolio Management.

Markovitz, H. (1959). Portfolio selection: Efficient diversification of investments.

Markowitz, H. M., Starer, D., Fram, H., & Gerber, S. (2020). Avoiding the downside: A practical review of the critical line algorithm for mean–semivariance portfolio optimization. Handbook of Applied Investment Research, 369-415.

Pflug, G. C., Pichler, A., & Wozabal, D. (2012). The 1/N investment strategy is optimal under high model ambiguity. Journal of Banking & Finance, 36(2), 410-417.

Rockafellar, R. T., & Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal of Risk, 2, 21-42.

Rubesam, A. (2022). The Long and the Short of Risk Parity. Journal of Portfolio Management, 48(4).

This approach has been pioneered in a series of papers by Ledoit and Wolf and their shrinkage estimators are widely available in portfolio optimization packages. These estimators can be helpful when the number of assets is large relative to the number of data points available. Dom et al. (2025) study the impact of the covariance matrix estimator for minimum variance portfolios. Sophisticated models do not add much value compared to the sample covariance matrix when long-only and turnover constraints are used.

Solving this problem for different levels of γ traces out the same efficient frontier as the more common approaches of minimizing the portfolio variance for different levels of expected return, or maximizing expected return for different levels of variance/volatility.

If we also assume decreasing absolute risk aversion, we can also make statements about investors’ preferences regarding skewness (investors prefer positive to negative skew) and kurtosis. If we also assume decreasing absolute prudence, investors dislike kurtosis.

Interestingly, this rule can even be optimal under extreme model uncertainty, as shown by Pflug et al. (2012).

Alternatively, it is the solution of mean-variance utility maximization stated previously with a coefficient of risk aversion γ→ ∞.

Note that we omit the dependence on the portfolio w.

That is,

Portfolio Optimization: What Could Go Wrong?

Systematically Biased — Thu, 30 Apr 2026 15:05:52 GMT

Mean-variance optimization is one of those ideas that is both foundational and strangely easy to caricature. In theory, it gives us the cleanest possible answer to a portfolio choice problem. In practice, small changes in the inputs can lead to large changes in the portfolio. This post is about the second part. I’ll show some concrete examples of how easily things can go wrong when portfolio optimization is implemented naïvely, and how some of these issues can be attenuated in practice. I’ll focus only on mean-variance optimization (MVO), the most vanilla of all portfolio optimization approaches.

Python code for all the examples shown in this article is provided at the end.

The Mean-Variance Optimization Problem

Given n assets, we are trying to build an efficient portfolio in a mean-variance sense. For a target level of expected return µ, we want the portfolio with the lowest risk possible.1

Any finance textbook gives us the following recipe:

Find the minimum variance portfolio (MVP) and compute its expected return. The MVP is the solution of the problem above without the target expected return constraint.
Create a grid from the expected return of the MVP to a maximum expected return.
Solve the problem above for each target expected return, store portfolio weights, expected returns, and standard deviations.
Plot the resulting set of portfolios in standard deviation-expected return space.

The result is the so-called efficient frontier, i.e., the set of portfolios with the lowest risk for each level of expected return (or alternatively, with highest expected return for each level of risk). This is the starting point for many classic results in finance which I’m not going to discuss here.

Problems with MVO

Already we can identify some potential issues:

We don’t really know the expected returns of the assets or their covariance matrix;
We’re assuming that the investor’s preference is fully captured by expected returns and the covariance matrix of returns;
Everything is static (this is a single period model);
The framework doesn’t take into account practical considerations, such as transaction costs.

I will come back to some of these issues in a future article. For now, let’s focus on the first one. Since we do not know the expected returns and covariances, the best we can do is to estimate them, which brings estimation error into the problem. The standard Markowitz machinery is not designed to take that into account. A common critique is that MVO is an “error maximizer”, but this is a critique of the estimation process, not of the method.

This distinction matters. Optimization does not create information. It operates on whatever information is in the inputs. If the inputs contain signal, optimization can turn small edges into meaningful portfolio gains. If the inputs contain mostly noise, optimization can amplify that noise into concentrated bets.

Among the inputs to be estimated, there is a pecking order in terms of the impact on the resulting portfolios:

Expected returns→Variances→Covariances

The intuition is simple: expected returns enter the optimizer as the reward for taking risk, so small differences in estimated means can dominate the allocation. Variances determine the scale of risk for each asset, while covariances determine diversification benefits across pairs.

The estimation of expected returns is the most critical. Not only are expected returns difficult to estimate precisely, but small differences in expected return estimates can significantly affect the resulting portfolios. In terms of the covariance matrix, errors in the variances are approximately twice as important as errors in the covariances.2 There are also known issues in estimating covariances, especially in high dimensions (hundreds or thousands of assets), but there are many methods that can attenuate these issues.

Many empirical studies that find lackluster performance for MVO portfolios estimate inputs directly from historical returns. Interestingly, this is clearly at odds with what Markowitz himself suggested. His 1952 paper opens with the following statement:

The process of selecting a portfolio may be divided into two stages. The first stage starts with observation and experience and ends with beliefs about the future performances of available securities. The second stage starts with the relevant beliefs about future performances and ends with the choice of portfolio. This paper is concerned with the second stage…
To use the E-V rule in the selection of securities, we must have procedures for finding reasonable μ_i and σ_ij. These procedures, I believe, should combine statistical techniques and the judgment of practical men.

So dismissing MVO because naïve implementations based on historical return estimates perform poorly feels a bit like throwing the baby out with the bathwater. That is, the fact that estimates based only on past returns produce poor results should not automatically disqualify the optimization process.

One interesting aspect regarding the estimation of expected returns is that even a small forecasting edge can significantly improve results. Expected return estimates based only on realized returns are mostly noise, but firm characteristics and other variables can be used to construct forecasting models with low, but nonzero, predictive power. This is a point I will come back to in a future article.

Practical Examples

The rest of this post is concerned with simple illustrations of the very real issues that arise when MVO is applied using naïve estimates from historical returns. These examples are pedagogical but typical of what happens in practice if we implement MVO “out of the box”. Examples 1 through 4 illustrate common problems that arise in naïve MVO implementations. Examples 5 and 6 show simple ways to attenuate some of them.

Example 1: Input sensitivity

A common issue with MVO is the sensitivity of the results to changes in the inputs. Consider the example below. There are three assets with the characteristics below. Assume all correlations are equal to 0.8, and the objective of the manager is to maximize the expected return for a 15% volatility target.3

If we solve the problem, we get the following portfolio weights:

Now suppose we increase all pairwise correlations from 0.8 to 0.9. In this case, we get the following solution:

An apparently small change in the correlation leads to large differences in portfolio weights.

Example 2: Estimation Error

Example 1 illustrated the fact that MVO weights are very sensitive to the inputs. We also know that financial returns are very noisy, and the inputs are estimated under significant estimation error, especially expected returns. But that doesn’t mean that we can take covariance matrix estimates for granted. Even with a small number of assets, sample variability can have an important impact.

In this example, we consider an extremely simple application of MVO. There are only two assets:

SPY (U.S. equities)
TLT (Long-Term U.S. Treasuries)

And we are estimating the minimum variance portfolio (MVP) of the two assets on a given date. In the two-asset case, the MVP is calculated in closed form as

Therefore, we only need to estimate 3 parameters:

The variance of SPY
The variance of TLT
The covariance between SPY and TLT

Suppose we decide to estimate these parameters using 5 years of daily returns for the two ETFs. This means we have approximately 1,260 daily return observations for each asset, which is a relatively generous sample for estimating only three parameters.

Using 5 years of daily data ending in March 2026, we get the following estimates:

SPY volatility: 17%
TLT volatility: 16%
Correlation (SPY,TLT): 0.0712

The corresponding MVP allocates 46% to SPY and 54% to TLT, resulting in a volatility of 12%. The reduction in risk is due to the low correlation.

In this simple example with only 3 parameters to estimate, how much sampling variability exists? That is, if we had used a slightly different sample, we would have obtained different estimates. To quantify the variability in these estimates, we use a technique known as bootstrapping. The idea is to construct new samples by randomly drawing, with replacement, from the original sample. The graphs below were generated with 1,000 bootstrap samples.

As we can see, even in this simple case in which data is relatively abundant, there’s a substantial amount of uncertainty. A 95% confidence interval for the MVP weight on SPY is [41.20%, 51.20%], while the corresponding interval for MVP volatility is [11.23%, 12.76%].

Example 3: Corner Solutions/Extreme Concentration

Another common issue in naïvely implemented MVO portfolios, especially when weights are left unconstrained, is the emergence of corner solutions. The optimizer finds a solution that invests almost all capital in just a small subset of the assets, leading to extreme portfolio concentration.

Let's consider the problem of optimizing a portfolio using the following ETFs representing different asset classes:

U.S. Stocks (SPY)
International Stocks (EFA)
Emerging Stocks (EEM)
Long-Term U.S. Treasuries (TLT)
Intermediate-Term U.S. Treasuries (IEF)
U.S. Corporate bonds (LQD)
Real Estate (VNQ)
Commodities (DBC)
Gold (GLD)

The objective is to find the portfolio with the highest expected return subject to a target volatility of 10%.

Below are solutions obtained from a naïve implementation of MVO on a specific date (December 2019). Expected returns and the covariance matrix estimates are obtained using sample moments from the previous 3 years of daily data. The “Unconstrained” solution allows short positions, while the “Long-Only” requires non-negative weights. In both cases, the full investment constraint (sum of weights = 1) is used.

The unconstrained solution invests in all ETFs, but the level of leverage is unreasonable (gross leverage of over 600%). The optimizer takes large opposite positions in IEF and LQD. The two ETFs are highly correlated (ρ=0.94), but have a difference in expected returns, so the optimizer aggressively buys LQD (expected return 6.7%) and shorts IEF (expected return 4%). While the positions are extreme, this is not unexpected. In the absence of other constraints, the optimizer is trying to take advantage of the fact that the two assets seem like close substitutes but have a return differential.

Estimated correlation matrix as of December 2019

The long-only solution, on the other hand, is extremely concentrated. It invests only in SPY (78.7%) and GLD (21.3%).

What about other efficient portfolios? The graph below shows the entire efficient frontier under the long-only constraint (w_i≥ 0). As we can see below, the entire frontier lacks diversification, investing on average in only 3 assets at any given level of volatility.

Example 4: Instability of Optimal Weights

Considering the same setup of Example 3, let’s explore how the optimal weights for the target volatility of 10% in the long-only case evolve if we rebalance the portfolio at each month end. As can be seen in the graph below, the optimal allocations can vary dramatically over time. In Example 3, the optimal allocations as of December 2019 were concentrated in SPY (~80%) and GLD (~20%). But we can see that, prior to that date, the GLD allocation was replaced by either TLT or LQD.

Optimal allocations can vary dramatically over time

The COVID shock shifts allocations dramatically due to changes in the estimated parameters. As shown below, the optimal portfolio moves to an almost 20% allocation to TLT in January 2020, and then abruptly allocates almost 100% to TLT in February 2020.

Once again, this is not unexpected. We relied on sample moments for the optimization. When the sample moments abruptly changed, the optimization results changed accordingly. This is not necessarily bad: it would make sense to reduce exposure to risky assets during this period. But without forward-looking estimates, the best the optimizer can do is react when the sample moments change. Using shorter samples will lead to faster reactions, with higher portfolio turnover.

Another interesting question is the extent to which empirical regularities can make their way into the optimization process. For example, using past returns to construct estimates of expected returns can introduce a momentum effect in the portfolio construction process. This illustrates an important nuance: using historical returns as expected returns is not always an accidental mistake. Sometimes it is an intentional signal design choice. The problem is that once expected returns are estimated from recent returns, the optimizer needs additional discipline, usually in the form of constraints.

JPMorgan designed an index based exactly on this idea. The JP Morgan Efficiente 5 index uses a set of 12 ETFs and constructs an optimal portfolio using mean-variance optimization with a lookback window of 6 months to estimate expected returns. The idea behind the short lookback period is to try to capture time series momentum in the ETFs. To get around the lack of diversification we have seen in this example, they impose maximum weights per ETF and per asset class. The inclusion of these additional constraints is the simplest approach to attempt to discipline MVO results, as I’ll illustrate in the next example.

Example 5: Adding Constraints

One of the most commonly used approaches to deal with the concentration issues in MVO is to introduce constraints on the portfolio weights. We explore the impact of simple constraints for the long-only optimal portfolio in Examples 3 and 4. We consider the following portfolios:

Long-only: the same portfolio in Example 3 (positive weights with full investment constraint).
Max. asset weight: long-only portfolio with an additional constraint that the weight on any asset is capped at 20%.
Max. asset class weight: long-only portfolio with the constraint that the weight on any asset is capped at 20% and additional caps per asset class.

For the last one, we divide portfolios into 3 asset classes for simplicity as follows:

Equity: SPY, EFA, EEM. The upper bound for the combined weights is 50%.
Fixed Income: TLT, IEF, LQD. The upper bound for the combined weights is 50%.
Alternatives: VNQ, GLD, DBC. The upper bound for the combined weights is 30%.

For each case, we obtain the optimal portfolio with a target volatility of 10% at the end of December 2019. The solutions in each case are shown below. The first column shows the same overly concentrated portfolio from Example 3, which invests in only two out of the nine assets. The introduction of the maximum weight constraint at the asset level (column “Max. asset weight”) forces the optimizer to diversify. The constraint is binding for SPY and GLD, but the portfolio now invests in six assets. Finally, the introduction of the additional asset class constraints (column “Max. asset class weight”) improves diversification a bit more, although two assets (TLT and VNQ) are still left out.

As we can see, introducing constraints into the portfolio optimization process has the practical effect of improving diversification. There are also theoretical reasons for why adding constraints can be beneficial, as they induce a shrinkage effect on the covariance matrix, which attenuates estimation error.

Example 6: A Simple Resampling Approach

In Example 2, I used bootstrapping to assess the variability in MVP weights. An interesting approach that can be used to mitigate several of the issues with MVO illustrated above is to apply resampling to the portfolio optimization process. A typical process works as follows:

Estimate expected returns and covariance matrix from the original sample.
Generate many simulated or bootstrap samples.
For each sample, estimate a new µ and Σ.
Compute an efficient frontier for each resample.
Average the portfolio weights across resamples at corresponding points on the frontier.
Evaluate the averaged portfolios using the original estimates.

I apply this approach to the multi-asset-class portfolio of the 9 ETFs on December 2019 (for comparison with the results in Examples 3-5). The resampling step uses 500 bootstrap samples.

The resampled frontier is shorter and plots below the full sample one (top chart). The bottom chart shows that the resampling approach improves diversification significantly. In particular, all assets are now part of the frontier, and the concentrations are much less extreme. Note that the resampled frontier below does not include any constraints, either at the individual asset or the asset class level. Nevertheless, the allocations are much less extreme compared to a single optimization.

The resampled frontier is much more diversified because it averages the optimal weights obtained from many slightly different versions of the data. In the original sample, mean-variance optimization tends to put large weights on the assets that look best in-sample, especially those with high estimated returns, low estimated risk, or favorable estimated correlations.

But these estimates are noisy, so the identity of the “best” assets changes across bootstrap samples. An asset that receives a large weight in one resample may receive a smaller weight, or no weight, in another. When the weights are averaged across many resampled frontiers, these unstable extreme positions are diluted, while assets that are consistently useful across samples retain higher weights. The result is a smoother and more diversified allocation that sacrifices some in-sample efficiency in exchange for lower sensitivity to estimation error.

Final Thoughts

Mean-variance optimization (MVO) is often criticized as yielding nonsensical or impractical results. This article shows practical examples that illustrate:

how sensitive MVO solutions can be to small changes in inputs;
how MVO can produce overly concentrated portfolios or extreme long-short positions;
how naïve sample-moment estimates can lead to unstable allocations over time;
how constraints and resampling can reduce, though not eliminate, these problems.

MVO is not the only game in town. Other portfolio construction approaches are designed, at least in part, to handle some of the issues illustrated in this article:

The Black-Littermann model starts from the idea that market weights contain useful equilibrium information, then allows investors to incorporate views with an explicit degree of confidence.
Bayesian approaches, more generally, model uncertainty about expected returns, covariances, or other inputs directly rather than treating estimates as known.
The Total Portfolio Approach uses the strategic asset allocation process to define a fund’s return objective and overall risk budget, but treats the resulting benchmark as a guide rather than a binding allocation. It recognizes the uncertainty in long-term forecasts and gives investment teams more discretion to deploy risk across the total portfolio as opportunities and market conditions change.
Risk parity reduces reliance on expected return estimates by focusing instead on how much each asset contributes to total portfolio risk.
Parametric portfolio policies allow investors to link portfolio weights directly to asset characteristics that may contain information about expected returns.

While there are many alternative portfolio construction methodologies, I would argue that MVO remains useful because it makes the trade-off between return, risk, and diversification explicit. Knowing its practical limitations, and ways to avoid them, remains essential.

Python Code

Notebooks that replicate all examples in this article are provided below for supporters of the Systematically Biased.

Building a Systematic Trading System With AI (Post #4: Building the Trading Layer)

Systematically Biased — Fri, 24 Apr 2026 18:22:15 GMT

This is the fourth and last post of a series in which I’m documenting the process of building a functional systematic trading system from scratch using an AI agent. Previous parts are here:

Post #1 (data pipeline).
Post #2 (trading rules/subsystems/strategies).
Post #3 (backtesting engine)

Once the data pipeline and backtesting engine were in place, the next challenge was operational. At that point, the app could already process historical futures data for signal generation and backtest strategies under fairly realistic assumptions. But neither of those things is enough to trade a strategy in practice. That requires a different layer of the system.

The trading problem can be stated very simply:

How do we take a signal generated by a model and turn it into an actual position held at the broker, while keeping the process observable, reviewable, and safe?

This quickly expands into a chain of distinct steps. Strategies produce signals. Given an account size and choices about how to combine them to achieve some target risk, the signals need to be turned into target positions. The target positions then need to be compared with what is currently held in the account. That comparison produces one or more orders that, once executed, change the state of the portfolio, which in turn determine future P&L.

In other words, the trading layer is really about managing the transformation

Signals →Target Positions→Orders→Actual Positions→P&L

in a way that maintains internal consistency.

From Signals to Targets

The first step is to decide what the strategy wants to hold based on a signal. In the app, signals are generated by trading rules, then combined inside subsystems, and finally combined into a strategy and translated into desired positions by the portfolio engine. The output of this stage is not an order but a target: the number of contracts the strategy would like to hold for each instrument.

Conceptually, if the strategy signal for instrument i is S_iand the sizing engine determines that one unit of signal corresponds to some risk-scaled contract quantity, then the target can be written schematically as

The exact form of the function depends on the sizing logic. While the distinction between signal and position may seem obvious, it’s important in order to avoid confusion between the two layers.

The strategy I adopted in this project is typical of many systematic trading setups. We take a snapshot of the current market state and append it, in effect, as the latest observation for the purpose of computing the current signal, which then is turned into a target position as described above.

Once the app computes targets, it has to decide what to do with them. The current setup is semi-automatic: the app treats targets as snapshots. A target run is computed, saved, and then reviewed. Execution then operates on that persisted snapshot rather than silently recomputing the strategy again at order time. This keeps the trading decision being executed explicit and auditable. Obviously, if we were to move into higher frequency strategies, automation and speed would be much more important.

From Targets to Orders

Once a target is known, the next step is to compare the target with the current position. If the desired position is n_i^* and the current position attributed to the strategy is n_i^B, then the order quantity is

Again, this is conceptually simple, but there are some important details. The app is not generating trades directly from signals, but from the difference between desired and current state. This also means that the app need to be able to read broker snapshots and reconcile positions with one or more strategies. Without a reliable view not only of current broker holdings, but of what each strategy is holding, the app cannot compute meaningful order deltas.

From Orders to Positions

Making the app connect to my broker’s API was extremely easy (despite a few glitches related to a well-known issue in Interactive Brokers’s API running inside a Streamlit environment). But the trading layer is more than a wrapper around the broker API. Once orders are submitted, the app has to update its view of what the account actually holds. It has to compare that with what the broker reports and decide whether the internal state and the external state are still aligned.

Connecting the app to Interactive Brokers was surprisingly easy

This is especially important because the broker account only knows account-level positions, ie it does not know about the internal hierarchy of rules, subsystems, and strategies that exists inside the app. So if several strategies are meant to coexist in the same account, the app has to maintain its own strategy-level accounting. That means the system needs to be able to retrieve broker positions, and attribute them internally to one or more strategies. The latter part needs to be coherently taken care of by the app. Without this internal attribution layer, a multi-strategy system running in one account becomes impossible to monitor coherently.

From Positions to P&L

Given existing positions, we would like to measure P&L to monitor the system. In a futures systems, this means marking positions to market using current prices and contract multipliers. The calculation is relatively straightforward, but the system needs to maintain a coherent chain from executed positions to current account state to P&L. Without that, the system cannot really be monitored.

The broad logic is simple enough, but the difficulty is in all the boundary conditions around it. Signals are generated on continuous futures series, but orders must be placed in actual tradable contracts. This means the app has to distinguish between the instrument used for signal generation and the instrument used for execution. I’ve also implemented a feature that allows the user to change the mapping of the contracts used for signal generation and trading. For example, I may use BTC for signal generation but execute on MBT.

One important detail was to make sure that the app would not break in case of unattributed positions. Since I have other positions in my brokerage account that have nothing to do with the system, I had the agent implement logic to put these aside in an “unattributed” bucket, which allows me to quickly check anything that is not internally attributed to a strategy.

That is what makes the trading layer a engineering problem rather than just an API integration (which the AI agent did very efficiently).

Monitoring Exceptions

Real life is messy, and it is not possible to plan for every eventuality. However, we can and should carefully monitor what happens along the way, and the app should flag any situations that seem anomalous. My app ended up with separate monitoring and exception views to keep track of whether the system is currently in a trustworthy state.

I did (and do) intend to use this app for real trading. For this reason, paper trading is essential, as it reveals many of the operational issues that can happen. To avoid any risk of mixing paper and live trading, I implemented a strict segregation between the two: each has its separate state, separate storage, separate logs, and separate broker connections.

This part of the project also made the limitations of AI-assisted coding very visible. The difficulty was not in coding, but in making sure the entire chain was logically coherent, due to the many details involved in even a modestly complex trading system. Testing strategies in paper trading revealed several issues and allowed me to fix them safely.

What Still Needs Work

So this is the end of the series, but not really the end of the project. At this stage, I’ve been paper trading strategies, but I still don’t trust it enough to put real money on the line. Some of the things in the list are:

The system still runs locally on my laptop, with a local database. That is fine for development and paper trading, but too fragile for serious live use. A more robust deployment and storage architecture is still needed.
Incorporate a complete workflow for ETF strategies so I can automate the tactical allocation strategy that I use.
Improve the backtesting engine to use contract/expiry level data for futures. Although the continuous futures series are good enough to backtest trend following and mean reversion strategies, using the right expiries is more realistic and will allow me to more correctly model rollover assumptions, as well as test strategies like carry, which require trading in more than one contract.
Maintenance tools: if something goes wrong (say, I decide to manually close a position), there should be a way to provide that information to the system. Of course, I can tell the agent to fix it, but the app should be able to do this independently.
Improve the overall robustness of the system, including better P&L and strategy monitoring.
Further automation of data update and signal generation.

What This Project Taught Me

I had very little experience working with AI agents when I started this project. In the process of building the app, I learned enough about how to scope, guide, and iterate with them that part of me is tempted to start over and rebuild it from scratch with that knowledge in hand. I’ve been experimenting with agents on other projects as well, and both the speed and the quality of what I can produce keep improving.

One lesson in particular has become very clear to me: it pays to spend more time upfront defining the context, clarifying the scope, and working with the agent to produce a coherent plan before any coding begins. I’ve found it especially useful to instruct the agent to keep asking questions until it has clarity on all relevant points.

Over the course of building this app, the AI agent was extremely useful. It could write large amounts of code quickly, scaffold interfaces, refactor workflows, build database layers, and implement features at a pace I could never have matched on my own. While I’m quite comfortable with trading strategies, backtests, and portfolio construction, my knowledge of database architecture and UI development is rudimentary at best. In that sense, AI significantly expanded what I could realistically build.

This project also reinforced for me that AI is not a substitute for domain knowledge. It can accelerate development dramatically, but it is most useful when we can provide detailed specifications of how things should work and still detect problems even when, superficially, things look correct. There were very few cases where the code itself was the main problem. Most bugs arose because I had not provided enough detail about the desired behavior, or because the problem being tackled turned out to be more complex than it initially appeared. In other cases, the code would run and the results would seem reasonable, yet still be conceptually wrong in ways that were not obvious unless I inspected the logic directly or questioned the agent. These are the kinds of mistakes that can get lost in “vibe coding.” In that respect, developers have a real advantage when using these tools, because they are used to thinking in terms of specifications, tests, and edge cases.

I began this series with a simple question:

Can an AI agent build a functional systematic trading system from scratch?

My answer is yes, but with some caveats. AI can greatly accelerate the process, but the quality of the result still depends heavily on the human component. I was able to build a prototype that is close to functional for my own purposes, although I would still want much more testing and validation before trusting it with real money. Even so, it remains far from anything I have used, or would consider using, in a professional setting. Now, I’m not a professional developer, and I’m well aware of my limitations in that regard. I have no doubt that a professional developer using AI agents could build something orders of magnitude better.

So that, at least, is what this project taught me. AI made it possible to build much more, much faster, than I could have on my own. AI agents have effectively removed the speed of code creation as a limiting factor, but in a project like this, coherence and a vision of how the parts fit together is what matters. This still depends heavily on the human(s) in the loop.

Your Trend Strategy is Just a Weighted Average of Past Returns

Systematically Biased — Fri, 17 Apr 2026 14:47:32 GMT

Trend following is one of the most widely used systematic trading strategies. There’s solid evidence that trend following has worked across long periods of time, different markets, and is particularly useful during periods of crisis1 There are endless ways to implement it, but many popular trend indicators turn out to be much closer cousins than they first appear. In fact, once you rewrite them in return space rather than price space, a lot of trend-following rules are just weighted averages of past returns. What differs across rules is not the basic idea, but the shape of the weights.

There are countless trend indicators around, but most are variations of simple ideas like:

Compare the current price with price n periods ago.
Compare current price with a moving average of prices.
Compare a fast moving average with a slow one.

Time Series Momentum Rules

Take a simple rule based on the first idea. We can express the signal as follows:

That means we should go long when the current price is above the price n periods ago, and short otherwise. We can express the same rule by looking at the cumulative returns, which is the time series momentum signal used in many of the papers on trend following:

where r_t-n,t is the cumulative return over the last n periods. How we define returns here matters. If we work with simple returns, then r_t-n,t must reflect compounding. Alternatively, we can write the signal as a sum of past price differences. If we use log returns, then is simply the sum of the last n one-period log returns:

Price-SMA Crossover

Another commonly used trend following indicator is the price-simple moving average crossover. This is based on comparing the current price with a moving average of past prices. Suppose we compare the current price with a simple moving average (SMA) of the past n prices. For simplicity, let’s assume we’re working with log prices p_t=log(P_t):

The SMA is defined as:

Then the price/SMA crossover signal can be written as (see the end of the post):

Therefore, the price/SMA crossover signal is equivalent to a weighted average of the past n-1 returns, where the weights are linearly decreasing.

Price-EMA Crossover

What about replacing the simple with an exponential moving average (EMA)? The signal then becomes

where

Following the same strategy used for the price-SMA signal, we can show that

Noting that multiplication by λ>0 doesn’t change the sign of the signal, the price-EMA crossover rule is equivalent to using the sign of an EMA of returns:

Moving Averages in General

Valeriy Zakamulin and Javier Giner have several papers that look at trend following in general, and alternative formulations of different rules in terms of returns:

This paper by Valeriy Zakamulin provides results for other kinds of signals based on different types of weighted averages.
In a subsequent paper with Javier Giner, the authors compare time series momentum and different moving average strategies.
In a more recent paper, Zakamulin and Giner look at optimal trend strategies under a two-state regime switching model.

I particularly like the chart below from the last paper, which shows the shape of the weights on past returns for different trend following rules:

The three cases I mentioned above (time series momentum, price-SMA crossover, and price-EMA crossover) are shown in the chart as MOM, SMA, and EMA. Zakamulin and Giner distinguish the following cases:

Constant weights: using the time series momentum (MOM) is equivalent to using an equal average of past returns.
Declining weights: using price/SMA or price/EMA crossover is equivalent to overweighting the most recent returns.
Hump-shaped weights that underweight the most recent and most distant returns: this pattern describes different cases like SMA and EMA crossovers.
Shapes where the sign of the weights can alternate between positive and negative. This is the case for the moving average convergence/divergence (MACD) indicator. In the last case, the rule negatively weights distant returns, suggesting return reversal at long horizons.

Trend Indicators and Return Dynamics

This matters because those weighting schemes are not just technical details. They embed views about the dynamics of returns. Equal weights assume that all past returns inside the lookback window matter similarly. Declining weights put more emphasis on recent information. Hump-shaped weights imply that the most informative lags may be somewhere in the middle, while sign-changing weights, as in MACD, effectively combine short-run continuation with long-run reversal. So when we choose a trend rule, we are not just choosing an indicator. We are implicitly choosing a model of return persistence.

Appendix: Deriving the Price-SMA Crossover Rule

The price-SMA crossover signal can be expressed as:

In the first passage, we use the fact that p_t can be written as a sum with n terms equal to (1/n)p_t, which allows us to put p_t inside the sum. In the second passage, we used the fact that the term for j=0 is equal to 0. Next, we use the fact that

Substituting this above gives

Expanding this sum, we can see that each return r_t-i appears for all j=i+1,…,n-1, that is, exactly m-1-i times. Therefore,

The signal in the price/SMA crossover using n prices reduces to a weighted average of n-1 returns. We could re-index the rule by the number of returns entering the signal. Let k=n-1, then the rule becomes

Here are some references with links:

Moskowitz, T. J., Ooi, Y. H., & Pedersen, L. H. (2012). Time series momentum. Journal of financial economics, 104(2), 228-250.
Hurst, B., Ooi, Y. H., & Pedersen, L. H. (2013). Demystifying managed futures. Journal of Investment Management, 11(3), 42-58.
Hurst, B., Ooi, Y. H., & Pedersen, L. H. (2017). A Century of Evidence on Trend-Following Investing. The Journal of Portfolio Management, 2017, vol. 44, no 1, p. 15-29.
Lim, B. Y., Wang, J. G., & Yao, Y. (2018). Time-series momentum in nearly 100 years of stock returns. Journal of Banking & Finance, 97, 283-296.
Yang, K., Qian, E., & Belton, B. (2019). Protecting the downside of trend when it is not your friend. The Journal of Portfolio Management, 45(5), 99-111.
Harvey, C. R., Hoyle, E., Rattray, S., Sargaison, M., Taylor, D., & Van Hemert, O. (2019). The best of strategies for the worst of times: Can portfolios be crisis proofed?. The Journal of Portfolio Management, 45(5), 7-28.
Rubesam, A. (2022). The Long and the Short of Risk Parity. Journal of Portfolio Management, 48(4).

Building a Systematic Trading System With AI (Post #3: Building the Backtesting Engine)

Systematically Biased — Tue, 07 Apr 2026 16:25:27 GMT

This post is part of a series in which I’m documenting the process of building a functional trading system from scratch using agentic AI. Post #1 was about building the data pipeline (focused on futures for now). Post #2 covered the design of the infrastructure to handle trading rules (one trading signal for one instrument), trading subsystems (combinations of trading rules for one instrument) and trading strategies (combinations of subsystems for multiple instruments). This post is about the implementation of the backtesting engine. But before I get into that, a quick update.

Stuff Breaks

This series is happening in real time. At this stage, my systematic trading app/platform already “works”. I can:

Run a daily routine to update historical data
Create strategies with different trading rules and backtest them
Connect to my broker
Execute orders
Keep track of positions by strategy/instrument and automatically reconcile them with broker positions.

I have been paper trading a set of strategies on a group of futures for a few weeks. However, some events have happened that have caused me to make adjustments as I go along:

Data update was still fragile. Depending on when during the day I ran the update routine, I was ending up with incomplete bars for the current trading session. The expected behavior was that, on the next update, the system would complete those bars, but this wasn’t working, so I had to make some changes. It’s much more robust now, and I include some automated checks to ensure everything is ok.
The data provider ocasionally has its own issues, which made me think about redundancy. For the goal of this project, I’ll keep it as is for now though, as I’m aiming for zero cost.

In the process of dealing with the data issue, I decided at one point to do a clean reset of the DB to reingest all historical data. However, the DB also contained metadata on saved strategies, including the ones I was paper trading. This was, of course, my fault for not having considered it, but after this, I decided to save strategies in a separate DB to avoid this kind of issue. I also implemented some tools for backups.

The Backtesting Engine

At first glance, backtesting sounds simple. Given a price series and a trading rule, it is easy enough to write a few lines of code that generate positions and compute returns. But that kind of toy backtest is not a backtesting engine. An actual engine has to combine signals across rules and instruments, translate them into tradable contracts, apply volatility targeting and transaction costs, and do all this in a way that remains consistent with how the strategy would actually be traded. The danger of toy backtests is not just that they are simplistic, but that they can give you misleading confidence about whether the strategy can actually be traded.

What the Backtesting Engine Needs to do

For this project, the backtesting engine has to solve a very specific problem. It must:

take continuous futures series as the inputs for signal generation
allow multiple trading rules per instrument
combine those rules into an instrument-level subsystem
combine multiple subsystems into a portfolio strategy
transform signals into futures contracts
apply volatility targeting
incorporate transaction costs
handle instruments with different data histories

The backtesting engine is not just producing an equity curve. It is acting as the layer that connects research signals to tradable positions, allowing realistic calculation of simulated P&L/returns. This is also slightly different from the usual backtesting approach used in academic papers, which are mostly focused on returns, not necessarily P&L, and which mostly focus only on weights, not position sizes.

In sum, the engine has to make decisions about hierarchy, position sizing, rebalancing, and contract-level implementation.

Signal Generation vs Position Sizing

A second important design choice was to separate signal generation from position sizing. A trading rule should tell us something about direction or conviction. It should not decide final leverage.

Some rules naturally produce discrete signals such as:

Others naturally produce continuous signals bounded in some interval such as:

But regardless of whether a rule is binary or continuous, the rule output is only an input into the portfolio engine: it is not yet the final position.

For this project, the cleanest approach was:

trading rules produce signals,
subsystems combine signals,
the backtesting engine converts the resulting signal into futures positions.

Futures

With ETFs or stocks, one can often get away with backtesting directly on adjusted price series. Futures are less forgiving. For futures strategies, at least three separate objects matter:

the continuous series used for signal generation,
the actual contract being traded at a given point in time,
and the contract specifications needed for P&L and risk sizing.

This forces the engine to keep several things separate that are often conflated in simpler systems. For now, the signals of the strategies I’m considering are computed on continuous series. But positions must ultimately be expressed in real contracts, each with its own price and multiplier.

That means a backtest engine for futures has to answer two distinct questions:

What is the signal on the synthetic continuous series?
What does that imply in terms of contracts in the currently tradable maturity?

For now, I’m relying on continuous series, which for strategies like trend following or mean reversion, provide a very good approximation. In a future version, I plan to modify the engine to work directly with each instrument’s contract chain, which will also allow me to incorporate other types of strategies that require trading more than one contract for the same instrument, like carry.

Position Sizing and Volatility Targeting

The most important part of the engine is probably how it sizes positions. The engine currently supports two broad approaches:

per-subsystem volatility targeting,
portfolio-level volatility targeting with covariance scaling.

Per-Contract Notional and Dollar Volatility

For an instrument indexed by i, define current futures price and contract multiplier as P_i and m_i. The notional value of one contract is then N_i = P_i m_i. Let annualized return volatility be σ_{i, ann}.Then annualized dollar volatility per contract is:

This converts percentage price risk into dollar risk per contract, which is the quantity the engine needs for sizing.

Per-Subsystem Vol Targeting

Under the simpler approach, each subsystem is sized independently. Suppose account equity is A and the target annualized volatility is σ^*.If the subsystem signal is S_i ∈ [-1,1]. Then the float number of contracts is approximately:

The backtest then rounds this to an integer contract count.

This is simple and intuitive, but since it ignores cross-asset correlations, it will undershoot target volatility.

Portfolio-Level Vol Targeting

A more interesting case is portfolio-level volatility targeting. Here, we first build inverse-volatility base positions. Suppose each instrument also has a risk budget weight w_i. Then the base float contracts are:

This gives a risk-balanced composition (although it still ignores correlations). To translate this into portfolio weights, we use:

Let the annualized covariance matrix of returns be Σ. The portfolio volatility is then:

To hit the target, the engine computes a global scaling factor:

and rescales the base contracts:

Only after this step are contracts rounded. This last point is important, because rounding too early produces systematic distortions, especially when capital is small relative to contract size or when many instruments are competing for limited risk budget.

I’m also going to implement a full risk budget/parity option soon.

Rebalancing Frequency

The system currently handles strategies which compute signals daily. To make backtesting more flexibile, the system can simulate trading at a lower frequency, such as weekly or monthly. This matters because the research frequency of a signal and the trading frequency of a strategy do not always need to coincide. It will also come in handy later for my other book, which has tactical allocation strategies with ETFs that rebalance monthly or weekly.

Transaction Costs

Any realistic futures backtest needs to model trading costs in contract space.

The app currently incorporates two simple but useful cost components:

commission per contract,
slippage in ticks.

These costs are applied whenever the position changes, not only when the sign of the signal flips. That matters because a volatility-targeted system may resize positions even when its directional view remains unchanged.

Commission cost is proportional to the absolute contract change |Δn_i|, whereas slippage costs depend on: |Δn_i| x tick value. This is still a fairly simple model, and I still need to incorporate rolling costs.

Trading Micro Futures

For some futures contracts, different versions exist, typically with different multipliers. The larger contracts usually have longer histories and better quality data. For example, micro Bitcoin futures (MBT) started trading later than the mini contracts BTC. Similarly for ES vs MES. I added functionality that allows me to use the larger contracts for signal generation and simulate trading with the micro by adjusting the contract multipliers and customizing cost assumptions.

What Broke While Building It

This part is worth emphasizing because it says something about both the engineering challenge and the use of AI for development. A backtesting engine sounds easy until one starts dealing with the boundary between signal generation and actual trading logic. Some of the issues I ran into included:

I discovered bugs in the contract roll logic for continuous series due to weird backtested P&L,
Rounding positions too early was causing issues in some edge cases,
Inconsistencies between signal contracts and execution contracts

These were not cosmetic bugs: they reflected places where the internal logic of the system was not yet fully coherent. This was also one of the places where working with an AI agent was most revealing. The agent was very good at producing working code quickly, but consistency across the entire pipeline had to be enforced through repeated testing, inspection, and revision

What the Engine Still Does Not Do Perfectly

At this stage, the backtesting engine is good enough to support serious experimentation, which is all I need for now. It is not “finished,” and probably never will be in any absolute sense. But it is now coherent enough to connect research signals to tradable positions without relying on the kinds of shortcuts that make toy backtests misleading.

Building a Systematic Trading System With AI (Post #2: Trading Rules)

Systematically Biased — Sat, 14 Mar 2026 10:50:50 GMT

This is the second post in a series in which I’m documenting an experiment to build a fully functional systematic trading system using AI. You can read part 1, in which I went over the data pipeline for futures contracts here.

A Disclaimer

AI is a powerful technology, but with all the vibe coding hype that has been going around lately, I feel like I need to add some disclaimers:

This series is not “I vibe coded a full trading platform in one day”. First, it’s taking much longer than a day. Second, the scope is limited to what I described as a “functional” app on the first post of this series.
Specifically, I want to explore if/how AI can fill the gaps in my knowledge that I have no time or interest to fix. I have no interest in becoming really good at designing UI or backend systems.
I’m not selling anything. If it turns out to be good enough, I’ll simply use it myself.
This Substack is written by me, not by an AI. Any errors are the result of traditional human stupidity. Also, no em dashes. No short impactful sentences with bold type. And — I hope — no slop.

The Framework

As a recap, my trading framework is a web-based app running (for now) locally on my machine. In the first post of this series, I covered the data pipeline:

setting up a database
ingesting downloaded historical data
automatically updating data from a provider
creating continuous time series for different futures contracts

Building Daily Bars

After playing around with the data from my provider, it became clear that their daily bars data were not a good option for me, even for strategies that use daily data. The reason has to do with how futures markets trade (ie markets are open for ~23 hours per day), and how the data are recorded in the provider's daily bar schema, which uses UTC dates. Following the provider’s recommendation, I switched to a higher frequency (1-hour) schema for data ingestion and now the app reconstructs daily bars internally, matching them to exchange sessions. This change was surprisingly painless. Once I realized the issue, I requested the change to the agent and the implementation worked without any issues. An added benefit is that the app can now handle higher frequency data, which may be handy in the future.

I also ran several diagnostics and checks on the continuous time series, and I’m quite confident with the results. With all of that in place, I have clean time series that can be used to generate and test different strategies.

Hierarchy

In this post, I focus on how trading signals are implemented and aggregated into trading subsystems. At this stage my goal is not to fine-tune trading rules, but to build infrastructure that allows systematic experimentation. The framework I’m developing follows the hierarchy described in the first post:

trading signal+instrument = trading rule

trading rules→trading subsystems→strategy

A trading rule applies one signal generation idea to one instrument. A trading subsystem is a collection of trading rules for one instrument. A strategy is a combination of several trading subsystems for multiple instruments.1

Trading Rules

The main requirements I had in mind for trading rules were the following:

Ability to add/modify new signal generation rules. Examples:
- A time-series momentum rule with a look-back of n days.
- A Bollinger breakout rule with look-back of n days and multiplier k.
Ability to toggle between long/short, long-only or even short-only versions of specific rules. Trend-following on equities, for example, tends to work much better as long-only rules.
Bulk generation of multiple trading rules for a given instrument.
- Example: generating variations of the time-series momentum with n∈ {21, 63, 126} days for the instrument ES.
Sweeping rules within a range of parameters and performing simplified backtests (this comes with a very strong warning, which I discuss below).
Easy visualization of signals from individual/multiple trading rules.

An example

In general, my preference is to use signals that generate a discrete output. For example, a time series momentum (TSMOM) signal can be defined as follows:

0}\\\\\n-1 & \\textrm{if}\\;{r_{t-n,t} <0}\n\\end{array}\\right.\n","id":"FXJHQHUOXB"}" data-component-name="LatexBlockToDOM">

where r_t-n,t is the cumulative return between t-n and t. This is a basic trend following indicator that goes long the asset when the return over the look-back window is positive, and goes short otherwise. The AI agent came up with this:

@dataclass(frozen=True)
class TSMOMRule:
    instrument: str
    lookback: int
    name: str | None = None

    def __post_init__(self) -> None:
        if self.lookback <= 0:
            raise ValueError("lookback must be positive")
        if self.name is None:
            object.__setattr__(self, "name", f"tsmom_{self.lookback}")

    def compute_signal(self, data: pd.DataFrame) -> pd.Series:
        if "date" not in data.columns or "close" not in data.columns:
            raise ValueError("Data must include 'date' and 'close' columns")

        df = data.copy()
        df["date"] = pd.to_datetime(df["date"])
        df = df.sort_values("date")

        ret = df["close"] / df["close"].shift(self.lookback) - 1.0
        signal = (ret > 0).astype(int) - (ret < 0).astype(int)
        signal = signal.where(~signal.isna(), 0)
        signal.index = df["date"]
        signal.name = self.name or "tsmom"
        return signal

This seems reasonable. It could have used returns and compounded, or used log returns and summed, but this approach will give the same signal. Of course, creating the code for signal generation is just one part of the puzzle. The other parts include the DB schema, some helper functions (to fetch, insert, delete, edit existing rules), UI layers etc. The agent handled all of that seamlessly for this signal and several other trading signals that I asked for. At the moment, the easiest way to add a new trading signal is to make a request to the agent and have it handle all of that. Of course, if I wanted to do all of this manually, I could, but I’d be spending a lot of time handling DB migrations and UI design. The AI agent makes all of that very straightforward, but the tradeoff is dependence on the agent. One possibility that I’m thinking of exploring in the future is to have the agent build the functionality to allow for signal creation/editing directly on the app. The user provides the code for the class in a pre-specified format, the app validates and handles all the dirty work in the backend.

In terms of bulk creation of trading rules, the solution I implemented allows me to quickly sweep a trading rule within a range of parameters and bulk-save selected ones as individual trading rules. This leads to an important warning.

A word of warning:

The absolute worst way to do select individual trading rules, which will guarantee overfitting and give you disappointing live results, is by backtesting that rule over a grid of parameters and selecting the best performing one(s).

Backtesting is the Schrödinger’s cat of systematic investing. Until you backtest an idea, you don’t know whether a trading rule makes money. Once you backtest it, you have already used the data. If you used the entire historical sample for that instrument, that’s it, you’ve opened the door to look-ahead bias. If you now select the parameters/combinations that performed the best in your backtest, the most you can hope for is that your backtest seriously overstates performance.

Another way to think about it is this. No matter how long the historical time series you used, it’s still only one observed sample or realized history for that particular financial instrument. Your backtest provides you with one realization of a set of metrics (CAGR, Sharpe ratio etc) that reflects that realized history. A more relevant question is what is the distribution of that performance metric.

Some relevant questions

There are many things to take into account when selecting trading signals in a systematic trading system. This is not the main focus of this series and has been discussed extensively elsewhere, so I will only highlight a few important questions we should ask:

Why does this make money? Is it because we’re collecting a risk premium? Is it a market structure issue? Is it because investors collectively have some sort of cognitive/behavioral bias? If you don’t have a reasonable explanation for why something works, you won’t be in a position to understand the risks and probably shouldn’t trade it.
How sensitive is it to implementation details? Does performance change dramatically if we slightly change the signal definition, sampling frequency, or execution timing?
Does it work across instruments? Does the signal work only on one instrument, or does it appear across multiple markets or asset classes? Signals that work only in one market are more likely to reflect noise or sample-specific effects.
Does the signal depend on a specific market regime? Does the strategy only work during a particular period (e.g., the 2008 crisis, the post-2010 QE era, or the COVID crash)? Strategies that rely heavily on one episode are unlikely to be robust.
How correlated is it with existing signals? A signal that performs well on its own but is highly correlated with existing strategies may add little value to the overall system.
How frequently does it trade? Faster trading signals trade more frequently and therefore incur higher transaction costs. A trading signal that rarely triggers may produce very misleading performance metrics.
Does it survive transaction costs? How much does it cost to trade the instrument, accounting for broker commissions, slippage etc? This is related to the speed of trading.
What is the minimum notional exposure and risk? This is particularly relevant for futures. The minimum notional exposure depends on the contract specs, and the risk depends on the volatility of the instrument. Examples:
1. ES: contract multiplier = $50 per S&P 500 index point. At current levels (~6700), one contract has a notional exposure of about $335k. Assuming annualized volatility of 15%, the annualized dollar volatility is roughly $50k.
2. MES: micro version of the ES contract with a multiplier of $5. One contract therefore has about $33.5k notional exposure and annualized dollar volatility of roughly $5k.
3. MBT: micro bitcoin futures at CME with a multiplier of 0.10 BTC. At current prices (~$70k), one contract has notional exposure of about $7k. Because bitcoin volatility is much higher (~60%), the annualized dollar volatility is roughly $4.2k.
4. ZF: contract size = $100,000 face value of a Treasury note. The volatility of ZF is only about 3%, so despite the larger notional the annualized dollar volatility is roughly $3k, lower than one MBT contract with a notional exposure of about $7k.2

Trading Subsystems

Diversification is the only free lunch in finance.

In a systematic trading program, several layers of diversification are possible :

Across asset classes
Across instruments within the same asset class
Across trading rules for the same instrument (e.g. combining trend following with carry)

In the hierachy I’m using, the last type of diversification takes place within a trading subsystem, which combines different trading rules for the same instrument. My requirements for trading subsystems were similar to those for trading rules, with one addition which is the ability to customize the weights assigned to the trading rules:

Aggregation of trading rules into a subsystem. Example:
- a trading system for ES could combine variations of time-series momentum with different lookbacks and a Bollinger breakout rule.
Ability to customize weights assigned to each trading rule (equal weights, custom weights, or weights obtained through bootstrapping optimization).
Ability to force a discrete signal as the output of the subsystem. When aggregating signals from multiple trading rules (eg by averaging), the result is no longer discrete. Using that signal directly mixes signal generation with position sizing. If the agreement between trading rules is informative of market movements, this could be a good idea. For example, when aggregating trend indicators with different lookbacks, as a trend weakens, the signal starts to decrease from the maximum value of +1, reducing exposures.

Optimization

I’m generally skeptical of using any kind of optimization due to the risk of overfitting. The worst kind of optimization is full-sample or in-sample optimization: using the full sample of historical data to pick the weights that compose the subsystem.3 This tends to produce very concentrated weights and introduces in-sample overfitting. A better approach is to use a resampling technique such as bootstrapping, which works by resampling blocks of the original time series to preserve temporal dependence. Optimizations across each of the samples are then aggregated, which reduces concentrations and gives much more stable results. Although resampling provides less concentrated and more stable results compared with full sample optimization, the results are not truly out-of-sample, unless we do it at each point in time using only past data and allow the weights assigned to different signals to vary over time. I implemented optimization as a research tool, but my preference is for simpler, equal-weighting schemes.

An example: TSMOM on ES

Here’s a concrete example for ES. Research has shown that time-series momentum rules on equities typically work with different lookbacks of up to 12 months. Slower rules (such as a lookback window such as n = 252 days, roughly one year) may take too long to react during sharp market reversals. On the other hand, fast rules (n=21 days, roughly one month) may produce many short signals that are subsequently reversed (i.e., a “whipsaw”). At the daily or lower frequency, it’s very hard to make money shorting the ES. Therefore, I tested long-only versions of TSMOM with lookbacks of 21, 63, 126, 189, and 252 days (1, 3, 6, 9 and 12 months, respectively). Full sample Sharpe ratio optimization gives the following result:

The result is not really surprising. Since the rules are highly correlated, the full-sample optimization concentrates on the top two performers (21 and 126 days).

Using a block bootstrap:4

And finally using an expanding window bootstrap:

These results of the bootstrap optimizations suggest reducing the weights of the 189 and 252 days rules. I would argue this is not worth the hassle and my preference would be to equally weight the signals. The chart below shows a composite equally weighted subsystem for ES (end date = March 06, 2026). The recent market movements have reduced exposure, although not to zero. The reason is that the three slower TSMOM rules (n=126, n=189, and n=252) are still active.

Strategies

Strategy construction in this framework is explicitly hierarchical: a trading rule is one signal model on one instrument with one parameter set, a subsystem combines multiple rules for the same instrument, and a strategy combines multiple subsystems across instruments with configurable subsystem weights. Signals are generated daily, can be constrained by position mode (long/short, long-only, short-only), and are then translated into futures contracts through a volatility-targeting engine. For risk control, the app supports (for now) two volatility-targeting modes:

Per-Subsystem: each subsystem sized independently to a target risk budget
Portfolio-Level: first builds inverse-vol base exposures, then rescales the full vector using the return covariance matrix to target overall portfolio volatility.

Both of these are simple inverse-vol approaches. The per-subsystem approach will undershoot target volatility because it ignores correlations across subsystems. This undershoot occurs because portfolio volatility depends not only on the volatility of individual subsystems but also on their correlations; when each subsystem is scaled independently, the resulting portfolio variance is typically lower than the sum of the individual variance targets. The portfolio-level approach starts from subsystem exposures scaled by inverse volatility and then rescales the full vector to achieve a desired portfolio volatility. A next step would be to use a proper risk parity approach that lets you define risk budgets per instrument or per asset class, as I did on this paper.

Next: Backtesting

On the next part, I will go over the backtesting engine of the app.

As mentioned in the first post, the hierarchy for trading rules, subsystems and strategies was inspired by this book by Robert Carver.

For Treasury futures, practitioners typically measure risk using DV01 (the dollar value of a one-basis-point move in yields) together with yield volatility, rather than price volatility. I use price volatility here only to keep the comparison across contracts simple.

If you pre-select only the best trading rules in the previous step, overfitting will have two chances to manifest: first in the pre-selection of the most profitable rules, second in the optimization step.

I used 200 bootstrap samples each with size equal to 15% of the length of the available historical time series and a block size of 20 days)

Building a Systematic Trading System With AI (Post #1: Data)

Systematically Biased — Thu, 26 Feb 2026 10:27:17 GMT

If you’re here, you’re probably interested in systematic trading, so we have that in common. Over the years, I’ve built tools for testing and trading systematic strategies, both in industry and academia. I’ve written plenty of code, but mostly for research and strategy logic. Whenever I tried to build full applications myself, I ran into the same bottlenecks: UI design, database structure, API integrations, deployment, and keeping code maintainable as complexity grows.

Recently, I’ve been experimenting with agentic AI for coding. That led to a concrete experiment:

Can an AI agent build a functional systematic trading system from scratch?

To make this a meaningful test, I’m going to focus (mostly) on trend-following using futures, but the idea is to build something general for systematic strategies.1 Using futures will also force the AI agent to deal with some technical details:

Futures data comes with structural complexities: expiries, contract multipliers, tick sizes, roll rules, and the construction of continuous series.
Trend following is simple enough to explain clearly, but rich enough to expose real system-design tradeoffs.
End-to-end automation is realistic: data ingestion, signal generation, monitoring, and broker execution can all be connected.

In other words, this isn’t a toy backtest: it’s a self-contained but non-trivial engineering problem. Throughout this series, I’ll highlight where the agent was surprisingly effective, where it struggled, where I had to intervene, and where domain knowledge proved indispensable.

This first post will focus on the data workflow (data acquisition and management).

The AI Agent Hype (or is it?)

While I’m more on the skeptical side regarding the AI hype in general (particularly when it comes to claims about cognition and AGI), the speed of improvement in AI for coding is mind-boggling. A lot of friends who are professional developers are doing very sophisticated things which make me feel like the ape in this meme. They have agents managing agents in self-improving loops that work while they, well, work even more on other things. It’s clear to me that AI agents are being used everywhere (even at Rockstar!), but I also see some criticism and risks:

It’s great for demos, but not for production.
What about debugging and code maintenance?
Governance and accountability: who is responsible when an AI agent ships bad code?
Security vulnerabilities
“AI agent made me forget how to code”

Define “Functional”

Like many people, I recently started experimenting with AI coding agents. I particularly like how I can quickly prototype simple web apps running locally on my machine to speed up recurring and time-consuming tasks. I’ve created some apps for teaching, some for research, and some for trading. It’s perfect for me because it gets me around the gaps in my coding skill set (which I have to admit, I already had little intention of fixing, even before AI agents came along).

My objective was to build a local web app that could run a systematic trading workflow with minimal hassle. The requirements I had in mind were:

Data pipeline: ingest historical contract-level data, fetch updates on demand, process continuous futures series, inspect quality/coverage.
Research workflow: generate trading signals, combine them into subsystems/strategies, and backtest across multiple instruments.
Execution plumbing: connect to broker, monitor signals/positions, stage orders, and optionally execute with safeguards.
Usable interface: browser-based UI on a local server. Based on advice from the AI agent, I used Streamlit for rapid iteration.

In short, “functional” means end-to-end: data, research, execution, and interface.

The Framework

The framework I’m building assumes that systems would trade at most on a daily basis. I have futures in mind for the moment, but it could be adapted to other asset classes like ETFs (in that case, it would look more like a tactical allocation program that should probably trade at most on a weekly or monthly basis, and the level of automation may be overkill).

The system is based on the following hierarchy2:

Parent instrument: the family of contracts for one instrument. For example, ES is the parent contract for S&P 500 futures trading on the CME.
- Children instruments: associated with each parent are the individual contracts with specific maturities. For example, ESH6 is the ES contract with expiry in March 2026.
- Continuous series: associated with each parent, we will build a continuous time series that “stitches together” the children contracts. This is needed because as we roll positions, there is a difference in price between the current contract that will expire soon and the next one.
Trading rule: a trading rule generates a signal to be long or short a particular instrument. Example: a time-series momentum (TSMOM) rule on ES with a lookback period of 6 months would provide a +1 or -1 signal to be long or short that instrument.
Trading subsystem: a trading subsystem combines trading rules for one parent instrument. These could be either variations of the same trading rule (e.g. TSMOM with lookbacks of 6 and 12 months on ES), or different kinds of trading rules (e.g. TSMOM with lookback of 6 months and a moving average crossover). Later, we need to make choices about how to combine the signals from different trading rules.
Strategy: a combination of several trading subsystems for multiple instruments. When we get to this stage, we will need to focus on risk management and position sizing. I’ll implement different options and discuss the tradeoffs.

How I used the AI Agent

I started from scratch and built the app in an exploratory, sequential way. This was partly because I wanted to explore how the AI agent works, but also because I knew that many details and issues would only become clear to me when I started building the app, so I didn’t write detailed specs to start with. Instead, I initially provided as many details as possible. I had some ideas about the workflows, but I iterated quite a bit with the agent and requested suggestions on some architecture decisions (data layout, DB, web interface). According to my AI agent:

You optimized for fast learning and domain correctness under evolving requirements; professional usage usually optimize for predictability, auditability, and safe integration into established codebases.

I provided some of my own existing code to the AI agent, especially for the backtesting engine, but this was mostly because I wanted it to build something that I would be familiar with.

Even with this approach, I was able to build something functional fairly quickly. I have no doubt that a professional quant developer with AI minions could build something much more sophisticated within the same time frame.

Show me the Data

The first step was to find a data provider. To build a trading/backtesting app, I wanted to find a provider with:

contract-level historical futures data,
API access for updates,
low enough cost (ideally free) for iterative development.

A friend who is a professional systematic trader recommended I take a look at databento. For daily frequency data, the startup credits when you create a new account are more than enough to download historical data for a large number of futures contracts.3 The downside is that historical data are available for a relatively short period going back to 2010 for most contracts. After setting up the databento account, generating an API key is straightforward. This will come in handy later to automate downloads.

Databento provides multiple schemas. For this system’s first milestone, the key ones are:

Definitions
Instrument metadata over time: symbol mappings, instrument IDs, activation/expiration timestamps, exchange, tick size, multiplier, and related contract fields.
This info will be needed for correct contract mapping and roll logic.
OHLCV-1d
Daily open/high/low/close/volume bars at instrument level.
This is the core dataset needed for backtesting and daily signal generation.

Databento allows downloads in different formats. I chose “Databento Binary Encoding (DBN)” which seems to be the most convenient/fastest. I gave the DBN documentation to the AI agent so that it knew how to work with this format.

Tickers

In this series, I’ll use the following contracts for illustration purposes:

ES (E-mini S&P 500)
ZF (5-Year T-Note)
BTC (Bitcoin)
M6E (Micro EUR/USD)

There’s nothing special about these specific contracts. I chose them because they cover different asset classes and illustrate the complexities of futures data (eg the contracts have different maturities, different contract multipliers etc).

Data Workflow

The app supports two operational paths for data workflow:

Download data from API and ingest
Ingest previously downloaded local files

In practice, local ingest is often preferable for larger definition pulls, because databento takes a few minutes to prepare downloads of definitions files. In the future, I plan to automate this so I never have to touch it unless I want to add a new ticker.

Pipeline Design

Raw files are stored unchanged in a dedicated folder (data/raw/).
Ingestion parses and validates raw files.
Cleaned contract-level bars are written to curated Parquet datasets.
SQLite stores metadata such as:
- parent-to-contract mappings,
- data coverage windows,
- parent-level contract specs (exchange, currency, tick size, multiplier, roll settings, adjustment method).

This raw/curated/metadata separation keeps updates reproducible and debugging manageable.

Creating Continuous Time Series

For backtesting and signal generation, we need to create a continuous time series that “stitches” together data from different maturities. There are several details that affect how these series are created4:

Contract listing structure

Example: ES is listed quarterly (H/M/U/Z), while other markets may list monthly.

Roll rule

The rule for when to roll positions as contracts approach expiry. Examples:

roll N days before expiry.
roll on liquidity trigger (e.g., next contract overtakes front contract in volume/open interest).

Adjustment method

This avoids artificial jumps at roll dates. Typical choices are:

Ratio adjustment: multiplicative scaling across rolls.
Additive adjustment: constant offsets across rolls.

The app infers available contract months from ingested metadata and lets the user define and save rolling rules per parent contract. Then, it builds or updates continuous series from those definitions, which the user can inspect on charts/tables before using them in backtests.

Different methods to create continuous series come with different tradeoffs. I chose to implement a methodology that always adjusts prices backwards. This means that the current price for the front contract matches the continuous series. However, whenever a new roll enters the dataset, we need to reprocess the entire series. This takes seconds so it’s totally acceptable for me. It can also be optimized by storing adjustment factors (which currently I don’t see a need to do).

Hiccups and Bugs

When I started building the app, I wasn’t quite sure what to expect. This is a breakdown of my experience in this part of the experiment:

Where AI Was Strong

AI agent trivially set up the local database and created worflows to store curated data in parquet format.
Likewise, the web app looked decent on the first iteration.
Implementing on-demand data updates using the databento API worked straightaway, with a few bugs on date ranges that were easily fixed.
The speed to prototype and test new features and make changes in the UI is incredible.

Where AI Struggled

UI behavior was occasionally flaky. A recurring issue was certain parts of a view being hidden because the AI agent placed the block in the wrong place.
AI tended to infer structure from ingested data that caused noisy metadata because the structure is not homogeneous for all contracts.
Occasionally, asking the agent to fix one bug (example symbol normalization) made something else break (although this is pretty common when a human is writing code?)

Human Intervention

I had to intervene heavily in a few cases:

The AI agent inferred contract roll dates incorrectly in a few cases because it was overgeneralizing based on inference about one of the contracts. This caused gaps in the continuous series for some contracts.
The first implementation of the contract stitching logic was incorrect. The error was not obvious nor easy to detect.

Some Lessons

AI excels at structured “plumbing” tasks.
It struggles with heterogeneous domain structures.
Ambiguous specifications lead to brittle implementations. This was particularly relevant for UI-related requests. I learned that I need to be very precise when describing the behavior I wanted.

What the app looks like at this stage

In the video below, I go over the data management part of the current version of the app.

In the Next Post

Systematic trading requires good quality data. This initial step gave me a decent starting point. In the next post in this series, I’ll move from data plumbing to signal generation:

Defining trading rules
Combining rules into subsystems
Building strategies combining different instruments

Building Along

If you want to build it along, these specs can be given to any AI agent.

Trend-following has strong long-run empirical support across markets. See for example this, this, this and this. Trend-following also seems to be particularly helpful during crises: see this and this.

The hierarchy for trading rules, subsystems and strategies was heavily inspired by this book by Robert Carver. The rest diverges because I mostly focus on binary trading signals, while his approach is based on continuous forecasts. On the site for his book, there’s a link to a python project that implements the systematic trading framework exactly as in his book.

I have no affiliation with databento.

See this for a discussion.