Additivity is all you need?
Arc Institute’s “MULTI-evolve” learns a classical additive model, not epistasis.
A recent paper from Arc Institute introduced MULTI-evolve, a framework for machine-learning-guided directed evolution:
Tran, Nemeth, Bartie, Chandrasekaran, Fanton, Moon, Hie, Konermann, Hsu. Rapid directed evolution guided by protein language models and epistatic interactions. Science 0, eaea1820 (2026)
The work is framed as navigating an exponentially complex search in a genotype-phenotype landscape by learning and exploiting epistasis. Arc’s announcement opens:
The search space for protein engineering grows exponentially with complexity. A protein of just 100 amino acids has \(20^{100}\) possible variants—more combinations than atoms in the observable universe.
The proposed solution centers on learning epistasis—interactions between pairs of mutations that manifest as departures from additivity of mutational effects (i.e. nonlinearity). The paper describes the MULTI-evolve neural networks as performing
epistatic modelling to predict synergistic combinations,
and as
learn[ing] the epistatic landscape from a compact dataset of double mutants and extrapolat[ing] to synergistic combinations.
The Arc website explains the learning mechanism as arising from double mutants that “reveal epistasis”, and these
pairwise interaction patterns teach models the rules for how mutations combine, enabling extrapolation to predict which 5-, 6-, or 7- mutation combinations will work synergistically.
Schematic figures in the paper, the Arc website, and accompanying media coverage depict MULTI-evolve as able to “jump” directly to hyperactive multi-mutant proteins across a rugged epistatic landscape.
Claims of this kind—that a neural network has learned nonlinear functions and interactions beyond what simpler models can represent—warrant scrutiny. In a critique of methodological practice in machine learning research, Lipton & Steinhardt catalog common failure modes that undermine scientific claims, including: neglecting to benchmark against simple baselines, speculative conflation of a model’s capacity to represent complex functions with evidence that it has learned them, and misattribution of empirical gains to a paper’s proposed model rather than more mundane possibilities.
In discussions about MULTI-evolve with UW CSE student Gian Marco Visani, we noticed the paper didn’t perform any baselines to substantiate the claim of learning epistasis. Additive models (linear models) are a key baseline here because, by definition, they have no capacity to represent epistasis. Additive models assume the activity change of a multimutant protein is the sum of the activity changes of each individual mutation. We set out to perform this neglected baseline analysis using the code and data from the original paper, also enlisting Aayush Verma for help along the way. As we report in a new preprint, we found no evidence that MULTI-evolve learns epistasis, and instead conclude that it trains a neural network to recapitulate a classical additive model (linear regression).
Visani, Verma, DeWitt. Additive baselines furnish no evidence for epistasis learning by MULTI-evolve. bioRxiv 2026.04.23.719915 (2026).
The neural network did learn a rule for how mutations combine: addition. The schematic depictions of MULTI-evolve traversing a rugged epistatic landscape are misleading: the surface MULTI-evolve learns is flat, with no synergy or antagonism represented. Ranking single-mutation effects and stacking them is all you need to recapitulate MULTI-evolve’s multimutant engineering, a standard strategy for at least four decades:
Wells. Additivity of mutational effects in proteins. Biochemistry, 29, 37, 8509–8517 (1990)
Our argument turns on three lines of evidence.
First, MULTI-evolve’s neural network produces multimutant predictions almost perfectly correlated (\(r > 0.999\)) with those of an additive model fit to identical training data. We show below the results for APEX peroxidase engineering, but our preprint shows similar results for dCasRx and HuABC2, where we also find binding-expression Pareto frontiers are practically identical between MULTI-evolve and an additive model.
Second, we found MULTI-evolve does not outperform an additive baseline on held-out test data, and does not even represent epistasis in its own training data, much less extrapolate it to higher order variant predictions (see preprint for details).
Third, the MULTI-evolve paper presented a benchmarking study with publicly available deep mutational scanning (DMS) data from ProteinGym, finding that multimutant prediction improved when they added double mutants to their training set (initially of only single-mutants) and improved further still upon adding triple-mutants. This trend was interpreted as evidence that MULTI-evolve is learning epistasis, however we noted it does not control for the fact that adding variants increases the training set size. We show that the reported trend of improvement is expected even under a null additive model due to elementary statistical effects, and we even fit an additive model (which has no capacity to represent epistasis) to the ProteinGym data, reproducing the same trend (see preprint for details).
MULTI-evolve learns a classical additive model, not epistasis.
We share the MULTI-evolve authors’ excitement about the potential for machine learning to accelerate protein engineering, but call for a sober approach to baselines before claims.
We also emphasize that our conclusions are about MULTI-evolve, not protein genotype-phenotype landscapes in general. Epistasis is out there, and ML models can, in principle, learn it!
Back to top