All News

This post explores mathematical aspects of recovering mutation rate histories from an ancestral recombination graph (ARG) Vs a sample frequency spectrum (SFS), expanding on a recent collaborative paper.

Deng, DeWitt, Song, Nielsen. A general framework for branch length estimation in Ancestral Recombination Graphs. Proceedings of the National Academy of Sciences 122(48): e2504461122 (2025).


Background

The germline mutation rate $\mu(t)$ (per site per generation) may vary across evolutionary time $t$ (generations before present) due to variation in DNA replication and repair mechanisms, environmental effects, or life history traits. Inferring this temporal variation from population genomic data provides insights into mutagenic processes and helps calibrate molecular clocks. Below I investigate the observability of mutation rate histories from genetic data, comparing a previous inference framework based on the sample frequency spectrum (SFS) with our new approach based on the ancestral recombination graph (ARG).

In an ARG, mutations are assigned to branches, but the precise times of mutation events along each branch are unknown (they are bounded by the branch start and end times), so one can’t directly read off Poisson mutation intensity from mutation age data. Although the method Relate (Speidel et al., 2019) has a heuristic for estimating mutation rate histories (marginally for each interval in discretized time) from mutation counts on ARG branches, the approach implicitly assumes rates are constant over each individual branch, so in fact this heuristic is valid only for mutation rate histories that are globally constant, and distorts inhomogeneous mutation rate histories even in the large data limit. Inhomogeneous mutation rate histories induce correlated estimates across the time domain, which calls for a joint inference approach.

In a later section I’ll derive a likelihood for $\mu(t)$ conditioned on an ARG. The approach is implemented in the POLEGON software accompanying Deng et al. (2025), but originally took shape in notes and a prototype implementation I developed over 5 years ago after a visit to U Oregon hosted by Peter Ralph to give a seminar on my mushi project at the Institute of Ecology and Evolution.

Side note: Getting a seminar invite as a student was a fantastic experience engaging with a broader scientific community. It was a highlight of my time as a PhD student and helped me build connections and meet future collaborators. Why don’t we all do this more often?

Peter’s mathematical innovations for ARGs inspired me to investigate if my mushi framework could be ported from SFS to ARG. It turns out yes it could, but my writeup and prototype remained disused and neglected for years until finding a natural home in Yun Deng’s project on branch length estimation in ARGs. Not everything made sense to include in the POLEGON paper, so this post presents my linear systems perspective on mutation rate history observability from ARG Vs from SFS.

SFS-based inference of mutation rate history

Initial approaches to mutation rate history inference relied on the sample frequency spectrum (SFS), which summarizes genetic variation by counting how many variants appear at each sample frequency. This provides a natural point of comparison for an ARG-based approach, which I will describe after first elaborating the SFS setting.

My main PhD project with Kelley Harris was developing SFS-based inference of mutation history:

DeWitt, Harris, Ragsdale, Harris. Nonparametric coalescent inference of mutation spectrum history and demography. Proceedings of the National Academy of Sciences 118(21): e2013798118 (2021).

We obtained a theorem showing that the expected SFS $\boldsymbol\xi$ is a certain linear transform of the mutation rate history function $\mu(t)$:

\[\boldsymbol\xi = \mathcal{L}\mu, \label{eqn:sfs-linear}\]

where the finite-rank linear operator $\mathcal{L}$ is parameterized by the population size history $N(t)$ via the coalescent process (an explicit construction is provided in the proof in the paper, as well as a finite-dimensional representation for piecewise constant $\mu(t)$ and $N(t)$). The software mushi accompanying the paper formulates mutation rate inference as a linear inverse problem. While this approach has proven useful, the SFS-based formulation suffers from spectral degeneracies—many different mutation rate histories can produce nearly identical site frequency spectra, requiring strong regularization to obtain stable inference.

The linearity of \eqref{eqn:sfs-linear} is appealing (compared with the nonlinear problem of demographic inference from the SFS) because we can draw on powerful tools from the theory of linear systems—especially spectral analysis on the operator $\mathcal{L}$—to understand the information contained in the SFS about mutation rate history.

A perturbative approach to Poisson noise

A technicality is that we typically use a count-based noise model for the SFS—the Poisson random field, for which variance depends on mean—rather than the additive Gaussian noise of classical linear systems theory. We can calibrate for Poisson noise using a perturbative approach. Suppose we have a baseline mutation rate history $\mu_0(t)$ producing an expected SFS $\boldsymbol\xi_0 = \mathcal{L}\mu_0$. For example, we might take $\mu_0(t) = \mu_0$ to be a constant function, in which case $\boldsymbol\xi_0$ is the standard expected SFS under demography $N(t)$, as in:

Rosen, Bhaskar, Roch, Song. Geometry of the sample frequency spectrum and the perils of demographic inference. Genetics, Vol. 210, No. 2 665-682 (2018).

If $N(t) = N_0$ is also constant, then $\boldsymbol\xi_0$ has a classical (and very simple) analytical form due to Fu (1995):

\[\xi_{0,i} = \frac{4N_0\ell\mu_0}{i}, \quad i = 1, \ldots, n-1,\]

where $\ell$ is the genome length (in base pairs) and $n$ is the sample size. In the example below, I will work with this constant population and constant mutation rate baseline for concreteness in numerical examples and plots, but the approach generalizes to arbitrary $\mu_0(t)$ and $N(t)$.

Now consider a perturbation to the mutation rate history: $\mu(t) = \mu_0(t) + \epsilon\mu_1(t)$, where $\epsilon$ is a small parameter. This perturbation represents a small modulation of the mutation rate about the baseline $\mu_0(t)$. Using \eqref{eqn:sfs-linear}, the corresponding perturbed expected SFS is $\boldsymbol\xi_0 + \Delta\boldsymbol\xi$ with

\[\Delta\boldsymbol\xi = \epsilon\mathcal{L}\mu_1. \label{eqn:perturb}\]

We now rescale the Poisson noise to unit variance by normalizing each component of \eqref{eqn:perturb} by the square root of its expected value under the baseline mutation rate history.

\[\widetilde{\Delta\boldsymbol\xi} = \mathrm{diag}(\boldsymbol\xi_0)^{-1/2} \ \Delta\boldsymbol\xi = \epsilon \ \mathrm{diag}(\boldsymbol\xi_0)^{-1/2} \ \mathcal{L}\mu_1 = \epsilon \ \widetilde{\mathcal{L}}\mu_1, \label{eqn:perturb-normalized}\]

where we define the normalized linear operator $\widetilde{\mathcal{L}} = \mathrm{diag}(\boldsymbol\xi_0)^{-1/2} \ \mathcal{L}$. We can now treat $\widetilde{\Delta\boldsymbol\xi}$ as being corrupted by additive Gaussian noise with unit variance in each component, which is a standard setting for linear inverse problems.

Spectral analysis of mutation rate observability

Taking the singular value decomposition (SVD) of $\widetilde{\mathcal{L}}$ gives singular values that reflect the effective signal-to-noise ratio (SNR) of each mode under Poisson noise about the baseline $\mu_0(t)$.

\[\widetilde{\mathcal{L}} = \mathbf{U} \boldsymbol\Sigma \boldsymbol{\mathcal{V}}^T, \label{eqn:svd}\]

with an orthonormal basis $\boldsymbol{\mathcal{V}}$ of “modes” of mutation rate history (the right eigenfunctions in the time domain) and their corresponding signatures $\mathbf{U}$ in the SFS (the left eigenvectors in the sample frequency domain), along with singular values $\boldsymbol\Sigma$ that quantify how much each mode is amplified or attenuated by linear operator $\widetilde{\mathcal{L}}$.

Using our transformation of the noise process, we can interpret the singular values as effective SNRs for each mode of mutation rate history, so that modes with singular values below unity are corrupted by noise. We can therefore define an effective rank $k$ of the operator $\widetilde{\mathcal{L}}$ as the number of singular values exceeding unity, indicating the number of mutation rate history modes from $\boldsymbol{\mathcal{V}}$ that can be reliably recovered from SFS data corrupted by Poisson noise about the baseline $\mu_0(t)$.

The effective null space of $\widetilde{\mathcal{L}}$ (the span of right singular vectors with singular values below unity) represents mutation rate history perturbations that are effectively invisible in the SFS data modality, because they are overwhelmed by Poisson noise. These define directions in solution space that inference is blind to.

ARG-based inference of mutation rate history

I’ll now derive a likelihood for $\mu(t)$ conditioned on an ARG, overriding some previous notation to emphasize conceptual analogies with the previous sections. Consider a branch indexed $i\in{1,\dots,n}$ (out of $n$ total branches) with start time $a_i$, end time $b_i > a_i$, and genomic span $s_i$. The number $X_i$ of mutations that occur on branch $i$ is a Poisson random variable $X_i \sim \mathrm{Pois}(\xi_i)$ where

\[\xi_i = \mathbb{E}[X_i\mid\mu] = s_i\int_{a_i}^{b_i}\mu(t)\mathrm{d}t \label{eqn:branch-intensity}\]

is the Poisson intensity measure of mutations on branch $i$. In functional analysis notation, we can write

\[\boldsymbol\xi = \mathcal{L}\mu, \label{eqn:arg-linear}\]

where the finite-rank linear operator $\mathcal{L}$ maps mutation rate histories $\mu(t)$ to the vector of expected ARG branch mutations $\boldsymbol\xi$ via the integrals in \eqref{eqn:branch-intensity}.

Discretizing the time domain

In practice we need to make this problem finite dimensional for computations by representing $\mu(t)$ as a piecewise constant function on a dense time grid. Let

\[\boldsymbol{\mu} = \begin{bmatrix} \mu_1\\ \vdots\\ \mu_{m-1} \end{bmatrix}\]

denote the mutation rate in each of $m-1$ pieces, bounded by the fixed time grid

\[\mathbf{t} = \begin{bmatrix} t_0\\ \vdots\\ t_m \end{bmatrix},\]

with $t_0=0$ and $t_m=\infty$. The discrete analog of equation \eqref{eqn:arg-linear} is

\[\begin{equation} \label{eqn:foward_discrete} \boldsymbol{\xi} = \mathbf{L}\boldsymbol{\mu}, \end{equation}\]

where the discretized forward operator $\mathbf{L}$ is the $n\times m$ matrix with elements

\[\begin{equation} \label{eqn:L} L_{ij} = \begin{cases} 0, \ a_i>t_j \ \text{or} \ t_{j-1}>b_i\\ s_i\left(\min(b_i, t_j) - \max(a_i, t_{j-1})\right), \ \text{otherwise} \end{cases} \end{equation},\]

which measures the intersection of grid piece $j$ with branch $i$.

Spectral analysis of mutation rate observability

Just as before, we can consider a perturbation to the mutation rate history about a baseline $\mu_0(t)$, and normalize the perturbed expected data to unit variance under Poisson noise. The key difference is that now the data vector $\boldsymbol\xi$ represents expected mutation counts on branches of the ARG, rather than entries of the SFS, and the ambient rank of the operator $\mathcal{L}$ is much larger than that of the corresponding SFS-based operator (the number of branches in the ARG is typically orders of magnitude larger than the number of SFS entries for moderate sample sizes). The effective rank $k$ of the normalized operator $\widetilde{\mathcal{L}}$ (again obtained by taking only singular values larger than unity) reflects the number of mutation rate history modes that can be reliably recovered from ARG mutation data. The effect rank and null space of the ARG-based operator $\widetilde{\mathcal{L}}$ can be compared directly with those of the SFS-based operator described above, providing a mathematical account of why mutation rate histories are far more observable from an ARG than from an SFS.

Example

For simplicity I’ll work with constant population size $N_0 = 10^4$, constant mutation rate baseline $\mu_0 = 1.25 \times 10^{-8}$ per base pair per generation, genome length $\ell = 10^8$ base pairs, and sample size $n = 200$ haplotypes. Using these population genetic parameters, and a discretized time domain with $m = 200$ logarithmically spaced pieces, I constructed the SFS-based operator $\widetilde{\mathcal{L}}$ as above (first using the mushi API to construct $\mathcal{L}$). I then used msprime with the above parameters and a recombination rate $\rho = 1.25 \times 10^{-8}$ per base pair per generation to simulate an ARG, and constructed the corresponding ARG-based operator $\widetilde{\mathcal{L}}$. The simulated ARG contained over a quarter million marginal trees and almost a million branches.

Performing SVD on both operators yields the singular value spectra depicted below.

Singular value spectra for SFS- and ARG-based operator $\widetilde{\mathcal{L}}$ under Poisson noise about a constant mutation rate baseline.

The singular values decay much more rapidly for the SFS-based operator than for the ARG-based operator, and far fewer singular values exceed the noise floor (dashed blackline). This directly demonstrates that many more modes of mutation rate history can be recovered in principle from ARG mutation data than from an SFS derived from the very same sample data.

Next we can visualize the time-domain singular vectors for each operator, which constitute an orthonormal eigenbasis for mutation rate history recovery from each data modality, taking only those basis functions not corrupted by Poisson noise.

Time-domain singular vectors for SFS- and ARG-based operator $\widetilde{\mathcal{L}}$ corresponding to singular values that exceed the noise floor.

The SFS-based eigenbasis (left) contains only a few low-frequency modes, reflecting the strong regularization (low-pass filtering) intrinsic to the SFS data modality. In contrast, the ARG-based eigenbasis (right) contains many more modes, including high-frequency modes that enable recovery of more complex mutation rate histories with localized features.

Projecting a pulse perturbation onto the SFS and ARG eigenbases

To illustrate the observability of mutation rate history perturbations from each data modality, let’s consider a mutation rate history $\mu(t)$ consisting of a baseline mutation rate $\mu_0$ plus a square pulse perturbation of amplitude $0.01\mu_0$. We consider a pulse at three different times in the past: recent, ancient, and very ancient.

The recoverable mutation history from each pulse in each data modality is obtained by projecting the pulse perturbation onto the corresponding eigenbasis.

Reconstruction of mutation rate history perturbations consisting of pulses at different times in the past, using SFS- and ARG-based eigenbases.

The SFS-based reconstructions (blue) are heavily smoothed due to the low effective rank of the SFS-based operator, with both recent and very ancient pulses having more limited observability, and localized features (edges of the pulses) are spread out in the time domain as oscillations. In contrast, the ARG-based reconstructions (orange) capture the pulse perturbations with high fidelity at all three time points, identifying all localized mutation rate changes accurately.

Invisible directions in solution space

To further illustrate the difference in observability of mutation rate histories from SFS Vs ARG data, we can visualize the effective null spaces of each operator. These correspond to mutation rate history perturbations that are invisible, being overwhelmed by Poisson noise. We can generate random samples from each null space by taking random linear combinations of the poorly-determined modes (those with singular values below unity) in each eigenbasis, and adding these to the pulse perturbation above.

20 random mutation rate history perturbations in the effective null spaces of SFS- and ARG-based operators, all producing virtually identical expected data.

We can see directly the superiority of the ARG: the only directions in solution space that are invisible from ARG data are very high-frequency and beyond the coalescent horizon, whereas the SFS-based null space contains many low-frequency directions that substantially distort mutation rate history estimates over the entire time domain.

Conclusion

This analysis makes concrete why mutation rate histories are far more observable from an ARG than from the SFS. For practitioners, this makes inference much more well behaved: because the noise-corrupted modes are much better separated in the ARG setting, only generic and weak regularization is needed to suppress solution irregularities, and render the ill-posed inverse problem numerically stable without imposing strong prior assumptions on the solution.

Comments