HP
MP

Genome Rearrangement

“Nothing in biology makes sense except in the light of combinatorics.”
— Some famous geneticist, probably


Biological Sampling

In the early 1900s, decades before the double helix, scientists tracking fruit-fly and maize chromosomes observed that whole segments of DNA can break, fuse, flip, or otherwise swap locations. As opposed to point mutations, which are single-letter “typos” in the genetic code, these genome rearrangements are large, chromosome-scale edits. Examples of such mutations are shown below, where each arrow is a single gene in the genome, its direction indicating which strand of DNA contains the coding sequence.

Four-panel diagram showing genome rearrangement operations: (i) cut operation splits a linear chromosome into two linear chromosomes; (ii) join operation connects two free ends of a linear chromosome to form a circular chromosome; (iii) cut-join takes a linear chromosome and cuts it into a circular chromosome and a linear chromosome; (iv) double-cut-and-join shows a linear chromosome doubling back on itself to rearrange its gene order in one step.
Four types of genome rearrangements: (i) cut splits an adjacency; (ii) join connects two free ends; (iii) cut-join combines a cut and a join in a single step; (iv) double‑cut‑and‑join makes two cuts and reconnects the four ends.

In addition to their role in driving evolution and speciation, such mutations can have major effects on health. In a process called chromothripsis, for example, chromosomes can shatter and reassemble in a single catastrophic event, sometimes leading to cancer or congenital disease.

Regardless of the specific biological context, we are concerned with the same central question: can we “replay history”? That is, given an ancestor genome AA and a descendant genome DD, can we determine the sequence of mutations that transformed AA into DD? Guided by Occam’s Razor (“pick the simplest explanation”), we focus only on the shortest possible sequences of mutations that convert AA into DD; we call such sequences the most parsimonious scenarios (MPSs) from AA to DD.

It turns out, even for relatively simple genomes, that there are absurdly many MPSs from AA to DD—far too many to feasibly compute. Hunting for “the one true history” is hopeless. We thus take a statistical lens, asking:

Out of all the MPSs from AA to DD, can we pick one uniformly at random?

If yes, then we can treat the set of all MPSs as a population from which we can sample. We could estimate how often particular biological features occur (circular intermediates, breakpoint reuse, etc.), build empirical distributions, and run standard hypothesis tests. As we will see below, however, it isn’t always easy to pick something “at random.”


From Sampling to Counting

Decision tree diagram with probability-weighted branches showing fractions like 3/8, 1/8, 4/8, rather than uniform probabilities at each step.
We can imagine each MPS as a branch in a decision tree. Weighting each branch by the fraction of completions beneath it guarantees that each MPS is equally likely to be chosen. Here, the product of probabilities along each branch is 1/8.

We might be tempted to just simulate random histories: pick a mutation that brings us closer to DD uniformly at random, then repeat until we reach DD. Unfortunately, this naive random walk is biased, where branches with relatively few continuations are over‑sampled. We need to weight our choices unequally in order to generate unbiased samples (see figure).

Luckily, because our problem is self-reducible, there is a standard way to do this. Specifically, let

 ⁣ ⁣ ⁣ ⁣#MPS(A,D)= # of MPSs from A to D.\!\!\!\!\#\mathrm{MPS}(A,D) = \text{ \# of MPSs from }A\text{ to }D.

If we can compute this value, then for any mutation mm we can also compute #MPS(A,Dm)\#\mathrm{MPS}(A,D\mid m), the number of MPSs that start with mm, simply by using our #MPS\#\mathrm{MPS} formula on genomes "A+mA+m" and DD. We can thus simulate random histories by picking our first mutation mm with probability

P[choose m]=#MPS(A,Dm)#MPS(A,D),\mathrm{P}[\text{choose } m] = \frac{\#\mathrm{MPS}(A,D \mid m)}{\#\mathrm{MPS}(A,D)},

and recursing until we reach DD. In short: we only need the counting formula #MPS(A,D)\#\mathrm{MPS}(A,D) to create a uniform sampler.


Computational Complexity

Counting and sampling algorithms are only useful if they can be evaluated on real inputs. For us, this means that we desire a way to compute #MPS(A,D)\#\mathrm{MPS}(A,D) in polynomial time, loosely meaning that the computation time required to analyze a genome of size nn is bounded by nkn^k for some fixed kk. This ensures that the runtime scales in a controlled way as nn increases. By contrast, we want to avoid super-polynomial-time algorithms (e.g. exponential runtime 2n2^n), where even a modest increase in nn renders computation impractical.

The most famous framing of complexity is undoubtedly the P versus NP problem, a Millennium Prize Problem which carries a million‑dollar reward if you can solve it. This question is concerned with two classes of decision problems:

For example, “Is there an MPS from AA to DD?” is a class P\textsf{P} problem: if AA and DD have different genes then no amount of genome rearrangements will make them match, so “no”; if AA and DD have the same genes then our naive, biased simulation above will build an MPS, so “yes.” Every P\textsf{P} problem is also in class NP\textsf{NP}; indeed, given a proposed MPS from AA to DD, it is certainly polynomial time to verify that it is truly an MPS (just do the proposed mutations and see if it works).

The problem of computing #MPS(A,D)\#\mathrm{MPS}(A,D), though, is a counting problem, not a yes/no decision problem. Specifically, it is counting the number of solutions to an NP\textsf{NP} problem. This places our problem into the class #P\#\textsf{P}, and we hope it lies in FP\textsf{FP}.

It is widely believed that these two classes are different and, regardless of how clever we are, some #P\#\textsf{P} problems simply cannot be solved in polynomial time.


SCaJ is #P-Complete

So is it polynomial-time solvable? Well… that depends on which mathematical operations we include in our model. At the top of the page we gave several examples of mutations: (i) cuts, (ii) joins, (iii) cut-joins, and (iv) double‑cut‑joins. This gives us a natural ordering of several models:

It is known that #MPS(A,D)\#\mathrm{MPS}(A,D) can be computed in polynomial time for the simplest SCoJ model, while it is conjectured that the more complex DCJ model cannot be computed in polynomial time (assuming FP#P\textsf{FP}\neq \#\textsf{P}). This tension motivated us to ask “what about the ‘middle’ SCaJ model?” Biologists must balance biological relevance with computational tractability when choosing which model to use in their analyses: if DCJ is too complex for these calculations, what about SCaJ?

Answer: computing #MPS(A,D)\#\mathrm{MPS}(A,D) in the SCaJ model is #P\#\textsf{P}-complete. Beyond not being computable in polynomial time (assuming FP#P\textsf{FP}\neq \#\textsf{P}), the “-complete” suffix indicates that this problem is equivalent to the hardest problems in #P\#\textsf{P}. That is, given an efficient algorithm to compute #MPS(A,D)\#\mathrm{MPS}(A,D), we could efficiently map that algorithm to solve any #P\#\textsf{P} problem. See Bailey, et int., Wiedemann 2024 for more details (or our preceding COCOON 2023 conference paper).


Fixed-Parameter Tractability

Three-panel diagram showing ancestor genome $A$ (top), adjacency graph (middle), and descendant genome $D$ (bottom).
The differences between an ancestor genome (top) and descendant genome (bottom) can be accentuated using a graph, where vertices are defined by the adjacencies and endpoints of each genome, and edges are defined by the intersection of vertex labels.

While #P\#\textsf{P}-completeness rules out a fast algorithm for computing #MPS(A,D)\#\mathrm{MPS}(A,D) in the SCaJ model for arbitrary inputs, it does not preclude a fast algorithm on “structured” inputs. Indeed, we have been able to show that many biologically reasonable instances fall into a regime where exact counting can still be done in polynomial time.

Informally, there is a standard way to translate the differences between AA and DD into an adjacency graph (see figure for loose definition), effectively consolidating the data about how AA and DD differ into a single object. If the number kk of components in that adjacency graph is small, then we can compute #MPS(A,D)\#\mathrm{MPS}(A,D) in polynomial time. This means our problem is fixed-parameter tractable with respect to this structural parameter kk. Because genomic distance bounds kk, the problem is also fixed-parameter tractable with respect to the number of mutations needed to translate AA into DD. To the best of our knowledge, this is the first explicit use of parameterized complexity in the genome rearrangement literature.

At a high level, our counting algorithm is a dynamic program over the components of the adjacency graph, bundling those components which “sort together.” The details of this algorithm are technical and nontrivial—not surprising, as the general problem is #P\#\textsf{P}-complete—but a practical, efficient implementation can be found on GitHub. See Bailey, et int., Wiedemann 2025 for more details (or our preceding SWAT 2024 conference paper).


Next Steps: Near-Uniform Sampling and Medians

SCaJ counting is #P\#\textsf{P}‑complete, so we cannot expect exact polynomial‑time counting in all scenarios. The next best thing is approximation, and our current work is aimed at designing a fully polynomial‑time randomized approximation scheme (FPRAS). At a high level, this would allow us to approximate #MPS(A,D)\#\mathrm{MPS}(A,D) well enough that the self‑reducible sampler from above would yield a near‑uniform sampler. If successful, this would give us a practical, unbiased way to run statistical analyses even when exact counting is out of reach.

A related problem, not mentioned above, is our work on genetic medians. Informally, given a collection of descendant genomes D1,D2,,DnD_1, D_2,\ldots, D_n, we seek to find the ancestor genome AA which minimizes the sum of the genomic distances; that is, what is the “closest common ancestor.” The #Median\#\textsf{Median} problem asks for the number of such medians. For the simple SCoJ model this problem was known to be polynomial‑time computable, and in Bailey, et int., Wiedemann 2024 we improved this to logspace-time computable. The complexity of this problem in the more complex DCJ model is #P\#\textsf{P}‑complete, so again we have a tension: what about the “middle” SCaJ model? We have made some progress in answering this question, but results are preliminary.


Students:

If you enjoy combinatorial puzzles or building recursive programs, then please reach out. I have a few tightly scoped projects at the intersection of combinatorics, algorithms, and probability. Recommended coursework includes discrete math and some programming experience (any language). For summer work, please get in touch early spring semester (SCUR proposals are typically due mid‑semester).