Genome Rearrangement
“Nothing in biology makes sense except in the light of combinatorics.”
— Some famous geneticist, probably
Biological Sampling
In the early 1900s, decades before the double helix, scientists tracking fruit-fly and maize chromosomes observed that whole segments of DNA can break, fuse, flip, or otherwise swap locations. As opposed to point mutations, which are single-letter “typos” in the genetic code, these genome rearrangements are large, chromosome-scale edits. Examples of such mutations are shown below, where each arrow is a single gene in the genome, its direction indicating which strand of DNA contains the coding sequence.
In addition to their role in driving evolution and speciation, such mutations can have major effects on health. In a process called chromothripsis, for example, chromosomes can shatter and reassemble in a single catastrophic event, sometimes leading to cancer or congenital disease.
Regardless of the specific biological context, we are concerned with the same central question: can we “replay history”? That is, given an ancestor genome and a descendant genome , can we determine the sequence of mutations that transformed into ? Guided by Occam’s Razor (“pick the simplest explanation”), we focus only on the shortest possible sequences of mutations that convert into ; we call such sequences the most parsimonious scenarios (MPSs) from to .
It turns out, even for relatively simple genomes, that there are absurdly many MPSs from to —far too many to feasibly compute. Hunting for “the one true history” is hopeless. We thus take a statistical lens, asking:
If yes, then we can treat the set of all MPSs as a population from which we can sample. We could estimate how often particular biological features occur (circular intermediates, breakpoint reuse, etc.), build empirical distributions, and run standard hypothesis tests. As we will see below, however, it isn’t always easy to pick something “at random.”
From Sampling to Counting
We might be tempted to just simulate random histories: pick a mutation that brings us closer to uniformly at random, then repeat until we reach . Unfortunately, this naive random walk is biased, where branches with relatively few continuations are over‑sampled. We need to weight our choices unequally in order to generate unbiased samples (see figure).
Luckily, because our problem is self-reducible, there is a standard way to do this. Specifically, let
If we can compute this value, then for any mutation we can also compute , the number of MPSs that start with , simply by using our formula on genomes "" and . We can thus simulate random histories by picking our first mutation with probability
and recursing until we reach . In short: we only need the counting formula to create a uniform sampler.
Computational Complexity
Counting and sampling algorithms are only useful if they can be evaluated on real inputs. For us, this means that we desire a way to compute in polynomial time, loosely meaning that the computation time required to analyze a genome of size is bounded by for some fixed . This ensures that the runtime scales in a controlled way as increases. By contrast, we want to avoid super-polynomial-time algorithms (e.g. exponential runtime ), where even a modest increase in renders computation impractical.
The most famous framing of complexity is undoubtedly the P versus NP problem, a Millennium Prize Problem which carries a million‑dollar reward if you can solve it. This question is concerned with two classes of decision problems:
- : yes/no problems solvable in polynomial time
- : yes/no problems where a proposed “yes” answer can be checked in polynomial time
For example, “Is there an MPS from to ?” is a class problem: if and have different genes then no amount of genome rearrangements will make them match, so “no”; if and have the same genes then our naive, biased simulation above will build an MPS, so “yes.” Every problem is also in class ; indeed, given a proposed MPS from to , it is certainly polynomial time to verify that it is truly an MPS (just do the proposed mutations and see if it works).
The problem of computing , though, is a counting problem, not a yes/no decision problem. Specifically, it is counting the number of solutions to an problem. This places our problem into the class , and we hope it lies in .
-
: counting problems arising from decision problems
-
: function problems (like counting) that can be solved in polynomial time
It is widely believed that these two classes are different and, regardless of how clever we are, some problems simply cannot be solved in polynomial time.
SCaJ is #P-Complete
So is it polynomial-time solvable? Well… that depends on which mathematical operations we include in our model. At the top of the page we gave several examples of mutations: (i) cuts, (ii) joins, (iii) cut-joins, and (iv) double‑cut‑joins. This gives us a natural ordering of several models:
- SCoJ: The Single Cut or Join model allows mutations of type (i) and (ii)
- SCaJ: The Single Cut-and-Join model allows mutations of type (i)–(iii)
- DCJ: The Double-Cut-and-Join model allows mutations of type (i)–(iv)
It is known that can be computed in polynomial time for the simplest SCoJ model, while it is conjectured that the more complex DCJ model cannot be computed in polynomial time (assuming ). This tension motivated us to ask “what about the ‘middle’ SCaJ model?” Biologists must balance biological relevance with computational tractability when choosing which model to use in their analyses: if DCJ is too complex for these calculations, what about SCaJ?
Answer: computing in the SCaJ model is -complete. Beyond not being computable in polynomial time (assuming ), the “-complete” suffix indicates that this problem is equivalent to the hardest problems in . That is, given an efficient algorithm to compute , we could efficiently map that algorithm to solve any problem. See Bailey, et int., Wiedemann 2024 for more details (or our preceding COCOON 2023 conference paper).
Fixed-Parameter Tractability
While -completeness rules out a fast algorithm for computing in the SCaJ model for arbitrary inputs, it does not preclude a fast algorithm on “structured” inputs. Indeed, we have been able to show that many biologically reasonable instances fall into a regime where exact counting can still be done in polynomial time.
Informally, there is a standard way to translate the differences between and into an adjacency graph (see figure for loose definition), effectively consolidating the data about how and differ into a single object. If the number of components in that adjacency graph is small, then we can compute in polynomial time. This means our problem is fixed-parameter tractable with respect to this structural parameter . Because genomic distance bounds , the problem is also fixed-parameter tractable with respect to the number of mutations needed to translate into . To the best of our knowledge, this is the first explicit use of parameterized complexity in the genome rearrangement literature.
At a high level, our counting algorithm is a dynamic program over the components of the adjacency graph, bundling those components which “sort together.” The details of this algorithm are technical and nontrivial—not surprising, as the general problem is -complete—but a practical, efficient implementation can be found on GitHub. See Bailey, et int., Wiedemann 2025 for more details (or our preceding SWAT 2024 conference paper).
Next Steps: Near-Uniform Sampling and Medians
SCaJ counting is ‑complete, so we cannot expect exact polynomial‑time counting in all scenarios. The next best thing is approximation, and our current work is aimed at designing a fully polynomial‑time randomized approximation scheme (FPRAS). At a high level, this would allow us to approximate well enough that the self‑reducible sampler from above would yield a near‑uniform sampler. If successful, this would give us a practical, unbiased way to run statistical analyses even when exact counting is out of reach.
A related problem, not mentioned above, is our work on genetic medians. Informally, given a collection of descendant genomes , we seek to find the ancestor genome which minimizes the sum of the genomic distances; that is, what is the “closest common ancestor.” The problem asks for the number of such medians. For the simple SCoJ model this problem was known to be polynomial‑time computable, and in Bailey, et int., Wiedemann 2024 we improved this to logspace-time computable. The complexity of this problem in the more complex DCJ model is ‑complete, so again we have a tension: what about the “middle” SCaJ model? We have made some progress in answering this question, but results are preliminary.
Students:
If you enjoy combinatorial puzzles or building recursive programs, then please reach out. I have a few tightly scoped projects at the intersection of combinatorics, algorithms, and probability. Recommended coursework includes discrete math and some programming experience (any language). For summer work, please get in touch early spring semester (SCUR proposals are typically due mid‑semester).