DNA and group theory - Part II


 

V4 didn’t just ‘work’, it separated real DNA from randomized sequences in a way SU(2) completely failed to do. First, a bit about V4: 

The Klein four-group V4 is the simplest non-trivial example of an abelian group that is not cyclic. It has four elements — the identity e, and three elements a, b, ab — each of order 2, meaning every element is its own inverse. The group table is entirely determined by this: any two distinct non-identity elements multiply to give the third. What makes V4 attractive for DNA is its natural fit with four bases, its asymmetric metric structure when you weight transitions and transversions differently, and the fact that its irreducible representations decompose cleanly into four components, one for each codon position, as it turns out. When you assign each DNA base to a V4 element and multiply bases along a sequence window as a path-ordered product, the probability that the product returns to the identity encodes the statistical structure of the sequence in a way that turns out to be sensitive to the number-theoretic properties of the window length. In addition, the Watson-Crick base pairs — A with T, G with C — are not just a chemical accident; they correspond exactly to the coset partition of V4, the mathematical structure that underlies the entire analysis.

To test whether the V4 group structure reveals non-random patterns in the E. coli genome, I used a distance-stratified approach. Each DNA base is assigned to a V4 group element using the canonical biologically motivated mapping — A→e, T→a, G→b, C→ab — where Watson-Crick pairs share a coset. For each lag distance d from 1 to 200, I computed the mean V4 distance between all pairs of bases separated by exactly d positions in the sequence, giving a profile of how V4 distance varies with sequence separation. To assess significance, I compared this profile against two null models: an IID null that shuffles the sequence completely while preserving base frequencies, and a Markov null that additionally preserves dinucleotide frequencies. Thirty replicates of each null were generated and used to z-score the real profile at each lag. The V4 distance metric itself is asymmetric — transitions are weighted differently from transversions, with the weights estimated empirically from the actual transition/transversion ratio in the sequence. This is the biologically motivated choice that distinguishes this mapping from an arbitrary one.

The result is striking. The real E. coli sequence sits dramatically above both null envelopes across all 200 lags tested, with z-scores against the IID null reaching -35 at short lags and oscillating between -5 and -20 across all distances. Even the Markov null — which preserves dinucleotide context — fails by 5 to 20 standard deviations at most lags. And the oscillation pattern is unmistakable: a clear periodicity of 3, the codon signal, running through the entire profile. Real DNA is not just different from random — it is overwhelmingly, persistently different, at every scale tested.

But here is where I have to be honest about what this result does and does not say. The statistic I computed: mean pairwise V4 distance at each lag, finds signal, but the interpretation of that signal is not entirely clear. We are measuring something real about the structure of the sequence, but we are not yet asking a sharp biological question. What does it mean, precisely, that bases separated by distance d are more similar in V4 space than expected? The period-3 oscillation tells us the codon structure is present, as it must be. But what is the rest of the signal... the persistent elevation above the null across all lags... actually encoding?

This lack of clarity is itself informative. It tells us that V4 is sensitive to structure in DNA that survives both IID and Markov randomization, but that the distance-stratified statistic is too blunt an instrument to tell us what that structure is. The right question, it turns out, is a different one entirely. 

Each of these "failures" and the "success" reveals something very cool about the symmetry structure about DNA. The fact that so many of these transformations (SU(2), GL(4), SO(3)) give no signal compared to a well-chosen null is itself a result as we would expect -some- structure to come out in a molecule like DNA. So what was the issue? Here's my take: 

SU(2) and V4 are not unrelated. V4 is isomorphic to a discrete subgroup of SU(2): the identity and the three imaginary Pauli matrices, scaled appropriately, form a copy of V4 sitting inside the larger continuous group. In this sense, moving from SU(2) to V4 was not an abandonment of the original intuition but a refinement of it: instead of the full continuous symmetry group, take its discrete skeleton. The reason SU(2) failed is now clear in retrospect. The Pauli matrices are equidistant on the 3-sphere... there is no natural asymmetry in the geometry of SU(2) that can encode the biological distinction between transitions and transversions. V4, by contrast, admits an asymmetric metric: you can weight the three non-identity elements differently, and the biologically motivated choice... transitions cheaper than transversions.... turns out to be exactly what the sequence structure requires. The continuous group was too uniform. The discrete subgroup, with the right metric, was just right.

Finally, the answer to my question at the end of the last blog post comes down to the choice of null. If you test any transformation of a real sequence against complete randomization, you will almost certainly find signal. Why? Because real sequences have structure at every scale, and destroying all of it at once is a very low bar to clear. The main question to ask is what is the right null? In the V4 case, the critical test is not the IID null but the Markov null, which preserves local dinucleotide context. The fact is that the V4 distance profile survives that test by 5 to 20 standard deviations! This means we are seeing something that local sequence memory alone cannot explain. The alpha and beta parameters matter not because they save us from a bad null, but because they ensure the metric is biologically grounded rather than arbitrary. Without that grounding, the statistic would be picking up structure, but we wouldn't know which structure or why it matters.

Next time: A discussion of what groups as a mathematical object measure in DNA.



Comments

Popular posts from this blog

Why Information is Logarithmic: Hartley’s 1928 Insight

An interview with a lawyer on Public Policy and Law

my family! Guest post by 7yo niece Part III