Friday, April 24, 2026

Do prime numbers hide in the genome?

 


I had a dream about primes hanging from the genome and decided to look for them there.

Before I tell you what I found, a little biology.

When a gene needs to be switched on, proteins called transcription factors bind to short DNA sequences near the gene and trigger the machinery that reads it. These binding sites — TFBS, transcription factor binding sites — are not randomly scattered. They cluster, they cooperate, and the distances between them are not arbitrary. Two binding sites close together can act in concert; far apart, they may regulate independently, or not at all. The spacing between TFBS tells you something real about how genes are co-regulated, how developmental programs are coordinated, how a cell decides what to become.

I've been learning number theory for the past several years — working through proofs, building intuition, trying to understand what primes actually are rather than just what they do. At some point the two worlds — the genome and the number line — started bleeding into each other in the way things do when you spend too much time with both of them.

So I looked.

Specifically, I asked: for a given type of transcription factor binding site, what is the distance to the next occurrence of the same site in the genome? And within those distances, are prime numbers overrepresented or underrepresented relative to what chance would predict?

The answer depends entirely on which binding site you ask about.

E-boxes are short DNA sequences bound by bHLH transcription factors — a family that includes the proteins driving the circadian clock, cell differentiation, and neural development. In yeast, the spacing between E-boxes is prime-depleted. Primes show up less often than expected: Z = −4.06, meaning the depletion is more than four standard deviations below the null. In human, Z = −4.52.

TATA boxes are core promoter elements found upstream of many genes, among the most ancient regulatory sequences in biology. Their spacing pattern is the opposite. TATA box spacings are prime-enriched — primes cluster near TATA boxes more than chance predicts. In yeast, Z = +14.72. In human, Z = +149.4.

SP1/GC binding sites show neither enrichment nor depletion. They are neutral.

The signal is not a statistical fluke. I ran the analysis against a dinucleotide null — a shuffled control that preserves local sequence composition while destroying large-scale structure — and the results hold.


What does it mean?

I genuinely do not know. The consistency across two organisms separated by years of evolution suggests this is not noise. E-boxes and TATA boxes are among the most ancient regulatory elements in biology — if there is a signal here, it has been maintained across deep evolutionary time, which implies it is doing something.

The opposing directions are the part I keep returning to. It is not that primes are generally enriched or depleted near regulatory elements. It is that different regulatory elements have opposite relationships to the prime structure of the number line. E-boxes avoid prime spacings; TATA boxes seek them. That specificity feels like it is pointing at something.

But I do not have a mechanism. I have a pattern.


Do any of you have any idea why?

No comments:

Post a Comment

Do prime numbers hide in the genome?

  I had a dream about primes hanging from the genome and decided to look for them there. Before I tell you what I found, a little biology. ...