Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Unexpected link between pure mathematics and genetics (ox.ac.uk)
134 points by geox on Aug 7, 2023 | hide | past | favorite | 41 comments


The title is not terribly informative, but the result sounds fairly interesting (and the article is open access):

"Maximum mutational robustness in genotype–phenotype maps follows a self-similar blancmange-like curve"

https://royalsocietypublishing.org/doi/10.1098/rsif.2023.016...


A decade ago doing some bioinformatics research I discovered that if you use Lempel-Ziv algorithm on the genetic sequences of many organisms, you get a sequence of "words" (of length 1 to n) that has a power law that matches the power law of displayed by human languages, and crucially is different to the "power law" you get from random, or almost random, sequences.

I submitted two papers about this, both were rejected. Not surprising I guess, my first papers ever!

It was also fascinating to me how complex (and how much larger) plant genomes are compared to animals. Plants have enormous genomes, almost making me wonder, what do they know that we don't?

I think the answer could be--a lot! :)


Papers, unfortunately, are only about answering questions and never about asking great questions.

As a consequence, there are just a handful of papers that answer great questions.

This has probably held back science a bit.


Gzip on genes is something I wondered about for a good while (since I read about "language trees and zipping" about 20 years ago[1].) Pleased to hear someone tried it!

I was wondering in particular whether anyone looked at "junk" DNA with gzip, etc., as a test of how randomised it was.. Do you know if that ever happened?

1: https://arxiv.org/abs/cond-mat/0108530


I looked at it. It had the same the structure. Maybe there was a different scale, I can't remember now.


It's more accurate to say that plant genome sizes vary over an incrediblely larger range than that found in other eukaryotes, than that plant genomes are inherently large compared to animals'. Plants can have quite small genome sizes as well as giant genomes. The large genomes are in large measure due to the fact that plants typically can duplicate their entire genomes (leading to polyploidy) or large elements of their genomes, with no ill effect. Animals typically don't survive massive genome duplication.


plants will hybridize, combining the entire genomes of multiple species. common wheat is an example of this, it has full pairs of chromosomes from three ancestor species https://en.wikipedia.org/wiki/Wheat


Have you posted the drafts to Archiv? If so please share. If not, please do so. I don’t like the idea that a couple reviewers block knowledge transmission permanently.


Thanks. Although sadly I think that might have been what happened. After the rejections, I moved into a startup and lost track of where the papers are. :(


I'm curious, what were the reasons given for the rejections, if you received any?


I'm going to guess because that topic was covered in the 1990s, making it harder for a new researcher to come up with a novel idea.

For example, quoting "Sublinear growth of information in DNA sequences", Giulia Menconi, https://academic.oup.com/nar/article/32/suppl_2/W628/1040725... from 2005:

> The Lempel–Ziv complexity measure is based on text segmentation; we have termed it a ‘complexity decomposition’. It may be interpreted as the representation of a text in terms of repeats. Initially, this approach was implemented for analyzing DNA by Gusev and coauthors (13,14).

13 is Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1991) Complexity analysis of genomes. I. Complexity and classification methods of detected structural regularities. Mol. Biol. (Mosk). 25 , 825–834.

14 is Gusev,V.D., Nemytikova,L.A. and Chuzhanova,N.A. (1999) On the complexity measures of genetic sequences. Bioinformatics, 15, 994–999.

Google Scholar has 326 matches for "Lempel-Ziv power law dna" at or before 2010, for example, "Entropy and predictability of information carriers" from 1995 with "The capability to describe the structure of information carriers as DNA, proteins, texts and musical strings is investigated.".


I forget. It was many years ago. I don't care.

At the time, it hurt, but now I don't care. I don't even remember where I submitted it! I don't care.

If they'd accepted it, I probably wouldn't be working in software. I'd probably still be in some kind of research, I guess. I don't know!

It's so many years ago, I just don't care. Time changes everything I guess. I suppose they missed out! Ha ha! :)


+1. I would definitely be interested in this kind of investigation. Ideas to consider: Is this just the coding part of the genome, or the whole genome? Was there any difference in characteristics between different parts of the genome? Could you show differences in characteristics between coding, intronic, regulatory, and other parts of the genome, and could these be used to potentially identify candidate regulatory elements?

Unfortunately (as I am discovering) the intersection between biology/genetics and some of these more computer-science-specific ideas gets far less interest than it should.


I assumed it's because plants can't run, so they need a larger chemical arsenal to convince predators not to eat them and mates to coit with them, all with line of sight to the sun.


Likewise cold-blooded things like frogs have fairly massive genomes too, and I thought it was so that the organism could cope with the vast differences in chemical processes required to operate at different temperatures. We as warm-blooded creatures don't need to be as complicated.


> cold-blooded things like frogs have fairly massive genomes too

Yes, the genome size of the common frog (Rana temporaria) genome is 4.1 Gb. https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8407/ which is larger than that of mammal genomes.

On the other hand, the smallest frog genome is the Ornate burrowing frog at 1.06 GB, which is 1/3rd the size of the human genome. https://en.wikipedia.org/wiki/Genome#Genome_size . https://en.wikipedia.org/wiki/Ornate_burrowing_frog says:

"This is an adaptation to the desert environment where it lives. Because the ponds where they breed dries up fast in the desert, the tadpoles has to go through metamorphosis as fast as possible, which can occur just eleven days after the eggs were fertilized. A small genome gives small cells, and the smaller the cells are, the faster the tadpoles transform into small frogs and can escape the shrinking ponds"

Xenopus laevis (African clawed frog) is 2.7 Gb, also smaller than humans - https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8355/

Thus, warm/cold-blooded cannot be the main reason for the difference.

From https://en.wikipedia.org/wiki/Genome_size :

"genome size is not proportional to the number of genes present in the genome ... In eukaryotes (but not prokaryotes), genome size is not proportional to the number of genes present in the genome, an observation that was deemed wholly counter-intuitive before the discovery of non-coding DNA and which became known as the "C-value paradox" as a result."

From https://en.wikipedia.org/wiki/C-value#Variation_among_specie... :

"Variation in C-values bears no relationship to the complexity of the organism or the number of genes contained in its genome; for example, some single-celled protists have genomes much larger than that of humans. ... C-values correlate with a range of features at the cell and organism levels, including cell size, cell division rate, and, depending on the taxon, body size, metabolic rate, developmental rate, organ complexity, geographical distribution, or extinction risk ...



> It was also fascinating to me how complex (and how much larger) plant genomes are compared to animals.

Perhaps a naive view but, plants have just a limited nervous system[1] and no brains. Doesn't it make sense that they compensate by more preprogrammed behavior in the form of a more complex genome?

[1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8331040/


No.

The range of genome sizes for plant genomes is very large. "In animals they range more than 3,300-fold, and in land plants they differ by a factor of about 1,000" and "genome size is not proportional to the number of genes present in the genome", quoting https://en.wikipedia.org/wiki/Genome_size .

See also the chart "Genome size ranges (in base pairs) of various life forms" from that page, and the table at https://en.wikipedia.org/wiki/Genome#Genome_size .

Arabidopsis thaliana has about 135 MB base pairs. Humans have about 3.1 GB base pairs. Paris japonica have about 150 GB base pairs.


Thanks for the links, much appreciated.

I knew about non-coding or junk DNA but had forgotten how extreme it was.

Ended up learning about the onion test[1] and this interesting article[2] which suggests classifying non-coding DNA as either junk DNA, spam DNA or both, to better capture the differences that exist.

I also found this article[3] providing some interesting overview and discussion on the topic.

[1]: https://en.wikipedia.org/wiki/Onion_Test

[2]: https://doi.org/10.1093/gbe/evac055

[3]: https://doi.org/10.3390/plants12020282


You'd have a lot of genes too if you synthesized chemicals to communicate and remember things.


Chemicals like… neurotransmitters?


This may or may not have been described by Zipf's law.

https://en.wikipedia.org/wiki/Zipf%27s_law


Did you put the papers on your website or on arXiv.org?



Here's a Project Euler challenge featuring it:

https://projecteuler.net/problem=226


I've just implemented this Blancmange curve using less than 140 characters of javascript. It's a fractal also known as Takagi curve. Source code and graphics: https://www.dwitter.net/d/28260 (it can be edited and remixed)


The rabbit hole that is the Blancmange curve wikipedia page goes very deep, though. See you back here in a few years?


> modern encryption techniques based on factoring prime numbers

This obviously wasn’t written by a mathematician. I _hope_ it wasn’t written by a physicist.


It's an easy flub, almost so easy as to be content free. The correct statement is either: factoring into prime numbers, or factoring semiprime numbers.

So when I see pedants picking this one out, I think people should be given a break! :)


I actually came here to delete my comment, because it occurred to me that it might sound mean-spirited. But it seems I am too late. I found the mistake amusing, that's all.


If you don't mind, I'd prefer if you leave it up as a reminder that comments that might seem mean-spirited should be given the benefit of the doubt.

In my opinion it is better to assume people mean well and I feel like you've proven me right (this time)


OH no! I've prevented you from deleting your comment! I wasn't being too hard on you. If you want, I can delete mine, and we can both delete everything before the 2 hour deadline?


Nope! It’s an interesting and informative mistake.


Funny — how it spawned a whole subthread on its own!


Modern encryption techniques are based on our inability to factor large numbers into prime numbers. If you factor a number that's used as the key into the two prime numbers it was composed from, then you have broken the encryption.


Given https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness... I sort of expected there would be a link.


Don't we not fully understand how the body, and the bodies of other beings work?

I don't understand how someone can be confident about quantifying phenotype robustness without complete knowledge of the physical properties of organic life.

Maybe in the robust category there are small changes to the processes in the organisms that we do not fully understand that build up to break into a threshold of our perception.


One day we will find that all math is a fractal of biochemistry


We’re definitely living in a simulation


In the outer world maths isn’t as useful?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: