A decade ago doing some bioinformatics research I discovered that if you use Lempel-Ziv algorithm on the genetic sequences of many organisms, you get a sequence of "words" (of length 1 to n) that has a power law that matches the power law of displayed by human languages, and crucially is different to the "power law" you get from random, or almost random, sequences.
I submitted two papers about this, both were rejected. Not surprising I guess, my first papers ever!
It was also fascinating to me how complex (and how much larger) plant genomes are compared to animals. Plants have enormous genomes, almost making me wonder, what do they know that we don't?
Gzip on genes is something I wondered about for a good while (since I read about "language trees and zipping" about 20 years ago[1].) Pleased to hear someone tried it!
I was wondering in particular whether anyone looked at "junk" DNA with gzip, etc., as a test of how randomised it was.. Do you know if that ever happened?
It's more accurate to say that plant genome sizes vary over an incrediblely larger range than that found in other eukaryotes, than that plant genomes are inherently large compared to animals'. Plants can have quite small genome sizes as well as giant genomes. The large genomes are in large measure due to the fact that plants typically can duplicate their entire genomes (leading to polyploidy) or large elements of their genomes, with no ill effect. Animals typically don't survive massive genome duplication.
plants will hybridize, combining the entire genomes of multiple species.
common wheat is an example of this, it has full pairs of chromosomes from three ancestor species
https://en.wikipedia.org/wiki/Wheat
Have you posted the drafts to Archiv? If so please share. If not, please do so. I don’t like the idea that a couple reviewers block knowledge transmission permanently.
Thanks. Although sadly I think that might have been what happened. After the rejections, I moved into a startup and lost track of where the papers are. :(
> The Lempel–Ziv complexity measure is based on text segmentation; we have termed it a ‘complexity decomposition’. It may be interpreted as the representation of a text in terms of repeats. Initially, this approach was implemented for analyzing DNA by Gusev and coauthors (13,14).
13 is Gusev,V.D., Kulichkov,V.A. and Chupakhina,O.M. (1991) Complexity analysis of genomes. I. Complexity and classification methods of detected structural regularities. Mol. Biol. (Mosk). 25 , 825–834.
14 is Gusev,V.D., Nemytikova,L.A. and Chuzhanova,N.A. (1999) On the complexity measures of genetic sequences. Bioinformatics, 15, 994–999.
Google Scholar has 326 matches for "Lempel-Ziv power law dna" at or before 2010, for example, "Entropy and predictability of information carriers" from 1995 with "The capability to describe the structure of information carriers as DNA, proteins, texts and musical strings is investigated.".
+1. I would definitely be interested in this kind of investigation. Ideas to consider: Is this just the coding part of the genome, or the whole genome? Was there any difference in characteristics between different parts of the genome? Could you show differences in characteristics between coding, intronic, regulatory, and other parts of the genome, and could these be used to potentially identify candidate regulatory elements?
Unfortunately (as I am discovering) the intersection between biology/genetics and some of these more computer-science-specific ideas gets far less interest than it should.
I assumed it's because plants can't run, so they need a larger chemical arsenal to convince predators not to eat them and mates to coit with them, all with line of sight to the sun.
Likewise cold-blooded things like frogs have fairly massive genomes too, and I thought it was so that the organism could cope with the vast differences in chemical processes required to operate at different temperatures. We as warm-blooded creatures don't need to be as complicated.
"This is an adaptation to the desert environment where it lives. Because the ponds where they breed dries up fast in the desert, the tadpoles has to go through metamorphosis as fast as possible, which can occur just eleven days after the eggs were fertilized. A small genome gives small cells, and the smaller the cells are, the faster the tadpoles transform into small frogs and can escape the shrinking ponds"
"genome size is not proportional to the number of genes present in the genome ... In eukaryotes (but not prokaryotes), genome size is not proportional to the number of genes present in the genome, an observation that was deemed wholly counter-intuitive before the discovery of non-coding DNA and which became known as the "C-value paradox" as a result."
"Variation in C-values bears no relationship to the complexity of the organism or the number of genes contained in its genome; for example, some single-celled protists have genomes much larger than that of humans. ... C-values correlate with a range of features at the cell and organism levels, including cell size, cell division rate, and, depending on the taxon, body size, metabolic rate, developmental rate, organ complexity, geographical distribution, or extinction risk ...
> It was also fascinating to me how complex (and how much larger) plant genomes are compared to animals.
Perhaps a naive view but, plants have just a limited nervous system[1] and no brains. Doesn't it make sense that they compensate by more preprogrammed behavior in the form of a more complex genome?
The range of genome sizes for plant genomes is very large. "In animals they range more than 3,300-fold, and in land plants they differ by a factor of about 1,000" and "genome size is not proportional to the number of genes present in the genome", quoting https://en.wikipedia.org/wiki/Genome_size .
I knew about non-coding or junk DNA but had forgotten how extreme it was.
Ended up learning about the onion test[1] and this interesting article[2] which suggests classifying non-coding DNA as either junk DNA, spam DNA or both, to better capture the differences that exist.
I also found this article[3] providing some interesting overview and discussion on the topic.
I've just implemented this Blancmange curve using less than 140 characters of javascript. It's a fractal also known as Takagi curve. Source code and graphics: https://www.dwitter.net/d/28260 (it can be edited and remixed)
It's an easy flub, almost so easy as to be content free. The correct statement is either: factoring into prime numbers, or factoring semiprime numbers.
So when I see pedants picking this one out, I think people should be given a break! :)
I actually came here to delete my comment, because it occurred to me that it might sound mean-spirited. But it seems I am too late. I found the mistake amusing, that's all.
OH no! I've prevented you from deleting your comment! I wasn't being too hard on you. If you want, I can delete mine, and we can both delete everything before the 2 hour deadline?
Modern encryption techniques are based on our inability to factor large numbers into prime numbers. If you factor a number that's used as the key into the two prime numbers it was composed from, then you have broken the encryption.
Don't we not fully understand how the body, and the bodies of other beings work?
I don't understand how someone can be confident about quantifying phenotype robustness without complete knowledge of the physical properties of organic life.
Maybe in the robust category there are small changes to the processes in the organisms that we do not fully understand that build up to break into a threshold of our perception.
"Maximum mutational robustness in genotype–phenotype maps follows a self-similar blancmange-like curve"
https://royalsocietypublishing.org/doi/10.1098/rsif.2023.016...