With a current population size of over 7 billion, the human population should contain a huge amount of genetic variation. Most of it resides in junk DNA so it's of little consequence. We would like to know more about the amount of variation in functional regions of the genome because it tells us something about population genetics and evolutionary theory.A recent paper in Nature (Aug. 2016) looked at a large dataset of 60,706 individuals. They sequenced the protein-coding regions of all these people to see what kind of variation existed (Lek et al., 2016) (ExAC). The group included representatives from all parts of the world although it was heavily weighted toward Europeans. The authors used a procedure called "principal component analysis" (PCA) to cluster the individuals according to their genetic characteristics. The analysis led to the typical clustering by "population clusters." (That term is used to avoid the words "race" and/or "subspecies.")
It's difficult to figure out which sequences were analyzed other than the fact they were exons. They say that all exons were sequenced but they don't say much about how many and the total sequence length. As near as I can figure out, they sequenced about 60 Mb (60 million base pairs) of exon sequences. This is way more than the total amount of coding sequence in the human genome (~26 Mb), That's because they sequenced 50 bp on each side of an exon for an additional 100 bp for every exon.
The authors identified 7,404,909 high-quality variants in the population of 60,706 individuals. Of these, the majority (>95%) were single nucleotide polymorphisms (SNPs). The rest were insertions or deletions. More than half of all the variants (54%) were singletons—mutations present in a single individual out of more than 60 thousand. The vast majority (99%) of all variants were present at a frequency of less than 1%. (There would be more than 400 million variants in the entire genome.)
It looks like there are about 10,000 genes that have premature stop codons or frameshift mutations that should prevent synthesis of a functional protein. These are loss-of-function mutations. They are not lethal because only one of the two copies is mutated. The data can be used to calculate how many genes cannot tolerate a loss-of-function mutation because losing one copy would be lethal. These are haploinsufficient genes and there are about 3,000 of them in the human genome.
By comparing the variants to presumed disease-causing mutations in medical databases, the authors calculated that the average person in their database is heterozygous for 54 such variants. Even though the individuals don't suffer from the genetic disease, this is too high a genetic load according to most population genetics. Many of the presumed disease-causing alleles are present at frequencies greater than 1% suggesting they may have been mischaracterized as disease-causing mutations.1
Several of the variants have been reclassified as benign based on the result of this study. The authors point out that filtering based on allele frequency is important—high frequency alleles are probably not disease-causing in most instances. They also point out that even many low frequency alleles previously suspected of causing genetic disease have probably been mischaracterized.2 The average individual in their group contains about 1 dominant disease-causing mutation and that's probably too high, suggesting that some of these mutations are probably benign.
There are 179,774 different variants classified as protein-truncating variants (PTV). These are mostly due to conversion of a normal codon to a stop codon. They are assumed to be loss-of-function mutations as described above. Many of them are present at significant frequencies in the population. The average person has 85 heterologous and 35 homologous PTVs. This suggests there are quite a few genes that can be inactivated (both copies) without significant effect. One of these would be the gene causing O-type blood [Is the high frequency of blood type O in native Americans due to random genetic drift?].
The highlight of the paper in both the abstract and the discussion is the fact there are 3,230 genes that are highly loss-of-function intolerant. Knocking out even a single copy is probably lethal. Most (72%) are not associated with any known genetic disease. This should not be a surprise. These genes are likely housekeeping genes such as those that encode RNA polymerase subunits or important enzymes. Any disruption of function will be lethal and they never appear as disease-causing. Most genetic diseases occur in genes that are not essential for cell survival. The most important genes are rarely associated with genetic diseases; a counter-intuitive fact that should be more widely disseminated.
The final comment in the paper concerns the total amount of variation in the entire human population. The authors point out that nearly every possible mutation should be present in a population of 7 billion individuals. The only exceptions should be dominant lethal mutations. Even with a sample size of 60,000 individuals they are only able to detect a fraction of the total variants present in humans.
1. Yes, the authors and I know about balancing selection and sickle-cell disease.
2. This is one of the problems with genetic testing.
Lek, M., Karczewski, K., Minikel, E., Samocha, K., Banks, E., Fennell, T., O'Donnell-Luria, A., Ware, J., Hill, A., Cummings, B. et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285-291. [doi: 10.1038/nature19057]