Fathom Logo

Learning PlanSessionsContributors
 DNA Fingerprinting, Genetics and Crime: DNA Testing and the Courtroom
 Fathom
Sessions
Session 3
Session 2Session 4

Determining the Frequency of the Genetic Profile in the Population

[DNA profiles]
Reprinted with permission from Nakamura et al., "Variable Number of Tandem Repeat (VNTR) Markers for Human Gene Mapping" Science 235, 1619, (1987) fig. 3, Copyright 1987 American Association for the Advancement of Science. (www.sciencemag.org)
enlarge A sampling of DNA profiles.

The frequency of the DNA profile obtained from the stain on White House intern Monica Lewinsky's dress was reported to be 1 in 7.9 trillion. Since the population of the world is estimated to be only a little more than 6 billion--much less than 7.9 trillion--the question naturally arises: Where does this number come from and how was it calculated? As we will see, this calculation involves assumptions about the genetics of the population itself. A further issue revolves around the question--which population or population group should be used?

The value of the VNTR and STR genes to discriminate between individuals lies in the number of different forms or alleles they may take. As with most genes on the genome, two copies of each VNTR and STR locus are present in every cell, one copy being inherited from the father and the other being inherited from the mother. Each VNTR or STR gene is located on one of the 23 pairs of chromosome we possess in the nuclei of our cells. As explained earlier, these different alleles can be separated by size, because they possess differing numbers of the repeated elements or motifs.

To understand how the frequency of a DNA profile is calculated, we need to introduce a couple of technical terms and some notation to allow us to understand the process. For any VNTR or STR gene, let us denote the alleles by the number of copies of the motif that the allele contains. Thus, for example one individual could possess an allele with 7 copies of the motif and one with 19 copies of the motif. We can write the genetic constitution or genotype of that individual as (7,19).

As we mentioned earlier, alleles may have as few as 7 copies of the motif and as many as 44. Thus, the total number of different alleles we could see would be 38. If there are 38 different alleles there will be many more different genotypes, (7,7), (7,8), (7,9) and so on. In fact we can calculate the total number of different genotypes, with a simple equation:


n  x  (n+1) / 2

where n is the number of different alleles. In our case the number of different alleles is 38, so the number of different genotypes will be 741.

One way to estimate the frequency of each genotype in the population would be to take a sample of the population and determine the genotype for each individual in the sample. However with 741 different genotypes the task is daunting. Certainly we cannot expect every genotype to be in equal frequency, and so even with a sample size of 1,000, we will not simply by chance find every one of the 741 genotypes. In fact even with a sample size of 10,000 individuals we may not see all the genotypes, even though they all exist in the population.

estimate the frequency of so many genotypes with a manageable sample size? We first estimate the frequencies of the different alleles, and then take advantage of a basic law in population genetics known as the Hardy-Weinberg Law.

The task of estimating the frequencies of the alleles in the population is considerably easier, as there are only 38 alleles, whereas there are 741 different genotypes. Moreover each individual possesses 2 alleles for each VNTR or STR gene, so that a sample of 500 individuals actually will allow us to sample 1,000 genes. To estimate the frequency of each allele we simply determine the genotype of each of the 500 individuals and count the number of alleles of each type that we see.

Calculating Allele Frequencies
Use this simple example to learn how to calculate allele frequencies.

We use the word "estimate" instead of "calculate" or "determine" to distinguish between the allele frequency in our sample, and the allele frequency in the whole population (for example the total US white population). We are really only interested in the allele frequency in the population, not in the sample, and our calculation of allele frequency is our best "guess" or estimate of the frequency in the whole population. As a "guess" or estimate it is of course subject to error. However using well known statistical techniques we can calculate a range in which we expect the real (that is the population) value to lie with a given probability--usually 95 percent. This range is known as the "confidence interval." Estimates of the frequencies of a given DNA profile are typically presented with the confidence interval.

Once we have calculated the frequencies of the alleles we can now estimate the frequency of each genotype by multiplying the allele frequencies together. A useful analogy that has sometimes been used is that of a roulette wheel. The chance the pair of numbers 14 and 35 will come up in two spins of the wheel will be:

2  x  1/38  x  1/38 = 1/722

We multiply 1/38 * 1/38 by two as there are two ways in which we can obtain the pair of numbers 14 and 35 in two spins of the wheel. 14 may come up first and then 35--or 35 may come up first and then 14. Notice however that there is only one way to obtain the pair of numbers 14 and 14--14 must come up first and 14 must come up second. This is analogous to the process by which genotype frequencies are calculated from allele frequencies

Now imagine a roulette wheel that has different sized slots, corresponding to the numbers, such that the frequency with which the roulette ball will end up in the slots will vary. Spinning this wheel a 1,000 times, and counting the number of times the ball ends up in the different slots will allow us to estimate the different sizes of the slots. These sizes are analogous to the allele frequencies.

Learn More
Learn more about the lives of Hardy and Weinberg.

This simple procedure for calculating genotype frequencies from allele frequencies is an application of the so-called Hardy-Weinberg principle, named after an English mathematician and a German physician. By multiplying the allele frequencies together we assume that the combinations of alleles in each individual occurs at random, and each copy of each allele has the same chances of being passed on to the next generation--from parent to child. This perhaps is not unreasonable if we consider that a man and a woman do not decide to have children on the basis of their respective VNTR or STR genotypes.

But what if an individual possessing a particular allele or combination of alleles is more resistant to a life-threatening infectious disease--or conversely is more liable to contract cancer? Then we may expect that certain genotypes will be more frequent--or less frequent--than expected from the simple multiplication of allele frequencies, and this method will not reliably estimate genotype frequencies. We do not know the function of VNTR and STR loci, and perhaps they have none. However, we do know of other genes that do have such functions. For example, there are genes known which increase resistance to being infected with HIV (Human Immunodeficiency Virus) the AIDS virus, and which increase susceptibility to breast cancer.

Thinking Point
Researchers have identified the genetic sources for various diseases, and in some cases can test for them. While tests for some diseases--such as Phenylketonuria (PKU)--are virtually 100 percent predictive, tests for others--such as breast cancer--can only tell if the likelihood of contracting the disease is increased. Carrying the gene for breast cancer does not mean one will develop it. This marks the difference between predictive tests and susceptibility tests.

Consequently STR and VNTR loci have been the subject of a number of studies to determine the relationship between alleles and genotype frequencies. There is now little disagreement that VNTR and STR genotype frequencies can be reliably determined from allele frequencies using the Hardy-Weinberg Principle.

Until now we have assumed tacitly that we can unambiguously determine the number of motif repeats that a particular allele contains from the distance migrated on an electrophoretic gel of a DNA fragment. In other words, we should be able to see distinct migration distances of the different alleles corresponding to 7 repeats, 8 repeats 9 repeats and so on. In the real world this is often not possible; we actually see a more or less continuous distribution of migration distances. To deal with this problem, we consider that bands that migrate within a certain range are operationally the same, and fall into the same "bin." Thus, all bands which fall within a certain "bin" are considered to represent a single allele, even though in reality, they may include alleles within similar numbers of repeats, say 25, 26 and 27.

This "binning" operation does not invalidate in any way the simple calculations made above. Neither is it related in any way to the criteria for determining a match. However, because the number of alleles defined by "bins" will be less than the number of alleles that actually exist, it is a conservative procedure, as the genotype frequencies of the alleles defined by bins will be higher than for the genotypes frequencies of the true alleles. Imagine, for example that we define only one "bin" to be the entire length of the electrophoretic gel. Then there would be only one "allele" and only one "genotype"; everyone would be the same, and the procedure would have no power to distinguish between individuals.

The simple calculations (given in the previous section) allow us to calculate the frequencies of a DNA profile at one VNTR or STR locus. The frequency of the alleles vary somewhat--some being more frequent and some being less frequent, but depending on the particular genotype we can expect most of the frequencies to be in the range of 1 in 500 to 1 in 2,000. For simplicity, let us assume it is 1 in 1,000. How can we reconcile numbers like these with the numbers that have appeared in the press on recent celebrated cases? For example, the frequency of the profile of the DNA obtained from the stain on Monica Lewinsky's dress was 1 in 7.9 trillion! The answer lies in the use of several different VNTR and STR loci.

A number of such loci are known, located on practically all the 23 different human chromosomes. Although two genes located on the same chromosome tend to be inherited together, genes that are located on separate chromosomes should be inherited independently. If we make certain assumptions--for example that combinations of genotypes at different loci do not increase or decrease the chances of contracting a life-threatening disease--we can multiply the frequencies of the genotypes at each locus. As many as 10 different loci all located on different chromosomes may be used, and it is easy to see that the frequency of the combination of genotypes at all 10 loci can be vanishingly small. For example if the frequency of a genotype at one locus is 1 in 1,000, and the frequency of the genotype at a second locus is also 1 in 1,000, then the frequency of the combination of genotypes at both loci will be 1 in 1,000,000. With increasing number of loci, the frequency of the combinations correspondingly lower.



Session 3
Session 2Session 4