Review Of The Statistical Part Of The RG Article

Review Of The Statistical Part Of The RG Article

2 October 2006

The statistical part of RG's article (attributable to Dr. Grant) is only one-third as long as the metallurgical part (by Dr. Randich). The statistical part is also easier to review, since it contains only two basic ideas.
This statistical critique is mine alone. Even though Larry Sturdivan knows much more about statistics than I do, it is not necessary to be expert in statistics to see the flaws in the statistical part of the RG paper, since they are one part basic logic and one part very basic statistics.

The basic statistical argument
(1) Dr. Guinn used uncertainties for his analytical data that are too small.
(2) When the proper uncertainties are used, his two apparent groups of crime-scene fragments merge into one big one.

The analytical uncertainties
The paper notes that Guinn reported 1-σ analytical uncertainties (from counting statistics alone), and then noted that the full analytical uncertainties should be 2–3 times larger than that. Both statements are correct, since they come directly from Guinn's testimony to the HSCA. (For the record, Guinn's factor of 2–3 also corresponds well with my experience in NAA.)

Using the proper uncertainties
    This part of the statistical argument is the important one. It has several parts, which need to be isolated and considered individually.
    Weighted averages. The paper proposes that weighted averages be used when calculating mean concentrations of antimony in the two apparent groups of fragments. That is certainly a reasonable approach. Weighted averages of the two groups would then be 811 ± 17 ppm antimony (the body shot) and 627 ± 10 ppm. The weighting has reduced the standard deviations of these means from 3.1% to 2.1%, and from 3.2% to 1.6%, respectively. These calculations seem very reasonable, although this conclusion does not affect the basic logic to come.
    The article then considers the effect of weighting on the averages of Guinn's reproducibility studies for two WCC/MC bullets. With weighting, the article comes up with an average standard deviation (the spread of the four replicate measurements on each bullet) of 4.3%. We need not further review this value here, however, because it is immediately dropped for being built into the next uncertainty, which is claimed to arise from chemical heterogeneity of the bullets. That last point (equating uncertainty of measurement with heterogeneity within a bullet) is where the article goes seriously wrong, and we now review it carefully.
    Uncertainty vs. heterogeneity. The article notes that Guinn's measurements of overall heterogeneity (four samples from each of three MC bullets) would average to 3.2%, 21%, and 14% when weighted averages were used. That would make an overall average of 13% for the heterogeneity of WCC/MC lead. It notes that the 4.3% from analytical reproducibility (previous paragraph) is built into this number, and so need not be considered further. So far, so good. But then it takes a giant leap and claims that the heterogeneity of the full bullet should be considered as the analytical uncertainty, or "overall accuracy," of the individual NAA measurements. In other words, no matter how well Sb is measured in an individual fragment, its true uncertainty will always be 13% because it cannot represent the bullet as a whole to better than 13%
    This idea seems to be based on an assumption that an MC bullet possesses a meaningful average concentration of antimony, which in turn seems to hearken back to the article's metallurgical idea that all the chemical variations in lead fragments are a consequence of small-scale metallurgy. According to this view, WCC/MC lead is homogeneous overall, and its fragments display only those variations that are caused by different microorigins—some coming more from centers of crystals and others more from edges of crystals. Thus the statistics and the metallurgy in this article are reflecting a common viewpoint. Unfortunately for them, that viewpoint has been shown by other chemical data to be wrong. (See the metallurgical critique.)
    The article then applies the 13% heterogeneity to the concentrations of antimony in the fragments from the crime scene. It first gives the individual concentrations an uncertainty of 13%, then uses that to calculate the weighted averages and standard deviations for the two groups. It gets 814 ± 75 ppm for the upper group and 622 ± 41 ppm for the lower group. Note that although these means are nearly the same as for the weighted averages given above, the uncertainties are four times greater. That means that the error bars for the groups are four times larger than before, which would make the groups overlap more easily.
    Separation of groups. The article then estimates the probability that the groups overlap by examining the confidence interval at which their error bars would just touch. ("Two distinct sample populations may be affirmed until error bars overlap, at which point the data become consistent with but one population.") That turns out to be 1.6σ rather than the 4.2σ from Guinn's 1-σ analytical uncertainties. Those values correspond to probabilities of 89% and >99.99%, respectively. They mean that under this view (heterogeneity representing uncertainty), the groups do not meet the conventional scientific standard for being distinct (95% or 98%, depending on whom you listen to).
    But this approach is a sort of worst-case analysis. It claims that two groups are no longer distinct the moment that their extremes just touch. The more appropriate test is the classical one that evaluates two populations by examining whether the difference in their means could have arisen by chance, that is, from random variations in the samples that made the groups look different when the underlying populations were actually the same. We can try this test on the sets of fragments in two different ways—on the weighted means and standard deviations built from the 13% "uncertainties" (the paper's recommended values) and on the weighted means and standard deviations that come from Dr. Guinn's report.
    The paper's preferred set of values (derived from the 13% "uncertainties"), 814 ± 75 ppm (n = 2) and 626 ± 41 ppm (n = 4), gives probabilities of 0.0137 for two-sided analysis and half that (0.0068) for one-sided analysis. In other words, the paper's own data show that there is a probability of 98.6% (or 99.3%) that the two groups cannot be said to be the same (informally speaking). That meets any reasonable scientific test.
    But those calculations are based on "analytical uncertainties" that are inflated by a wrong view of impurities in MC lead. If we use the correct view, namely that the true analytical uncertainties are just those coming directly from Dr. Guinn's NAA (multiplied by the factors that he recommended), we get an even stronger result. We can do this in three steps. The first is to take the paper's weighted averages for Guinn's 1-σ results (from counting statistics alone), which will make the groups appear more separated than they really are. The values are 811 ± 17 ppm (n = 2) and 627 ± 10 ppm (n = 4). The probabilities of the different means arising by chance are 0.0001 and 0.0000 (limited by the readout of the computer program), respectively. Doubling the uncertainties (to 2 σ) gives 811 ± 34 ppm (n = 2) and 627 ± 20 ppm (n = 4), and probabilities of 0.0009 and 0.0005. Tripling the uncertainties (to 3 σ) gives 811 ± 51 ppm (n = 2) and 627 ± 30 ppm (n = 4), and probabilities of 0.0043 and 0.0021. In other words, even under Guinn's worst case (3σ), the probabilities that the groups came from a single population are well below 1 in 100. The groups do not merge into one big group, as claimed in the paper. You can easily tell the difference between the groups, as is obvious from any plot of the Sb in the fragments. Things are exactly as they seem at first glance.

Summary
    (1) The paper is correct when it states that Dr. Guinn's 1-σ analytical uncertainties are too small. But he acknowledged that.
    (2) The paper quite reasonably proposes that weighted averages be used in calculating means and standard deviations of the two groups of crime-scene fragments, although it makes little difference in practice.
    (3) The paper proposes that the "overall accuracy" of an analytical measurement be equated to the variable's overall heterogeneity in a bullet, which averages 13% for Sb. This unconventional idea is based on a view of impurities in MC lead that has been shown by other data to be wrong.
    (4) The paper uses confidence intervals and "overall accuracy" to claim that there is only at 89% probability that the groups of crime-scene fragments are drawn from different populations, which is well below the normal threshold for accepting the difference.
    (5) When the conventional test for difference of means is used instead, the probabilities rise to ≥99%, which means that the groups can in practice be considered to be separate.

Conclusions
    (1) The statistical part of RG's paper fails because it made two basic errors.
    (2) This mirrors the metallurgical part of the paper, which also failed by making two basic errors.
    (3) Thus the paper as a whole fails.

(Note: I am posting this part of the review a little earlier than I would have preferred, because I will be traveling for the better part of the next month and wanted to get it out for others to see. There are two or three places where I may strengthen the arguments later, and I reserve the right to do so.)

Back to Review of Article