Review Of The Statistical Part Of The RG Article
2 October 2006
The statistical part of RG's
article (attributable to Dr. Grant) is only one-third as long as the metallurgical part (by
Dr. Randich). The statistical part is also easier to review, since it contains
only two basic ideas.
This statistical critique is mine alone.
Even though Larry Sturdivan knows much more about statistics than I do, it is
not necessary to be expert in statistics to see the flaws in the statistical
part of the RG paper, since they are one part basic logic and one part very
basic statistics.
The basic statistical argument
(1) Dr. Guinn used uncertainties for his analytical data that
are too small.
(2) When the proper uncertainties are used, his two apparent
groups of crime-scene fragments merge into one big one.
The analytical uncertainties
The paper notes that Guinn reported 1-σ
analytical uncertainties (from counting statistics alone), and then noted that the full analytical uncertainties
should be 2–3 times larger than that. Both
statements are correct, since they come directly from Guinn's testimony to the HSCA. (For the record, Guinn's factor of 2–3
also corresponds well with my experience in NAA.)
Using the proper uncertainties
This part of the statistical argument is the important one.
It has several parts, which need to be isolated and considered individually.
Weighted averages. The paper proposes that weighted averages be used when
calculating mean concentrations of antimony in the two apparent groups of
fragments. That is certainly a reasonable approach. Weighted averages of the two
groups would then be 811 ± 17 ppm antimony (the body shot) and 627 ± 10 ppm. The
weighting has reduced the standard deviations of these means from 3.1% to 2.1%,
and from 3.2% to 1.6%, respectively. These calculations seem very reasonable,
although this conclusion does not affect the basic logic to come.
The article then considers the effect of weighting
on the averages of Guinn's reproducibility studies for two WCC/MC bullets. With
weighting, the article comes up with an average standard deviation (the spread
of the four replicate measurements on each bullet) of 4.3%. We
need not further review this value here, however, because it is immediately dropped for being built into
the next uncertainty, which is claimed to arise from chemical heterogeneity of
the bullets. That last point (equating uncertainty of measurement with
heterogeneity within a bullet) is where the article goes seriously wrong, and we
now review it
carefully.
Uncertainty vs. heterogeneity. The article notes that Guinn's measurements of
overall heterogeneity
(four samples from each of three MC bullets) would average to 3.2%, 21%, and 14%
when weighted averages were used. That would make an overall average of 13% for
the heterogeneity of WCC/MC lead. It notes that the 4.3% from analytical reproducibility (previous paragraph) is built into this number, and so need not be
considered further. So far, so good. But then it takes a giant leap and claims
that the heterogeneity of the full bullet should be considered as the analytical uncertainty, or "overall
accuracy," of the individual NAA measurements. In other words, no matter
how well Sb is measured in an individual fragment, its true uncertainty will
always be 13% because it cannot represent the bullet as a whole to better than
13%
This idea seems to be based on an assumption that an
MC bullet possesses a meaningful average concentration of antimony, which in
turn seems to hearken back to the article's metallurgical idea that all the
chemical variations in lead fragments are a consequence of small-scale
metallurgy. According to this view, WCC/MC lead is homogeneous overall, and its fragments display
only those variations that are caused by different microorigins—some coming more from centers of
crystals and others more from edges of crystals. Thus the statistics and
the metallurgy in this article are reflecting a common viewpoint. Unfortunately for them, that viewpoint has been shown by
other chemical data to be wrong. (See the
metallurgical critique.)
The article then applies the 13% heterogeneity to the
concentrations of antimony in the fragments from the crime scene. It first gives the
individual concentrations an uncertainty of 13%, then uses that to calculate the
weighted averages and standard deviations for the two groups. It gets 814 ± 75 ppm for the upper group and 622 ± 41 ppm for the lower group. Note that
although these means are nearly the same as for the weighted averages given
above, the uncertainties are four times greater. That means that the error bars
for the groups are four times larger than before, which would make the groups
overlap more easily.
Separation of groups. The article then estimates the probability that the groups
overlap by examining the confidence interval at which their error bars would
just touch. ("Two distinct sample populations may be affirmed until error bars
overlap, at which point the data become consistent with but one population.") That turns out to be 1.6σ rather than
the 4.2σ from Guinn's 1-σ analytical uncertainties. Those values correspond
to probabilities of 89% and >99.99%, respectively. They mean that under this
view (heterogeneity representing uncertainty), the groups do not meet the
conventional scientific standard for being distinct (95% or 98%, depending on
whom you listen to).
But this approach is a sort of worst-case analysis. It claims
that two groups are no longer distinct the moment that their extremes just
touch. The more appropriate test is the classical one that evaluates two
populations by examining whether the difference in their means could have arisen
by chance, that is, from random variations in the samples that made the groups
look different when the underlying populations were actually the same. We can
try this test on the sets of fragments in two different ways—on the weighted
means and standard deviations built from the 13% "uncertainties" (the paper's
recommended values) and on the weighted means and standard deviations that come
from Dr. Guinn's report.
The paper's preferred set of values (derived from the 13%
"uncertainties"), 814 ± 75 ppm (n = 2) and 626 ± 41 ppm (n = 4), gives
probabilities of 0.0137 for two-sided analysis and half that (0.0068) for
one-sided analysis. In other words, the paper's own data show that there is a
probability of 98.6% (or 99.3%) that the two groups cannot be said to be the
same (informally speaking). That meets any reasonable scientific test.
But those calculations are based on "analytical
uncertainties" that are inflated by a wrong view of impurities in MC lead. If we
use the correct view, namely that the true analytical uncertainties are just
those coming directly from Dr. Guinn's NAA (multiplied by the factors that he
recommended), we get an even stronger result. We can do this in three steps. The
first is to take the paper's weighted averages for Guinn's 1-σ
results (from counting statistics alone), which will make the groups appear more
separated than they really are. The values are 811 ± 17 ppm (n = 2) and
627 ± 10 ppm (n = 4). The probabilities of the different means arising by chance
are 0.0001 and 0.0000 (limited by the readout of the computer program),
respectively. Doubling the uncertainties (to 2 σ)
gives 811 ± 34 ppm (n = 2) and 627 ± 20 ppm (n = 4), and probabilities of
0.0009 and 0.0005. Tripling the uncertainties (to 3 σ)
gives 811 ± 51 ppm (n = 2) and 627 ± 30 ppm (n = 4), and probabilities of
0.0043 and 0.0021. In other words, even under Guinn's worst case (3σ),
the probabilities that the groups came from a single population are well below 1
in 100. The groups do not merge into one big group, as claimed in the paper. You
can easily tell the difference between the groups, as is obvious from any plot
of the Sb in the fragments. Things are exactly as they seem at first glance.
Summary
(1) The paper is correct when it states that Dr. Guinn's 1-σ
analytical uncertainties are too small. But he acknowledged that.
(2) The paper quite reasonably proposes that weighted
averages be used in calculating means and standard deviations of the two groups
of crime-scene fragments, although it makes little difference in practice.
(3) The paper proposes that the "overall accuracy" of an
analytical measurement be equated to the variable's overall heterogeneity in a
bullet, which averages 13% for Sb. This unconventional idea is based on a view of
impurities in MC lead that has been shown by other data to be wrong.
(4) The paper uses confidence intervals and "overall
accuracy" to claim that there is only at 89% probability that the groups of
crime-scene fragments are drawn from different populations, which is well below
the normal threshold for accepting the difference.
(5) When the conventional test for difference of means is
used instead, the probabilities rise to ≥99%, which means that the groups
can in practice be considered to be separate.
Conclusions
(1) The statistical part of RG's paper fails because it made
two basic errors.
(2) This mirrors the metallurgical part of the paper, which
also failed by making two basic errors.
(3) Thus the paper as a whole fails.
(Note: I am posting this part of the review a little earlier than I would have preferred, because I will be traveling for the better part of the next month and wanted to get it out for others to see. There are two or three places where I may strengthen the arguments later, and I reserve the right to do so.)