To the outsider coming to the dispute for the first time, the flurry of numbers and measures is bewildering. In displaying and analyzing their measurements, scientists call on two distinct intellectual traditions, both often labeled with the word statistics.66 The first tradition—the amassing of numbers in large quantity to assess or measure a social problem—has its roots (still visible today) in eighteenth — and nineteenth-century practices of census takers and the building of actuarial tables by insurance companies.67 This heritage has slowly mutated into the more recent methodology of significance testing, aimed at establishing differences between groups, even when individuals within a group show considerable variation. Most people assume that, because they are highly mathematical and involve complex ideas about probability, the statistical technologies of difference are socially neutral. Today’s statistical tests, however, evolved from efforts to differentiate elements of human society, to make plain the differences between various social groups (rich and poor; the law-abiding and the criminal; the Caucasian and the Negro; male and female; the English and the Irish; the heterosexual and the homosexual—to name but a few).68
How are they applied to the problem of gender differences in the CC? The CC studies use both approaches. On the one hand, morphometrists make many measurements and arrange them in tables and graphs. On the other, they use statistical tests to correlate measurements with variables such as sex, sexual preference, handedness, and spatial and verbal abilities. Sophisticated statistical tools serve both rhetorical and analytical functions. Each CC study amasses hundreds of individual measurements. To make sense of what the philosopher Ian Hacking calls this ‘‘avalanche of numbers,’’69 biologists categorize and display them in readable fashion.70 Only then can investigators ‘‘squeeze’’ information out of them. Does a structure change size with age or differ in people suffering from a particular disease? Do men and women or people of different races differ? The specialized research article, which presents numbers and extracts meaning from them, is really a defense of a particular interpretation of results. As part of his or her rhetorical strategy, the writer cites previous work (thus gathering allies), explains why his or her choice of method is more appropriate than that used by another lab with different outcome, and uses tables, graphs, and drawings to show the reader a particular result.71
But statistical tests are not just rhetorical flourishes. They are also powerful analytic tools used to interpret results that are not obvious from casual observation. There are two approaches to the statistical analysis of differ — ence.72 Sometimes distinctions between groups are obvious, and what is more interesting is the variation within a group. If, for example, we were to examine a group of ioo adult Saint Bernard dogs and ioo adult Chihuahuas, two things might strike us. First, all the Saint Bernards would be larger than all the Chihuahuas. A statistician might represent them as two nonoverlapping bell curves (figure 5.5A). We would have no trouble concluding that one breed of dog is larger and heavier than the other (that is, there is a group difference). Second, we might notice that not all Bernards are the same height and weight, and the Chihuahuas vary among themselves as well. We would place such Bernard or Chihuahua variants in different parts of their separate bell curves. We might pick one out of the lineup and want to know whether it was small for a Saint Bernard or large for a Chihuahua. To answer that question we would turn to statistical analyses to learn more about individual variation within each breed.
Sometimes, however, researchers turn to statistics when the distinction between groups is not so clear. Imagine a different exercise: the analysis of
figure g. g: A: Comparing Chihuahuas to Saint Bernards. B: Comparing huskies to German shepherds. (Source: Alyce Santoro, for the author)
100 huskies and 100 German shepherds. Is one breed larger than the other? Their bell curves overlap considerably, although the average height and weight differ somewhat (figure g. gB). To solve this problem of ‘‘true difference,’’ modern researchers usually employ one of two tactics. The first applies a fairly simple arithmetical test, now automated in computer programs. The test takes three factors into account: the size of the sample, the mean for each population, and the degree of variation around that mean. For example, if the mean weight for shepherds is go pounds, are most of the dogs close to that weight or do they range widely—say, from 30 to 80 pounds? This range of variation is called the standard deviation (SD). If there is a large SD, then the population varies a great deal.73 Finally, the test calculates the probability that the two population means (that of the huskies and that of the shepherds) differ by chance.
Researchers don’t have to group their data under separate bell curves to establish differences between populations. They can instead group all the data together, calculate how variable it is, and then analyze the causes of that variability. This process is called the analysis of variance (ANOVA). In our doggie example, researchers interested in the weight of huskies and German shepherds would pool the weights of all 200 dogs, and then calculate the total variability, from the smallest husky to the largest German shepherd.74 Then they would use an ANOVA to partition the variation—a certain percent accounted for by breed difference, a certain percentage by age or brand of dog chow, and a certain percentage unaccounted for.
Tests for mean differences allow us to compare different groups. Is the difference in IQ between Asians and Caucasians real? Are males better at math than females? Alas, when it comes to socially applied decision making, the clarity of the Chihuahua versus the Saint Bernard is rare. Many of the CC studies use ANOVA. They calculate the variability of a population and then ask what percentage of that variability can, for example, be attributed to gender or handedness or age. With the widespread use of ANOVA’s then, a new object of study has crept in. Now, rather than actually looking at CC size, we are analyzing the contributions of gender and other factors to the variation of CC size around an arithmetical mean. As scientists use statistics to tame the CC, they distance it yet further from its feral original.75
Convincing others of a difference in CC size would be easiest if the objects simply looked different. Indeed, in the CC dispute a first line of attack is to claim that the difference in shape between the splenia of male and female CC’s is so great that it is obvious to the casual observer. To test this claim, researchers draw an outline of each of the 2-D CC’s in their sample. They then give a mixture of the drawings, each labeled only with a code, to neutral observers, who sort the drawings into bulbous and slender categories. Finally, they decode the sorted drawings and see whether all or most of the bulbous file turn out to have come from women and the slenders from men. This approach does not yield a very impressive box score. Two groups claim a visually obvious sex difference; a third group also claims a sex difference, but males and females overlap so much that the researchers can only detect it using a statistical test for significant difference.76 In contrast, five other research groups tried visual separation of male from female CC’s but failed in the attempt.
When direct vision fails to separate male from female, the next step is to bring on the statistical tests. In addition to those who attempted to visually differentiate male from female CC’s, nine other groups attempted only a statistical analysis of difference.77 Two of these reported a sex difference in sple- nial shape, while seven found no statistical difference. This brings the box score for a sex difference in splenial shape to 5 for, 13 against. Even statistics can’t discipline the object ofstudy into neatly sorted categories. As Mall found in 1908, the CC seems to vary so much from one individual to the next that assigning meaningful differences to large groups is just not possible.
In 1991, after the CC debate had been raging for nine years, aneurobiolo — gist colleague told me that a new publication had definitively settled the matter. And the news accounts—both in the popular and the scientific press— suggested he was right. When I began to read the article by Laura Allen and her colleagues I was indeed impressed.78 They used a large sample size (122 adults and 24 children), they controlled for possible age-related changes, and they used two different methods to subdivide the corpus callosum: the straight-line and the curved-line methods (see figure 5.4). Furthermore, the paper is packed with data. There are eight graphs and figures interspersed with three number-packed, subdivided tables, all of which attest to the thoroughness of their enterprise.79 Presenting their data in such detail demonstrates their fearlessness. Readers need not trust the authors; they can look at their numbers for themselves, recalculating them in any fashion they wish. And what do the authors conclude about gender differences? ‘‘While we observed a dramatic sex difference in the shape of the corpus callosum, there was no conclusive evidence of sexual dimorphism in the area of the corpus callosum or its subdivisions.’’80
But despite their emphatic certainty, the study, I realized as I reread it, was less conclusive than it seemed. Let’s look at it step by step. They used both visual inspection and direct measurement. From their visual (which they call subjective) data, they reach the following conclusion.
Subjective classification of the posterior CC of all subjects by sex based on a more bulbous-shaped female splenium and a more tubular-shaped male splenium revealed a significant correlation between the observers’ sex rating based on shape and the actual gender of the subject (x2 = 13.2603; 1 df; contingency coefficient = 0.289; p<o. oo3). Specifically, 80 out of 122 (66 percent) of the adult’s CC (x2 = ю.123; 1 df; contingency coefficient = o.283;p<o. ooii) were correctly identified.81
First, we can extract the actual numbers: using splenial shape, their blind classifiers could correctly categorize as male or female 8o out of 122 tracings of adult 2 — D CC’s. Was that good enough to claim a visual difference, or might we expect the 8o out of 122 to occur by chance? To find out, the authors employ a chi-squared test (symbolized by the Greek letter x2). The well — known founder of modern statistics, Karl Pearson (and others) developed this test to analyze situations in which there was no unit of measurement (for example, inches or pounds). In this case the question is: Is the correlation between bulbous and female or slender and male good enough to warrant the conclusion of a visual difference? The take-home is in the figure p<o. ooii. This means that the probability of 8o of 122 correct identifications happening solely by chance is one-tenth of i percent, well below the cutoff point of 5 percent (p<o. o5) used in standard scientific practice.82
Well, 66 percent of the time observers could separate male from female CC’s just by eyeballing their shape. And the X2 test tells us how significant this differentiation process is. Statistics don’t lie. They do, however, divert our attention from the study design. In this case, Allen et al. gave their CC tracings to three different observers, who had no knowledge of the sex of the individual whose brain had generated the drawing. These blind operators divided the drawings into two piles—bulbous or tubular, on the assumption that if the difference were obvious, the pile of tubular shapes should mostly turn out to have come from men and the bulbous from women. So far so good. Now here comes the trick. The authors designated a subject’s gender as correctly classified if two out of the three blind observers did it right.
How does this work out numerically? The complex statistical passage quoted above says that 66 percent of the time the observers got it right. This could actually mean several things. There were 122 drawings of the corpus callosum. Since three different observers looked at each drawing, that means that there were 366 individual observations. In the best case (from the authors’ point of view), all three observers always agreed about any individual CC. This would mean that 244/366 (66 percent) of their individual observations accurately divined sex on the basis of shape. In the worst case, however, for those measures that they counted as successful separations, only two out of the 3 observers ever agreed about an individual brain. This would mean that only 160/366 (44 percent) of the individual observations successfully separated the CC drawings on the basis of sex. Allen et al. do not provide the reader with the complete data, so their actual success remains uncertain. But using a chi-squared test on their refined data convinces many that they have finally found an answer that all can accept.
The data do not speak for themselves. The reader is presented with tables, graphs, and drawings and are pushed through rigorous statistical trials, but no clear answer emerges. The data still need more support, and for this scientists try next to interpret their results plausibly. They support their interpretations by linking them to previously constructed knowledge. Only when their data are woven into this broader web of meaning can scientists finally force the CC to speak clearly. Only then can ‘‘facts’’ about the corpus callosum emerge.83