I am the youngest of five children, so I get to ask questions today.^{1} Sara Goldrick-Rab pointed out yesterday that John Hattie has turned his giant summary of education intervention research into a book designed for practitioners, Visible Learning for Teachers, which promises to be somewhat more accessible than the giant-book-o’-meta-analysis.

Meta-analysis is one important approach summarizing research data on specific questions: put different studies in the same general universe of measures and compare. A concept with a term coined originally by Gene Glass in the 1970s, the idea that one could formally analyze multiple studies with different measures tossed a number of people’s methodological caution inside-out,^{2} and eventually it replaced qualitative reviews in many fields, including medicine, psychology, and education. The effort to compare studies with different measures gave us the “effect size” measured in relationship to a sample/study standard deviation, a clever cheat that takes its place alongside other late-20th century statistical hacks including the bootstrap, the jackknife, and multiple imputation. Among other things, the existence of meta-analysis should make you skeptical of any attempt at a quantitative research review that engages in the equivalent of “vote-counting” rather than summary of effect sizes. It satisfies the tremendous urge to answer big questions in messy areas, and as long as one understands that it’s a messy technique, it’s enormously useful.

Meta-analysis is at one end of the general-particular spectrum on research, coming from the urge to draw universal conclusions (or reasonably universal conclusions, depending on the question at hand). In this vein, it addresses the “how representative/general are your conclusions?” question by aggregation.^{3} I am an historian, and while I have some social-science and quantitative training, there are important questions at the other end of the spectrum, where context and contingency rule. “How unique is this situation?” is the flip of “How general are these conclusions?”^{4}

What does not currently exist is a quantitative social-science toolkit for sussing out uniqueness. I don’t mean individual outliers — that exists in various forms^{5} but rather the uniqueness of a dataset. Maybe an example will help: For my first book, Creating the Dropout, I looked at relationships to teens’ high school graduation status in decennial censuses from 1940 to 1990. Some measures had coefficients and odds-ratios very similar from decade to decade.^{6} But some changed–including the relationship between race and graduation after taking a whole bunch of other factors into account. In 1940, in the depths of Jim Crow and the Great Depression, being African American was a significant disadvantage for teens’ high school attendance and completion even after taking household income, household home ownership/rental status, and parental education into account. A few decades later, being African American was no longer a disadvantage in high school completion compared to white children of similar circumstances. Looking at that single issue, I can tell a broader story and did in part of *Creating the Dropout*, one updated with more specific and persuasive data in John Rury and Shirley Hill’s new book (which you should read, either by getting from a library [the previous link] or buying it).

Apart from the storyline, the practical question is how do we *know* that the census samples from later in the 20th century are different from earlier? We can perform one-by-one or sets of tests for the single question I have described, but I was working some time later, and I was attuned to the issue. How would one know a pattern’s emergence with less complete information? In A Piece of the Pie (1980), Stanley Lieberson used a variety of clever techniques to figure out what was happening with equal opportunity precisely at the time the disadvantage in high school completion was disappearing (important caveat: *after* other measurable factors are accounted for).^{7} On my best days I may be about a quarter as clever as Lieberson was for that project, so I’m thinking we need to steal ideas from elsewhere. Here are a few I have been mulling recently:

*Ordinary, plodding analysis*. This is the type of thing I would do now in the same way I looked at my analyses of censuses: how do coefficient of measures of interest vary across samples/years/contexts?

*Bootstrap mixed samples*. This would create a set of samples drawn from different data sets with different mixing ratios from the “potential sample of interest” to a blend of other samples. How would the coefficients for measures of interest change as that mixing ratio changes? This is only available when you have access to the individual records of different data sets, and it starts to move towards speculative simulations.

*Hypothetical bootstrap mixed samples*. This would perform the same analysis as above but, where you do not have access to original records, you would assume a parametrized distribution from reported effect sizes and standard errors to generate hypothetical records. This is something I would have to label as a crazy Monte Carlo (simulation) based on specifics of published research. I hope it would be fun crazy Monte Carlo…

*Overfitting as a feature, not a bug . *One of the standard critiques of many poli-sci prediction models is that they are highly fitted to historical data–such as the presidential elections since 1860 or Congressional elections since 1952. Such “overfitting” reduces the ability to generalize because statistical algorithms try to squeeze as much as they can out of existing variability, generating more “performance” for a model than one could reasonably expect for an extension to all hypothetical presidential or Congressional elections (to use the two examples mentioned above). But to an historian or anyone else wanting restricted context, this is not a problem. Or, more practically, one could compare a potentially generalizable and cautious model with a model cranked up to its overfitted extreme. The difference between the cautious and the cranked-up model would be one reasonable estimate of the measure of specificity.

*Speciation as a model of uniqueness*. Here, I am taking the biologist’s classic model of species differentiation as inspiration for defining uniqueness. When a segment of the population is separated from the larger population and subject to differential environmental pressures, that’s when speciation occurs with mutation and natural selection. At some point, the separated population’s characteristics become sufficiently different from the rest of the population you can say it is a unique population (or a different species), even with the variation in characteristics one can expect with any population. Note: I have no idea how to operationalize this one. A quick search on “measuring speciation” leads to chemistry, not natural history.

Kibitzing most welcome…

**Notes**

- For those reading this entry months or years later and don’t have the skills to translate between calendars on the fly, it’s the first day of Passover, and traditionally asking the four questions is the privilege/obligation of the young. [↩]
- How can you combine studies that are so different??? [↩]
- And if you’re not satisfied by simple aggregation, you can code studies by characteristics and get fancy… [↩]
- You can interpret more recognizably historical questions as variants on uniqueness: continuity vs. discontinuity, contingency and causation, etc. [↩]
- Including influence measures in statistical packages. [↩]
- I used logistic regression in the analysis. Want to quibble more? Read the book! [↩]
- Other important caveat: that relationship may have changed since 1980–I should update the analysis with 2000 census and ACS data. [↩]

An obvious “day after” observation: of course there’s a huge literature on statistical smoothing that is somewhat related, in the idea of splitting general patterns from deviations, as well as the too-often-ignored art of residual analysis.

Hey Sherman– I shared this and discussed it with Doug Harris and Geoffrey Borman, who had these thoughts to add. You asked the question, “how do we know that the census samples from later in the 20th century are different from earlier?” It seems you mean this in terms of some model (rather than descriptives)? If so, there are lots of fairly standard ways to do this. One is to merge the samples with lots of interaction terms related to the key sample characteristics and see whether the interactions are significant with an F-test. Also, as you note, one can specify random effects meta-analytic models to test if effects (or relationships) vary by study (or sample) and if they do vary, you can specify methodological and/or substantive predictors coded for each sample to determine the underlying moderators of the effects/relationships.

Wouldn’t that do it?

Sara,

You, Doug, and Geoffrey are right on the question of models with comparable data structures — merge samples, code for year. That’s not really “plodding analysis,” but it’s part of the contemporary toolset for narrow questions. I’m playing around with whether you can look at broader structures, and I may be completely off my rocker here.

The pleasure of being tenured: not only is my job safe if I look like an ass in public being wrong, but I

shouldtake such risks, as long as I am being honest.the last part of your comment is now my new motto. I adore it.