What “average effect” means and does not mean

Earlier this month, Politico education reporter Stephanie Simon wrote an article on what she describes as minimal positive effects of vouchers on student achievement. Rick Hess wrote one of the more visible responses, which repeated the following from an Ed Week column last year he was a co-author on:

Among voucher programs, random-assignment studies generally find modest improvements in reading or math scores, or both…. None of these studies has found a negative impact.

I could quibble with the summary of voucher research, but this is reasonably close to how I see the published literature: some number of of small positive average effect sizes in a pool of no-significant-effect studies. There isn’t a rigorous meta-analysis of voucher-effect studies that I am aware of, but I don’t think either a meta-analysis or more studies are likely to budge the general pattern. You can spin this pattern in a variety of ways; the research evidence on vouchers is not driving policy, but that should be no surprise to observers of education politics.

For today, I wish to use the recent articles to point out something about the meaning of average effects descriptions, such as the one above. Economists (and many other social scientists) tend to refer to and manipulate average effect sizes as if the sample population is the unit of analysis. You can see this in papers such as the Chetty-Friedman-Rockoff study of teacher value-added when looking through the lens of future student earnings. Doing so makes a great deal of sense if there is a clear, very large effect size of some policy (or medical intervention, to use a different context). Where the average effect is large and beneficial, and where there is no comparable alternative given the practical choices facing policymakers (or doctors or …), then you use the average effect.

But it is not clear that it makes sense to use the estimated effect for a sample population (or hypothetical broader population) when the evidence is less compelling, even if it tends in one direction. The average effect sizes in voucher studies generally hovers around zero or is a little bit positive. It is also not clear that you want to use the effect size when there are significant costs associated with an intervention — so for example, there is solid evidence that lower class sizes in primary grades can have important benefits, but it’s also expensive to lower class sizes. When does it make sense to reduce average effects in that way, and what do you do if it’s not sensible to use that reductionist approach?

Start with basics: in general, an average effect size is an estimate of the relationship between one variable of interest (in this case, something you can presumably control) and an outcome of interest. Let’s assume we are talking about a multivariate analysis that takes into account at least a good proportion of the potential confounding issues, and that we understand the limits of the particular study.1 Unlike something like r2 (which tells you something about the explanatory power of the relationship), an effect size is a claim about the average end result if you move the variable of interest in one direction.

Average is the key term. An estimated average effect is important, but so is the spread of estimates around the average. There are multiple sources of the spread of effect estimates. One is the uncertainty about the average effect itself. For some methods of estimating the average effects (resampled or bootstrap estimates, or a Bayesian MCMC method), you can calculate the spread directly as the dispersion of estimates of the average effect. For most authors,2 this is what statistical significance refers to: they are fairly confident that the spread of the estimates of the average effect excludes zero.

But while statistical significance focuses on the estimate of the average effect, there is another spread around the estimate of the average effect–the spread of effect for individuals, either individuals in the sample or estimates in a hypothetical, simulated population. This spread of individual effects is as important as the average effect. If people respond very differently as individuals to a reading program, or monetary incentives, or a voucher program, then the average effect may be obscuring important information. Such heterogeneous effects may mean that the program (or treatment) should be restricted to those who can truly benefit… which requires more research.

And sometimes it may be impossible to craft a program that is truly targeted. For example, with the voucher research that Paul Peterson and Matt Chingos have updated recently trying to connect voucher access to high school graduation and college attendance, the overall average effect size is not statistically significant, but the effect size for African American students in the sample is positive and strongly so. By implication, that means that the effects for everyone else in the sample is negative, if weakly so. Suppose for a moment that this heterogeneity is verified through further analysis or other studies. Does that mean that we should have voucher programs available for African American students but not others, so that we can be sure that the program targets those whom it is likely to benefit but not harm others? Err…. probably not. For a moment, ignore your preconceptions about voucher policies–the point here is that average treatment effect for many potential programs is relevant but an insufficient basis on which to make policy, and that life can quickly become complicated when you look beyond average treatment effects.

The second issue I want to highlight is the narrow construction of the term average treatment effect. It ignores the context of potential policies, and it ignores policy alternatives. That is not to say that knowing average treatment effects is bad — far from it. Rather, it tells us one piece of information, and here, only one kind of information. The classic example here is class size research. We know from the Tennessee STAR experiment and other research that at least in primary grades, there is evidence of significant positive effects from reducing class sizes to a relatively low student-teacher ratio.3 What is missing here are two pieces of contextual information. Experimental research such as the Tennessee STAR experiment can only tell us what happens at the margins of a single state, when the changes will not affect other pieces of a large system. If you hypothetically reduce the class size of primary grade classrooms in a few schools, it is highly unlikely that either the overall cost will be significant for the state or that the small number of increased teachers is likely to affect the teacher labor market, except maybe in a few neighborhoods. But if you mandate class size caps at a state level, there will both be significant costs and noticeable aggregate changes in the labor market for teachers.

To point out the missing contextual information does not mean we ignore the research, but rather that we understand the limits of it: research about individual programs often tells us the effects of a program that is about as controlled, faithful, and well-executed as possible, and in limited settings. It does not tell us what would happen if we try to scale up that program, with attendant issues of aggregate costs, (the predictable) other effects, and management issues.

Notes

  1. This is a generous assumption; a good deal of even highly-competent research cannot address important, known confounding issues because there is no relevant (or trustworthy) data collected. To steal from Donald Rumsfeld, you go into analysis with the data one has collected. []
  2. You can call these people frequentists. []
  3. For an accessible description of the research, see a 1995 article written for a general audience by Fred Mosteller. []