Sins of presentation: False precision and Just-So Stories of heterogeneity

Most real-world studies of education are messy. Even when you can run a quasi-experimental or even pure randomized controlled trial, life can interfere with research designs. How one handles that messiness is an important factor in how I tend to read studies and evaluate them. That may not be fair, but an attempt to hide the ball on important issues tends to reduce credibility. The latest CREDO (Stanford) report on charter schools is an example of such a messy real-world research project, as is the recently-published Chingos-Peterson study of (privately-financed) vouchers in New York City and college attendance. They present different difficulties in interpretation. In the first case, there is a portrayal of false precision in findings. In the second case, there is a thin and unconvincing attempt to explain a surprising heterogeneity of results — or, in plain, jargon-free English, the paper does not persuasively explain why not everybody or every group had the same result. In one case, the sin is venal, as the overall study is important and exhaustively presented, if the presentation is still weakened. In the other case, the presentation dramatically undermines the credibility of what would otherwise be a new and interesting result.

1. False precision

The CREDO report makes a fundamental error in presentation, albeit in the service of being more friendly to lay readers. In an attempt to portray its estimates of effect sizes in human terms, the authors translated them into “days of extra learning”: “To make these results more meaningful to non-technical readers,we also include a transformation of the results in days of learning” (p. 12). That sounds reasonable, until you see the translations by individual state: effects ranging from a relative advantage for local public schools in math in Nevada of 137 days and a relative advantage for charter schools in math in Rhode Island of 108. Really, authors? 137 days, not 136 or 138? And for the tiny state of Rhode Island, where the number of charter-school students in the study in any year is below 1000, the authors are so confident in the precision of their estimates and robustness of the models that they know that the advantage for charter schools in math is 108 days — not 107 or 109, but precisely 108 days. At least they didn’t attempt to tell us whether the last day of advantage included the afternoon or stopped at lunch.

Keep in mind, this is in a study where the overall effect sizes (in terms of estimated differences in growth between charter and local-public sectors) are minimal, on the order of 0.01 standard deviations for the whole set of states examined, with slightly larger effects (in either direction) for some subgroups. As Tom Loveless noted, that’s essentially bupkis.1 We might have some confidence in the precision of a national estimate and estimates in states with reasonably large charter sectors (by number of kids with data, not necessarily percentage). But for every state, we can estimate relative value-added effects down to the day?

Overall, the report is an interesting and exhaustive follow-up to the 2009 CREDO report on charter-school achievement compared to local public schools. As Loveless, Matt Di Carlo, and others have pointed out, the study provides a lot to think about. I just wish that study authors who choose a “user-friendly” presentation would also take a little care in not implying greater-than-appropriate precision. Maybe some confidence intervals, folks?

2. Just-So Stories to explain heterogeneity

In the case of the Chingos-Peterson paper, the study found no average net difference for voucher-receiving students in comparison with students who applied for but did not get to use a voucher. What became a point of contention shortly after the release of the working paper was the authors’ focus on the statistically significant positive impact on one subgroup, African American students, who comprised about 43% of the sample. In reviewing the working paper, Sara Goldrick-Rab focused on the overall effect, which was zero or close to zero, as well as several technical issues related to the focus on one subgroup. It is important to understand Goldrick-Rab’s criticisms, as it appears that Chingos and Peterson did not address them in publishing the study in Education Next (where Peterson is the editor). The two most crucial concerns of Goldrick-Rab2 focus on the subgroup effect the authors claim for African American students:

  • The reliance on linkages using the National Student Clearinghouse may underestimate college attendance because of the clearinghouse’s failure to include for-profit institutions, with that measurement error disproportionately greater for African American students.
  • The use of naturalistic rather than designed subgroups — i.e., that instead of creating a study design to select members from each relevant subgroup, the authors analyzed all participants in each subgroup and then assumed that the differences in the results are as robust as if the authors had set the rules for inclusion within each subgroup.

In addition, when the paper first came out I performed some crude simulations suggesting that if African Americans on average benefitted from vouchers, then it was highly likely that all others on average in the sample suffered worse college attendance if they received vouchers. I remember that on Twitter, Chingos said they would rework the description of the total null-effects vs. subgroup effects and address Goldrick-Rab’s concerns. In the version that found print on Education Next, the authors did not appear to respond to Goldrick-Rab’s methodological concerns. When discussing subgroup effect estimates, the authors speculated about hypothetical reasons for (nonsignificant) differences between African-American and Latino students:

We do not know for sure why larger impacts were observed for African American students than for Hispanic students, but it appears that the African American students in the study had fewer educational opportunities in the absence of a voucher. As noted above, Hispanic students were considerably more likely to attend college in the absence of a voucher opportunity. There is also some evidence that the public schools attended by Hispanic students were superior to those attended by African American students…. A possible alternative explanation focuses on motivations for moving from public to private school. Many Hispanic families may have been seeking a voucher opportunity for religious reasons, while most African American families had secular education objectives in mind.

What’s the problem here? In both speculated explanations, there is an assumption that there is an effect–that’s circular reasoning, or assuming your conclusion. There is no attempt to build or test a competing hypothesis wherein the difference is spurious, when Goldrick-Rab specifically suggested both a way to report this specific subgroup difference and suggested that maybe there needed to be some empirical testing of the just-so stories that appeared in the working paper… and in this published version.

And then, on the larger question of how one should look at a subgroup effect that appears statistically significant when there is no observed overall effect,

The small group of students in the study from other ethnic backgrounds was diverse and less likely to use the voucher when it was offered to them, so we are hesitant to interpret their results. The group consists of 196 treatment and 127 control students, including 91 white students, 14 Asian students, 78 students from another background, and 140 students for whom information on ethnicity was not supplied. For this group as a whole, the estimated impact of the voucher offer on college enrollment within three years of expected graduation has a negative sign but is imprecisely estimated.

If we separate out white or Asian students, other-race students, and those for whom information on race is unavailable, the estimated effects of the voucher offer are all negative, but only the effect for white or Asian students is statistically significant. This group includes only 105 students, however, and we find that the treatment and control groups did not have similar characteristics at the beginning of the study. Consequently, we do not place much weight on this negative effect.

My teeth grind at the last sentence: you can either treat the heterogeneous disaggregated results as important, or you don’t. You can either consider pre-treatment differences as important (as Chingos and Peterson did with the residual numbers) or not (as they did with differences in parental education that Goldrick-Rab pointed out for African American students), but to try both is remarkable intellectual flexibility. You cannot preferentially select some issues as important without a model of what is going on or without testing your just-so stories. This is the equivalent of saying I’m going on a diet, weighing myself every day, and only counting a small daily weight changes if it’s a loss. That would not be much of a diet or a estimate of weight changes.

The bullheaded presentation after Goldrick-Rab’s critique of the working paper is a real disappointment, especially after Chingos hinted that the presentation would change. With the most careful published voucher studies, such as those written or co-written by David Figlio, we have information generally about state achievement tests in the short term rather than either long-term outcomes or social outcomes. So having something that explores potential heterogeneity in college attendance would be very interesting. Alas, that did not happen here.

Addendum: Chingos notes a new version of the paper out this month. Our evening conversation on Twitter yesterday, mostly before the Zimmerman story broke:

I appreciate Chingos’s continued engagement on this — and the revised version of the working paper does have estimates for the “neither African American nor Hispanic” subgroup. In addition, I think the revised version has some other model specifications suggesting some pretty good stability on the effect estimates for this particular sample (in particular, they have an instrumental-variable analysis that gets around some of the problems with an intent-to-treat model). I’ll let others read the revision and judge for themselves the extent to which Chingos and Peterson address the concerns listed above in the main blog entry, in particular Goldrick-Rab’s.


  1. Loveless did not use the word bupkis, so I am taking some poetic license here. Note Loveless’s discussion of statistical significance as a red herring, and his discussion of a different problematic lay-oriented way of presenting the findings. []
  2. Or perhaps the two that are easiest to capture briefly… []

One response to “Sins of presentation: False precision and Just-So Stories of heterogeneity”

  1. CCPhysicist

    They can’t include confidence intervals because the typical college grad legislator will be confused by them. I don’t recall seeing a real statistics class in the list of classes recommended to pre-law students.