State-level NAEP data — slightly wonkish

Earlier this week, Morgan Polikoff and I had a brief Twitter exchange about the use of the low-stakes National Assessment of Educational Progress (NAEP) tests for policy analysis, specifically the consequences of high-stakes accountability. I am on record as being moderately dissatisfied with use of state-level NAEP data for education policy analysis, and given the continued use of it, I should explain more than 140-character bites can.

First, a mea culpa: at one point in the brief exchange, I stated that the state-level NAEP research at this point was no better than not having conducted the research. On this, I was wrong.

Why am I dissatisfied with the continued use of state-level NAEP scale scores in policy analysis? A little more than eight years ago, when I was an editor at Education Policy Analysis Archives, we published an article by Sharon Nichols, David Berliner, and Gene Glass, High Stakes Testing and Student Achievement, using state-level NAEP results to assess the results of high-stakes accountability. Shortly afterwards came Gregory Marchant, Sharon Paulson, and Adam Shunk’s Relationships between High-Stakes Testing Policies and Student Achievement after Controlling for Demographic Factors in Aggregated Data. Nichols et al. have updated the work more recently, and the substantive conclusions of the 2006 articles are reasonably close to those of Thomas Dee and Brian Jacob’s 2012 article, The Impact of No Child Left Behind on Student Achievement, also based on aggregate data and which was where Polikoff’s and my exchange began.

Because I was copyediting/composing pages for the first two articles linked above, I spent considerable time on the manuscripts. As I kept rereading the articles before publication, I kept coming back to two conclusions:

  • The pieces were deserving of publication. In particular, Nichols, Berliner, and Glass had spent time working through the method they had chosen, a unique and clever way of assessing the strength of accountability.
  • At the same time, the use of state-level NAEP data was problematic. States are far from ideal units of analysis, especially in education policy. As Matt DiCarlo wrote in 2011, “[NAEP] test scores at this aggregate level – across entire states – cannot be used to make arguments about the causal impact of specific policies.”

Part of the problem with state-level data is the length of the implicit causal chain, as DiCarlo wrote. It’s an inherently weaker design if you want to draw strong policy conclusions. On the one hand, you sometimes have to draw conclusions with incomplete, messy data, as long as there are appropriate caveats. And I appreciate people who use or create clever tools to address data problems. However, it’s hard to see states as the best unit of analysis when your question is whether policies improve achievement for children.1 More importantly, the use of aggregate data is now unnecessary: no one has to use state-level aggregate data with NAEP. The National Assessment Governing Board has provided a restricted-use data set that allows one to look at individual-level data, allowing for more sophistication than just using state-level aggregates. The individual test scores are not point estimates — because of the sampling nature of NAEP the data consists of several score estimates for each individual. But NAEP provides software that is built to enable researchers to use the data.

At the time, in late 2006, I did not see a problem with clever attempts to use state-level data. It weakens the conclusions one can draw, but it was not (and is not) a professional flaw in the conduct of studies.

On the other hand, I did see a problem with aggregate data being the default dataset used, because of the richness of the available data on an individual level. In the long term, staying with state aggregate data by default is an unhealthy rut. As an editor, I wanted to see the use of individual-level NAEP data in future studies, and wrote an editorial to accompany the Nichols et al. article in hopes of prodding change. Unfortunately, on the whole researchers who focus on achievement and accountability have ignored my plea. I think we’re the worse for that–at this point, the burden is on researchers who use NAEP data to use the data that is readily available, which is at the individual level.


  1. It should be kept in mind that those who use modern microeconomic methods sometimes have a fairly gung-ho approach to interpretation–not to the details of statistics but to the underlying meaning of concepts such as treatment effects. See the recent popular books Mostly Harmless Econometrics and Mastering Metrics by Joshua Angrist and Jörn-Steffen Pischke for a taste of this approach. To the James Bond of econometrics, the difference between a state mean scale score and a child’s achievement is not an insurmountable gap. []