Can sampling save high-stakes testing?

Over the weekend, the Washington Post‘s Valerie Strauss described one Colorado school district’s proposal to test a sample of children for accountability purposes. Proposals something like this float up occasionally: let’s not test all children in all subjects but a sample. Sometimes the National Assessment of Educational Progress (NAEP) sampling plan is used as a model: if sampling is good enough for the gold standard for assessment in the country, why shouldn’t it be used everywhere?

I’m a fan of a great deal about NAEP, and I wouldn’t mind some use of sampling in assessment systems. Having said that, there are limits to borrowing from NAEP, because that system was not designed for accountability purposes. There is no expectation that NAEP will provide detailed profiles of performance at the student, classroom, or school level. To move from an every-child-every-year testing system to a sampling system for all subjects would reduce the total amount of testing in schools. I don’t think it would pass political muster because it removes the illusion that we can closely estimate the academic performance of individual children, classrooms, or schools, let alone issues such as whether vulnerable populations are getting the short end of the stick. And that is the itch that high-stakes testing scratches.

There is another reason why moving to a sampling system for annual tests is likely to be unsatisfactory–it may not reduce testing or test-prep as much as such a proposal might suggest. On top of NCLB-style testing in the past decade, districts have layered on additional tests partway through the year. So-called “benchmark testing” at its best has a light footprint in the school and provides important feedback. At its worst, it lards up the year with even more poorly-constructed tests, creating formative assessment theater driven by the desire to target mid-year interventions at the district level and hold a whip over the heads of principals. Parents and school board members worried about extensive and wasteful test-prep should focus as much attention on benchmark testing as on annual testing and ask important questions about the time involved, the speed and usefulness of feedback for teachers, and the business aspects of decisions (was the choice of assessment tied to a commercial bundle of curriculum sold to the administration as a “complete solution” for accountability pressures?).

There is a potential role for assessment sampling: fighting the narrowing of curriculum that often attends high-stakes testing. Here, sampling would not be of the NAEP style, which is expensive to create. Rather, it would be simple–say, ask 5 or 10% of high school students in a history course to take something like an AP document-based question–and in part assessed publicly by a committee of teachers, parents, and community stakeholders (with anonymization of student answers, of course). The results could not be analyzed at the classroom level, with a few exceptions (recorded performances by music ensembles would count), but instead allow conversations around the breadth of the curriculum at the district level.

I suppose if PARCC and Smarter, Balanced had built sampling into their framework, it could also be used to make sure that assessments in math and reading covered more of the curriculum, if not measured at the student or classroom level. But unless I am misreading their materials, that simple step was not part of the so-called “next generation” tests that will not have form-fitting uniforms or much else next-generation-y. Alas, a more complex system of sampling needs more investment than most individual states can muster. And even there, there are limits to what it could accomplish.

I wish there were a magic wand like sampling that would solve all of the concerns parents and others have about high-stakes testing. But in the same way that every-child testing is not a panacea, neither is sampling.

If you enjoyed this post, please consider subscribing to the RSS feed to have future articles delivered to your feed reader.

One response to “Can sampling save high-stakes testing?”

  1. Paul Bielawski

    One of the serious limitations in using assessment data for accountability is the trade-off with reliability. NAEP uses a stratified sample to estimate performance of the population of students in a state. The beauty of NAEP is that is not a high stakes assessment. It is given in 200 schools in a state every other year, and few schools repeat across administrations. For accountability at the school level, the number of student scores is usually around 30 to get sufficient reliability in an annual performance measure. It it took four students to get a complete assessment, the minimum n would be increased to 120 or so. There would be very many small schools without an annual accountability rating.