Earlier today on Twitter, Annie Lowrey and I had a brief exchange (or an exchange of tweets) about the column inches she used for the Hanushek-like extrapolation in Friday’s New York Time story on the Chetty, Friedman, and Rockoff value-added measure paper.1 Of 1142 words for the story as a whole, about 15% of the story was spent on the following passages:
All else equal, a student with one excellent teacher for one year between fourth and eighth grade would gain $4,600 in lifetime income, compared to a student of similar demographics who has an average teacher….
In the aggregate, these differences are potentially enormous. Replacing a poor teacher with an average one would raise a single classroom’s lifetime earnings by about $266,000, the economists estimate. Multiply that by a career’s worth of classrooms.
“If you leave a low value-added teacher in your school for 10 years, rather than replacing him with an average teacher, you are hypothetically talking about $2.5 million in lost income,” said Professor Friedman, one of the coauthors….
“The message is to fire people sooner rather than later,” Professor Friedman said.
There are two reasons why those passages concerned me. One was the policy frame the paper’s authors accepted from Erik Hanushek: the main way you improve teaching is by firing people. But more relevant to the reporting, Lowrey was spending a measurable amount of space on the weakest part of the paper.
I understand the instinct of reporters to look for implications, especially in reporting research. Who cares that CFR created a new algorithm to test whether a teacher’s value-added measures might be an artifact of student assignment within a school? My professional judgment is that’s much more important than the (s)extrapolation in section 5 of the paper. But it is very hard work for a reporter to explain that significance, and when researchers such as Erik Hanushek or Chetty, Friedman, and Rockoff give you a hook, it’s hard to resist a strong sound bite.
I don’t think Chetty, Friedman, and Rockoff wrote the paper to attract reporters’ interest by section 5 — if so, it wouldn’t really be section 5. But Friedman certainly went there in the interview, and Lowrey used it. The same space could have been used to get Jesse Rothstein’s views on whether the paper addressed the potential bias of value-added measures. While Lowrey quotes Rothstein, it’s on a different point entirely. If you’re writing a story for the New York Times, for goodness’ sake, don’t talk about the research equivalent of the Kardashians when there’s more substantive material available! Lowrey’s a good reporter on the whole, but in this case I cringed at the waste of space on (s)extrapolation.
For another view on the same question, here’s Bruce Baker (who lays more of the blame on Chetty and Friedman):
These two quotes by authors of the study were unnecessary and inappropriate. Perhaps it’s just how NYT spun it… or simply what the reporter latched on to. I’ve been there. But these quotes in my view undermine a study that has a lot of interesting stuff and cool data embedded within.
One more bit of perspective: the job of a reporter is compounded by a history of researchers’ odd attempts to quantify the unquantifiable, most obviously with the practice of cost-benefit analysis (how much is a 46-year-old professor’s life worth?). And in education, there are various studies that make reasonable but fragile assumptions, whether you’re talking cost-effectiveness analyses such as Clive Belfield and Hank Levin’s work on various interventions or the whole practice of meta-analysis. So what can reporters do when trying to explain the significance of new research, without getting trapped by a poorly-supported sound bite?
- If a claim could be removed from the paper without affecting the other parts, it is more likely to be a poorly-justified (s)implication/(s)extrapolation than something that connects tightly with the rest of the paper.
- If a claim is several orders of magnitude larger than the data used for the paper (e.g., taking data on a few schools or a district to make claims about state policy or lifetime income), don’t just reprint it. Give readers a way to understand the likelihood of that claim being unjustified (s)extrapolation.
- More generally, if a claim sounds like something from Freakonomics, hunt for a researcher who has a critical view before putting it in a story.
Notes
- If you are curious about my substantive views of the paper, see my comments, and I also highly recommend the notes of Bruce Baker. [↩]
There is a problem with the fundamental extrapolation itself, Sherman, let alone the soundbite.
Three years ago, I had the opportunity to examine the math test “growth scores” for my own tenth grade chemistry students. I had to go through the school report, and pick out my own students by name. The first thing that surprised me was that so many names were missing, but that is for another comment.
The method compared eighth grade (low stakes) MCAS score with their tenth grade high stakes scores, and then weighted the difference somehow according to the expected performance of the socioeconomic cohort to which each student was assigned. I believe the high/low stakes score difference was compensated for in that process.
I was struck by the large number of very wild swings, in both directions, in the raw test scores. Since I knew the students well, I could see the effect of a drug crash or family tragedy or triumph in some, but in others the result was just baffling.
Overall, my value-added was slightly above average, but when I calculated the mean for each section, a pattern emerged. My blue class had very poor growth. I’m a bad teacher. On the other hand, my yellow section was really spectacular. If we could fire the blue me, and give all those students the yellow me for three years running, their lifetime earnings would skyrocket.
Of course the sample size is too small, and the data too limited, to draw any such conclusion. The within-group variance swamps the differences in the means, and that was exactly the conclusion the Mathematica Institute came to, when it examined the statistical underpinning of “value added” metrics.
There will always be high and low average distributions, but those differences are an artifact of population selection. They never accumulate in the real world, because they aren’t real.
This is what it must have felt like, to honest people, as the “science” of phrenology gained ascendency.
Mary,
Very interesting take on testing results.
Is the “wild swings” more like a bi-modal bifurcation? Or tri-modal?
I saw this “polarization” in my college student evaluations that I developed, with older students getting more out of the class than the younger students just out of high school.
You suggest that these results correlate with what you know to be going on — but the statistical view of testing seeks to minimize just that sort of correlation. Too bad for the students, eh?