This morning, the New York Times carried a column by Nicholas Kristof talking about the import of the Chetty, Friedman, and Rockoff paper; later, Kevin Carey wrote a blog entry telling us what to think about the study.1 To be honest, I’m shocked it took more than half a week for folks to use Friday’s Times story by Annie Lowrey as a springboard for public policy discussions. Maybe the quick responses by Bruce Baker and Matthew Di Carlo played a role in delaying the inevitable.2 What was most surprising about the Kristof column is not that he bought the weakest part of the paper as a shiny bright object (as did Carey) but that he first cited (and linked to) Di Carlo’s comments and then entirely ignored Di Carlo’s cautions about the extrapolatory analysis on young-adult effects.
What we’re seeing in both comments is confirmation bias: Carey and Kristof (but especially Kristof) are using the study to confirm preexisting policy preferences. Neither acknowledge any weakness in the extrapolations made by the study authors, even though there are several items that should raise red flags for a reader reasonably well-educated in statistics. Carey even makes the (surprising) mistake of confusing statistical significance with effect size (note: see discussion of this item in comments).3
There’s a short passage in Carey that conflates two separate issues and requires some explanation in rebuttal:
Academics complain all the time that policy is insufficiently informed by evidence, and as a general proposition that’s true. But these complaints are themselves often informed by a vague or naive view of how standards of evidence properly translate to policy choices…. For CFR to conclude from their research that present policies ought to be more strongly weighted toward the possibility of going with someone else … [is] a case of academic researchers fulfilling their responsibility to make their findings meaningful on behalf of society.4
The issues here5 are whether it was appropriate for this study to identify useful policy consequences and, quite separately, the burden of proof in using research evidence.
1) Is the classroom-aggregate income claim a responsible effort by the researchers at outreach to policymakers? I don’t know if Carey read my comments on the paper, but I specifically pointed out clear policy consequences I saw from the stronger parts of the paper (specifically, the method CFR used to test potential bias effects on value-added measures from within-school student assignment). But exaggeration of policy implications annoys me as a reader, and that’s what I saw in the section Carey likes. He quoted a clear example of Statistical Bull Shiitake from the paper:
Replacing a teacher in the bottom 5% with an average teacher generates earnings gains of $9,422 per student, or $267,000 for a class of average size… (underlining of non-zero figures added)
I spent half a decade as a journal editor dreading the occasional discussion with authors on the number of non-zero figures that made sense in research results, and the basic lesson is that just because SAS prints out 16 digits doesn’t mean it’s impressive or justifiable to use all of them; in general, it shows one’s statistical ignorance (or a temptation to imply too-great accuracy) instead. In this case, the study authors estimate the income effects of a 1 standard-deviation change in teacher effects on the order of 0.9%-1.1%. So how did they get from two significant figures in the underlying parameter estimate to three or four in the dollar amounts? When I see that sort of nonsense, my first impression is that the measure is “merely corroborative detail to add verisimilitude to an otherwise bald and unconvincing narrative,” as Poo-Bah from The Mikado put it. Let me state clearly that the estimate of long-term outcomes by itself is fine as research. It’s the packaging of extrapolation as a soundbite that is irresponsible, and CFR have to perform some interesting contortions to come up with anything that doesn’t look like the moderate effects they found.6
2) What is the proper presumption stance on incorporating research/who has the burden of proof in arguing policy? Carey argues that there’s no such thing as a nonchoice in policy–or, more specifically in the case of the Hanushek argument about “deselection” of teachers with low value-added measures, the opposite of “deselection” or any proposed policy is not a policy vacuum. At one level, I agree that the lack of explicit comparisons makes it difficult to argue ethically that one should wait for conclusive research findings before ever changing policy. But that’s a false dichotomy. One need not keep singing A Study’s About To Begin to know there’s a substantial difference between a nihilist approach to policy change (what Carey essentially accuses me of having) and cautious reception of a single study, no matter how interesting.7 And, speaking of explicit comparisons, the relevant section of the Chetty, Friedman, and Rockoff paper has no explicit modeling of opportunity costs. That’s the central conceptual engine in economics, and the lack of a full comparative analysis in their paper doesn’t mean the classroom-aggregate estimates are evil, just that they create a sketch and nothing more. Not much to hang policy on, Kristof and Carey’s protests notwithstanding.
- Yes, he used “what to think” in the entry title. Carey’s mind-control machinery isn’t working tonight, at least down here in Florida. I suspect he reversed the polarity of his neocortical reticulator. Or he bought the model that was shown in the 2011 Consumer Electronics Show. Didn’t he know not to trust that stuff? [↩]
- I think it notable that Baker, Di Carlo, and I generally agreed both on where the paper overreach was but also on what the significant contribution of the paper was. We don’t always agree, and the convergence with separate readings should say something to those inclined to take the paper at face value or to dismiss it offhand. [↩]
- Given the quantity of records the study uses, it’s very easy to see statistical significance with a minimal effect. All statistical significance tells you is the likelihood that the existence of an effect–not its magnitude, but just the existence–could have been generated by random variation in the data. [↩]
- The full quotation of the second sentence from Carey, without ellipses, is “For CFR to conclude from their research that present policies ought to be more strongly weighted toward the possibility of going with someone else isn’t the academic equivalent of staging a fake wedding for Entertainment Tonight and pocketing the profits, it’s a case of academic researchers fulfilling their responsibility to make their findings meaningful on behalf of society.” He’s responding to my quip that for Lowrey to spend more than 10% of the article on the authors discussing the need to fire teachers based on value-added measures is a waste of column inches when she did not get Jesse Rothstein’s response to CFR’s bias-testing method. Carey is confusing my criticism of CFR with my criticism of Lowrey. [↩]
- Apart from Carey’s misunderstanding my Kardashian quip. [↩]
- I could distort the findings to minimize the effect of schools on the lives of students, but that would be equally irresponsible. The demonstration of both a deliberate minimization and its methodological irresponsibility is left as an exercise for the reader. [↩]
- As far as I am aware, neither Bruce Baker nor Matthew Di Carlo is on record as being a policy change nihilist. [↩]
4 responses to “Myside bias in deciding “what to think” about research results–(S)extrapolation II”
I thought the irony of the blog post title was obvious.
How do I confuse statistical significance with effect size? That paragraph doesn’t even mention effect size.
If the authors had said, “about $9,400 per student,” would you have no other objections to their conclusions?
You say “I’m shocked it took more than half a week for folks to use Friday’s Times story by Annie Lowrey as a springboard for public policy discussions.” Do you think using the study, or the story, as a springboard for public policy discussions is inappropriate or regrettable? Why?
Here’s the relevant sentence Kevin and I are discussing: “That’s why the study is full of findings at a p< .01 level of significance, or better, i.e. 'beyond a statistical doubt,'" in the middle of a paragraph about the value of developing information systems. If I misread "full of findings," mea culpa; the alternative reading is pretty odd (and probably why I read the sentence the way I did), that we should be so glad we have millions of records because the findings are so modest that they require the accumulation of hundreds of thousands of records to avoid findings of statistical nonsignificance. (There's another subtle problem with the claim the study was "full of findings," but I'll let the reader search for "Bonferroni adjustment" to see why.)
On the sig figs issue (until I have more time to write), “about $9,400 per student” would be consistent with more modest statements in general about lifetime earnings and classroom aggregates. As I stated in a footnote, I could create an alternative debating point from the same conclusions, but I don’t wish to because (a) there are similar methodological problems with the debating point I could craft, and (b) it’s an inappropriate use of the findings.
My question for Kevin is: What concrete policy conclusions do you draw from this paper’s findings?
Assuming you don’t believe in dismissals based entirely on growth model estimates, are you simply saying that these methods have some legitimate role to play in teacher personnel policies?