About five weeks ago, Kevin Carey wrote a longish blog entry about null hypotheses, the status quo, and decision-making about policy. The gist of Carey’s argument was that we should be willing to make policy changes with a preponderance of evidence in favor of change.
Carey claimed that academic skepticism aimed at various policy proposals was a legacy of frequentist notions of the null hypothes, where you have to prove that a result was unlikely to have occurred by chance (usually stated as a p < .05 threshold, though that’s a value choice and convention, not carved into tablets). In contract, he said, policy options need to be chosen on an epistemological equivalent of first-past-the-post voting — i.e., based on the preponderance of evidence on which was the best option at the time.
I think Carey has at least a few people pegged wrong in the reasons for skeptical views of reform, including me, and I think he has the causality backwards for the few social science-y folks for whom he might be right on surface rhetoric. The reason why the null hypothesis exists in the disciplines where it does is because academics (and I hope others!) are conservative in accepting new claims of Truth (or truth). We’re socialized to be skeptical, to begin with the caveats as the main story, and the null-hypothesis framework is just one operationalization of that broader academic culture. (Minor bit of evidence: usually the first advice given academics in media training is to reverse the order of presentation, to start with a main positive claim and only later get to caveats.) Because academics are conservative on changing views about their disciplinary reality, the most popular type of article with a new factual claim is the plausible surprise, the small twist on disciplinary convention that makes the reader go, “Hmmmn…. not what I had thought, but I can see it.” (There’s also a danger in that socialization: a scholar can create a professionally attractive claim by heading for that plausible-surprise sweet spot. Witness the Bancroft Award given to Michael Bellesiles’ Arming America before the fabrication/falsification charges were investigated, his resignation, and the embarrassed withdrawal of the Bancroft.)
Back to the core of Carey’s argument instead of the straw-man argument he had created: Carey was responding to criticisms of value-added approaches to accountability (by the anonymous Eduwonkette, but I’ve made similar criticisms). Over at Eduwonkette’s blog, skoolboy argued in rebuttal that policy conservatism exists because policies are always enacted in specific times and places, and the real costs of implementation as well as the existence of unintended consequences means that the a priori preponderance of evidence is not always a good prediction of what would happen in practice. This is very close to the default framework I remember from cross-examination team debates in high school, where the negative wins by default unless the affirmative team overcomes the predisposition towards the status quo. (For the life of me, I can’t remember the early-1980s term used for this, though I think it started with a p.) But the default position in high school debate is a faux default created to hone the competition with ground rules rather than a great Rule in the Sky.
There are two broader perspectives I have on this question about warrants and evidentiary evaluation, and then an idea for someone else’s dissertation. First, the status quo v. reform framework is itself fictive. There ain’t no such thing as a monolithic status quo or monolithic reform, policy rhetoric is fluid, and evidence about practices isn’t stagnant, either. I don’t even think many people use that specific framework as the set of mental bins in which they store the various policy proposals floating in the ether. C’mon, Kevin and skoolboy, fess up: where would you slot “performance pay”? “Collective bargaining”? “High-stakes testing”? [insert whole-school curriculum plan here]? You can think of the counterarguments as well as I can. We can talk about the policy frameworks people work with, but they’re likely to be much more earthy than “I work with a preponderance standard” or “I’m waiting for a representative sample before I’m convinced.” Well, unless you’re one of Russ Whitehurst’s in-house methodological purists (and I doubt even Whitehurst is his own purist). That fictive framework doesn’t mean that people don’t ask questions about “school reform,” but the more useful work takes the term as much as a problem as a foundation (e.g., Tyack and Cuban’s work).
It may be useful here to separate the evaluation of factual claims from the evaluation of policy option. In my relatively limited experience in the world, both inside and outside academe people have separate ways of judging claims. In academe, these are very roughly divided into questions of procedural warrants and questions of substantive warrants. Procedural warrant debates are often called methodology, especially in experimental disciplines, but the procedural warrant does not always require a section called methods: the historian’s standard procedural warrant is the footnote, and it’s a pretty serious matter if you screw that one up (see the case of Bellesiles, referred to above). Substantive warrants revolve around the interpretation of evidence and how that dovetails with previous disciplinary knowledge and substantive frameworks (i.e., “the literature”). Herbert Gutman’s response to /critique of Fogel and Engerman’s Time on the Cross is full of such substantive arguments about what he claimed were Fogel and Engerman’s misinterpretations of the evidence.
In a similar way, we can (and do) have all sorts of debates about what the right substantive questions are on policy as well as what evidence we will accept about a particular factual claim. The last time elected officials took a very-well-designed study as the sole basis for creating policy was California’s class-size initiative. STAR was a great study. I suspect Kevin Carey would admit that California’s policymakers didn’t ask enough questions after being convinced of the factual claim that in Tennessee, a pretty-darned-close-to-random-assignment study documented both short- and long-term benefits to very low elementary class sizes.
So you know by now that I’m an advocate of separating the evaluation of factual claims from making policy. On the narrow question of evaluating factual claims, I’m going to be even more iconoclastic: there is a difference between confirmation bias and outright irrationality. We all have confirmation bias, and moreover, there’s a pretty good case to be made that it’s okay to have a confirmation bias if you’re honest. I’m not much into Bayesian probability theory, but there’s some pretty famous philosophical stuff which starts from the premise that we have preconceived ideas about what truth is before we come across any chunk of evidence pertaining to a factual claim. If I understand the Bayesian perspective, that’s not irrational because our personal (or subjective) judgment about reality before we come across a chunk of evidence should be affected by the evidence to push us towards post-encounter judgments of reality (or, more formally, posterior probability estimates). Or, in more gutsy language, it’s okay to have preconceptions as long as you’re willing to change them based on the evidence. What’s not kosher is to entirely ignore evidence that’s been reasonably vetted. (Holocaust denial claims and the like can be dismissed because their advocates have violated this test of rationality.)
While I was driving around central Florida over the past few weeks, I’ve been thinking about the Carey-skoolboy posts and trying to think through a formal approach to work backwards from Bayes’ theorem to an identification of assumptions, something like this: “After reading a lay description of research claiming benefits for prescribing watermelon juice to ADD-identified adolescent boys, a reader is still skeptical and believes that it’s highly unlikely (say, only a 2% probability) that the claim is true. From that posterior gut-level belief and the research evidence, can we infer an a priori assumption about the claim?” Unfortunately, my glorious plans for a simple article that would win me the Nobel Prize for Mathematical Political Philosophy Written by Historians were chewed up by a rabid pack of math and common sense (to paraphrase Berkeley Breathed). (For this reason, please do not put me in charge of planning any post-invasion occupation of a country. Or planning the Great Hydrogen Economy. Or the Best School Uniform Policy. I tend to be… oh, yeah, academically skeptical.)
But in my late-night wanderings through lit databases, I came across a fascinating 2004 article by Drazen Prelec in Science that argues for a much better way of processing subjective judgment than well-known approaches such as the Delphi process. And then there’s a more concrete demonstration paper he wrote with H. Sebastian Seung. Prelec’s search for a “Bayesian truth serum” is wonderfully outlandish, but the basic stuff seems to be sensible, which is is to use an individual’s own set of judgments as a filter with which to identify particularly common or uncommon judgments in a data set and particularly accurate or inaccurate judgments of individuals about the distribution of judgments in the population. That’s pretty abstract, but it strikes me as a definite improvement on the Delphi process and possibly very useful for research on sociometrics… or subjective judgments of education. Last doc student in is a rotten egg!
(No, not really, but if you can understand the math of both papers, there are some obvious applications here.)