Please leave your magic numbers on the magic carpet with the magic wand

The shorter Bill Tucker: "No, you pull a number out of the air." Well, buckaroos, if anyone's going to be all airy-fairy on this quantitative-component teacher-evaluation thing, it's going to be all Ivory Tower, historian me. But I don't think you'd like that.

Maybe I should explain. Tucker's argument in a nutshell is, "Okay, if the Economic Policy Institute paper authors don't think 50% of teacher evaluations should be based on test scores, what number do they think is the right one?" The sly assumption here is that there should be an a priori percentage. Yes, yes, I get the point–if there isn't an a priori percentage, then why is 50% wrong?

Since I wasn't involved in the development of the EPI white paper, I can step in here with my judgment: I don't think there's a magic number, but both my research and my teaching experience tells me that there is no pragmatic setting of a weight that should come close to 50% for any derivative of test scores. Maybe I'll be dead wrong, but I don't think so.

First, the teaching experience, since that's more easily explained. I have taught hundreds of students, and I don't remember a single time when more than 40% of any term grade algorithm depended on a single assignment–not a paper nor an exam. Having multiple inputs to a term grade can be a headache, but I prefer having more information about student performance, and I suspect today's students hate classes where 50% or more of the grade depends on the final exam.

In addition, my experience calculating grades for hundreds of students tells me that weighting has a nonlinear relationship with the influence of a single assignment on the final term grade. An assignment with 20% contribution to the final grade does not have twice as much influence as an assignment with 10% contribution. Part of the influence is in the effective range of scores (something I've mentioned before). And part is a threshold effect: because three points near a grade threshold can be especially influential, components with larger effective ranges have a disproportionately larger influence on final grades. That teaching experience is consistent with my airy-fairy paper on the subject, which you should take seriously because it has lots of fancy equations I think I've been sufficiently careful to explain the consequences of a Bayesian approach to evaluation components: don't try to widgetize evaluation systems.

So what should happen? Glad you asked! Maybe we gather data for different components, as is happening now in a number of places, and run simulations for evaluations with different models and for different classes of teachers. I even have some bold predictions you can test:

  • Any system for non-core academic teachers (e.g., teachers in the arts) is going to be clearly, obviously unstable.
  • Within core academic areas, you will see less stability for teachers who share responsibility for some students (with disabilities, who are English language learners, etc.).
  • Well-trained teams of peer evaluators will create more stable evaluation components than either test-score components or typical administrator-observation evaluations.

As I've said before, the 50% figure was pulled out of thin air in the same way that the "65% solution" figure was pulled out of thin air. Or maybe they were pulled out of pieces of someone's anatomy. We all do that on occasion, make sheer guesses that turn out later to be bold and utterly false. Most of us don't get to turn such braggadocio into policy, and I don't think it's wise for it to happen with teacher evaluation, either.

Update: Tucker responds.

9 responses to “Please leave your magic numbers on the magic carpet with the magic wand”

  1. Bruce Baker

    Now, I like numbers and sometimes feel as though I’ve become 50% economist (not sure if that’s really my magic number though). But I have noticed of late on my own blog that this “magic number” argument is becoming increasingly prevalent among the pundits and reformers in response to critiques of their ill-conceived and overly rigid proposals. I like that you mentioned 65% solution above, because indeed, every time I criticized that idea I was confronted with the question – “well then smarty pants professor guy… what is the magic number?” Telling them that there likely wasn’t one usually didn’t go over very well.

    My most recent confrontation with the bogus “magic number” argument was last week on my own blog, when I argued that Louisiana in particular had been slacking off on education funding and was unlikely to make substantive improvements to their education system without substantially greater investment (and not RttT type investment, but the basics). As one might expect, I was pressed to provide the “magic number” for what Louisiana should spend. Is there a handbook out there that these people are using that tells them to request the “magic number?” What is the source of this “magic number” insurgency?

  2. Catherine Lugg

    To me, all of this search for “holy grail” percentages feels like Monty Python meets “Education and the Cult of Efficiency.” If we could only get the numbers “right,” public schools will end their “holy” quest for whatever the current political fetish may be involving schooling .

    But if the quest is, by definition, on-going and never-ending (like the search for any “perfect performance”), then it becomes much hard to develop magic public policy solutions. For policy entrepreneurs whose livelihoods, political status, or both, depend on simple solutions to hideously complex and idiosyncratic situations (see public education), that’s a non-starter or “buzz kill.”

    In a society filled with all sorts of people who love the lure magic numbers (see lotteries, CNBC, the Wall Street Journal, “supply siders,” “deficit hawks,” etc.), the more complex and nuanced policy ideas will always have an uphill political fight. Like the poor women accused of being a witch in Python’s “Holy Grail,” those who offer these policy proposals are likely to be sunk no matter what.

  3. Chad


    I think you’re being a little disingenuous here. It’s understandable to argue the 50% figure is too high, but right now the percentage pretty much everywhere is 0. I doubt there is a magic number here, but it’s somewhere above 0 and less than 100. And, if you agree with that, we should be talking about the best way to slowly incorporate this new data, test it out, and be prepared to tinker with the measure or expand it as we learn more.

  4. Bill Tucker

    Hi Sherman — I should have answered my own question. My response here:

  5. Chad


    I didn’t mean to be offensive, and I certainly didn’t mean to attack your credentials on this issue. My use of the word “disingenuous” was directed at two things. One, I think the “magic wand” title was a little glib, and, two, I think you know that there are people who are using or who will use the EPI paper in defense of the 0% stance.

  6. CCPhysicist

    Bill Tucker mentions what he might do if he had his own school, but forgot to mention whether he would get to select the students and how those students get assigned to teachers. If you get to assign students to teachers, you can guarantee whatever result you want in anything other than an elite high school environment.

    I stumbled on this blog observation today …
    … and it makes an important point that those of us teaching at a university (or who attended HS in a previous lifetime) can easily forget. We can throw kids out of class. We can have the campus police remove them if they won’t leave. High school teachers are stuck with whatever jerks someone puts in their class. The idea that you weight the value-added performance measure based on the number of disciplinary referrals out of a classroom is an interesting one and fits well with Sherman’s observation about other cases where the teacher faces specific challenges.

    Even someone with the skillz to take that 12th grader working at a 3rd grade level of math up to the 6th grade level will not be able to do so if that kid is in a 10th grade class of kids where the objective is to get them from 10.0 to 11.0+. Doubly so if his best friend is there too.

    I’ve written here before about my view that no “A” teacher should get a special bonus unless they can swap classes (including swapping schools) with an “F” teacher and do just as well, and vice versa for firing the “F” teacher. You see, anyone would have looked good if I had been in their class.

  7. Bob Calder

    Sometimes I wonder if it is all noise.

    It might be true that creating this kind of accountability measurement is motivated by the helpless feeling of being incapable of evaluating employees in the face of what appears to be the overwhelming evidence of scores. The scores must therefore be crucial to the measurement. It seems like common sense, but common sense isn’t necessarily to be trusted.

    It may also be true that students in different cultural settings do not place the same stress on success. High stakes are only scary if you think the people making the test have power over you. It wouldn’t take much of an effect. Personally, I would like to see how much of an effect sugar pills would have if given on test day in combination with a suitable ceremony.