The shorter Bill Tucker: "No, you pull a number out of the air." Well, buckaroos, if anyone's going to be all airy-fairy on this quantitative-component teacher-evaluation thing, it's going to be all Ivory Tower, historian me. But I don't think you'd like that.
Maybe I should explain. Tucker's argument in a nutshell is, "Okay, if the Economic Policy Institute paper authors don't think 50% of teacher evaluations should be based on test scores, what number do they think is the right one?" The sly assumption here is that there should be an a priori percentage. Yes, yes, I get the point–if there isn't an a priori percentage, then why is 50% wrong?
Since I wasn't involved in the development of the EPI white paper, I can step in here with my judgment: I don't think there's a magic number, but both my research and my teaching experience tells me that there is no pragmatic setting of a weight that should come close to 50% for any derivative of test scores. Maybe I'll be dead wrong, but I don't think so.
First, the teaching experience, since that's more easily explained. I have taught hundreds of students, and I don't remember a single time when more than 40% of any term grade algorithm depended on a single assignment–not a paper nor an exam. Having multiple inputs to a term grade can be a headache, but I prefer having more information about student performance, and I suspect today's students hate classes where 50% or more of the grade depends on the final exam.
In addition, my experience calculating grades for hundreds of students tells me that weighting has a nonlinear relationship with the influence of a single assignment on the final term grade. An assignment with 20% contribution to the final grade does not have twice as much influence as an assignment with 10% contribution. Part of the influence is in the effective range of scores (something I've mentioned before). And part is a threshold effect: because three points near a grade threshold can be especially influential, components with larger effective ranges have a disproportionately larger influence on final grades. That teaching experience is consistent with my airy-fairy paper on the subject, which you should take seriously because
it has lots of fancy equations I think I've been sufficiently careful to explain the consequences of a Bayesian approach to evaluation components: don't try to widgetize evaluation systems.
So what should happen? Glad you asked! Maybe we gather data for different components, as is happening now in a number of places, and run simulations for evaluations with different models and for different classes of teachers. I even have some bold predictions you can test:
- Any system for non-core academic teachers (e.g., teachers in the arts) is going to be clearly, obviously unstable.
- Within core academic areas, you will see less stability for teachers who share responsibility for some students (with disabilities, who are English language learners, etc.).
- Well-trained teams of peer evaluators will create more stable evaluation components than either test-score components or typical administrator-observation evaluations.
As I've said before, the 50% figure was pulled out of thin air in the same way that the "65% solution" figure was pulled out of thin air. Or maybe they were pulled out of pieces of someone's anatomy. We all do that on occasion, make sheer guesses that turn out later to be bold and utterly false. Most of us don't get to turn such braggadocio into policy, and I don't think it's wise for it to happen with teacher evaluation, either.
Update: Tucker responds.