The Institute of Education Sciences has released a new Evaluation of the Teacher Incentive Fund, or TIF, (after two years), which is generally solid research by Mathematica Policy Research, at least at a quick first read today. The main findings:
- Most of the experimental part of TIF was implemented by the schools.
- Some parts of the program were more difficult to implement (e.g., higher performance pay for a more limited group of educators), or more difficult to maintain.
- Part of the logic model was hard to confirm, especially the issue of educator understanding of their opportunities to earn higher pay.
- The bottom-line effects on student performance were weak: 0.04 standard deviations in math, 0.03 in reading. If you obsess about p values, only the association with reading was statistically significant.
I say that this is generally solid research… until you get to the part of the document where the main effect size for reading is translated into a statement that teacher and principal performance pay is associated with three additional “weeks of learning” in reading. Mathematica is using a common, well-intended attempt to translate the abstract concept of effect size into something a general audience can understand. This translation has become more common in the last few years.
It is also bad interpretation. That doesn’t mean that documents should not attempt the translation for a general audience, but there are problems with just using terms like “three weeks of learning” as naked representations. To cut to the chase:
- “Weeks (or days) of learning” avoids the most important part of recontextualizing effect sizes: comparing the effect size in an individual study with effect sizes from empirical research in the same domain — i.e., if you are translating your research findings for use in the real world, how does this intervention or policy compare with other interventions or policies that are realistic alternatives?
- “Weeks (or days) of learning” implies more accuracy than is realistic; it is hard to spot a difference between 3 and 4 weeks of learning (and for those tempted to publish “days of learning,” under no circumstance in the real world can research make an empirically-justified distinctions between 15 and 16 days of learning). This study does not report standard errors for the estimates, but Mathematica does report the effect sizes under different models (or sensitivity to model assumptions), and the variations easily surpass 0.01 standard deviations, or the equivalent of one week of learning. If you want to talk about weeks of learning for this study, we need to understand that depending on the model used the inferred effect on reading for the first cohort in the second year is likely to range somewhere between 0 and 4 weeks of learning. That interval may looks odd, but it’s a better representation of the research than the statement in the report.
Reporters reading such findings can ask the authors two questions before writing stories, as a consequence:
- What are the effect sizes of potential alternatives, either in standard-deviation units or weeks/days of learning?
- What is the error of the estimate — or the confidence intervals, in weeks or days of learning?
Background: the idea of translating effect sizes into “weeks/days of learning” is an attempt to recontextualize research that is abstract. In the growth of systematic research reviews in the past 40 years, including meta-analyses, it has become more common to publish not only the raw estimates of effects in the context of an individual study, but to convert that estimate from the scale of an individual study into a more generalized unit, standard deviations. So we now float in a research environment of effect sizes: all well and good for comparing all sorts of things, a la John Hattie’s Visible Learning project to compare effect sizes of various education techniques and policies, but hard to explain to a general audience.
Thus, the attempt to translate effect sizes back into a concrete unit. In the case of this report, the Mathematica report authors use a 2008 study that provides one estimate of generalized annual gain in learning for specific grades, and then convert the effect size they find, 0.03 standard deviation units, as follows.1 While it is not clear which benchmark they use, it looks to be about 0.40 standard deviations per year (the grade 4-to-5 gain in the referenced study):
- 0.03 standard deviation units / 0.40 standard deviations per year =
- 0.075 years of learning * 36 weeks in a school year =
- 2.7 weeks of learning, rounded to 3.
Conversion between units is an interesting exercise, and my very first lab in high school physics was a snail race: we had to estimate the speed of each of our snails in furlongs per fortnight.2 The conversion of school effect sizes into weeks or days of learning is a similar stretch in unit conversions. It is not nearly as ridiculous as Randall Munroe’s comic about abusing dimensional analysis, but it is best understood as a counter-intuitive chain of reasoning, a chain of reasoning with some plausibility but not inherent logic.
I have two primary concerns with the “weeks/days of learning” translation. The first is that it misunderstands the needs of the readers. If you are a teacher, principal, or a policymaker you might want to know a graspable figure for “how much” from the evidence but the relevant decision is different: Among my realistic choices in this domain, what is best? Translating a single effect size into weeks of learning is useless for that purpose, unless this report had also translated effect sizes of competing options into weeks of learning.
The second concern is with the implication of accuracy, more than the evidence can bear. If you follow the logic of this specific translation, a week of learning represents 0.011 standard deviations. To say that incentive pay is associated with 3 weeks of additional learning reading rather than 2 or 4 weeks of learning, that means we need to be able to trust that the research really could make distinctions down to one hundredth of a standard deviation. In few real-world education studies can one make that claim credibly, and as noted above, the sensitivity analyses reported in the appendices make clear that model assumptions create changes of more than 0.01 standard deviations — you would need to add standard errors in as well.
Fortunately, these flaws are easily remediated if either study authors or reporters specify the relevant answers: 3 weeks of learning in comparison with (in this case unspecified) realistic alternatives, or a range of 0 to 4 weeks of learning (from the sensitivity analyses).
That ends my serious criticism of days and weeks of learning. And now we get to have some fun. Because I have two non-serious criticisms of the concept of weeks or days of learning.
One is that we don’t know what kind of school week this study is describing. Is incentive pay associated with three weeks where everyone is focused, and things are happening on all cylinders? Or are we talking about the weeks right before Christmas break, when no one is paying attention? Are these weeks with lots of subs? Or three weeks with a series of schoolwide assemblies where teachers never have time to get into a topic? Maybe — and I hate to bring this up, but you know it’s a possibility — these are three weeks when everyone is sick in turn, and on too many days, Johnnie came in despite being sick because his mother didn’t have any sick days at work, and he upchucked right beside Betsy’s desk at 9:30, the school splits the janitor with another school and didn’t have one that day, and it took about 60 minutes to find someone who could spare the time to find the key to the closet with mops, clean the mess up, find a fan to blow out the air into the hallway (and the fourth graders passing by the room on their way to the playground instantly made a HUGE complaint about what they smelled), and then find the class and let Ms. Deronde know she could bring the kids inside. With luck, no one else threw up in the meantime. I don’t know about you, but if this study is saying that performance pay is associated with three of THOSE weeks, we don’t want any part of that.
My other complaint — yes, there’s another one — is that we don’t know that weeks of learning is the right measure. Should we stop the unit conversion with weeks of learning? Let’s see what else we could do.3 A week is a unit of time, and we know that light travels at 186,282 miles per second, so a week is equivalent to 1.9 billion miles of learning. That’s not very useful to children, but I know that the average home run in the big leagues is 397 feet, so a week of learning is also equivalent to 24.6 billion home runs of learning. Not bad! Let’s imagine that we’re asking what performance pay means in New York City — I mean, if an intervention or policy can make it in New York, it can make it anywhere, right? Everyone in New York loves Manny Rivera, and he allowed only 71 home runs in his career. So a week of learning is equivalent to 346 million Manny Riveras of learning. Or, for this study, 935 million Manny Riveras.
Doesn’t every child deserve 935 million Manny Riveras of learning? If you don’t support performance pay, you are denying every child her or his right to 935 million Manny Riveras of extra learning in reading.4
But maybe we should get out of the Big Apple. Manny Rivera’s official stats give a weight of 195 pounds, so a week is also equivalent to 67.5 billion pounds. But pounds of what?5
Bull. Pounds of bull. You can take Sherman out of the University of South Florida, but I had to see what the equivalent of 346 million Manny Riveras in bull would be. At about a ton a bull, it’s 33.8 million bulls. Every week of learning is the equivalent of just under 34 million bulls.
And I know what you’re thinking: what are these bulls doing?6 They’re running in Pamplona! It turns out that while there are 12 animals who participate each of eight days in the annual Pamplona Encierros, half of them are steer. With 33.8 million bulls and 48 bulls per Running of the Bulls in the San Fermin Festival, each week is worth approximately 700,000 San Fermin Festivals with their bull runs. Or, for the effect size in reading, that’s about two million San Fermin Festivals.
And why would you ever deny young children a chance to observe the Pamplona Running of the Bulls two million times? It’s criminal to even think about! Oh, I know what some of you are thinking: Sherman, many young children would never want to see the Running of the Bulls once, let alone two million times. People are gored! Some have died!
Nonsense. The Running of the Bulls is just like Youtube cat videos, if the cats could chase, maim, and kill people.
Okay, I lied. I am not ending with the ridiculous. Because Steve Sawchuk wrote a story on the study and quoted my “I am not impressed” tweet (to put it precisely, “Meh” — and Morgan Polikoff’s response suggesting a Likert-type scale for effect sizes), one final thought:
Whether or not you think performance pay is a good or bad idea, this is hard work both in politics and in practice. This is the type of hard work that is very common with complicated policy ideas. The reason why it is important to compare effect sizes against alternatives in the same domain is because it is important to consider the return on the political investment in policies. If you have the choice between several policies, all of which are hard to accomplish, you may want to consider where to invest time and political capital. Or, if there are a number of smaller-scale alternatives that are easier to pull off (simultaneously) and each of which have at least the magnitude of effect as the Almost Impossible, it would make a great deal of sense to push for the several Difficult Lifts with Demonstrable Effects over the One Almost Impossible Lift with Minimal Effects.
- Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2(3), 172-177. [↩]
- It’s about 10^-4 furlongs per fortnight–snails are not fast. [↩]
- For convenience, I have put all of the calculations and sources used in a public Google Sheet. [↩]
- Pedants may object on grammatical grounds. Perhaps it should be 935 million Mannys Rivera of learning. [↩]
- An alternative is to translate each Manny Rivera into his $90 million net worth. A week of learning is thus worth 31 quadrillion dollars, or $84 quadrillion dollars per child for the estimated effect of performance pay in reading. Puts Raj Chetty’s calculations to shame, if you ask me. [↩]
- Oh, I understand what those of you with one-track minds are thinking. Okay, for those of you who are wondering, an adult bovine produces approximately 65 pounds of manure per day. You can do the calculations from there. [↩]