The Institute of Education Sciences has released a new Evaluation of the Teacher Incentive Fund, or TIF, (after two years), which is generally solid research by Mathematica Policy Research, at least at a quick first read today. The main findings:
- Most of the experimental part of TIF was implemented by the schools.
- Some parts of the program were more difficult to implement (e.g., higher performance pay for a more limited group of educators), or more difficult to maintain.
- Part of the logic model was hard to confirm, especially the issue of educator understanding of their opportunities to earn higher pay.
- The bottom-line effects on student performance were weak: 0.04 standard deviations in math, 0.03 in reading. If you obsess about p values, only the association with reading was statistically significant.
I say that this is generally solid research… until you get to the part of the document where the main effect size for reading is translated into a statement that teacher and principal performance pay is associated with three additional “weeks of learning” in reading. Mathematica is using a common, well-intended attempt to translate the abstract concept of effect size into something a general audience can understand. This translation has become more common in the last few years.
It is also bad interpretation. That doesn’t mean that documents should not attempt the translation for a general audience, but there are problems with just using terms like “three weeks of learning” as naked representations. To cut to the chase:
- “Weeks (or days) of learning” avoids the most important part of recontextualizing effect sizes: comparing the effect size in an individual study with effect sizes from empirical research in the same domain — i.e., if you are translating your research findings for use in the real world, how does this intervention or policy compare with other interventions or policies that are realistic alternatives?
- “Weeks (or days) of learning” implies more accuracy than is realistic; it is hard to spot a difference between 3 and 4 weeks of learning (and for those tempted to publish “days of learning,” under no circumstance in the real world can research make an empirically-justified distinctions between 15 and 16 days of learning). This study does not report standard errors for the estimates, but Mathematica does report the effect sizes under different models (or sensitivity to model assumptions), and the variations easily surpass 0.01 standard deviations, or the equivalent of one week of learning. If you want to talk about weeks of learning for this study, we need to understand that depending on the model used the inferred effect on reading for the first cohort in the second year is likely to range somewhere between 0 and 4 weeks of learning. That interval may looks odd, but it’s a better representation of the research than the statement in the report.
Reporters reading such findings can ask the authors two questions before writing stories, as a consequence:
- What are the effect sizes of potential alternatives, either in standard-deviation units or weeks/days of learning?
- What is the error of the estimate — or the confidence intervals, in weeks or days of learning?