Validity’s likely irrelevance in personnel processes

My colleague Audrey Amrein-Beardsley is starting a series of blog entries glossing the March special issue of Educational Researcher devoted to value-added or growth measures, specifically the technical qualities vis-a-vis teacher evaluations. In her first entry on the issue today, she argues the following:

[Quoting Doug Harris and Carolyn Herrington:] “The issue is not whether value-added measures are valid but whether they can be used in a way that improves teaching and learning.” [Amrein-Beardsley:] I would strongly argue that validity is a pre-condition to use, as we do not want educators using invalid data to even attempt to improve teaching and learning. I’m actually surprised this statement was published, as so scientifically and pragmatically off-based.

In psychological and educational research, the term validity refers to a basket of substantive warrants for the relevance of research: is a construct or measure predictive of something important, is it highly associated with another important and independently verified construct/measure, does it have merits for particular uses, etc. While researchers sometimes fetishize reliability and validity as terms of art, it is important to understand that the terms simply refer to procedural and substantive warrants for the concept or tool in question.

Here is the broader context of the quotation (with the quoted sentence emphasized and Harris and Herrington’s references deleted):

Since [Race to the Top’s] inception in 2009, our academic journals have been filled with articles about the validity and reliability of value-added measures. Books and reports have been written that describe the pros and cons of value added. Others have expressed skepticism, whereas prominent foundations and some think tank reports have been more positive. The various sides look at the same evidence and see a different picture partly because the evidence we have is so far removed from the decisions that have to be made. The issue is not whether value-added measures are valid but whether they can be used in a way that improves teaching and learning. How do educators actually respond to policies that use value-added measures? On this question, we know very little. Therefore, in the debate about these policies, perspective has taken over where the evidence trail ends.

In essence, Harris and Herrington are saying that value-added measures do not have to have other clearly-accepted substantive warrants for use if they are useful for helping teachers… which is usually something that falls under the term consequential validity. Amrein-Beardsley is saying that an assertion of consequential validity cannot stand by itself without evidence of some other form of validity. She does not flesh out that argument, but let’s try substituting something ridiculous or irrelevant for the term value-added measures to see how the reasoning plays out in other areas where “we know little” is a reasonable conclusion:

The issue is not whether a classroom’s average cholesterol score is valid but whether it can be used in a way that improves teaching and learning. How do educators actually respond to policies that use a classroom’s average cholesterol score? On this question, we know very little.

The issue is not whether a Rorschach test is valid for teachers but whether it can be used in a way that improves teaching and learning. How do educators actually respond to policies that use Rorschach tests? On this question, we know very little.

My colleague wins on the logic front: it is reasonable to ask if there is any there there before using a weak evaluative mechanism as a projective test for professional changes.

On the other hand, there is neither a legal nor practical reason to set the bar high for some use of weak evaluative mechanisms, as long as they do not determine outcomes by themselves. I have absolutely no reason to think that value-added residuals tied to teachers will give better practical guidance for professional change than reasonably-constructed classroom observations. Classroom observation tools are also flawed, sometimes very flawed. As I have written before, the fundamental issue is that schools and school systems cannot wait for the perfect evaluation system, and given my cynical nature I think most of them are so far below the Lee Shulman idea of a “marriage of insufficiencies” that most of them are best characterized as marriages of incompetencies.

Without getting into the details of designs, any good system of evaluation should have the following political and human attributes:

  • A process to check an evaluator’s gut instinct/first impression, and one that holds facial validity for the bulk of teachers (or any group under that process). Here, facial validity simply means whether the bulk of teachers thinks the process for checking first impressions holds water. This use of validity is in a political rather than a technical sense of the word.
  • A requirement for the evaluator to use professional judgment, not rely on strong inferences from weak data. Any version of “The numbers made me do it” should be forbidden.
  • An incentive and support for evaluators to make tough calls where they are needed.
  • A systematic check of the entire process and overall outcomes, to address the potential for bias and caprice.

These are desirable qualities, not legal requirements. The legal requirements for evaluations have nothing to do with wisdom or technical qualities, and even in collective bargaining contexts, the decisions reached in evaluation processes are generally considered to be management rights. What is bargainable by most unions is the process, and what courts set as the bar for legally-proper evaluations is fairly low.

In Florida, a teacher of the year in an Alachua County elementary school was rated unsatisfactory in 2012 based the test scores of children she had never taught, despite high ratings in other areas of the evaluation system. Yet the federal courts ruled that this method of evaluation that was arbitrary and capricious for this teacher (and many others) met the very low constitutional bar for government policy because the overall goal of the legislation was plausibly met by the policy. There is no legal requirement for policies to be sensible when applied, and thus the issue of validity is largely irrelevant in the practices of school systems. Sadly, the debates over reliability and validity of value-added measures depend on the conscience and sense of policymakers to hold meaning. Judges are not easily persuaded that a rational test for government action requires a rational test of evidence and logic.