Assessment has a learn and hungry look

I was in sixth grade when I first argued with a teacher about a math test. It was a multiple-choice test, made by the teacher and returned to us with her corrections and our grades, after comparing our work to her answer key. During a few minutes reserved for seatwork, I walked to the front of the room. I was explaining why my choice on a specific question was reasonable when she replied, “Well, it may be a possible answer, but it’s not the best answer.” Like many students, I fumed a bit as I returned to my seat, indignant at the capricious nature of school authority. But I also realized something important: the answer key was the set of answers my teachers had chosen when creating a test. The test was partly a measure of how well I could guess the teacher’s mind at the point of the test’s creation. 

Most adults have years of experience with school-based assessment, and the frustration I experienced that day is common. Our experience of assessment is institutional — each assessment measures agreement of some sort with the test-creator(s) — at the same time that we are told assessment is about learning — how well I knew the math taught in sixth grade in the late 1970s. Often critics of assessment practices urge assessment to strip away everything but what’s essential to learning. I certainly had my own version of that: “My answer was sensible to me, and not wrong!” was my thought that day. 

But desiring assessment to be separated from its context is an unrealistic expectation, and I knew it after sitting back down at my desk. Tests and other assignments are always created by someone, and the judgment of relative success or failure is also explicitly a human creation, in the context of a particular year, school, and society.  

Historical and sociological perspectives on assessment

There is a long history of various assessment forms that arose from institutional contexts. The new book Off the Mark by Jack Schneider and Ethan Hutt is a useful short critique of grading, testing, and recording and use of academic transcripts, a critique that is historically informed. It has a sadder but wiser argument: many of the practices that may irk you and me are now deeply rooted in the politics of education as much as its institutionalized practices, and deeply rooted in practice around the world. I recommend it to anyone who has engaged in a discussion about grading or testing, or had questions about the inherent contradiction between assessment as a way to measure learning and assessment as used for organizational purposes. 

Looking at assessment from an historical perspective teaches us about the choices that have been available — the United States was not destined to have things called the SAT or ACT that became required parts of applications for many four-year colleges. As Bill Reese has documented, the origins of some form of written standardized tests in the U.S. start at least as far back as early 19th century Boston, and those tests arose in the context of arguments about whether Massachusetts schools were effective. An historical perspective such as Schneider and Hutt’s also teaches us about the political and cultural side of assessment: how IQ tests in the early 20th century fed the racist eugenics movement and was used to justify immigration restrictions, and the way that our modern experience with assessment frames testing and grades as part of a “real school.” That cultural legacy of testing also guides the way that Americans think of testing as a concrete judgment about test-takers, making it hard to use testing in other ways (Dorn, 2010). There is a mass culture dimension, a centuries-long, global set of literature and now films that centers on testing (Dorn, 2014). 

One can be cynical about the role of assessment, like postwar sociologist Robert Dreeben. In his 1968 book On What Is Learned in School, Dreeben argued that one of the roles of school was to strip away the comforting psychological cocoon of children’s families and teach them that they were going to be judged by society based on fragments of their whole selves — and assessment is a critical part of that humiliating experience. Like many sociologists of his ilk, Dreeben was capturing what he saw in a static sense: he didn’t try to explain how assessment came to acquire this dehumanizing role, let alone modern schooling as a whole.1 But as long as we remember that static interpretations don’t capture the dynamic background story, they can be useful for generating informative perspectives. Our modern experience in the judgment of people by snapshot assessments is one of the reasons why people think of assessment primarily as a judgment rather than useful for guiding practice. 

This search for institutional roles can be useful in thinking about assessment. When I wrote Accountability Frankenstein (2007), I sought to explain something about the key characteristics of standardized tests as used in accountability systems, and to do so without ignoring the history, the technical expertise behind the construction of tests, or the way technical language can insulate tests from criticism. My finesse was to transform the technical construction of tests into how those techniques functioned in their institutional context. Standardized tests follow some professional rules in a way that teacher-made assignments don’t have to in their classroom use. Some of these rules are aligned with research use: reliability and validity. Some are shared by the world of surveys as well, but within a testing context: item/test construction and scaling. Others are standards in terms of administration and scoring. Others are responses to critiques of test bias and unfair use in the 1960s and 1970s. Those are professional standards that require a certain minimum level of knowledge, experience, and conscientiousness that organizations involved in testing must be able to demonstrate.2 

But as used in schools, concepts such as reliability and validity don’t appear as themselves. Reliability appears instead embedded in the broader notion of consistency: tests need to behave consistently. That consistency means they need to be reliable, but they also need to be consistent with organizational and political expectations of what tests look like. Validity appears within the the role that standardized tests play in comparing students, schools, and systems. Item and test construction, and test scoring and scaling appear within two complementary traits of tests: they are composed of multiple skills and subdomains of content, and they are circumscribed by sampling from and thus limiting their coverage of content. Describing these 4 C’s of testing was my way of inviting readers to be skeptical (or skeptical enough) about test scores: they behave in certain ways when administered for accountability, and those behaviors are not exactly about professional standards. 

Cautious and expansive languages of assessment

This isn’t the only way in which standards and use differ: in many cases, assessment researchers who would recognize the caveats and cautions in standards are far looser when they advocate for practices and policies that maintain or expand the use of standardized testing (e.g., Hambrick & Chabris, 2014; Mayer, 2014). This gap between lay and expert conversations isn’t universal, by any means: Daniel Koretz (2009, 2017) is an example of an assessment expert who wrote carefully to a lay audience about the limits of testing in the NCLB era. But the common slippage between research exactitude and public advocacy deserves some scrutiny. Where does this gap come from?3 

To explain this gap, we can adapt some of the language of cultural-historical activity theory, or CHAT, which can give us a way of talking about tools and concepts that stretch across institutions. Despite its name, CHAT comes out of psychology rather than out of anthropology, sociology, or history. Rooted in the work of early 20th century Soviet psychologist L.S. Vygotsky and his intellectual progeny, the CHAT framework focuses on the interactive nature of knowledge. Much of its modern use focuses on knowledge and learning–more about that and assessment later–but there is another use of the framework that seeks to explain the way organizations create concepts and tools that they find useful, and the way that different organizations and contexts can develop shared concepts that serve their respective needs. This version of CHAT is associated with Yrjö Engeström (2015), who developed a diagrammatic way of discussing the development of norms, practices, and concepts within and between organizations. What is useful here is less the diagram than the idea: concepts and practices can arise and develop a logic out of the relationship between organizations, at the boundaries between them. These shared concepts are boundary objects.4

This notion of a boundary object puts a little shape onto a well-known aspect of education history: long-lived ideas are flexible enough to be useful to a broad coalition. The classic example is the history of the kindergarten, whose advocates talked about a range of its alleged benefits from saving poor children and their parents to preparing children for first grade and reforming the regular school curriculum (Cuban, 1992). The shared definition of a boundary object does not mean that all of the involved organizations see the object in the same way, and their internal sets of norms and practices can be quite distinct. The result is that the discourse within a smaller, more bounded community will defer to its developed (and developing) norms, while the shared discourse around an object will respond to coalition dynamics. Even more broadly, a technical tool can serve the development of multiple boundary objects. If used in enough contexts, the technical tool becomes part of the experience of a population, part of modern life. 

Hungry assessment

In a single context, the tool operates to address the single boundary object, but the same organization or network that supports and mediates a specific tool can have relationships with multiple other organizations and thus make a tool seem useful in different contexts. In a way, a technical tool can acquire uses and generally recognized value by its attachment to multiple boundary objects. Over time, the tool asserts more and more uses and absorbs more attention and credibility. A tool can thus be hungry for uses. Assessment is such a tool, and has long had a learn and hungry look.5

The development of national standardized exams for professional licensure illustrates this dynamic in the U.S. In nursing, licensure exams had begun before World War II, originally as state-specific standardized tests (Benefiel, 2011). But in the 1940s, nursing licensure exams moved to national exams with a common set of test specifications and thus content coverage. Nursing was but the first of many occupations with a national standardized test as part of licensure requirements. These licensure exams addressed multiple needs of both individual states and professional groups from nurses to electricians. In part, 20th century licensure exams served to certify the expertise of members of an occupation, with that recognition of expertise as one of the key characteristics of 20th century professionalization (e.g., Friedson, 1984). But that certification role was played by state-level exams, which nursing had well before the nationalization of an exam, now known as NCLEX. National exams were more efficient, their advocates claimed, as states (and state-level nursing organizations) did not have to create their own set of tests with dozens if not hundreds of items. But further, a national exam created an implicit national curriculum for nursing programs: any program that failed to prepare its students for the national exam eventually risked shaming with the public reporting of passing rates. 

I saw this dynamic firsthand when I served on the faculty senate graduate council at the University of South Florida. One meeting, the entire dean’s staff from the College of Nursing came to the graduate council and begged it to approve a wholesale change in the masters-level nursing programs, requesting that we waive our usual rules about having several weeks to review proposals. We were presented with more than a dozen syllabi, and had scant time to read and ask questions. Why this urgency? The American Nurses Association had announced its test guidelines for the new version of NCLEX shortly before that meeting, and the test specifications were both newly announced and going to be in effect by the time the next semester’s beginning nursing students were sitting for the exam. For the next class of nursing students to have a chance at licensure, the program needed to revolve around the new NCLEX, and several nursing faculty had been working almost literally around the clock to prepare syllabi for our approval.6 

But the proliferation of licensure exams led to an additional phenomenon: the proliferation of test preparation guides. When I lived in Tampa in the late 1990s and 2000s, there were several chain book stores within a few miles of my house. Each of them had reference sections that occupied more than 10% of the total store shelving, and a clear majority of the reference section consisted of test preparation guides. Some were focused on undergraduate admissions tests, preparing high school students to take the SATs and ACTs, or even the junior-class PSATs. However, these stores marketed the bulk of test study guides to those preparing for professional licensure exams. 

The market for licensure exam study guides has been part of American life for more than half a century. In one sense it marks the extent to which assessment has gobbled up the time of Americans at all ages, not just in actual test-taking but in planning for and thinking about the process well beyond high school. Assessment is a tool that has been more than hungry: it has eaten. 

If we cannot entirely unwind the development of assessment machinery and its relationships with the politics of education, can we realistically address our concerns about it?7

Yes, and

Off the Mark provides a survey of the alternatives various educators have proposed as improvements to grading, testing, and making permanent records of student work. The language of these proposed alternatives is important: authentic assessment, pass/fail and contract grading, narrative evaluation, micro-credentialing, competency-based credits, portfolio graduation. Together, the alternative proposals suggest that standard practice is inauthentic, judgmental and capricious, narrow, irrelevant to real-world application, with a tendency to fragment the type of thoughtful life we would like students to lead. Schneider and Hutt are careful to point out not only the limits of these experiments, but the way that they often fail to address the roles that grades, tests, and transcripts play in education, and specifically the relationships among different institutions and interests. The current configuration of practices was not inevitable, but we need to pay attention to the role that assessments play in addressing felt needs, in serving as tools to address boundary objects that we call by a more ordinary name: communicating something about schooling to people outside school walls, for all sorts of reasons.8 

For educators, students, families, and others who have some concern about assessment practices, it is important to address technical concerns and philosophical perspectives on assessment, and also move beyond them. Few alternative suggestions will be robust without considering the aggregation of practices and experiences of assessment, what Schneider and Hutt argue is now part of our common definition of real schools. Here, we can borrow an argument from Harry Brighouse and colleagues in their 2018 book Educational Goods: whatever long-term and institutional interests are served by various assessment practices, we should insist that each assessment practice also serve the short-term needs of students.9 To borrow a term from improvisation, we need to yes and assessment practices currently driven by needs and interests far from the everyday concerns of students. 

Here, we can look to sociocultural and sociocognitive arguments about assessment as healthy inspiration. Pryor and Crossouard (2008) look at formative assessment through a CHAT lens; to them, it is important to understand that there can be both “convergent assessment” structured to provide definitive answers to questions (usually asked by educators) and “divergent assessment” that is more open-ended.10 Dann (2014) argues that from a CHAT perspective, formative assessment or assessment for learning is missing the mark if it omits student perspectives; from that, she argues for the construction of assessment as learning (i.e., an inherent structure that provides feedback to students that makes sense in a classroom context). For these authors, critical questions about assessment revolve not only around the purpose of assessment (formative vs. summative) but the relationship between the tool of assessment and the individuals in a class setting with all of its social contexts. Zooming out to a policy perspective, Shepard, Penuel, and Davidson (2017) argue for focused development of assessments that are socioculturalist in orientation. Harris et al. (2022) review the relatively new literature on designing assessments around learning progressions.

These perspectives do not exactly provide a prescriptive program, but they provide questions that we can ask as a supplement to the “yes and” frame I am proposing: Whose relationships does the assessment affect? What model of learning does the form of assessment think is possible and worthy of capturing? What are the practical tradeoffs when we craft new forms of assessment or adapt older techniques to attempt alignment of assessment with explicit models of learning? As I argue above, I think these need to be tempered with an understanding of the institutional roles of current assessment practices, and the construction of assessments that are feasible and address the institutional motivations for current practices. I turn to them as ways of thinking about assessment that are constructive and worthy of our time. 

References

Artiles, A. J., Dorn, S., & Bal, A. (2016). Objects of protection, enduring nodes of difference: Disability intersections with “other” differences, 1916–2016. Review of Research in Education, 40, 777-820. https://doi.org/10.3102/0091732X16680606.  

Benefiel, D. (2011). The story of nurse licensure. Nurse Educator, 36(1), 16-20. https://doi.org/10.1097/NNE.0b013e3182001e82

Brighouse, H., Ladd, H. F., Loeb, Susanna, & Swift, A. (2018). Educational goods: Values, evidence, and decision-making. University of Chicago Press.

Cuban, L. (1992). Why some reforms last: The case of the kindergarten. American Journal of Education100(2), 166-194.

Dann, R. (2014). Assessment as learning: blurring the boundaries of assessment and learning for theory, policy and practice. Assessment in Education: Principles, Policy & Practice, 21(2), 149-166. https://doi.org/10.1080/0969594X.2014.898128

Dorn, S. (2007). Accountability Frankenstein: Understanding and taming the monster. Information Age Publishing. 

Dorn, S. (2010). The political dilemmas of formative assessment. Exceptional Children, 76(3), 325-337.

Dorn, S. (2014). Testing like William the Conquerer: Cultural and instrumental uses of examinations. Education Policy Analysis Archives, 22(119). http://dx.doi.org/10.14507/epaa.v22.1684

Dreeben, R. (1968). On what is learned in school. Addison-Wesley Publishing Company. 

Engeström, Y. (2015). Learning by expanding. Cambridge University Press.

Friedson, E. (1984). Are professions necessary? In T. L. Haskell (Ed.), The authority of experts (pp. 3-27). Indiana University Press. 

Hambrick, D. Z., & Chabris, C. (2014, April 14). Yes, IQ really matters: Critics of the SAT and other standardized testing are disregarding the data. Slate.  https://slate.com/technology/2014/04/what-do-sat-and-iq-tests-measure-general-intelligence-predicts-school-and-life-success.html

Harris, L. R., Adie, L., & Wyatt-Smith, C. (2022). Learning progression–based assessments: a systematic review of student and teacher uses. Review of Educational Research92(6), 996-1040. https://doi.org/10.3102/00346543221081552

Koretz, D. M. (2009). Measuring up: What educational testing really tells us. Harvard University Press.

Koretz, D. M. (2017). The testing charade: Pretending to make schools better. University of Chicago Press.

Mayer, J. D. (2014, March 10). We need more tests, not fewer. New York Timeshttps://www.nytimes.com/2014/03/11/opinion/we-need-more-tests-not-fewer.html?_r=1

Pryor, J., & Crossouard, B. (2008). A socio‐cultural theorisation of formative assessment. Oxford Review of Education, 34(1), 1-20. https://doi.org/10.1080/03054980701476386

Reese, W. J. (2013). Testing wars in the public schools: A forgotten history. Harvard University Press. 

Schneider, J., & Hutt, E. (2023). Off the mark: How grades, ratings, and rankings undermine learning. Harvard University Press.  

Shepard, L. A., Penuel, W. R., & Davidson, K. L. (2017). Design principles for new systems of assessment. Phi Delta Kappan, 98(6), 47-52. 

Notes

  1. As David Karen taught me almost 40 years ago, the failure to explain change is a key trait of most structural-functionalism. There are plenty of sociologists who incorporate history into their scholarship, so don’t judge them by Dreeben’s flaws! []
  2. My focus here on organizations is related to the institutional context of standardized testing, and also to the discussion of cultural-historical activity theory (CHAT) below. []
  3. One consequence of this shifting discourse is that researchers who use expansive language about the public uses of assessment can sound like positivists, where assessment measures something that exists in the real world, while cautious writings for a professional audience can read as postpositivism, where assessment is a practical measure of something that is inherently unstable and slippery. One could also see this slippage as a form of motte-and-bailey argument structure, where an expansive argument has a more restrictive, defensive stance. []
  4. In 2016, Alfredo Artiles, Aydin Bal, and I discussed the Progressive-Era creation of special education as a boundary object concurrent with the modern definition of disabilities. In that article, we discussed the role of psychology in a triangle of expertise: a bundle of tools, object, and network. That has a relationship with assessment’s history, but is beyond the scope of this blog entry. []
  5. My apologies to fans of Julius Caesar for the awful pun. I had originally used the line from the second scene, but a typo changed “lean” to “learn,” and I’ve kept it. []
  6. I remember that the involved nursing faculty did a good job; we approved the changes unanimously, with I think one or two questions. []
  7. In the last chapter of Accountability Frankenstein, I listed 15 specific ways to address my critiques of accountability mechanisms, and I think they mostly hold up. This blog entry addresses the practical side of criticism from a different angle. []
  8. Schneider and Hutt term this short-haul communication, long-haul communication, and synchronization. []
  9. Brighouse et al. define a distinction between general “educational goods” and “childhood goods,” or the value of a good childhood. You may recognize this as an application of the third formulation of Kant’s categorical imperative: one cannot see students just as means to an end, but must see their needs as ends in and of themselves. []
  10. They posit a more abstract and general understanding of formative assessment as well. []