High Stakes
![]()
CORING MULTIPLE-CHOICE EXAMS IS A VISION OF LATE-20TH CENTURY EFFICIENCY:
A quietly humming machine electronically reads the penciled-in bubbles and spits out answer sheets, neatly stacked, at a rate of 10,000 an hour. In contrast, scoring portfolios or essays is a messy, imprecise affair: Raters by the roomful sweat over raggedy piles of handwritten papers, looking for evidence of complex qualities such as "voice" or "purpose."Ours is an age when personal computers and golden arches are the defining artifacts of culture. Yet many states across the nation are forsaking speed and economy in large-scale testing. Fast and cheap have their place, even in education. But policymakers and educators who want more than just a ranking and sorting of students—the information that standardized multiple-choice tests were designed to deliver—are turning to more cumbersome, labor-intensive strategies.
Requiring students to perform complex or open-ended tasks can tap rich veins of insight about what kids can really do. For example, can they use their powers of reasoning to defend a political viewpoint? Can they employ scientific method to solve an environmental problem? Can they apply mathematical principles to resolve an engineering dilemma? Can they incorporate writing traits to create a story that dances across the page?
Standardized tests are still the bread and butter of statewide assessment. But in all regions—north and south, liberal and conservative—alternative assessments have been added to the menu, if not as the main course, then at least as a side dish. While nearly 85 percent of states use traditional multiple-choice exams for statewide assessments, most of them combine multiple-choice with writing tasks, portfolios, or some other form of performance, according to a 1995 report from the Council of Chief State School Officers and the North Central Regional Educational Laboratory. In The Status of State Student Assessment Programs in the United States, Linda Ann Bond and Edward Roeber report that in 1993-94, 38 states assessed writing, 25 states used other performance assessments, and seven states required portfolios. Two states—Kentucky and Maine—had abandoned multiple-choice altogether in favor of alternative assessment strategies.
In the Northwest, the same study reports, Montana and Washington were among only seven states nationwide that relied solely on standardized multiple-choice (see Voyage of Discovery for an update on Washington's comprehensive statewide standards and assessment initiative). Alaska, Idaho, and Oregon were supplementing standardized tests with statewide writing assessments.
But signs of backlash are everywhere. Recent headlines in Education Week and Education Daily declare:
- EDUCATORS WEIGH HIGH PRICE OF ASSESSMENT
- TEACHERS FOUND SKEPTICAL ABOUT REVAMPED TESTS
- STATE TEST QUESTIONS FOCUS OF RENEWED SCRUTINY
- STATES STRUGGLING TO ALIGN STANDARDS, ASSESSMENTS
- ASSESSMENT REFORM AT A CROSSROADS: A RETREAT FROM PERFORMANCE-BASED PRACTICE MAY SIGNAL THE RETURN TO FAILED FORMS OF TESTING
"Caution is increasingly the watchword, testing experts say," Millicent Lawton wrote on the front page of Education Week in January. She goes on to note that California and Arizona "recently dumped troubled statewide programs."
More than half the states that are designing alternatives to multiple-choice are encountering "major difficulties," according to Bond and Roeber. States' experiments with alternative assessment have faltered in three key areas: cost, purpose, and scoring.
PERFORMANCE-BASED ASSESSMENTS ARE EXPENSIVE. Researcher D.H. Monk estimated in 1993 that statewide performance assessments would cost between $35 and $70 per student. Standardized multiple-choice tests, in contrast, can cost as little as $1 or $2 per student. A recent study by the U.S. General Accounting Office predicted that a national multiple-choice achievement test would cost about $40 million, while a slightly longer test with short, performance-based questions would cost nearly $210 million.
"From a policy perspective, the major concern is whether the added cost of alternative assessment produces more meaningful results for parents and policymakers," observes an article in the Spring 1996 issue of the U.S. Education Department's online newsletter, Improving America's Schools. "Research has not yet explored this question."
The "meaningful results" states seek are clues to students' mastery of the complex skills missed by multiple-choice. It's the complexity of the desired findings—how well kids can think, create, analyze—that makes performance assessments costly. Those rooms full of raters must be trained to distinguish an adequate performance from an outstanding one. To do this, they compare each performance against a set of criteria laid out on a grid, usually called a scoring rubric. The technical quality of performance tests can hinge on how well raters understand the criteria and how consistently they apply them. Still, even if the scoring is reliable—different raters assign similar scores to the same performance—there's another key question that must be answered: Does the test really measure what it's supposed to measure? In other words, is it valid?
For performance assessments to succeed, the experts say, states must invest heavily in designing high-quality assessments and in training raters—usually teachers—to use the scoring criteria effectively. States that have rushed new assessments into schools or cut the training process short have run into criticism and opposition when the tests fail to produce valid, reliable results.
"Given problems in some of the assessment efforts first implemented—in Arizona, California, Georgia, and Maine, to name a few—policymakers pushed to set aside innovative approaches to assessment and to return to commercially available norm-referenced tests," write Bond and Roeber. "Although there have been some successes, such as in Maryland and Kentucky, the setbacks...elsewhere indicate that widespread acceptance of performance assessment is certainly not automatic."
Observes Dr. Richard Stiggins, director of the Portland-based Assessment Training Institute: "There have been a number of instances where states have tried to use performance assessments and portfolios in large-scale assessment and had it blow up on them because of a lack of quality. The problem associated with conducting good-quality, judgment-based assessment is that the criteria need to be clear. The other thing is that raters need to be systematically trained.
"At a school-building or classroom level," Stiggins continues, "that's relatively manageable. But at a statewide level, an awful lot of people have got to be involved. An awful lot of people have got to internalize this vision of a target and be trained to apply the criteria. I see few states, few large-scale assessment enterprises, willing to invest the resources it takes to do it well. It's a troubling paradox. It's just a whole lot cheaper to score multiple choice tests by machine than it is to train raters to be good judges of student achievement."
STATES WANT ASSESSMENT DATA FOR ALL KINDS OF REASONS. Often, those reasons clash. The most jarring conflict happens when states use the same assessment program to hold schools or students accountable on one hand and to improve instruction on the other, Bond and Roeber assert. To be useful as a tool for improving instruction, assessments should be flexible and ongoing. They should turn data around fast so that teachers can feed the information back into their practice, making continuous adjustments in instruction. Accountability, in contrast, calls for nearly the opposite qualities. Standardization, which ensures comparability and fairness, means that tests can't be tailored to fit schools' needs. Centralized scoring, which ensures the security and accuracy of scores, means that weeks or months go by before teachers see results.
"The primary goal of state assessment continues to be the improvement of instruction in order to help students meet new, challenging standards," write Bond and Roeber. "But states seem unsure whether improved assessment content and format or increased accountability will result in the most improvement. They therefore continue to do both, a situation that limits the utility of the assessment program for either purpose."
Most states use each of their assessment components for anywhere from two to five purposes, Bond and Roeber found in their national study. Improving instruction is the most common purpose (43 states), followed by school performance reporting (41 states). Other purposes include program evaluation (37 states), student diagnosis (26 states), high school graduation (17 states), and school accreditation (12 states). Stakes for schools are high—funding gains and losses, loss of accreditation, warnings, and eventual takeover of schools—in almost half the states.
"Caught between politics and science" is the way one policymaker, now an official in the Clinton administration, describes the status of current assessment initiatives. Science, represented by testing experts, cautions against using assessments as carrots and sticks to reward or punish teachers, schools, or students. These experts warn, too, against rushing ahead with new assessments before all the data are in.
Politics, on the other hand, calls for holding schools accountable—now. Lorraine McDonnell, who interviewed 34 state and national policymakers for a 1994 Center for Research on Evaluation, Standards, and Student Testing (CRESST) study, found a consensus among respondents that said, in essence: "America can't wait for all the data to come in. Let's move ahead with new assessments and iron out the kinks as we go."
McDonnell found at least seven different types of purposes that policymakers expect assessments to serve: (1) providing information about the status of the educational system; (2) aiding in instructional decisions about individual students; (3) bringing greater curricular coherence to the system; (4) motivating students to perform better and parents to demand higher performances; (5) acting as a lever to change instructional content and strategies; (6) holding schools and educators accountable for student performance; and (7) certifying individual students as having attained specified levels of achievement or mastery.
"The sharpest disagreements between testing experts and the policy community have been over the policy uses of assessment," McDonnell asserts in her report, Policymakers' Views of Student Assessment. "Experts warn that if assessments are used to advance policy objectives, particularly if they involve the imposition of rewards and sanctions, negative consequences are likely to result. These may include widening the gap in educational opportunities available to different kinds of students, a narrowing of the content and skills taught, a centralization of educational decisionmaking, and the deprofessionalization of teachers."
Researchers have found that assessments can act as powerful "curricular magnets," pulling teachers' attention to the content of test items, report Audrey Noble and Mary Lee Smith in Old Beliefs About Measurement-Driven Reform: The More Things Change, the More They Stay the Same, published by CRESST in 1994. They cite studies of statewide testing programs in Pennsylvania and Maryland where researchers found a direct relationship between what's at stake and what's taught. "The higher the stakes, the more likely narrowing of the curriculum will occur," the studies concluded.
For students, perhaps the highest-stakes assessments are those that serve as doorways to diplomas. Next year, Oregon will join the ranks of states that require students to pass competency exams for graduation. Such exams can run into legal challenges. Assessments that are used to award or withhold diplomas "are the ones most likely to end up in court," Bond and Roeber note. Courts have ruled that such assessments are proper only if they test what the school actually taught. That is, did the student have the opportunity to learn the skills and information she needed to pass the exam? Also important, the researchers say, are "the timing of the notice (students need to know approximately three years ahead of time that passing the exam will be a requirement for graduation), and the technical quality of the exam (the test must be reliable, valid, and fair)."
TECHNICAL QUALITY IS THE HOLY GRAIL OF ALTERNATIVE ASSESSMENT. One of the early pilgrims was Vermont, whose statewide portfolios in math and writing have succeeded in changing the way teachers teach. The state's efforts, nonetheless, have fallen short in scoring reliability.
To score portfolios and performances, raters draw upon their pool of knowledge and experience. They use their collective professional judgment. "Performance assessment scoring may be viewed as a social process in which a group of individuals negotiates meanings and comes to consensus about the interpretation of scoring rubrics," write Edward Haertel and Robert Linn in a 1996 publication from the National Center for Education Statistics, Technical Issues in Large-Scale Assessment.
Compared to choosing A, B, C, or "all of the above," this method of scoring looks soft and squishy. But, in the words of Geof Hewitt, one of the designers of Vermont's portfolio assessment system, "Reliable large-scale scoring is not a dream, but it takes patience and lots of training." He cites Oregon's statewide writing assessment as an example of scoring success. Some 400 teachers in 11 districts achieved inter-rater agreement levels about 95 percent of the time, he reports. The state then certified the scores by comparing them against pre-scored "anchor papers."
Even if states can devise a reliable scoring system, they face another trouble spot in performance assessments: generalizability. The question is, Does a student's performance on one task, or a few tasks, give an accurate reading of his overall ability? Researchers suggest that multiple tasks are needed to make sure the full range of student abilities is sampled.
Yet another technical issue is comparability: Will tasks produce comparable results from student to student, school to school, district to district? Experts acknowledge that the costs—both in time and money—of designing, refining, administering, and scoring assessments that meet rigorous standards of reliability, validity, generalizability, and comparability are huge.
In Technical Issues in Large-Scale Assessment, Richard Jaeger and colleagues write: "One reason for having performance standards is to attempt to establish comparability among the performances of students in different jurisdictions. That is, theoretically, parents of students in one state and school district could compare their children's performance to students' achievements in another state or country, or employers and colleges could make comparisons across students. Further, school systems and the nation could monitor trends across time in attaining curriculum standards. Yet a score on one task often means little in predicting the score on a second, seemingly related, but different task, either for individuals or for groups of students.
"Developing comparable standards across performance assessments," Jaeger concludes, "appears to be the most problematic venture of all."
As promising as performance assessments are, they should not be embraced to the total exclusion of other methods, researchers warn. Multiple-choice tests, they say, still have a place in the assessment mix. Stiggins argues that multiple-choice tests (along with true/false, matching, and other "selected-response" tests in which students choose the best response from an array of options) are valuable tools for finding out whether students have mastered content knowledge. In performance assessments, content knowledge is mixed in with a lot of other skills and is tough to isolate. In his book, Student-Centered Classroom Assessment, Stiggins calls for balance.
"Performance assessment and objectively scored paper-and-pencil tests do not represent different ways to assess the same things," he writes. "They attend to different outcomes. Further, one is not inherently superior to the other. They are equally powerful in fulfilling their respective assessment roles—in the hands of qualified users."
Walter Haney and George Madaus, writing in the Phi Delta Kappan in May 1989, make an eloquent plea for a reasoned approach—one that doesn't tip the scales too heavily in one direction. "Technologies of educational assessments (of whatever variety) in and of themselves will never cure the ills commonly attributed to standardized tests," they assert. "More than new forms of assessment, what is needed is a refusal to accept bondage to any single technology—no matter how useful it may be in a particular instance—and thoughtful selection of different kinds and mixes of assessments for different purposes."
RESOURCE NOTES: The report Technical Issues in Large-Scale Performance Assessment is available for $10 from the U.S. Government Bookstore in Portland, (503) 221-6217.
![]()
![]()
![]()
This document's URL is:
Home | Up & Coming | Programs & Projects: Northwest Education | People | Products & Publications | Topics
© 2001 Northwest Regional Educational Laboratory
Date of Last Update: 9/28/01
Email Webmaster
Tel. 503.275.9500![]()