Thursday, August 25, 2011

STANDARDIZED TESTING INSANITY?

by Ron Willett

This writer was in the classroom for a quarter century, as an instructor and professor.  Even after leaving academia for industry, it turns out one of the functions of corporate leadership is promoting learning, by employees, customers, vendors, stockholders and stake holders, and even regulatory bodies.  That testing is more by boots on the ground, but the core principles of creating learning apply.  

As an academic, I designed multiple-choice tests, essay tests, problem-based tests, pop quizzes, open-book tests, take home tests, according to former students now friends or associates some nasty tests, and graded thousands of structured tests and blue books.  Scores were scaled, transformed, clustered, curved, et al., to try to properly assign metrics for achievement.  As a practicing statistician and researcher for most of that tenure, those tests were rarely out of a publisher's end-of-chapter boilerplate for teachers.

Reflecting, I also agonized over every one of those tests and their outcomes: Was the test a proper sample of the course contents; was it fair in difficulty and its language; was the grading fair and consistent; had multiple tests given across multiple sections, therefore changed to arrest cheating, been made equivalent, in contemporary parlance, “standardized;” if scaled, had results been correctly adjusted?  How would individual students respond to their grades and any comments?  Had I created a negative stimulus to learning?  Did the test create any learning?  Did the test results truly differentiate levels of student learning achievement?

In sum, achievement testing in education, K-12 or higher, is an essential part of assessing student learning performance.  K-12 public education characteristically coined another term, "formative assessment," to describe what good teaching has been doing for decades, but that in-process assessment and testing is a critical component of executing formal learning strategies and tactics.

So the issue and point of this blog is not whether there should be standardized testing in our schools, not even whether that testing should be reflected in high stakes assessment of student, school, teacher or jurisdictional performance, but precisely how you do that with integrity, and whether we have the knowledge and mastery of measurement to do that without causing damage greater than the good alleged?

There are strong indications that the current standardized testing vision, and its application to assessing student learning across schools, especially teachers, and longitudinally, are badly flawed.  That is the quest in this particular blog: What is standardized testing; when, where and how can it be employed for proper effect; how must its attributes be gauged to make it viable; and how is the present overreach likely to impact our public education system?

Adding some credence to these questions, the August 12, 2011 issue of the premier journal Science contains a well-documented review of the “value-added” approach to judging our teachers based on longitudinal standardized test scores.  Its author is a researcher at the University of Wisconsin’s Department of Educational Policy Studies.  Concluding on whether value-added can improve education, the review states:  “The statistical properties of value-added measures are unlikely to improve much.  According to Campbell’s law, using the measures in high stakes testing will likely distort the measures themselves and make matters worse.”  Then:  “Until researchers who have demonstrated the theoretical promise of value-added measures also demonstrate its effectiveness in practice, the vacuum of empirical evidence will continue to be filled by ideology and speculation.”  The writer might have added, and some really terrible decisions impacting our teachers where the process is misapplied.

In bolder terms, are we really in danger of further denigrating US public education, already damaged by a half century of dry rot and refusal to consider new ideas and ground breaking results from expanding neural biological research on learning?

What is standardized testing?

Sounds obvious, but in fact "standardized" is deceptive and the concept is quite complex.  Adding to the confusion especially among our policy makers, what is being called standardized testing is rarely standardized at all, with discrepant elements appearing through all aspects of that testing.  These discrepancies make up the basis of much of the current critique of NCLB, RttT, and the "value added" models being employed to assess and even fire teachers.  In fact, the sloppy and in some cases ideological use of flawed measurement is almost criminal, because of the abuses of measurement logic, incompetent administration, and flawed judgments being imposed on both students and teachers.

Start with simply definitions.  Standardized tests as they are being used are subsets of the generic category achievement testing.  Achievement testing has likely been with us as long as there has been a concept of formal teaching.  

Perhaps the most famous unknown in American testing is E. F. Lindquist at the University of Iowa.  In a prescient paper over a half-century ago Lindquist articulated the pros, cons and criteria for defensible standardized achievement testing.  Unknown except to the few deeply into testing theory, Lindquist was hardly anti-testing, and was one of the developers of the Iowa Tests of Basic Skills, the Iowa Tests of Educational Development, the ACT, the high school equivalency GED, and the National Merit Scholarship Qualifying Test.  His paper basically lays waste to most of the present standardized testing protocols.

Lindquist's arguments are highlighted and extended by Daniel Koretz, an education professor at Harvard, and one of the reigning experts on educational testing.  The substance of his 2008 book, Measuring Up:  What Educational Testing Really Tells Us, should be taught to every would be K-12 teacher and administrator, ironically with aggressive formative and summative testing of their grasp of its concepts before they are unleashed to staff a classroom or lead a school.

An additional caveat, standardized testing has come to be associated with multiple-choice tests, sometimes pejoratively referred to as “bubble tests,” but that is not a correct assumption.  Any test theoretically could be made a standardized test by using the protocols that seek to assure comparability of results across students, classrooms, schools, districts, time periods, and even states. 

At the level of grand design, a standardized test in K-12 is any test where student achievement being measured is attributable to that classroom’s practice, and free from competing explanations for the results across all of the causes, conditions and environments present.  Even with little specific knowledge of experimental logic, the reader can quickly sense that it could be a challenge to qualify a test as standardized when it is taken by a dynamic group of children or youth, then multiple groups, with different prior experiences and achievement, from different sub-cultures, with varying pre-test immediate experiences, for different knowledge areas, in various classroom physical environments, at different points in time, even in different inside room noise levels-temperatures-humidity, even whether the sun is shining to push the envelope.

There are multiple critical issues:  An assumption – and another dimension of whether that test result means anything – is whether the test itself and the individual items or questions elected measure what was intended or appropriate; another, establishing those protocols that will assure you really have standardization is metaphorically just a tad short of trying to find the Higgs boson given our present tools for creating tests and assessment. Virtually never mentioned, what we call knowledge is now doubling every decade, as well as being corrected for historical errors of interpretation, meaning that the sample of learning acquired in any test is getting more and more selective as time marches on.

You’re measuring what?

Put this into a real world context.  Imagine 30 eighth graders, reflecting an assortment of backgrounds, personalities, sub-cultures, parental interest, present moods, and other school exposure, are given a “standardized” test on science.  You are supplied from some distant test vendor, a 50 item multiple-choice test, created by an equally distant test designer who has never been in your classroom, perhaps any classroom recently, using an unknown sample of science knowledge and protocol to select 50 items to constitute that test, covering a fragment of the learning you believe you’ve stimulated over a term or year.  Across the city, town or county, another 30 eighth graders are administered the same test, but at a different time in a different environment, with those 30 students representing different backgrounds, personalities, cultures or sub-cultures, prior learning, and contemporary mindsets.  Employment of you and your peer across the way is to be determined by the results of that testing, allegedly measuring precisely and exclusively what either of you, uniquely, has managed to instill as science learning in your respective 30 kids.

Just a common sense question, that has a long history of measurement science and research at its roots, how confident should you be that when you’re fired because of that test’s scores you had a fair hearing?  Multiply that isolated testing challenge by a large multiple via accumulating those results to rate schools, jurisdictions and even states; room for error?  There’s an old saying for those of us who entered the computer age at its tender stage, before Bill Gates and Steve Jobs were out of their nappies, and a window was just a window:  Its acronym, GIGO; its definition, garbage in, garbage out.

The earlier cited work by Daniel Koretz is pretty dry reading, but it is a quality assessment by a competent testing professional on what we can measure and where those measurements can supply usable information.  The above “competing explanations,” which is simply an eloquent way of summarizing effects, explodes in your face when you open the box.

First, the question breaks into two questions:  The test itself; and how, where, when, and to whom it is applied. 

A test is almost invariably a specific sample of some aspect of learning or achievement from the universe of what is covered as instruction in the particular discipline.  To have it be otherwise would make the testing approximately the equivalent in scope and time as the original instruction.  Who selects what elements of knowledge or learning achievement are to be tested?  Who elects what specific property of what is learned to be the object of testing?  Is the test arbitrary or to be used to reference a student’s result (or accumulated results) against the norm of a larger universe of students’ results? Alternately is any result compared to some standard of learning, i.e., standards referenced?  Who selects the kind of test – i.e., the complexity or scope of each topic tested?  Does the test expression validly match what you wish to know, construct relevance?  The disconnect between the learning model(s) and the testing models becomes a yawning gap.

Then the issues really get hoary:  Validity – more than one type, construct underrepresentation, content versus performance standards, standard error of measurement (test-retest reliability, not traditional sampling error), sampling error resulting from who is tested and how they are specified, scaling or relating test scores to meaning, linear versus non-linear scale properties, bias, the real world tradeoff between the complexity of what is tested and reliability of results; and on.  Overarching the technical issues, present standardized testing has delegated answers to most of the above questions to a black box of test designers and psychometricians, all in a different place than the teacher in the classroom. 

Lastly, there is the classic wrecking ball of “correlation is not causation.”  One of the more common explanatory mistakes, even for professionals, is equating two or more sets of metric data because they demonstrate superficial association.   As test results proliferate, with pooling of data to move from usually defensible results within a classroom for specified materials, to comparisons of schools and longitudinal performance, the chance of visually suggestive associations increases.  Even most professional educators are inadequately trained in interpreting what is termed statistical correlation, much less multiple and partial correlation, factor analysis, and other tools that have been used at a research level to try to diagnose and verify such associations.  Even when there is a replicable statistical fit, the issue of causation is a logical evaluation, not just number manipulation.

A perspective comes from contrasting the models where measurement science is confident, with the kinds of testing involved in K-12.  It has been a challenge in social research to successfully install classic random assignment and experimental treatments with controls in the simplest economic and social experiments, and usually with few variables to make explanation robust.  In the kind of standardized testing called out here, throwing the results from millions of test takers, from thousands of schools, and an unknown mix of prior and environmental conditions, into crude classification “grades” for schools is an oar short of rowing to some reasonable destination.

One blossoming and terrible result for K-12 education, call it cart dragging the horse, what is taught and billed as learning is increasingly being determined by what is coming down the pike as testing, and there is a full disconnect between what that on-site learning effort should be and emphasize versus the origins of the tests.  The defensive albeit unethical result from our schools has been dubbed “teaching to the tests.”  It has even resulted in a caricature of serious testing, an abomination of more time devoted to trying to game the design of the options in those multiple-choice questions to try to increase scores, than had been devoted to creating usable learning.  Even more destructive of our education systems, full cheating via changing or manipulating actual test results is turning into an epidemic.

Question, why are we testing?

Most thoughtful critics of the present trend in testing, to force standardized testing into every aspect of K-12 change, make a highly valid point:  Present testing has done a tectonic shift, from being primarily a valid tool for diagnosing the effectiveness of classroom efforts to create learning, to being Thor’s Hammer, a “get tough on education” method of forcing accountability.  It offers little comfort to students impacted by flawed testing, or teachers fired for test results, or schools condemned on the basis of those results, that public education over a half century shot itself in the foot resulting in NCLB and its devil’s spawn.

But the shift has occurred, being aggressively promoted by our corporate business community on the premise that our schools should be more “businesslike,” and by a cabal of corporate test creators, vendors and scorers for profits. It is also being politically driven by conservatives because it has an aroma of control and discipline, and Federally because however crude, it is the only tool Constitutionally available.  Curiously, the one reason not promulgated is that it will demonstrate how to improve learning.  You judge, is it likely to enhance K-12 education, or create more problems than solutions?

The central issue is that at every stage of aggregation – from a student’s scores, to classroom level, to school level, to district, to state, across time periods – standardized does not necessarily mean equivalence.  It is in classic research parlance, conducting a massive set of experiments, with high stakes impacts, but without the controls that assure that student achievement or learning produced by its classroom mediator, and only that specific measurement is what is represented in the number tossed out with little regard for the underlying assumptions.   Arbitrary assignment of value judgments to results, such as Ohio’s (and other states’) crude school ratings now issuing is even more suspect.

Standardized testing has become an educational “wave,” almost a flash mob of rabid reformers, in railroading students and schools.  The comparisons claimed by those advocating high stakes testing and present standardized tests as the mechanism for improving learning simply don’t hold up as defensible when subjected to the standards that, for example, Koretz details.  We aren’t yet knowledgeable enough about learning as a process, or able to dictate the conditions for testing, or in possession of needed statistical discriminators, that allows absolute judgments of performance to support broadly changing educational infrastructure.

Does this take testing off the table as an educational tool?  Hardly, but it should place its use back into the hands of those in the classroom creating learning, for the purpose that makes most sense, diagnosing whether the classroom performance or any other medium is achieving desired learning, whether it is in process, or a summary effect needed to assess whether resources are being applied effectively.  A proposition is that the present trend in using testing to nationally penalize or award resources or systems is dysfunctional.

The counter to this is the large common sense question, if not such testing, then how do we hold our schools and educators accountable for doing their job, for producing needed student learning, and operating at desired levels of productivity?  Trust them? 

That’s a legitimate question, but it has a business counterpart in the distinction between traditional manufacturing quality control based on inspection, versus a contemporary concept of quality assurance based on process control.  In the latter model, by assuring the perfection of the inherent component process(es) (in this context, teachers, administrators, curricula, technology, assessment models) you ensure finished product quality.  You rarely inspect because the quality of the output is determined by the earlier control of the processes involved.

It is ironic and just short of condemnation, that the gang of school reformers smashing our public schools with standardized testing, are using last century’s obsolete management model of quality, while hypocritically trumpeting the need for sound business principles to be applied to education.  It appears that neither our reformers nor our public education establishment has managerial credibility.  Perhaps it is also that to establish educational process control means better teacher training, screening for competent administration, curricular reform, reforming school board oversight, understanding technology, and allocation of resources to learning first, to sports and bricks and mortar a distant second?  Called change, says easy, does hard.

But it also reflects a simplistic view of an enormously complex system, and the assumption that you can crank out a simple measure that will validly differentiate education that’s working from its obverse, and by cracking the whip.  Realty is that testing linked to the classroom, when it has been properly trained, works, even when there is a good deal of diversity in how it is handled teacher to teacher.  We should not be creating teachers with the notion that their skill level is limited to posturing at the front of a classroom, or on the side in a rocking chair in one of the more bizarre manifestations of pedagogical theory.

Where large scale results are necessary, testing like the National Assessment of Educational Progress (NAEP) has proven viable; the ACT and SAT testing less so but still effective.   Futuristically, but not very far out, there needs to be a “race to the moon” effort to create testing models that can get beyond simple memorization of facts or simplistic relationships, and test more complex integration of learned information as it applies to problem solving.  Paradoxically, Harvard’s Howard Gardner, creator of the concept of multiple intelligences, pioneered trial versions of such testing decades ago, but it was never followed by subsequent recognition and development.

Pointing up the movement

At the minimum, there needs to be some common sense in allowing the Federal government to develop some nationally comparable testing, ala NAEP, that can be used to with validity assess learning achievement across schools and our states.  Controversial, yes in the current political muddle, but the present testing game allowing states to choose to test what each considers knowledge is a fools’ paradise.  Present state deception, provinciality, ego, or belief in magic, by geography, simply isn’t supported by how knowledge is evolving. 

Lastly, the multiple-choice testing being saturated into K-12 is pushing precisely the wrong learning buttons, emphasizing isolated components of memory and narrow concepts rather than their integration into constructs that explain real phenomena and enable prediction, science’s sine qua non.  Ideology being vocalized, that we need to get tough and slam those public educators and their social engineering, can seem attractive because of the arrogance frequently demonstrated by our schools of education and self-righteous public educators/administrators in silos, but the game played that way has never in the history of education produced real winners.

Perhaps the best way to summarize a perspective about standardized testing as it is being used presently is to resort to a hoary old saying from bygone days of teaching and designing tests.  Present standardized testing, by those creating those tests in absentia, and heavy-handedly using their results to flog students, teachers and schools, as the way to improve K-12, are attempting “brain surgery with a meat ax.”

Nuance and incremental change in inducing positive change in our K-12 schools are not big in present reform circles.   They should be, because on its present course reform based primarily, or even heavily on use of standardized testing as it is being postured has the potential of pushing public education into a more fragile and dysfunctional position than present when it exited last century.  The reform movement should heed an ancient admonition, “first do no harm.”

References

For readers who haven’t yet challenged the literature of testing, the following books are entry points:

Daniel Koretz, Measuring Up:  What Educational Testing Really Tells Us. Cambridge, MA:  HARVARD UNIVERSITY PRESS, 2008.  ISBN 978-0-674-02805-0

Diane Ravitch, THE DEATH AND LIFE OF THE GREAT AMERICAN SCHOOL SYSTEM:  How Testing and Choice Are Undermining Education.  New York:  Basic Books, 2010.  ISBN 978-0-465-01491-0
 
Additionally, on a day-to-day basis, some of the very best commentary on K-12 reform and testing can be found in the blogs edited by The Washington Post’s Valerie Strauss, in its Education Section, "The Answer Sheet."  Many posts are by seasoned K-12 educators, education administrators and researchers.

No comments:

Post a Comment