Friday, December 23, 2011

SQUINTS 12/23/2011 - THE TESTING GRINCH

In the spirit of the season, today’s SQUINTS looks at Ohio’s K-12 standardized testing Grinch, and its dubious gift of assessment that keeps on misgiving.

Tilt

If as an Ohio parent you believe your children’s K-12 education is life-enabling, and that Ohio is accurately measuring learning outcomes -- equitably awarding or penalizing schools and even teachers on the basis of that testing -- today's SQUINTS is sobering.

In a prior SQUINTS the story was related of Ohio’s Legislature requiring Ohio’s Department of Education (ODE) to produce a “performance index” for every Ohio K-12 school district, and rank all districts.  Ostensibly, the exercise was to be an attempt to improve on the prior system of assigning a series of descriptors to Ohio’s districts, ranging from “excellence with distinction” to “academic emergency.”

Readers may recall, from the earlier SQUINTS, that except for adding the Ohio Graduation Test (OGT) to the formula, and weighting the test components, the Performance Index (PI) was based on virtually the same NCLB standardized testing as before.

The catch came in the form of an ODE acknowledgment that the PI did not necessarily track well with Ohio’s 12th grade ACT (originally the abbreviation for "American College Testing") test results.  Given the addition of the OGT to the formula, the observation was curious.*  The ACT has been well validated; a reasonable assertion is that it is currently a better measure of learning outcomes produced by school districts than the now increasingly criticized NCLB standardized tests at earlier grades.  In 2011 ACT test-takers were 63 percent of Ohio’s district graduating classes.

Without revisiting the lengthy prior arguments why NCLB-type testing has become a K-12 school liability -- including nationally entrenched and regular cheating on those tests to avoid NCLB’s negative effects -- a legitimate question is, just how do Ohio’s PI scores and rankings hold up in paralleling ACT results? 

The reason the question is valid is that the answer is neither trivial nor a matter of wading through some obscure analysis.  On the basis of that standardized testing, and any positioning of school districts because of those scores, children may not be receiving the promised education, schools may see penalties, cheating can be rewarded, levies for schools can be impacted, and teachers’ attitudes, ratings and even their employment may be impacted.

A Brief Update

When this issue was last visited, a request had been made to acquire Ohio school district ACT results to attempt a comparison.  Ohio’s Department of Education had not yet responded, but subsequently complied with the request, providing a 2011 ACT file of scores by school district. 

A small discrepancy, not yet fully sorted, ACT reported for Ohio 92,313 ACT test-takers in 2011; the ODE supplied file accounted for 79,212 of those tests, 86 percent of the test-takers.  To date ODE and ACT have not provided an adequate explanation of the difference, which could be attributable to either ODE’s or ACT’s school categorization or data processing.  In the process of reconciling ODE and ACT district-by-district data, the tentative inference is that a large share of the 13,000 test-taker count shortfall is associated with districts with either bottom-dwelling PI scores, schools where the SAT is taken rather than the ACT, and charters or private schools not in the database provided.

Also, a repeat admonition, there is an expected relationship between ACT scores and the proportion of a graduating class taking the ACT.   Two effects are present:  One, a larger fraction of test takers may pull down the ACT average; and/or two, a larger fraction of test takers may reflect schools’ testing cultures and preparation producing higher overall ACT performance.  The two effects can’t be separated here, but the positive relationship found between both the PI and ACT average, and a class’ fraction of test-takers tends to inflate subsequent ACT scores.

This has to be factored into any comparison of the ODE PI with ACT, which entails processing additional Ohio district data to create such a variable.  All of these chores were accomplished just days ago, resulting in a correlation analysis of 650 Ohio school districts’ PI scores and comparable ACT scores.

The Results

The test conducted is called partial correlation; simply, the relationship between the PI scores across districts and the comparable district ACT scores, with the effect of the proportion of test-takers held statistically constant.  It produces a simple answer, but with large impact and implications.

A correlation coefficient (r) of 1.00 would be perfect association; values of .80 and up are considered respectable, but even a coefficient of .80 (r-squared = .64) means that only 64 percent of a criterion variable is being explained by another indicator.

The ODE PI and ACT fit had a partial r-squared (also known as the coefficient of determination) of .52.  The translation, the PI scores “statistically explained” 52 percent of the variation in ACT scores across districts.  In sum, if the ACT is a better measure of Ohio learning outcomes, then dollars, policies, or any other action on the basis of using Ohio’s PI rankings as a criterion are not on solid footing. 

A scatter chart of the composite results appears below in FIGURE ONE.

FIGURE ONE

(Fit:  Partial r-squared = .52)

Clusters at the extremes in the distributions of both the Performance Index and the ACT scores effect a correlation, tending to anchor the line of fit based on extreme points.  But a practical measure of learning outcomes must reflect discrimination across the full range of an indicator along with reasonably uniform error in estimates.  More meaning comes out of looking at both the low region of PI scores, the high end, and at the middle performances.

Looking at just the highest (4th) quartile (25 percent) of districts based on the PI, the fit of ACT scores to PI scores produced a r-squared of .256, or in that top quartile the PI explained 26 percent of the variation in ACT scores.  In the lowest quartile of PI, the fit was just a bit better, the PI explaining 36 percent of the ACT variation.  Both fits are evidence of very weak association.  The challenge of discriminating among school districts based on the PI scores can be seen below in FIGURE TWO, describing the fit between the ACT estimated by the PI scores versus actual ACT scores.

FIGURE TWO

(Fit:  r-squared = .256)

The most troubling fits were of PI and ACT in the 50 percent of all school districts studied in the 2nd and 3rd quartiles of the PI score, or over 320 school districts, the middle majority.  That scatter of ACT scores estimated by PI scores versus actual ACT scores is shown below in FIGURE THREE.  With a r-squared of .1225 it indicates that, in the middle 50 percent of Ohio districts based on the PI rank, the PI scores explained only slightly over 12 percent of the ACT scores.

FIGURE THREE

(Fit:  r-squared  = .1225)
 
Hence, discriminating among any of these districts at other than a very macro level, based on ODE’s Performance Index and rankings, is a serious challenge potentially misrepresenting a school district’s standing in Ohio.

Flags

One implication of the findings relates to why a district might score well on the standardized tests but fail to match that with the terminal ACT performances.  One quite serious implication is whether a system has fallen into the pedagogy of “teaching to the tests.”

The latter term is frequently misunderstood.  It doesn't necessarily refer to an obvious direct transmission of future (or even past) standardized test questions to students, or their overt inclusion in lesson plans, although that really egregious event has been widely documented nationally in our public K-12 systems along with worse (most recently, again in Georgia), to try to beat the NCLB sanctions.  

When a school's leadership and faculty start to inch into the dark side, even believing it is for the greater good, the tactics are more subtle, but no less corrupt as alleged K-12 education.  Those approaches take the form of targeted selection of texts, prohibition of alternative learning materials that might dilute the focus, exclusion or suppression of any sources that might question curricular or teaching policies, and constructing lesson plans to maximize test scores whether or not the tactics and material strategically constitute genuine learning.  This isn't going to be visible to most school boards unless they immerse themselves in a system and dig into the pedagogy embraced.

From the present analysis the places to look are in the data quadrants where a district’s PI score is high, but its ACT is lower than expected.  For example, the top half of districts’ PI scores start at the median PI score of 99.  A flag should go up where a district has a PI score materially above that value, but an ACT score at or below Ohio’s median of approximately 22.

There are obviously still other explanations for that combination but none is good news.  It may signify a one-off, a senior class that has simply underperformed in that year.  More significant is the possibility of a performance gradient between the administration and teaching of the 9-12 grades versus middle and elementary performance.  That has become increasingly a source of learning vulnerability as the high school level has been pressured to create greater college-readiness, that frequently entails adopting more of the course design of post-secondary work, challenging both curricula, materials, and teacher preparation.  

Lastly, the errors of fit may simply signify, as now asserted by many education professionals and researchers, that the standardized testing that has been allowed to flourish is both invalid and unreliable measurement of meaningful K-12 learning.  The obvious sequel to that is why the US public K-12 establishment has simply acquiesced to a corporate oligopoly, with little to no oversight, exercising major input and control over K-12 testing and curriculum -- the education tail wagging the dog, so to speak?

Bottom Line

An implication, that should be visible to all but the dogmatic or deniers, is that material decisions or claims made on behalf of a school district, and based on the Ohio PI or any similar derivative of present standardized tests, have a high probability of being flawed.  At the polar extremes, a high PI and equally high ACT average likely identify a superior system.  Conversely, very low scores of both PI and ACT do signal poor performing districts.  But in between, using the present PI scores to either strategize or tactically manage Ohio’s schools is unreliable and has the potential for harming Ohio K-12 education.

The Greater Challenge
 
Results above simply mirror findings starting to appear across the US as education researchers challenge the almost maniacal dependence being placed on present standardized, corporately devised testing and scoring.   The irony is that virtually none of the present highly orchestrated and heavily funded attempts to force that testing to be the order of the day was subjected to research, or experimental verification, or even employed prudent small-scale trial runs before being dropped on US K-12 schools.

The hard reality is that the “testing to force K-12 change” model is not functioning.  Although in the present example the ACT may not be the ultimate measure of a school district’s learning performance, it is likely a better basis for overall assessment than present NCLB testing.  Another aspect of that reality graced this week’s news, where New York State is now investigating the Pearson Foundation, an arm of the nation’s largest publisher of standardized tests and packaged curricula, for improper lobbying.

But the greater challenge is to go beyond calling out folly and see K-12 education creatively and rigorously substitute better measures of learning outcomes to assess school and student performance.  That the venue has not reached for its bootstraps, and assumed the responsibility for devising more valid learning outcome measures, are both vexing and something of a mystery. 

One explanation was offered in the December 21, 2011 Education Section of the Washington Post, by Mark Phillips, a professor emeritus of education.  His view is empathetic and realistic:

"Most teachers and administrators, dealing with the daily challenges of teaching, don’t have the luxury of thinking beyond the present paradigm.  They’re too busy dealing with meeting student needs, designing engaging lessons, and responding to external pressures, from assessment to the latest mandated 'innovation.'  But for those of us who have the luxury of time to think and lead, reformers and policy makers alike, I think the relative paralysis should be a matter of concern." 
"The concept of schools without walls is not a new one, and yet in this age of instantaneous electronic communication, as we freely Skype and network in multiple ways with people all over the world, how can we possibly think of education as taking place in a building in blocks of 49 or 53 minutes?  While I don’t know exactly what a new paradigm should look like, the little I see suggests that it might include classrooms as command centers to coordinate schooling without walls, with present subject organizers vastly changed, the line between teaching, facilitating, and counseling blurred, the functions integrated, and a seamless connection between the school, the community and the land itself.  This is not boring?"

Phillips, obviously, sees even more opportunity for innovation than just improving measurement of learning outcomes.  But a return to K-12 sanity would be ginning up the needed research and grass roots testing on indicators of school performance and quality that reflect all aspects of genuine education:  Categorically, direct evidence-based measurement of learning, but learning with context that satisfies the test that knowledge has been created; measures of the performance of the process variables that functionally allow learning to happen -- curricular validity, teacher preparation, organizational alignment with the social and cultural environment of a school; and equally solid evidence of school administrative, leadership, and ethical performance.

Lastly, longitudinal measurement has now become part of the nomenclature for assessing teachers but in a perverse fashion.  The so-called value-added measures being touted are totally misaligned with the processes that produce good teaching.  Longitudinal effects of a school’s and teacher’s contribution to a child’s learning need to be assessed, but it is not accomplished by rigid microscopic measurements of short term change in fragments of retained information, but by at least three initiatives:  Recognizing that a child’s “prior knowledge” is a major condition of present learning and that we are still vague on how to address that; by acknowledging that social and cultural factors do impact spot capacity for student learning; and by using the tools available and employed by virtually every sophisticated private sector organization to gauge post-transaction customer satisfaction and behavior.  The latter notion and methods have been out there as long as modern marketing, now multiplied by digital tools, but seemingly unrecognized or ignored by K-12 education.

After tours through multiple web sites of states’ departments of education there is evidence that a few states are asking the right questions.  Whether that is enough to kick-start better learning performance measurements is a question mark.  One perspective is that it will take an orchestrated national voluntary effort, or a Federal data initiative to generate consistent inputs for better assessments.

One closing perspective, the digital capability to both codify good data for measurement, and techniques to automate their application for both student and school assessment appear to outstrip the basic understanding of many K-12 educators; digital methods intrinsically are not the roadblock.  For those of us who encountered computers before they fit in your shirt pocket, or your garage, or a few years earlier in a warehouse, there was an anthem any computer user learned to respect.  It went by the acronym, GIGO, “garbage in, garbage out;” it still applies.

Happy Holidays.

* A footnote:  Another test that seems timely, and due diligence, would be an assessment of the concordance of the Ohio district 2011 OGT results with 2011 ACT scores.


Technical Postscript
Analysis Coverage

The comparative ODE and ACT score databases do not fully align.  The report references the discrepancy between ACT counts of test-takers in the ODE file supplied (79,212), versus the ACT Ohio report of 92,313 2011 test takers (14 percent shortfall).  Some of that discrepancy appears to have occurred in areas of very low PI scores.   Another part may have a different explanations, schools where only the SAT was taken, and charters and private schools not in the database ODE provided.  There is presently no way to sort that out.

However, the ACT test-takers represented in the above analyses (77,966) account for approximately 98 percent of the 79,212 ACT test scores provided. The small difference occurred where district data as reported by ODE could not be matched with the ACT report.

The SAT Test-Takers

The present analysis did not include SAT test-takers, their present data coverage representing approximately 15 percent of Ohio's 12th graders, versus the ACT, where approximately 63 percent of graduating seniors were test-takers.  A new data set was required to test the concordance of SAT scores with the ODE Performance Indexes.

Because of the difference between numbers taking the SAT versus the ACT, much smaller data sets were available.  Total 2011 SAT test-takers were 18,998, and only ten percent of the prior district average scores were assignable to test.

Available data indicate that where the ACT and SAT tests were jointly reported, the proportion of 12th grade SAT test-takers was approximately 88 percent.  In this same subset of test-takers, the proportion of students taking the ACT remained at 63 percent, congruent with the full ACT analysis.

Correlation Results:  SAT Versus ACT & Versus PI

A correlation analysis was run to assess the fit of SAT scores to ACT scores, and the fit of SAT scores to the PI scores.  Results were:

   SAT x ACT:  r-squared = .716, 71.6% of SAT scores explained by ACT scores.

   SAT x PI:  r-squared = .526, 52.6% of SAT scores explained by PI scores.

With less data representation as well as excluded districts, there is little opportunity for disaggregated analysis of how error associated with the SAT x PI fit is distributed.  The overall fit of SAT scores with the PI scores is consistent with the larger set of PI x ACT findings reported (r-squared = .52), although it may reflect a different pattern of errors, referencing the above ACT x SAT fit.

In this subset of schools employing both ACT and SAT test taking, the ACT scores better fit the ODE PI scores in the region above the median of both variables, but not below.  The explanation may be that subset of systems heavily represents the 4th quartile cluster of Ohio systems with high level performance on all testing.

























No comments:

Post a Comment