Fordham Institute’s Simplistic System for Evaluating State Accountability Systems Gets it Wrong for California.

The Fordham Institute recently published evaluation of state accountability system is dangerously off the mark. They use three criteria to give strong, medium, or weak rankings.

The first criterion elevates giving clear and intuitive annual ratings such as A-F grades to schools as an easily understood way for parents, educators and policy makers to evaluate a school and then push for improvements. California had persuasive reasons to reject that approach. A-F grades are simple and clear, yet often misleading as primarily based on reading and math scores which are much too narrow a definition of school quality. Several states which have used this system have experienced widespread misidentification of schools and found that the grades merely tracked socio-economics. Many schools with low grades were actually high performers.

California instead uses a broader set of measures called a dashboard which includes test results of annual exams but also such measures as graduation rates, preparation for college, preparation for careers, achievement gaps among groups, school climate, enrollment in advanced courses and suspension rates. Each measure is given a ranking using a quadrant method which combines growth and level of performance. This strategy is much more useful for educators and parents to determine where improvements need to be focused.

To mingle these diverse measures to produce an average score may be simple but could well mask major differences in performance. One school may have medium test scores but high engagement levels, graduation rates, and college attendance. Another school with the same ranking grade may have very high test scores but low college attendance. If the purpose of accountability is to provide useful information to school and district staffs to guide improvement efforts, then discrete information on each measure is warranted, and mushing these various measures together inappropriate and counter-productive.

If your car’s temperature gauge is in the red but all the other gauges are fine, a high average score will mask the seriousness of the situation. Or conversely, if the gas gauge is on empty but the temperature gauge is fine in one car and the opposite is true in another car, the same average score is highly misleading and doesn’t pinpoint the problem.

Fordham agrees that a variety of measures could also be provided but argues parents won’t be able to understand multiple measures so they need one rank even if it is not accurate. They produce no evidence that parents can’t use a dashboard to push for needed changes. The California PTA found that its parent members liked the dashboard idea as a more precise method of understanding strengths and weaknesses in their schools and had no difficulty understanding it. Moreover, even if multiple measures are offered, the single ranking will become the main way to judge schools and crowd out the more useful information to the detriment of the proper educational response.

I suspect the real reason Fordham advocates a flawed ranking system based on averaging measures is that they have wholly bought in to a “test and punish” policy of weaponizing a single grade as a way to put pressures on schools to improve, for districts or states to close “low performing” schools, and to encourage charter expansion.

The strategy of using reading and math test scores and supporting consequences for low performance was the basic policy idea behind No Child Left Behind (NCLB). That program and philosophy did not produced results but did cause large-scale deleterious consequences. NAEP (the National Assessment) scores were climbing before NCLB, slowed down during its first years and in the final years when consequences multiplied came to a screeching halt. Closing schools has also proven to produce no effect on average but has caused significant collateral damage to communities and families.  . and Fordham has been an acknowledged advocate of charter expansion. California views accountability much differently. It is following a “build and support” approach” primarily aimed at producing useful information for educators and others to improve the quality of schools. The state policy assumes that teachers and educators are committed to continuous improvement and don’t need to be bludgeoned to get them to improve.

In summary, using California as an example, there is strong evidence that the rankings for the first criteria are backwards—weak should be ranked strong and the strong states which rely on misleading letter grades should be ranked as weak.

Fordham’s second criterion is valid and important. Avoid basing test scores on reaching a set proficiency levels which encourages schools to only concentrate on those students just below that proficient level. Instead, use scaled scores or averages which results in all students contributing to the measure.

Unfortunately, the Fordham review for California was flawed and completely misrepresented the state’s approach. Fordham gave the state a weak designation because through sloppy staff work it thought that the state used proficiency levels to determine its measures. It didn’t and even a cursory view of California’s system would have proven it. The state uses the distance from a standard met level, which is fully in keeping with the scale score or average approach. David Sapp, deputy policy director and assistant legal counsel for the state board, said the report [Fordham’s Rankings] also contained a big error. California already has moved away from the old standard of rating achievement based on the percentage of students who scored proficient. The dashboard measures performance in relation to the point identified as minimum proficiency on the Smarter Balanced math and English language arts tests. It measures how far above or below that point students, on average, scored.

Fordham’s third criterion Fairness to All Schools has a very helpful basic idea—growth scores on tests are fairer to low income schools. One of the major problems with NCLB was its reliance on levels of performance which disadvantaged schools with lower socio-economics and gave a pass to schools with higher income children. Under that rubric school scores almost completely tracked the socio-economic level of the school.

However, Fordham’s fix is terribly flawed. They want growth in student test scores in math and reading to be at least fifty percent of the total grade of the school. If growth is emphasized to the exclusion of status (the actual performance level), then schools which have historically produced students scoring at high levels are mistakenly identified as mediocre or worse. Imagine a school with low-income students who after considerable effort has reached a high plateau of performance and maintained that level for several years. They would unfairly look mediocre or worse on a measure heavily weighted to growth.

California solved this problem in a clever way based on what some of the best jurisdictions in the US and Canada have instituted. They use a quadrant method so schools get high marks if they produce high scores in either growth or status—the fairest method around. So in reality growth could be higher than 50% for some schools on this measure. This solution completely escaped the pundits at Fordham and they gave the state a weak designation as minimizing growth. Furthermore, while California has used cohort growth instead of student score growth, this is only temporary until four years of student data is available.

In addition, Fordham made a major strategic error in this criterion. To emphasize growth their standard requires that the growth score should be at least 50% of the total score or grade. If status scores are added at a somewhat smaller percentage to protect schools already achieving at high levels, the overall score becomes essentially a math and reading annual test score. That strategy in NCLB resulted in a profound narrowing of the curriculum shortchanging history, civics, science, and the arts and humanities. It also produced widespread gaming by extensive test prep. de-emphasizing quality instruction, and outright cheating, and yet results were still meagre or non-existent. It also ignored local measures of quality which are essential for a realistic picture of school performance. Daniel Koretz in his recent book The Testing Charade persuasively demonstrates how placing high-stakes on math and reading scores was so devastating. Campbell’s law is still potent: The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.

Finally, Fordham’s selection of only these three criteria is highly questionable. Why aren’t there measures of the strength of the curriculum or instructional program, of teacher and community engagement, of effective school teams devoted to continuous improvement, of performance gaps among groups, or of social and community support? In a rush for a club to beat schools Fordham has ignored those measures which actually produce results. As an example, effective team building and promoting teacher efficacy produce extremely high effect sizes which are large multiples greater than giving letter grades to schools or the punitive use of testing. Fordham at one time supported such valuable measures, but, unfortunately, lately the Institute has neglected them as Fordham became chained to a much narrower approach.

California has developed some of the strongest efforts in the country in developing and implementing a powerful curriculum, school site team building and continuous improvement, district support for those efforts, and state policies which enhance them. See the report on standards by Achieve which give California the highest rankings for the quality of standards, frameworks and instruction.  Of course, these policies fly in the face of Fordham’s reliance on a discredited “test and punish” agenda. A piece of advice to Fordham: back to the drawing boards and base your efforts on the best research and proven experience of what works and what doesn’t.



Leave a Reply

Your email address will not be published.

Designed and Developed by