Be Careful What You Test For... You Might Get It

[Contents]

Be Careful What You Test For...
You Might Get It
John Denker

* Contents

1 Some Examples

1.1 GPA

1.2 Teaching to the Test

2 Management Issues

2.1 How to Survive the Test – Or Not

2.2 Extending the Accountability Horizon

2.3 Caveats and Controls

2.4 Prerequisites and Prospects

1 Some Examples

1.1 GPA

Once upon a time, physics major #1 took an elective course in music, namely “orchestration and composition”. Everybody else in the course was a music major.

Meanwhile, physics major #2 took a different elective course, namely “music appreciation”. This involved sitting in class, listening to recordings and discussing them. Music majors are lot allowed to take this course, not for credit anyway.

Student #1 got a C.

Student #2 got an A.

Which of these students got the better grade? Which got the better education?

This is significant, because there are lots of real-world incentives that depend on GPA ... including school admissions, scholarship money, job offers, et cetera. On the other hand, it is perverse to focus on a good GPA rather than a good education.

1.2 Teaching to the Test

In every grade, the teacher and students are subject to all sorts of incentives to produce a good score on that grade’s standardized test.

It is obvious that teachers will teach to the test to some extent ... and students will study to the test to some extent. Simplifying things a bit, we can identify several cases and sub-cases:

In the rare but not-unheard-of case that it’s a high-quality comprehensive test, you just teach to the test, and everybody is happy with that. You spend 100% of the time doing the right thing and 100% of the time teaching to the test.
In many cases, the required test is far from comprehensive. All too commonly, it is little more than a trivia test. Little if any reasoning is required, let alone critical thinking. Any principle that comes to mind can be applied directly or not at all. In such a case there are – simplifying again – two main approaches:
1. The wise teacher will concentrate on using sound methods to teach general-purpose reasoning skills plus a principled approach to the subject matter ... while illustrating the principles with examples taken from the test. In this way you can spend 90% of your time doing things that make sense and 40% of your time teaching to the test, and it adds up to only 100% because there is a lot of overlap. We can call this the strategic approach, the far-sighted approach, the understanding-based approach, or simply the wise approach.
2. The not-so-wise teacher will be hypnotized by the test, like a rabbit in the headlights. The temptation is to fixate on the items that show up on the test, to the exclusion of the fundamental subject matter and to the exclusion of sound teaching methods. This approach may sometimes yield good test results, but in any case it leaves the students ill-prepared to handle later grades. We can call this the gaming-the-test approach, or the short-sighted approach, or simply the unwise approach.
A truly bad test causes lots of problems. I’ve seen plenty of standardized test questions where the required answer was arbitary, incomprehensible, or outright wrong. This is a disaster, and there is not much the teacher can do to mitigate it. In such a situation, if the students want/need to achieve good test scores, they must spend time learning things that cannot possibly be true. This is the opposite and the enemy of critical thinking.

To repeat: Teachers will teach to the test to some extent. This is either a good thing, a manageable challenge, or a disaster ... depending on details. Much depends on the test, and on how the teacher goes about teaching to the test.

2 Management Issues

2.1 How to Survive the Test – Or Not

Let’s look more closely at the cases mentioned in section 1.2.

Case 1 takes care of itself.
In case 3, the only recourse is to get rid of the lousy test. You can then replace it with something better, or not replace it at all.
Case 2 is the case that demands the most clever management, as we now discuss.

Any teacher with good sense can tell the difference between case 2a and 2b, i.e. the wise approach and the unwise approach to coping with the end-of-year test. However, a bureaucrat who is not in the classroom cannot easily tell the difference, because the distinction is not reflected in the test scores – at least not in the obvious way. Given the current emphasis on scores rather than good sense, this commonly leads to perverse incentives, i.e. situations where teachers feel obliged to game the test in unwise ways.

If you confine yourself using the test score itself in the direct, obvious way, then there is no reliable way to distinguish the wise approach from the unwise approach. However, there are less-obvious ways of measuring the distinction. It shows up in later grades, and in later life.

Here are a couple of ways things could play out:

The hallmark of a good school system in a disadvantaged area is that the students start out behind national norms in first grade, and then grade-by-grade they catch up and pull ahead.
The hallmark of bad school system is that grade-by-grade the students do less-and-less well compared to national norms. In such a case, it is profoundly wrong and futile to punish the high school for the bad scores of the high-school students, insofar as they reflect problems that lie elsewhere. Don’t shoot the messenger. Don’t punish the school that detected the problem.
(Of course the latter school may have problems of its own, but that is a separate issue, not necessarily correlated.)

2.2 Extending the Accountability Horizon

This leads to a constructive suggestion: In any grade school that is large enough to have more than one class at each grade level, shuffle the students – randomly – when assigning classes at the beginning of each year. For example, half of the students from class 3A and half of the students from class 3B wind up in class 4A. The other two halves wind up in class 4B. There are then four possibilities.

If the 3A students do better on the third-grade end-of-year test and also do systematically better in 4th grade, that’s great.

If the 3A students do worse on the test but better in 4th grade, it suggests that the 3A teacher is not paying enough attention to the test.

If the 3A students do better on the test but do systematically worse in 4th grade, it suggests that the 3A teacher is paying too much attention to the test and not enough to the fundamentals.

If the 3A students to worse on the test and worse in 4th grade, it suggests there is room for improvement in the 3A classroom. Perhaps the 3A teacher can learn a thing or two from the 3B teacher.

This same procedure can be applied to students who graduate from elementary school to middle school, provided the classes they take in middle school are not too strongly correlated with what class they were in in elementary school.

A coarse-grained version of the procedure can be applied on a school-by-school basis when multiple elementary schools feed a single middle school, and when multiple middle schools feed a single high school. You can hold the school as a whole accountable for not only how well the students do on the end-of-year test, but also on how well they do in later years.

2.3 Caveats and Controls

There is no escape from the law of unintended consequences. The measures set forth in section 2.2 do not eliminate all possible ways of gaming the system. In particular, there is an obvious scheme whereby a teacher (or a school) can improve student scores in the short term and the long run, namely by selecting the incoming students.

There is a lot of selection going on already, sometimes for good reasons and sometimes otherwise. Extending the accountability horizon will not make it go away. If you want to reach any halfway-valid conclusions based on test scores, with or without extending the accountability horizon, you need to control for this. Assigning students randomly to one class or another – and randomly re-shuffling them at each year-to-year boundary – helps with some of this.

2.4 Prerequisites and Prospects

The question arises: To what extent is it worthwhile to extend the accountability horizon, along the lines suggested in section 2.2? Well, that depends.

Best case: Given good teachers, no cumbersome assessment is needed and no cumbersome accountability is needed. Just leave the teachers alone. Let them do their job.
Interesting case: Hypothetically, if you decided you really needed a testing program, and if you started with a good (or even halfway decent) test, extending the accountability horizon would make things better. It would make the gaming-the-test approach approximately twice as hard, since it would require gaming this year’s test and next year’s test. Meanwhile, it would place no additional burden on teachers who use the strategic, understanding-based approach.
At some point, people would conclude that gaming the test is not worth the trouble, which is what we want them to conclude.
Worst case: On the other hand, many of the state-mandated year-end tests that I’ve seen are at such a low level that the interpretation of the results is very asymmetrical: A low score is a reliable indication of disaster, whereas a high score doesn’t tell me anything I consider worth knowing.
In a disaster situation, tests are superfluous. A disaster is easy to detect, and there is rarely any need to quantify how disastrous it is. Making the trivia test slightly better won’t help. More importantly, a trivia test may detect a problem, but it won’t tell you much about the causes or possible solutions.
Also: It is easy to foresee a situation where results of each grade’s test are largely uncorrelated with the previous grade’s test. It is hard to draw meaningful conclusions from such a situation.
- Perhaps the test is so dysfunctional that it doesn’t measure anything.
- Perhaps the school is so dysfunctional that at every grade along the way, the teacher assumes that nobody learned anything in the previous grades. (Items that were temporarily learned but not retained don’t count.)
- Perhaps something else.

Let’s be clear: In situations where the test is a disaster-detector only, the technique outlined in section 2.2 will be primarily of academic interest, in the worst sense of the word. That’s because staving off disaster is nothing to be proud of. The goals should be much, much higher.

Beware that many of the current tests are so bad that trying to use them in cleverer ways is like re-arranging the deck chairs on the Titanic. The fact that I have mentioned something that could be done using well-behaved test scores must not be taken as an endorsement of the current crop of tests.

I am not opposed to all testing. I am opposed to dumb testing.

[Contents]