Comments on California Standards Test

1 California Standards Test – Process Issues

I was asked to comment on the Physics Standards Test (reference 1). It is part of the California Standardized Testing and Reporting Program.

Executive summary: The main issue has to do with the understanding the process, and fixing the process. How did the system get to be so broken as to produce questions like this? How can we fix it?

As a tangentially-related minor point, we note that the released questions have many flaws. (Details can be found in section 2.) That’s a minor point, because you presumably already knew that. Anybody with rudimentary physics knowledge and teaching ability could see that in an instant. It is not the purpose of this document to quantify just how awful this test is, nor to see how/whether we can improve specific test-questions.

An informative overview of the existing process can be found in reference 2.

The test is called a “Standards Test” but the standardization is not very effective. At the beginning of reference 1 one finds a list of standards, but I see no evidence that the questions uphold those standards.

For that matter, the relevance and even the correctness of the stated standards is open to serious question. This is discussed in section 4.

I am informed that the taxpayers paid ETS (and possibly some other companies) to produce these questions. Well, they should demand a refund. Whoever prepared these questions should be disqualified from doing so in the future.

More generally, whoever prepares questions in the future should provide, for each question, a brief statement as to its rationale. That is, they ought to tell us what the question is supposed to test for. Also the preparer should be responsible for vetting the questions, and should provide evidence of validity and sensitivity. That is, does the question do what it is supposed to do? This includes making sure the intended answer really is better than the distractors, and that each of the distractors really serves some purpose.

It is also important to consider the standard as a whole. It is unacceptable for a question to address one narrow part of the standard, while contravening other more-important parts of the standard (notably the parts that implicitly and explicitly expect students to be able to think and reason effectively).

Similarly it is important to consider the test as a whole. There are important questions of balance and coverage, as discussed in section 3. In particular, even if the test contains some plug-and-chug questions are individually acceptable, a heavy preponderance of such, to the near-exclusion of questions that require even a modicum of reasoning, is collectively unacceptable.

Process must not be used as an excuse. That is, to turn the proverb on its head, the means do not justify the ends. It is preposterous to argue that the questions must be OK because they resulted from an established standards-based process. The unalterable fact is that the questions are not OK. Somebody has got to take responsibility for this.

I mention this because some folks involved in the process have adopted the attitude that

“If an item assesses a standard, it’s good to go on the test”.

Talk like that makes my hair stand on end! It’s a complete cop-out, i.e. it is a lame excuse for acting without good judgment, without common sense. It ignores the truism that no matter what you’re doing, you can do it badly, and sure enough, we can see in section 2 many items that badly address various aspects of the standard.

The facts of the matter are:

a)
The standards do not produce the test. People produce the test. These people need to take responsibility for the product.
b)
If it’s a good test, the standards do not get the credit.
c)
If it’s a bad test, the standards do not get the blame.
d)
It is possible to create a good set of questions, with far fewer deficiencies than we see in section 2, within the existing standards.
e)
There are opportunities to improve the standards in certain ways, as discussed in section 4, but doing so will not change facts (a), (b), (c), and (d).

As an example of something that has much more potential for improving the process, here’s a suggestion: Dual-source the questions. That is, have two companies submit questions.

A crucial ingredient in this arrangement is to pay each company in proportion to how many of its questions are selected for use.

This provides each company with a financial incentive to submit top-quality questions.
That in turn means the selection committee will be less likely to find itself in the unenviable position of being forced to choose the best of a bad lot.

If a vendor objects to such a dual-sourcing arrangement, it is a dead giveaway that they don’t have confidence in their own product. You don’t want to do business with such a vendor.

Test printing, scoring, and analysis should be covered by a separate contract, separate from composing the questions.

2 Comments on Specific Test Materials

There is a bank of questions that are eligible to be used on California Standards Tests. Each year, after the tests have been administered, 25% of the questions are released from the bank. The released questions are not used again. For an index of all released questions, see reference 3.

It is unclear whether the released physics questions (reference 1) are representative of the bank. The released questions are so bad that one hopes they are not representative. It is astonishing to see such a high concentration of objectionable questions. Perhaps they are the culls, removed in order to improve the bank. (Releasing the culls would create the paradoxical situation where doing the right thing makes matters seem worse than they are.)

2.1 Test Questions

The question numbers here conform to the 2006/2007 version of reference 1.

Question 3: In technical usage – notably in the expression “freely falling reference frame” – the word “falling” does not necessarily mean falling downward. That means we should consider sideways or upward trajectories, in which case answer B works at least as well as answer A.

As a result, this question could well have negative discriminatory power. See discussion of this point under question 17.

Furthermore, some skilled teachers have taught their students to adopt the policy that “human error” is too vague to be acceptable as an explanation for anything. Therefore some test-takers will reject answer A out-of-hand.

It is pointless to argue whether these are strong objections or weak objections. Instead, the convenient and 100% proper way to proceed is to re-word the question to eliminate the objections. In this case a good starting place would be to speak of an object “dropped from rest” (as opposed to merely “falling”).

This just goes to support the larger point that it is important to have a timely, systematic vetting process. Also it is important to have a process for improving questions, not just selecting questions, because choosing the best of a bad lot is unsatisfactory.

Question 4: This is a remarkably poor question. There are twin weaknesses.

The obvious major weakness is that the data could be equally well described by greater mass or less air resistance. The counter-argument that the object appears smaller and therefore “shouldn’t” have greater mass is untenable, because it is based on the undocumented assumption that the objects “should” have comparable densities. Forsooth, if they had the same density, object A would have a smaller terminal velocity, because of a smaller volume-to-surface ratio.
On the other side of the same coin, the other major weakness is that the question tries but fails to capture the spirit of real-world problem solving. In the situation as described, it would not typically be necessary to determine mass by looking at the picture. It would make more sense to weigh the objects, or ask what they are made of, et cetera. In the real world, one would also have to consider hypotheses not mentioned, such as updrafts, inconsistent strobe settings, et cetera.

Question 5: This is another remarkably poor question. The obvious criticism is that the question is unanswerable. The expectation could be based on formal theory or it could be based on direct observation. Real scientists do not make the distinction between “theory” and “hypothesis” that is implied here – see reference 4 for more on this.

Furthermore, the stem of the question contains strange misconceptions and anomalies. One normally applies voltage (not current) to light-bulb circuits, for a number of very good reasons. (One might apply current to a magnet coil, or to the base of a transistor, but those are significantly different physical situations.) Indeed applying current to a series circuit containing a chunk of rubber might be downright dangerous. On an advanced test, I might not object to a question-stem containing obiter dicta that cause confusion, but the confusion here is so pointless and so out of balance with the chug-and-plug questions that make up the rest of the test that I have to assume it is simply another mistake, another question drawn up by somebody who didn’t really understand the subject matter.

I am reminded of the proverbial expression “all hat, no cattle” (referring of course to someone who dresses like a cowboy but has not the slightest understanding of real cows or real cowboys). This question, like the previous one, appears to have been designed by someone who likes to talk about “the scientific method” but has no idea how real science is done ... likes to talk about “physics” but has no real understanding of the subject matter ... and likes to talk about “teaching” but has no idea how real teaching is done.

Question 7: There are two possibilities:

Case (a): We assume all test-takers already know how many feet there are in a yard, or
Case (b): We are testing to see whether they do.

In either case, this is an objectionable question.

In case (a): What’s the point? This is a very low-level plug-and-chug question. Why put it on the test? Are we testing for the ability to perform arithmetic at the 3rd-grade or 4th-grade level?
In case (b): This is blatantly culture-biased. Test-takers from overseas may not know how long a yard is. Mentioning “football” only compounds the offense, because that term has a culture-dependent meaning. I note that the tables at the back of the test define some standard units including the joule ... but do not define the non-standard unit “yard”. What’s the logic in that?

Question 9: What’s the point? Is there any reason to believe this question has any appreciable sensitivity?

Question 11: This tests only rather low levels of understanding (not quite the lowest possible levels).

Question 13: Again, this tests only the almost-lowest levels of understanding. As far as I can tell, it discriminates against the worst sort of mindless plugging-and-chugging, but not much more.

Question 15: This question was extensively discussed on the Phys-L discussion group. The consensus found this question to be highly objectionable. The four answers are so absurd and bizarre that one cannot imagine any rational basis for deciding which is “best”.

I am informed that the stem of the question should have asked about “speed” not “velocity”. Even with that explanation, the situation is well-nigh incomprehensible.

In general, it is always possible for mistakes to occur during the initial design of the question ... but it is something else entirely that the mistake went uncaught. It indicates a highly dysfunctional quality-control process.
In this particular case, it is rather hard to understand how anybody in the field could make this particular mistake, since speed and velocity are such central, important topics in high-school physics.

Question 17: This is not the worst question I’ve ever seen, but it’s not the best, either.

A relatively minor weakness is that one of the alternatives can be dismissed out of hand, so what appears to be a one-out-of-four choice is more like a one-out-of-three choice. This is a common weakness on poorly-designed tests.
A more interesting weakness is that this question could easily have negative discriminatory power ... that is, it wouldn’t surprise me to find the score on this question anti-correlated with overall proficiency in the subject. An unsophisticated test-taker would naturally answer the question using a terrestrial reference frame, in which case C is the right answer, whereas a more-sophisticated customer might imagine that the satellite had just been released from a space shuttle, and in the shuttle’s reference frame it is weightless, making A the right answer. Indeed answer A is the preferable answer in terms of 20th-century physics, which emphasizes freely-falling reference frames. You could argue that test-takers who know enough physics to prefer answer A also know enough sociology to realize that whoever constructed this test isn’t sophisticated enough to appreciate a 20th-century answer, so the 17th-century answer must be the expected one. As a result, rather than testing mastery of the subject matter, we are testing for perverse sociology and test-taking skills per se. Question 3 suffers from a similar defect. How would you like to be put in a position where you will be rewarded for giving answers you know to be wrong, and penalized for giving answers you know to be right? Again, this is symptomatic of a poorly designed test.

Question 18: No serious objections, although C is a pretty lame distractor.

Another question on the same topic appears on the ETS web site (reference 5), as an example of an SAT question. The SAT version is markedly better. Why did California get saddled with the dumbed-down version?

Question 19: Plug and chug, especially considering that the definition of kinetic energy is given in the appendix.

Question 21: Plug and chug.

Question 24: No objections.

Question 26: Plug and chug. Ineffective distractors, resulting in negligible sensitivity.

Question 28: This is objectionable for two reasons.

At the tactical level, the distractors are ineffective ... it’s obvious what the desired answer is.
At the strategic level, we must realize that giving a test has consequences. The primary intended consequence is to evaluate the test-takers ... but another unavoidable and usually-desirable consequence is to teach them something in the process. Alas this question teaches the wrong thing. The law of conservation of momentum should be expressed in such a way that momentum always obeys a strict local conservation law. This is how it should be taught, and this is how it should be learned. This question needlessly reinforces possible confusion involving the idea of “conserved” versus the idea of “unchanging”. Momentum is always strictly conserved, even if it is flowing out of one subsystem into another. For details on the important distinction between conserved and unchanging, see reference 6.

Question 29: Plug and chug. Ineffective distractors ... is anybody really likely to choose D?

Question 31: What’s the point? Is it really possible to get this wrong? If somebody gets it wrong, what do we infer from that?

Question 35: No objections.

Question 36: No objections.

Question 37: No objections.

Question 38: Ineffective distractors. Does anybody really think that smashing a coffee-cup with a hammer increases the “order”?

Also, this uses “order” as a stand-in for “entropy”, which is not really correct, for reasons discussed in reference 7.

Question 42: Rote memory.

Question 44: This is at best a one-out-of-two choice, not a one-out-of-four choice. Also, the question is masquerading as a real-world experimental scenario, but how often to you measure a solid that has a speed of sound as low as 1000 m/s? Cork is an example, but there aren’t many such.

Question 47: This is at best a one-out-of-two choice, i.e. two of the distractors are ineffective.

Question 48: Rote memory. Ineffective distractors.

Question 49: Small sensitivity, since it’s obvious that answer D is what is wanted ... even though answer B would be physically correct in almost all practical situations. The other two distractors are ineffective.

Question 50: Two of the three distractors are lame. (This question could have been made pretty good, with a little reworking. Again this illustrates the point that a process that involves choosing the best of a bad lot is a process in urgent need of improvement.)

Question 52: Plug and chug. The fact that the battery and second resistor are unlabeled is a dead giveaway that they are irrelevant.

Question 56: I’m not sure that the nonclassical spin of an electron counts as “motion” in the usual sense ... so the desired answer can be seen as perpetuating a misconception. See discussion of strategy under item 15.

I also wonder about distractor A. Although there is a somewhat-common misconception concerning the northness of north poles and the southness of south poles, I suspect anyone who is sophisticated enough to know the vocabulary word “monopoles” doesn’t suffer from this misconception. It seems as if the test-makers were trying to show off, trying to show me that they knew about monopoles, but alas doing so in a way that weakened the test. All hat, no cattle.

Question 59: What’s the point? This question seems to be testing for rote-level familiarity with the word “plasma”.

2.2 Reference Sheet

In conjunction with the test-question booklet, students are given a “Physics Reference Sheet” containing “Formulas, Units, and Constants”.

Among these is the formula

Δ S =

[allegedly] (1)

which is really quite objectionable, for reasons discussed in reference 7.

The released questions do not make use of this formula, but one may reasonably fear that some of the unreleased questions do.

3 Coverage and Balance

When viewed as a whole, a well-designed test must meet a number of global requirements, such as requirements as to coverage and balance. That includes making sure that none of the important topics are skipped or unduly under-weighted.

Checking individual questions one by one is not good enough. Checking to see that an individual question is “acceptable” is nowhere near sufficient for achieving acceptable coverage and balance. The existing test-construction process pays far too little attention to coverage and balance. The individual questions, as bad as they are, are not the biggest problem. The lack of coverage and balance is a more serious and more deep-seated problem.

The idea of balance applies not just to individual bits of domain-specific knowledge, but also to higher-level goals. As an example of a higher-level goal, consider question 6 on the corresponding chemistry document (reference 8). In reference to figure 1 it says: «The chart ... shows the relationship between the first ionization energy and the increase in atomic number. The letter on the chart for the alkali family of elements is: ....»

Figure 1: Ionization Energy Trends

This question requires knowing the definition of “alkali family” but does not ask for a rote recitation of the definition. Similarly it requires knowing the definition of “first ionization energy” but does not ask for a rote recitation. Thirdly it requires some minimal skill in interpreting a graph. Therefore this question has a special role to play as part of the overall test:

If you are interested in testing the student’s ability to think, to combine ideas, as opposed to plugging and chugging, then this is a commendable question. The question is not difficult if you know the material, but it does require knowing the material and it also requires a multi-step thought process.

In contrast, if you aren’t trying to test for thinking skills, this question is sub-optimal, because its score is difficult to interpret: If the student gives a wrong answer, you won’t know which of the various steps went wrong.

I am not saying that all questions should require combining multiple ideas. I am saying there needs to be a balance between checking basic factoids one-by-one and checking for higher-level thinking skills.

The alkali question has some remarkable weaknesses to go along with the strengths mentioned above. For one thing, the chart has four Ws but only three Xs, two Zs, and two Ys ... even though it would been the easiest thing in the world to add a third Y and/or remove one of the Ws. This telegraphs that W is the “interesting” thing. A student who doesn’t know the material but is test-wise will pick up on this. I cannot imagine how this weakness arose, or how it slipped through the question-selection process.
Actually there are multiple excellent reasons for removing one of the Ws. Calling hydrogen an alkali metal represents the triumph of dogma over objective reality. For details on the placement of hydrogen, see reference 9.

If the released questions are representative, they indicate that the physics test has serious coverage and balance problems:

Vast parts of the standards are left uncovered.
The test is seriously unbalanced toward rote learning as opposed to thinking.

4 The Standards Themselves

The full, official physics standards can be found in reference 10. You may wish to compare them to the standards for related subjects, such as math (reference 11), algebra (reference 12), science (reference 13), and chemistry (reference 14). There is even a standard for “Investigation & Experimentation” (reference 15).The state also provides some “framework” documents in areas such as science (reference 16) and math (reference 17) to explain and elaborate upon the standards.

The standards need to be explicit about the following:

The primary, fundamental, and overarching goal is that students should be able to think, to reason effectively. This is far more important than any single bit of domain-specific knowledge.

Clearly it couldn’t hurt to say that, but you may be wondering whether it is really necessary.

To be fair, the existing standards hint at this. Many of the items in the PHIE1 cluster can be seen as means toward this end.
Alas it appears that not everyone is playing fair. These standards are being subjected to legalistic wise-guy micro-interpretation.

Therefore, yes, it is important to stop pussyfooting around and explicitly make “thinking” the primary, fundamental, and overarching goal.

Moving now to a lower level, the tactical level, the standards fail to mention the great scaling laws. In 1638, Galileo wrote a book On Two New Sciences. In it, he made heavy use of scaling laws. The scaling laws are simultaneously more profound, more age-appropriate, and more readily applicable than many of the topics that are mentioned in the standards.

Except for those omissions, and except for a few howlers mentioned below, the standards themselves seem reasonable. Although they could be improved, they are not the rate-limiting step. As mentioned in section 1, it is perfectly possible to make a good set of questions, much better than the questions discussed in section 2, within the current standards.

On the other side of the same coin, we should keep in mind the fact that you can’t make anything foolproof, because fools are so ingenious. No matter how good the standards are, they will never be foolproof or abuse-proof.

Cluster PHIE1:
PHIE1.a)
No released questions address this point.
PHIE1.b)
No released questions successfully address this point.
PHIE1.c)
How does this standard differ from the previous one?
PHIE1.d)
No released questions successfully address this point.
PHIE1.e)
No released questions address this point.
PHIE1.f)
This makes no sense. First of all, the word theory has two conflicting meanings, both of which have been accepted for thousands of years, and both of which are relevant to this situation. (Hypothesis also has multiple meanings, but only one is relevant here.) Secondly, successful practicing scientists do not draw any important distinction between hypothesis and theory. See reference 4 for a well-reasoned discussion of scientific methods, including the confusion surrounding the word “theory”.
PHIE1.g)
No released questions address this point.
PHIE1.h)
No released questions address this point.
PHIE1.i)
No released questions address this point. Question 44 directly violates this criterion.
PHIE1.j)
No released questions address this point.
PHIE1.k)
No released questions address this point.
PHIE1.l)
No released questions address this point.
PHIE1.m)
No released questions address this point.
PHIE1.n)
No released questions address this point.
Cluster PH1: The released questions fail to address some of the points.
Cluster PH2:
- The standard should be revised to include the notion that the rest energy (mc²) is included in the energy. This conversion of mass to other forms of energy is already covered by the chemistry standard (reference 8), so it would be foolish to divorce energy from mass on the physics standard.
- The standard should be revised to include a clear statement that momentum always obeys a strict local conservation law. Ditto for conservation of energy.
- The released questions fail to address some of the points in the existing standard.
Cluster PH3:
- Item PH3.c is just plain wrong. Not all thermal energy is kinetic.
- Item PH3.d is just plain wrong. Energy levels are not distributed uniformly (eventually or otherwise).
- Item PH3.3 defines entropy in terms of disorder versus order. This is not correct, for reasons discussed in reference 7. This notion is widely used as a crutch by those who don’t understand what entropy is – but wide usage does not make it any less incorrect.
- Missing item: The standard should be revised to include a precise, specific statement of the second law, namely strict local paraconservation of entropy. You can’t do thermo without the second law!
Cluster PH4: The released questions fail to address most of the points.
Cluster PH5: The released questions fail to address most of the points.

5 References

California Standards Test – Released Questions (physics). The 2004/2005 version of the rtqphysics document (which was the basis for the first draft of this document) is archived at:
http://web.archive.org/web/20050325132956/www.cde.ca.gov/ta/tg/sr/documents/css05rtqphysics.pdf
I call this the 2004/2005 version, because it appeared in early 2005 and covers questions used on the 2004 and earlier tests.

The corresponding “current” document is at: http://www.cde.ca.gov/ta/tg/sr/documents/rtqphysics.pdf

This analysis has been revised so that the question-numbering conforms to the 2006/2007 rtqphysics document, i.e. the one released in January 2007 and covering questions administered to students in 2006 and earlier years.

Dean Baird, “CST Schoolhouse Rock” http://phyzblog.blogspot.com/2007/08/cst-schoolhouse-rock.html

Index of all released questions (various topics and grade levels) http://www.cde.ca.gov/ta/tg/sr/css05rtq.asp

John Denker, “Scientific Methods” ./scientific-methods.htm

Test question: what happens when the string breaks? http://www.collegeboard.com/student/testing/sat/lc_two/phys/prac/prac08.html?phys

John Denker, “Conservation as related to Continuity and Constancy” ./conservation-continuity.htm

John Denker, “The Laws of Thermodynamics” ./thermo-laws.htm

California Standards Test – Released Questions (chemistry). The 2004 version is archived at: http://web.archive.org/web/20050325132956/www.cde.ca.gov/ta/tg/sr/documents/css05rtqchem.pdf
and the “current” version is at: http://www.cde.ca.gov/ta/tg/sr/documents/rtqchem.pdf

John Denker, “Periodic Table of the Elements – Cylinder with Bulges” www.av8n.com/physics/periodic-table.htm

10.

California Department of Education, “Physics” (9-12 standards) http://www.cde.ca.gov/be/st/ss/scphysics.asp

11.

California Department of Education, “Mathematics” (K-12 standards) http://www.cde.ca.gov/be/st/ss/mthmain.asp

12.

California Department of Education, “Algebra 1” (8-12 standards) http://www.cde.ca.gov/be/st/ss/mthalgebra1.asp

13.

California Department of Education, “Science” (K-12 standards) http://www.cde.ca.gov/be/st/ss/scmain.asp

14.

California Department of Education, “Chemistry” (9-12 standards) http://www.cde.ca.gov/be/st/ss/scchemistry.asp

15.

California Department of Education, “Investigation & Experimentation - Grades 9 to 12” (standards) http://www.cde.ca.gov/be/st/ss/scinvestigation.asp

16.

California Department of Education, “Science Framework” (K-12) http://www.cde.ca.gov/re/pn/fd/documents/scienceframework.pdf

17.

California Department of Education, “Mathematics Framework” http://www.cde.ca.gov/ci/ma/cf/index.asp

[Contents]