© 1992-2008 by Glenn Elert
All Rights Reserved -- Fair Use Encouraged
We tend to be better at concocting excuses for standardized tests than we are at making sense of their results. Faced with the baffling complexity of human thought, we look for "objective" methods in hopes of a direct route to its assessment. No one would presume to describe a student's mind in a single sentence; but we are confident that a number can say it all. Every year, 1.7 million students subject themselves to the Scholastic Aptitude Test at the request of college admissions officers looking for just such a short cut (Milwaukee Journal). Faced with a diverse array of applicants, 92% of American colleges require SAT scores on the assumption that they provide a method of equating students with differing academic backgrounds but identical grade point averages (Hartnett & Feldmesser 4). Educational Testing Service (ETS), which produces the SAT and four other entrance exams1, claims that the test measures not just how capably individuals answer analogy and geometry questions, but how capably they will perform in the academic world. A review of the literature tells a different story. The SAT is not a measure of how successful one will be in college, but how well one conforms to the demographics of the group that did well on the first exam.
The original SAT was offered in 1926. The format was much the same then as it is now: two 30-minute "verbal" sections on "vocabulary, verbal reasoning, and reading comprehension;" two 30-minute "math" sections on "arithmetic, algebra, and geometry;" and an additional 30-minute "experimental" section (verbal or math) used to equate the exam with previous versions of itself and to pre-test questions that might appear on future exams2 (CEEB; 1991: 3). The experimental section is not identified on the test and does not count towards a candidate's score. Scores range in value from 200 to 800 points for both the math and verbal sections. Between 1926 and 1941, the scores were readjusted to produce an average of 500 points. All versions of the test subsequent to 1941 are equated to one another using the experimental section. The first SAT was developed by Carl Campbell Brigham for the College Entrance Examination Board (CEEB) who had previously participated in the development of the "Army-Alpha" intelligence tests. The College Board, the Carnegie Foundation, and the American Council on Education — all of which still exist — consolidated their standardized testing programs under the Educational Testing Service in 1947 (adapted from Owen and Nairn).
Colleges derive predictions of applicants' performance from regression equations based on the performance of the previous year's students. The ability to predict freshman grades is the backbone of the SAT's claim to measure aptitude. In the world of psychometrics, the valid aptitude test is one that can predict a person's performance. A perfect prediction would be 100% accurate. A physical measurement, such as that provided by a thermometer or a ruler, generally delivers accuracy of 95% or more. The SAT, according to figures compiled by Ford and Campos of ETS, ranges in accuracy from 8 to 15% in the prediction of freshman grade point average3 (11). This means that, on the average, for 88% of the applicants (though it is impossible to know which ones) an SAT score will predict their grade rank no more accurately than a pair of dice. The best known record of prediction by the SAT, reported in a 1978 ETS survey of studies, was at a New Jersey college where the 1978 SAT-Verbal would have been matched as a predictor by random chance only 59% of the time. The worst result was reported at a university in Indiana where chance would have predicted grades as well as the 1972 SAT-Verbal 99.96% of the time (Breland & Minsky 149, 153).
As dismal as these self-reported results are, the actual results are lower still. ETS has been known to take liberties with correlation coefficients for use in defending the SAT. Slack and Porter, in a 1980 Harvard Educational Review article, showed that Ford and Campos consistently misreported validity calculations in an apparent effort to make the SAT look better. Ford and Campos found average predictive accuracies of 16% for SAT-Verbal, 12% for SAT-Math, and 25% for high school record (11). But when Slack and Porter redid the arithmetic they found actual values of 14%, 10%, and 27% respectively (165). The errors were quite systematic, and always in favor of the SAT. Previous grades are thus about twice as good as the SAT at predicting academic achievement.
Although the SAT is an inferior predictor relative to high school grades, it can increase the accuracy of prediction when used in combination with them. This has been the main justification for requiring the tests for admissions. Data from several validity studies, however, indicate that inclusion of SAT scores improves prediction by an average of only 5% or less (Nairn 66). The major reason that the benefits are so low is that the SAT provides redundant information. Gottfredson and Crouse argued that at least 90% of the decisions to admit or reject a student are the same whether the SAT is used in conjunction with high school rank or not. "SAT scores and high school rank," they said, "are moderately correlated with each other [from 0.4 to 0.5] so that outcomes predicted from high school rank alone have a part-whole correlation of at least 0.8 with outcomes predicted from rank plus SAT" (368).
Marginal as they are, the predictions of first year grades are the test's most accurate forecasts. Correlations between scores and grades in later years, and overall college average, are lower still. One study found that the ability of college admission tests to predict grades declined consistently from one semester to the next throughout eight semesters (Humphreys). The virtual disappearance of the aptitude tests' ability to predict beyond the freshman year has been explained by some commentators as a result of the nature of advanced study. Multiple choice testing predominates introductory courses, they argue, but intermediate and advanced courses demand a broader range of performance.
An even better standard of scholastic success is staying in school. Alexander W. Astin suggested that:
In a very practical sense, the student's ability to stay in college is a more appropriate measure of his success than is his freshman GPA. Although it is true that good grades will help him gain admission to graduate school, to win graduate fellowships, and even to secure certain types of jobs, they are irrelevant to any of these outcomes if the student drops out of college before completing his degree requirements (14-15).
Astin found that using SAT scores to predict who will graduate resulted in 3.2% of perfect prediction for men and 2.9% for women (17-18). This means that for over 95% of the cases, random selection would predict the odds of remaining in school as well as the SAT. "Whether or not the student will drop out of college after the freshman year," Astin noted, "can be predicted with only a low degree of accuracy" (20).
Crouse and Trusheim, in their book The Case Against the SAT, conducted the most detailed statistical analysis of the SAT's predictive shortcomings. Using data from the National Longitudinal Study (NLS) of the high school class of 1972, they calculated the number of additional correct admissions using high school rank (HSR) alone and with the SAT. With four different measures of undergraduate success, they calculated that using the SAT in admissions adds between 0.1 and 2.7 additional correct forecasts per 100 applicants (see Table 1).
|Table 1:||Additional Admissions Correctly Forecasted
Compared to Random Selection per 100 Applicants
(adapted from Crouse & Trusheim 54-56)
|Test||Freshman Year with||Bachelors Degree with|
|GPA > 2.5||GPA > 3.0||GPA > 2.5||GPA > 3.0|
|HSR + SAT||11.9||16.2||6.2||6.1|
Statistics on test scores and college success, however, can never reveal what might have been. In a rare practical experiment Williams College admitted 358 students, ten percent of each year's new admissions, over a ten year period who would otherwise have been rejected by the school's normal test score and grade requirements. The identities of the "ten percenters" were kept secret from faculty and students. They were subjected to the same academic requirements as other students and received no special aid. In 1976 the results were announced — 71% of the "ten percenters" graduated, compared with the school average of 85%. In one graduating class, the class president, president of the college council, and the president of the honor society were all "ten percenters" (New York Times).
The qualities students need to succeed over the long haul were examined in a 1972 study of over 300 professors by ETS psychologist Jonathan R. Warre. His findings were as follows:
What does it take to succeed in college? Motivation was the quality most frequently cited by over 3400 college teachers during a recent study of academic performance. The teachers mentioned students' academic commitment and interest even more often than intellectual ability as characteristics of their best students (qtd in Nairn 71).
Similarly, a study by the American College Testing Program reported that "variables such as motivation and a student's background" discriminated between average students and those who dropped out of college, while academic data such as SAT scores and College Board Achievement Tests, did not discriminate between these groups (Nicholson). In general, the best predictors of creative output in adulthood are participation during youth in independent, "self-sustaining" ventures (Joekel 6). According to research summarized by ETS in 1979, "the best predictor of accomplishment in college" is not the SAT but "accomplishment in the same area in high school, as measured by simple check lists of nonacademic achievements" (Baird qtd in Nairn 77).
In a truly bizarre experiment sponsored by ETS, Dr. John R. P. French of the University of Michigan reported high correlations of "achievement orientation" with uric acid levels in the blood (0.66). According to Dr. French:
[We] were able to predict four and a half years in advance which high school students would go on to college and which would not. We were also able to predict which ones would drop out of college, and if we took into account IQ, how long before they dropped out.... We hope to do some studies of serum uric acid in the selection of executives (31).
Considering the SAT's poor predictive ability, one has to wonder exactly what takes priority at ETS. Such criticisms are not isolated. A former ETS executive once described the company as "an educational country club [designed to] pamper over-priced researchers who sit all day and contemplate their psychometric navels4" (Owen 8).
To get an idea of what ETS really thinks about the accuracy of the SAT, consider its principle method of detecting cheating. ETS' scoring machines are programmed to set aside the answer sheets of students who score suspiciously higher or lower in taking the SAT for the second time. In order to set off the machines, there has to be a 150 point difference on either half of the test or a 250 point combined difference between the first and the second times a test is taken (Milwaukee Journal). When a paper is set aside, ETS investigators check for inconsistencies in handwriting or signatures and can recreate the seating arrangements to look for indications of collaboration or spying. In most cases cheating is not found and the results are allowed to stand. If this isn't cheating then what is? Does ETS think that scholastic aptitude is so volatile that it can grow or shrink 25% in three months5? If aptitude was really an innate, unlearnable thing and if the SAT really measured it, then any change over 34 points — the SAT's standard error of measurement — should be suspect (CEEB; 1965: 21).
That ETS would allow such a wide variance is, to me, an indication of the exam's misguided construction. The SAT is not built from content specifications or on a model of human reasoning, but rather from statistical guidelines. This results in a circular reasoning where the right answer is the one that the students who perform best on the test chose most. Questions are designed solely on their ability to discriminate high scorers from low ones. When new questions are written they are "pretested" to see if they conform to these requirements. A different thirty minute section of each SAT consists of untried questions that don't count towards the test-taker's score. How students respond to these items determine whether or not they will be used on real SATs. An item writer for ETS explained the process:
It was all very pragmatic. It wasn't... theoretical or anything. I had always known this to be true, but it had never been presented to me with such force. There is no Platonic correct answer to any of these questions; it's all determined by the statistical performance of the question as it relates to other questions. If students who do well on the exam generally tend to pick the same answer then it must be pretty good (Owen 79).
It's interesting to note that when the first SAT was administered, the idea of multiple choice testing was almost unknown in American schools. Many of the students who took the test were hesitant to guess when they weren't certain of an answer. "They felt that guessing was not only risky but even immoral, equating it with cheating" (Owen 93). This led one ETS researcher to suggest that students should learn "how to behave effectively when taking a test" (qtd in Nairn 95). If one could be taught effective test taking might it not also be possible to teach superior test taking? The official line at ETS was no.
In 1976, the Federal Trade Commission responded to ETS' long standing wish for a government investigation of the coaching schools. Their claim was that the aptitude the SAT measured was acquired over years — promises of significant results (over 100 points) in six weeks were false advertising. In Effects of Coaching on Scholastic Aptitude Test Scores the College Board reported that:
Despite variable factors from one study to another, the net result across all studies is that score gains directly attributable to coaching amount, on the average, to fewer than 10 points — a difference of such small magnitude... that it is unreasonable to expect it to affect college admissions decisions. The magnitude of the gains resulting from coaching vary slightly, but they are always small regardless of the coaching method used or the differences in the students coached (4).
Unfortunately for ETS, the plan backfired. The test preparation schools were not cited with fraudulent advertising — ETS was. The initial FTC report found that coaching courses, on the average, raised scores more than 100 points on both the verbal and math sections6 (Nairn 102). "Contrary to [the] explicit claims of ETS/CEEB," said Albert Kramer Director of the Bureau of Consumer Protection, "coaching can be effective..." (Levine 5).
How ETS got into this situation is beyond me. The original 1968 report, Effects of Coaching on Scholastic Aptitude Test Scores, had become known within the company as "The Little Green Book That Lies" (Owen 89). By the time of the investigation, it had been known for four years that the math portion was vulnerable to coaching (Pike & Evans). In 1978, Lewis W. Pike of ETS summarized the results of previous coaching studies — including several that ETS had neglected to cite in its earlier summaries. Published in 1978, he concluded that the SAT-Math was clearly coachable and that the SAT-Verbal probably was, but that no comprehensive study of the latter had been attempted (Pike). A few weeks later, Lewis Pike was fired. The following year, the College Board issued a new official statement on coaching, published under this headline: "Board reaffirms its position that 'coaching' for SAT is not likely to improve students' scores" (Owen 100). One really has to doubt the objectivity of ETS in assessing its own products.
In order for a test question to make it onto a real SAT it has to have certain statistical characteristics. A question is used only if high scoring students tend to get it right and low scoring students tend to get it wrong. If low scoring students do as well as high scoring students, then the question will have an unacceptably low discrimination, and ETS will either have to rewrite it or discard it. The key to a successful test preparation course lies in its ability to address the circular logic in writing questions on a statistical foundation. David Owen, in his book None of the Above, describes just one such coaching school: The Princeton Review. In their method, test candidates are taught how to recognize the incorrect alternatives to a question. As Owen reported:
Adam Robinson [one of Princeton Review's founders] calls this average test taker Joe Bloggs. When Joe Bloggs takes the SAT, he scores 450. When ETS lays a trap for him, he steps in it. Princeton Review students learn how to avoid these traps by learning to understand how Joe Bloggs thinks. When Princeton Review students come to a hard question they don't understand, they ask themselves: What would Joe do here? Then they do something else (124).
Owen's reported average score increases for Princeton Review students were 185 points on either portion of the SAT. Roughly 30% of the students experienced gains in excess of 250 points (122).
At Princeton Review, candidates are also taught how to find the experimental section. Since it doesn't count towards the final score, they fill it out at random and save themselves the trouble70. This effectively sabotages the SAT's statistics on future versions of the exam. When they answer the questions at random, they reduce the difficulty of future SATs by making the pretest questions look harder than they are. Remarked John Katzman, another founder of Princeton Review, "The SAT is bullshit" (Owen 140).
The triumph of Princeton Review over the SAT reveals an inherent problem with a test based on a statistical model. If the psychometricians can't define what aptitude is — outside of saying that apt people have it — what exactly are they measuring? It might pay to examine some of the characteristics of the people ETS identifies as less apt.
If the SAT is an extremely weak predictor of academic potential it is a moderate predictor of family income. Average scores are proportional to family income: students from families with higher incomes tend to receive higher scores80. Estimates of the correlation between SAT score and family income vary from 0.23 to 0.40 (Crouse & Trusheim and Doermann, respectively). This ranking by income prevails not just when large groups are averaged together but also among applicants within the same institution.
A table from Crouse & Trusheim's book The Case Against the SAT (reprinted below) indicates that SAT scores differentiate people not only by income but also by their parents' role in the economic system. The average scores of the children of professionals are higher than the children of white collar workers, which in turn, are higher than the children of blue collar workers. High school rank, which is a better measure of academic achievement than SAT scores, shows no such correlation.
|Table 2:||Correlations of SAT and High School Rank (HSR)
with Socioeconomic Background
(Crouse & Trusheim 126)
If SAT scores really measure a person's scholastic aptitude, then that aptitude is distributed according to parental income.
Some have used such studies to indicate the academic superiority of the upper classes. Further investigation reveals the folly of such assumptions. An American Council on Education study of 36,581 students in 55 colleges concluded that: "The income of a student's parents has no relationship to freshman GPA, either before of after controlling for high school grades, academic aptitude, and college selectivity" (Astin; 1971: 14). Similarly, an ETS study of 15,535 college bound students found that actual accomplishments outside the classroom did not correlate with income either:
Although educational ambitions were significantly related to accomplishments in several areas, family income was not [one of them]. That is, students from families with different incomes did not significantly differ in the number or level of accomplishments they reported (23).
Not only do the children of the wealthy score unusually high on the SAT, they also have, by virtue of their wealth, increased access to test preparation materials and coaching schools — tuition for the Princeton Review course was $500 in 1984. The FTC investigation found that those candidates who had taken advantage of coaching were heavily concentrated in the upper income brackets: In 1978, 41% of the coached students came from the top income bracket of $30,000 or more (Levine qtd in Nairn 98). As Owen said of Princeton Review students, "[They] simply don't take the same test.... The effect would be the same if ETS randomly selected a thousand white, wealthy students each year, gave them the answers to the SAT in advance, and then denied that it had done so" (139).
In addition to its socioeconomic bias, the SAT is also prejudiced against non-whites. For example:
ETS' reply to such claims — some of them from their own researchers — is that the SAT does not discriminate against particular groups per se, but rather that it reflects the fundamental inequality of American society. The SAT is no more responsible for these inequalities than a thermometer is responsible for a fever. It is one thing to use test scores to illuminate disparity, but it is something else entirely to restrict opportunities with them. The real crime of the SAT is that it disguises this disparity as a morally neutral difference in aptitude. Daniel Seligman, associate managing editor of Fortune magazine, had this to say on the subject:
ETS tests persist in showing some people to be smarter than others. And if some people are smarter than others, there might actually be some justification for an economic system in which some people have more money and authority than others.... The really interesting question is not whether rich people are smarter, but why they are. Is it because of their superior environments or their superior genes? The answer... is 'both, obviously' (84).
Wealthy whites don't see SAT results as proof that the poor are mistreated, they see them as proof that mistreatment of the poor is fair.
The decline of test scores with age has long been a feature of standardized aptitude and intelligence tests. The American College Testing Program openly admitted the age discrimination in its college admissions test: "Age groups are combined for prediction [by ACT scores]; however, this procedure leads to consistent underprediction of the grades of older students, and thus to bias against them" (ACT 23). If claims about what these tests measure are taken at face value, they show that adults decline in aptitude as soon as they pass their early twenties. If we buy into the whole notion that the SAT measures "verbal and mathematical abilities... developed over many years both in and out of school" then on the average most people lose ability shortly after high school (CEEB; 1991: 3). What's really happening, however, is that as the test taker advances in the performance and skill-oriented job world he moves farther and farther away from the test-oriented school world. The tendency of aptitude tests to penalize people without recent practice in test-taking skills has its greatest impact on candidates returning to school after several years in the job market. Those who weren't able to go on to higher education immediately out of high school, displaced workers looking for additional education, and homemakers returning to school are all penalized.
A lot of press over the last 20 years has been devoted to the consistent decline in average SAT scores beginning in 1963 and their miraculous turnaround in 1982. Every SAT since has been normed against the test administered in April 1941. Whatever else has changed in the world, the SAT remains, according to a College Board publication, an "unchanging standard" (Advisory Panel 8). Various reasons have been proposed — from increased electives and "diminished seriousness" to television and fluoride in drinking water — but none of them are based on a test-oriented model9 (Advisory Panel 46-48). Given that the test is a better predictor of status quo demographics than of scholastic aptitude I would imagine that any statistically significant changes are directly attributable to demographic changes in the population of students that take the test. In much the same manner that scores decline with age, they decline as the demographics of the test-takers move away from the white, Anglo-Saxon, upper class norm of 1941 and towards a multiethnic, economically heterogeneous sample.
Carl Campbell Brigham, the creator of the first SAT, was a firm believer in tying advancement and opportunity to "merit." Unfortunately, he was also a bigot. His only book, A Study of American Intelligence, "proved", through army intelligence statistics that Catholics, Greeks, Hungarians, Italians, Jews, Poles, Russians, Turks and — especially — Negroes were innately less intelligent than Germanic and Scandinavian peoples. "We... face here," he wrote, "a possibility of racial admixture here that is infinitely worse than that faced by any European country today for we are incorporating the Negro into our racial stock, while all of Europe is comparatively free from this taint" (209). By carefully sampling the mental power of the nation's young people, he hoped to identify and reward those citizens whose racial inheritance had granted them superior intellectual powers. The SAT was to be the currency of merit in a new American social order based on an aristocracy of aptitude — or meritocracy.
Henry Chauncey, founder and first president of ETS, inherited the meritocratic ideal and continued to preach its virtues:
To many the prospect of measuring in quantitative terms what have previously been considered intangible qualities is frightening, if not downright objectionable. Yet, I venture to predict that we will become accustomed to it and find ourselves better off for it. In no instance that I can think of has the advance of accurate knowledge been detrimental to society.... Educational and vocational guidance, personal and social adjustment most certainly should be greatly benefited. Life may have less mystery but it will also have less disillusionment and disappointment. Hope will not be a lost source of strength, but it will be kept within reasonable bounds (qtd in Nairn 4).
The Scholastic Aptitude Test is a clever attempt to conceal aristocracy and racism behind the cover of science and objectivity. Students, parents, teachers, and administrators who submit themselves to the SAT and believe the lies that ETS tells them are not acting in their own best interest. The SAT is a tool for the privileged to maintain the status quo. Like the razor wire surrounding a gated community, the "reasonable bounds" of the SAT serve to isolate the well-to-do from the rest of society and ensure that the wealthy and powerful are the only ones with access to the wealth and power.
Cornell University News Service. Research News Release: 6 November 1997.
Intelligence test scores among racial and socio-economic segments of American society are not growing ever wider, contrary to arguments in The Bell Curve, but are, in fact, converging, say Cornell University psychologists Wendy M. Williams and Stephen J. Ceci, based on analyses of national data sets of mental test scores. This is contrary to often-reported arguments that Americans are getting dumber because low-IQ parents are outbreeding high-IQ parents (EurekAlert!).
Complete text of an email reply from an ETS representative.
Subject: re: SAT Inquiry Date: Wed, 16 Oct 96 15:38:24 EDT From: sat_agent3 <email@example.com> To: firstname.lastname@example.org Thank you for contacting College Board Online. SAT is now an acronym for Scholastic Assessment Test. The name change had been in effect since March 1994. If we can be of further assistance, please contact us.
Well you wouldn't know it from reading any ETS publications. They seem ashamed to admit what it is they actually measure with their products (see footnote 9).
Quote from an interview with Steve Jobs, co-founder of Apple computers.
It's a political problem. The problems are sociopolitical. The problems are unions. You plot the growth of the NEA [National Education Association] and the dropping of SAT scores, and they're inversely proportional. The problems are unions in the schools. The problem is bureaucracy. I'm one of these people who believes the best thing we could ever do is go to the full voucher system.
I have a 17-year-old daughter who went to a private school for a few years before high school. This private school is the best school I've seen in my life. It was judged one of the 100 best schools in America. It was phenomenal. The tuition was $5,500 a year, which is a lot of money for most parents. But the teachers were paid less than public school teachers — so it's not about money at the teacher level. I asked the state treasurer that year what California pays on average to send kids to school, and I believe it was $4,400. While there are not many parents who could come up with $5,500 a year, there are many who could come up with $1,000 a year.
If we gave vouchers to parents for $4,400 a year, schools would be starting right and left. People would get out of college and say, "Let's start a school." You could have a track at Stanford within the MBA program on how to be the businessperson of a school. And that MBA would get together with somebody else, and they'd start schools. And you'd have these young, idealistic people starting schools, working for pennies.
They'd do it because they'd be able to set the curriculum. When you have kids you think, What exactly do I want them to learn? Most of the stuff they study in school is completely useless. But some incredibly valuable things you don't learn until you're older — yet you could learn them when you're younger. And you start to think, What would I do if I set a curriculum for a school?
God, how exciting that could be! But you can't do it today. You'd be crazy to work in a school today. You don't get to do what you want. You don't get to pick your books, your curriculum. You get to teach one narrow specialization. Who would ever want to do that? (Wired. February 1996: 158.)
How exciting indeed! You're right Steve, I am paid too much. I can't wait to start working for pennies. I feel so deprived teaching only one narrow specialization. Throw some more work my way. Unions? They just get in the way, don't they. All these teachers with more education than you demanding reasonable salaries and decent working conditions. The temerity! Send a couple of MBAs our way, Steve. We need their leadership and insightful knowledge. How did I ever manage to teach without a Stanford grad manning the whip? Thank you, Steve. Thank you for solving our nation's educational problems. Please excuse my tears of joy.