By: Hans Sandberg
Often a hot topic when educational testing is discussed, high and low stakes are actually connected to the use of test scores rather than the test itself. Focus on ETS R&D asked ETS scientists Michael T. Kane and Richard J. Tannenbaum to explore the issue of stakes in relation to decisions based on test scores.
What do we mean when we talk about stakes in testing?
Michael Kane: The notion of stakes in testing programs grew out of a recognition that tests can have serious consequences, which is why it is so important to make the tests as good as possible. Since we don't want to harm test takers or misinform test users, it is critical that test programs with serious consequences, or high stakes, live up to the highest possible standards when it comes to the accuracy of the scores and how they are interpreted and used. But not all test uses have high stakes. Take a spelling quiz that an elementary school teacher gives his or her students to encourage them to study their weekly spelling list and to decide where more attention is needed. It wouldn't make sense to demand the same level of rigor for such a quiz as for a test that is used to decide whether a student should be promoted to the next grade level.
Richard Tannenbaum: I agree with Michael that an in-class quiz doesn't carry the high stakes of an admissions test, but that doesn't mean that teachers should use quizzes with no measurement quality. For the quiz to be useful, its content should certainly be relevant to what is being taught. The point is that we want to strike a balance between our general interest in promoting high standards in testing, and our interest in not unnecessarily burdening teachers and other practitioners who need to assess students or candidates.
Is it fair to have tests whose scores have high stakes?
Michael Kane: There is nothing inherently fair or unfair about using tests to measure something. Even if we didn't have tests, decisions on selection would be made, and some could have serious consequences for individuals and institutions. That's just a fact of life in our society. For example, we require that candidates for professional licensure meet certain educational requirements and that they pass licensure tests. We want to protect the public by assuring that professionals meet basic requirements for safe and effective performance. People don't want to take unnecessary risks when they see a doctor or dentist, or when they board an airplane. The decision to grant or deny a candidate a license has serious and lasting consequences for both the candidate and society.
Richard Tannenbaum: Obviously, the goal of using a test is to reach a well informed decision — a better decision than if the test was not used. In general, a test is considered fair when there is a comparable opportunity for different groups of test takers to demonstrate their mastery of the knowledge, skills, or abilities being measured and when performance differences between groups of test takers are related to the knowledge, skills, or abilities being measured. But fairness is a complex issue, as it may also be associated with different opportunities to learn and be exposed to the content that the test measures. This means that fairness is somewhat distinct from stakes, but fairness is of course more important when the stakes are high, and when the consequences of the decisions based on the test scores are significant and difficult to reverse.
If the stakes are low, why bother with a test?
Michael Kane: Every testing program seeks to achieve some goal, arising from some need. That goal would still be of some importance even if the stakes are low. In some cases, tests are categorized as low stakes because the potential consequences associated with the use of the scores are not too serious, and in some cases because their consequences are temporary. In many educational contexts, we use assessments to provide test takers and teachers with feedback on how students are doing, their strengths and weaknesses on the course content. If a student or a class is having trouble with a concept, then the teacher may want to provide some additional instruction or add examples that clarify the concept. This kind of test score use will likely be seen as having low stakes because the consequences for the students are not too serious and don't last long. But such "formative" uses are intended to shape the course of instruction, and they can therefore have a significant impact on the quality of instruction over the long term.
So, what lies beyond high and low stakes?
Richard Tannenbaum: Let's take a closer look at the terms. Do high stakes mean that all typical indicators of measurement and testing quality, such as reliability, validity, scoring accuracy and consistency, accuracy of equating and scaling, fairness, test security, must apply and meet rigorous expectations? Do low stakes on the other hand mean that we don't need to worry about evidence of measurement and testing quality? Our answer to both of these questions is no!
The uses of the test scores are certainly a major factor in considering stakes, but the testing conditions and context also matter. How often a test is given, for example, affects the stakes involved. A licensure test that is only offered a few times over the course of a year is likely to be considered as having higher stakes than the same test offered many times in a year since test takers will have fewer opportunities to take the test. This also affects how important it is that the test meets high standards for accurate classification.
Michael Kane: In a paper that Rick and I recently presented at NCME¹, we suggested an alternative framework for evaluating stakes in testing. We think it makes more sense to move away from thinking about stakes as simply low or high and to move toward thinking about a range of stakes. We need a comprehensive approach for looking at how stakes apply to any particular test use, and how we can identify a range of possible consequences, as well as how positive and negative consequences sometimes can balance each other out.
Every testing program is designed to achieve some desirable consequences, such as identifying student needs, or helping institutions make sensible selection and placement decisions.² We also know that every program can have unintended negative consequences, such as when the use of a test results in teachers changing their instructional practice to focus only on the content covered by the test.
Richard Tannenbaum: Our approach encourages test publishers to focus their attention on those aspects of measurement and testing quality that carry more severe consequences. A simple dichotomous approach to labeling tests may result in ineffective and inefficient use of staff resources and expertise if we place too much attention on less critical areas. Take a licensure test as an example. Greater attention should be placed on assuring reliability at the cutscore level, which is the point that separates those who succeeded from those who failed to reach a required level of performance, rather than across the entire score scale. If however, we are dealing with an admissions test, then we must pay attention to reliability across the score scale, because decisions are likely to be made at more than one score location.
Michael Kane: It will be easier to design testing programs that achieve their intended goals if we can identify any potential negative consequences, evaluate how serious they are, and estimate how likely they are to occur. Hopefully the result would be a more balanced view of how effective testing programs are. More importantly, it could help us identify and ameliorate program characteristics that have unwanted consequences. If, for example, we learn that some students fail to complete a test and are then misdiagnosed or placed in the wrong program because of tight time limits, we could suggest that the time limit be extended or the test be shortened or broken up into two or three segments.
Richard Tannenbaum: The distinction between high stakes and low stakes test use helps us allocate resources for developing and evaluating testing applications. We want all tests to be as good as they reasonably can be, but we don't want to place excessive burdens on teachers and others who use assessments in a low stakes environment. The distinction between high stakes and low stakes is useful in this context, but we should also recognize that there is a range of stakes associated with different test uses, not only high and low. This insight encourages us to address the specific risks associated with specific programs and to use our resources effectively to improve testing programs and reduce the presence of negative consequences.
Michael T. Kane holds the Samuel J. Messick Chair in Test Validity at ETS's R&D division, and Richard J. Tannenbaum is a Principal Research Director in the same division.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Educational Testing Service. (2014). ETS Standards for Quality and Fairness. Princeton, NJ: Author.
Geisinger, K. F. (2011). The future of high-stakes testing in education. In J. A. Bovaird, K. F. Geisinger, & C. W. Buckendahl (Eds.), High-stakes testing in education: Science and practice in k–12 settings (pp. 231–248), Washington, DC: American Psychological Association.
Heubert, J. P., & Hauser, R. M. (Eds.). (1999). High stakes: Testing for tracking, promotion, and graduation. Washington, DC: National Academy Press.
1 Tannenbaum, R. J., & Kane, M. T. (2017, April). A reasoned approach to considering testing stakes. Paper presented at the meeting of the National Council on Measurement in Education, San Antonio, TX.
2 Wendler, C. (2016, April). Using a theory of action to ensure high quality tests. Paper presented at the meeting of the National Council on Measurement in Education, Washington DC.