By: Hans Sandberg
Computer technology allows for new ways of designing assessments, whether they use multiple-choice tasks, essays or interactive games and simulations. This technology also makes it possible to follow how students work their way through a test by collecting information such as keystroke data, mouse clicks and more — process data — which can provide new insights into their performance. Focus on ETS R&D asked three experts on test fairness, Neil Dorans, Michael Zieky, and Rebecca Zwick about the impact of new technology on test fairness.
What does it mean when ETS says that it strives to build assessments that are fair?
Rebecca Zwick: Let's begin by looking at ETS Standards for Quality and Fairness from 2014, which says that we must "design, develop, administer, and score tests so that they measure the intended construct and minimize the effects of construct-irrelevant characteristics of test takers." We don't want a test of computational skills to include complex word problems that would, in effect, also test English skills. And we don't want an English-language test to include specialized vocabulary that is likely to be more familiar to some groups than others. ETS Guidelines for Fair Tests and Communications, published in 2015, states that we shouldn't assume that most adults know about "a combine, a joist, a margin call, an aria, a subpoena, a stenosis, an RPG, a filibuster, a lumen, a bunt, a buffer, a chuck, or a sloop" unless these words really are the focus of the test. It is a basic fairness principle that groups of test takers defined by language, ethnicity, gender, or socioeconomic status should differ in test performance only if they differ in terms of the skills or qualities the test is intended to measure.
Michael Zieky: There are many definitions of fairness in assessment, but I think the most useful one is based on validity. A good way to think about validity is that test scores are valid to the extent that they meet their intended purpose and that differences in the scores reflect real and relevant differences in test takers. For a test to be fair, it must be valid for different groups of test takers, so that it accurately reflects real group differences in relevant knowledge, skills or other attributes. I wish to be explicit in saying that fairness doesn't require all groups to have equal average scores. If groups on average differ in relevant knowledge, skills or other attributes, a fair test will show those differences.
There are a couple of ways in which scores can be invalid and therefore unfair. Rebecca pointed out that tests can be unfair if they measure irrelevant knowledge, or skills or other attributes in which groups may differ. Tests can also be unfair if they fail to measure important aspects of what they are supposed to measure, and that leads to an advantage or disadvantage for certain groups. For example, men tend on average to be better than women at visualizing objects in three-dimensional space. If a test measured that skill even though it was not important for valid measurement, it would be unfair to women. But if a test failed to measure that skill, and it was required for valid measurement that would be unfair to men who would have done better if it had been measured.
Neil Dorans: I agree with Rebecca about the need to avoid measuring the irrelevant and with Michael's point about the need to measure all of the relevant. In practice, our fairness procedures have been designed to weed out the irrelevant. For example, we can ask whether a test of comprehension, composed of problems associated with reading passages written in English, is measuring the same thing in the same way for men and women. However, we can't claim that scores from such a test represent measures of universal reading ability, nor can we assess their fairness as measures of that universal reading ability. We can't assess the fairness of a construct, only the fairness of our attempt to implement that construct. Neither should we expect the test design to guarantee that all relevant material is included in the mix of materials needed to measure the skills or attributes a test seeks to measure.
The edited volume Fairness in Educational Assessment and Measurement (NCME, 2016), describes existing strategies for designing, developing, and administering fair assessments. It also covers procedures for detecting unfairness relating to scoring and fair use of scores. These empirical procedures were designed specifically to weed out irrelevant test content in settings where the conditions of measurement permit direct comparisons of test takers from different subgroups, such as men and women. They do not ensure that all relevant material is included in the test. The volume also makes it clear that these empirical procedures are not readily applicable to a variety of settings that have become commonplace in the field of educational assessment. These settings include comparing scores from tests that are built to different specifications, designed for different grade levels of students, administered in different modes, for example, paper vs. computer, or given in different languages.
Computerized assessments could collect process data that reveal how test takers reached their answers. What might this mean for test fairness?
Michael Zieky: We all agree that fairness is closely related to validity and valid use of test scores. We also agree that validity and fairness require us to measure a good sample of the right stuff, while not measuring the wrong stuff. Innovative assessments make it possible to routinely measure aspects of student performance that previously could only be measured with great difficulty, for example, the amount of time a test taker spends on answering each question, or the number of times a test taker revises a sentence in an essay. But just because we can measure additional aspects of a test taker's performance in computer games, technology enhanced items, and simulations, doesn't mean that they are part of the right stuff. Fairness requires that every new measured variable be shown to be valid. It may be true, for example, that the number of revisions to an essay is related to group membership, but not relevant to the purpose of the test. In that case, it would not be fair to evaluate test takers based on the number of revisions.
The definition of fairness does not change with the types of assessments used. Just as was the case with traditional assessments, new types of tests must be validated. Real and relevant sources of score differences will be fair, while irrelevant sources of score differences will remain unfair. New assessment types don't require new fairness definitions, but additional investigations of their validity.
Rebecca Zwick: Computer games and simulations present their own fairness challenges. We need to determine, for example, whether test takers with previous experience of playing computer games tend to perform better on these tests even when they are not more proficient in the skills the test is intended to measure. Once again, the issue is one of validity: If skills in computer gaming are not part of the construct, such skills should not give an advantage to test takers having those skills.
I agree with Mike's note of caution about the use of process data such as response time, number of pauses between words or paragraphs in creating an essay, and the number and type of intermediate responses made before the final solution to a science problem is reached. Such data can be a gold mine for instructors by helping to reveal why students have certain misunderstandings, or what aspects of essay-writing are difficult for them. But whether process data should be incorporated in a test score is an entirely different question, which, as Mike said, is ultimately tied to validity. Suppose an essay test is offered to prospective employees in a newsroom, where it may be important to be able to compose text rapidly. Let's also assume that an article that is composed faster is considered superior to one that is composed more slowly, all other things being equal. In that situation, it could be reasonable to consider response time as part of the essay test score. However, in most settings speed is not seen as an important aspect of essay writing — within limits of course! You could say that speed is not ordinarily part of the construct, so it wouldn't be fair to give extra points for it. Besides, awarding points for speed could be counterproductive in a classroom setting, since it might encourage haste rather than thoughtfulness.
Neil Dorans: I share Rebecca's concern about how a variation in a test taker's familiarity with a new type of assessment can interfere with the valid assessment of a construct. A test taker's performance on a test depends on how quickly the test taker masters the rules of the assessment, how the tests are scored, how much time the test taker has to answer questions, and other contextual aspects of the assessment in addition to the ability of interest. If only the mode of assessment changes, but not its purpose, then the innovation might introduce construct-irrelevant differences in performance among test takers, differences that reflect their knowledge of the changed rules. This underscores the importance of informing test takers of how an innovative assessment works.
ETS has since the late 1980s implemented many procedures for ensuring test fairness, procedures that work very well with assessments that use a clear specification to assemble easily-scored questions that measure a particular skill or proficiency. However, these procedures may not be easily adapted to more complex assessments. Take writing for example. A multiple-choice test might be faulted for under-representation of the construct of interest, which is how well a person can write. An alternative is to ask the test taker to write an essay, which could be seen as a more authentic and direct method.
The fairness of multiple-choice test items is easy to assess since we have access to a reliable total score based on responses to many test questions. In contrast, assessing the fairness of a direct writing assessment is more challenging, in part because the total score is based on a limited number of tasks. This difference in the number of items or tasks is important for standard fairness assessment procedures.
Proponents of innovative assessments often argue that they are superior because the tasks used, for example, essay writing, mirror reality more authentically than simpler assessments, but not all assessment ideas are easily engineered. There is always a risk that the implementation of an innovative concept introduces multiple sources of construct-irrelevant variance, which could undermine the assessment's fairness and consequently, its validity. So, I agree with my colleagues that we need to proceed with caution as we innovate.
Neil J. Dorans, Michael Zieky, and Rebecca J. Zwick are Distinguished Presidential Appointees in ETS's R&D division.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington DC: American Educational Research Association.
Dorans, N.F. & Cook, L.L. (2016). Fairness in Educational Assessment and Measurement. National Council on Measurement in Education. New York, NY: Routledge.
Educational Testing Service. (2015). ETS Guidelines for Fair Tests and Communications. Princeton, NJ: Author.
Educational Testing Service. (2014). ETS Standards for Quality and Fairness. Princeton, NJ: Author.
Zieky, M. (2016). Developing fair tests. In S. Lane, M. Raymond, & T. Haladyna (Eds.), Handbook of Test Development. New York, NY: Routledge.
Zwick, R. (2016). Who Gets In? Strategies for Fair and Effective College Admissions. Cambridge, MA: Harvard University Press.