angle-up angle-right angle-down angle-left close user menu open menu closed search globe bars phone store

Test-based Accountability Systems: The Importance of Paying Attention to Consequences


Video duration: 1:06:24

On-screen: [ETS®.]

On-screen: [Test-based Accountability Systems: The Importance of Paying Attention to Consequences. Suzanne Lane, University of Pittsburgh. 17th William H. Angoff Memorial Lecture.]

Speaker: ETS® Ida Lawrence, Senior Vice President, Research and Development, ETS®

Ida Lawrence - Good afternoon and welcome to what is now the 17th William H. Angoff Memorial Lecture. My name is Ida Lawrence; I’m the senior vice president for the Research and Development Division at Educational Testing Service. For those of you who aren’t familiar with where this lecture started, it’s a series that was established in 1994 to honor the life and work of Bill Angoff. He died in 1993 and he was a long-term employee at Educational Testing Service. He was there for more than 40 years and during this time he made major contributions to educational and psychological measurement. He was recognized by major societies in the field.

In keeping with Bill’s interests, this lecture series is devoted to presenting relatively non-technical discussions of important public interest issues related to educational measurement. Bill Angoff was a wonderful presenter, and many of us who were at ETS® and around in the field actually, had the benefit of listening to his lectures. They were always very understandable, and that’s the spirit of this lecture series as well.

I’m sure you’re all aware; this year’s lecture will be delivered by Professor Suzanne Lane. Suzanne Lane is a professor in the Research Methodology Program in the Department of Psychology and Education at the University of Pittsburgh. Her research and professional interests focus on educational measurement and testing with an emphasis on design and validity issues in large-scale assessment programs and the effectiveness of education and accountability programs.

The topic today is listed up here: Test-based Accountability Systems: The Importance of Paying Attention to Consequences. In her talk, Dr. Lane will address intended and potentially unintended consequences of test-based state accountability systems. This is an enormously important topic for researchers and policymakers in order to be able to examine and improve educational opportunities for all students. She will provide a conceptual framework for evaluating consequences and also discuss the inherent conflict in test-based systems between the goal of accountability and the need to support teaching and learning. She’s going to also discuss implications for state assessment and accountability systems and as part of the talk, she’s going to discuss the impact of alternate assessments that are used for students with the most severe disabilities.

By way of a little background, Professor Lane’s research has been published widely in educational measurement and testing journals. She’s also had many, many key roles including the president of NCME and the vice president of Division D of AERA. In addition, she’s served on technical advisory committees for the College Board for Educational Testing Service, for PARCC, the U. S. Department of Education’s evaluation of NAEP, and also for several state and accountability program technical advisory committees.

I think you’re really going to enjoy the talk. I’ve heard Suzanne present many times and she’s been a great colleague to us at ETS® as well. She’s on our Visiting Panel on Research and she’s a friend to lots of scholars and people who have worked with her at ETS® and around the field. I’d like to now welcome Suzanne Lane.


Speaker: Suzanne Lane, Professor, Research Methodology Program, University of Pittsburgh

Suzanne Lane - Good afternoon, and I’d like to thank you all for coming out in the rain. Yesterday was a beautiful day, but you can’t order the weather.

I first would like to thank Ida and her colleagues at ETS® for inviting me to present in the Angoff Memorial Lecture Series. It is truly an honor. I am amongst many scholars in measurement who have presented in the past. In fact, I will probably be dropping a few of their names throughout my talk. Thank you, Ida and ETS®.

As Ida said, I’m going to be talking about test-based accountability systems with respect to the consequences of using tests for accountability purposes.

On-screen: [Overview: Consequences for Tests used for Accountability Purposes. Legislation for Test-based Accountability Systems. Conceptual Framework for Evaluating Consequences. Concluding Thoughts.]

In framing my talk, I’m going to be talking about the need for thinking about an interpretation and use argument and a validity argument for test-based accountability systems. As many of us know, many of them include a theory of action, which really specifies what the claims one wants to make and what outcomes, and what are the mechanisms to get there.

I’m also going to be talking about the social consequences of test-based accountability systems and value judgments, the values that different stakeholders have in making decisions about what is important in terms of consequences, both positive and negative. That’s just going to provide a frame and then I’m going to discuss different legislation that has gone on in this country, starting with the NCLB Act and identify some consequences there based on some research, Race to the Top, and ESSA. Then I’m going to hopefully provide a conceptual framework that I’ve been thinking about for a long time. We proposed one in the past, my colleagues and I.

On-screen: [Consequences for Tests used for Accountability Purposes.]

The intent of test-based accountability systems is to improve the educational opportunities afforded to all students, particularly those students who have been historically underserved. Students living in economically disadvantaged areas, black students, Hispanic students, English language learners, and other students, students with severe cognitive disabilities as well. They are also there to hold schools, educators, and sometimes students accountable. Tests serve as powerful tools for forwarding educational reforms. However, they also expose some societal and educational inequities. When we think about achievement gaps, I think we also need to think about what the educational gaps have been, because those are typically the things that move—or the achievement gaps result from educational gaps.

Integral to validation efforts is the appraisal of test-based decisions and uses in terms of the consequences. In the use of tests, as policy levers bear on fairness and equity issues, there are differential outcomes for different types of schools, for different subgroups of students. We need to, when we’re evaluating the consequences, to determine the extent to which the intended positive consequences outweigh any unintended negative consequences, such as narrowing the curriculum, especially for some groups.

On-screen: [Importance of an Interpretation / Use Argument.]

Many of us have heard this already, but it’s something I always try to do to frame my talks is, thinking about when we’re evaluating consequences, it should be part of a validity argument. When we specify an interpretation in use argument, what we’re doing is specifying the—explicitly linking inferences from performance to conclusions and decisions including actions resulting from decisions.

In the interpretation and use argument, we want to specify the claims that the test and the use of the test is based on, such as that there’s going to be increased student performance, that there will be changes in instruction. The validity argument actually provides evidence to support the claims that we want to make. We need to accumulate this validity evidence to support each claim and we need to synthesize the validity evidence.

On-screen: [Theories of Action (TOA).]

As I mentioned, theories of actions provide a framework for evaluating programs and they identify critical components in the logical points of impact. Theories of action for accountability systems can guide the development of comprehensive IU interpretation and use arguments and validity arguments.

For example, the Race to the Top assessment program called for a theory of action that describes in detail the relationships between specific actions or strategies and its desired outcomes, including improvements in student achievement in college and career readiness.

Randy Bennett at ETS® has identified aspects that theories of action should include, the intended outcomes of the assessment systems. The primary outcome of test-based accountability systems being used now is to prepare students to be college- and career-ready.

Secondly, components of the assessment system and a rationale for each should be specified, so what are the components within this particular system? Interpretive claims that will be made from the assessment results that perhaps instruction has changed, that students are college- and career-ready. Action mechanisms designed to cause the intended effects, that resources are allocated to different instructional approaches and to ensure that the content standards are being instructed on. And the potential unintended negative effects, as well as positive effects, and to identify how we can minimize the negative effects.

On-screen: [Consequences in the Argument-based Approach to Validity.]

There has been a long debate in our field, and probably outside of our field, of whether consequences should be part of a validity argument. Some consider it’s not important enough, or they are important, but they are not part of the validity argument. Others say that definitely they are, and I stand in the latter camp. In terms of the debates, there’s actually been a recent issue in assessment and education debating whether or not the examination of consequences should be part of the validity argument.

About five decades ago, Cronbach considered evaluation of decisions and actions based on test scores as part of validity evaluation, and Bob Linn indicated that the best way of encouraging adequate consideration of major intended positive effects, and plausible unintended negative effects of test use, is to recognize the evaluation of such effects, a central aspect of test validation. So, both positive and negative consequences should be part of our validity argument. Some of the well documented unintended negative consequences is the narrowing of instruction for some students and the focus on students slightly below the cut point for their proficiency level.

An evaluation of the congruency between the intended interpretations and uses and the actual or enacted interpretations and uses should be integral to the validity argument. What’s intended to be and the way in which the test is intended to be used can be very different than the way it is actually being used and the data from it.

As Coburn and Turner indicate, inactive interpretations and uses are more complex and varied than intended interpretations and uses. Tests used locally in different contexts by educators and administrators to address their own needs should be looked at. For example, in some school districts it might be the case that the state test data is being used for grade promotion.

On-screen: [Social Consequences and Value Judgments.]

Tests that are used, are mechanisms of social action resulting in consequences from both direct and indirect actions. Cronbach argued that the validity argument must link concepts, evidence, social and personal consequences, and value. Social consequences weren’t addressed until the 1960s. In an early paper by Sam Messick, he reflected on the ethical aspects of whether a test should be used for a particular purpose, which requires a justification of the proposed use in terms of social values.

Messick’s perspective on adverse social consequences have a role in test validation however, is limited to those consequences being traced back to construct or relevant variance, such as math tests that might be assessing reading skills too much, and construct underrepresentation. He viewed it as consequences as part of validity should only be linked if there were problems with the test itself.

Messick’s perspective, as Mike Kane has said, gave consequences a secondary role. Cronbach indicated that consequences have an integral role in the validity argument, that negative consequences can invalidate test use, even if they cannot be traced back to a flaw in the test. Tests that impinge on the rights and life changes of individuals are inherently disputable, Cronbach said. It’s the obligation to evaluate whether a use has appropriate consequences for both individuals and institutions and, more importantly, to argue against adverse consequences.

On-screen: [Social Consequences and Value Judgments.]

Decisions made in test design, development, and use are grounded in value judgments. When we develop a test, each of us have different values, so the way in which we specify the purpose of the test and the use of the test is laid in our values. The specification of this construct recognizes what we value. The extent to which state tests measure some content standards over others reflect values. The design and development of items and scoring systems, the development of performance level descriptors reflect differing values. The extent to which what we say is proficient is value-laden. So, value judgments in test design and use impact test consequences. The way in which we define the construct for the test will have an impact on the instruction that’s provided for some individuals and students. Value judgments inherent in the testing program, including the specification of intended consequences and potentially unintended negative consequences. We need to be very careful in documenting the value judgments that go through the entire design process, as well as the values that we associate with the consequences that we’re going to study.

Different stakeholders, policymakers, advocacy groups, administrators, educators, parents, students, the community, business leaders, and the list could go on, have different, perhaps, conflicting perspectives and values on the purposes and the uses of the tests. What’s considered of value with regard to test use and consequences depends on the stakeholder.

Bachman, in a 2005 paper, actually indicated that perhaps we should have different use arguments for the particular stakeholder, which would be rather comprehensive and hard to probably provide evidence for, but at least we should document what values stakeholders hold.

On-screen: [Test-based Accountability Systems and Consequences.]

With respect to test-based accountability systems, advocates view test-based accountability systems as a way to raise academic standards for all students, and then to focus instruction on rigorous content. This is one of the standards that’s italicized from the Standards for Educational and Testing, and all individuals who develop tests, use tests, interpret scores, etc., should be following the Standards for Educational and Psychological Testing. This standard—1.6—states “When a test use is recommended on the grounds that testing or the testing program itself will result in some indirect benefit, logical or theoretical arguments and empirical evidence for the indirect benefit should be provided.” In test-based accountability systems, one of the indirect benefits is to improve the instruction for students, so that the goal is to change instruction through the use of tests and other means.

There’s inherent tension between using tests to achieve the goals of instruction and the accountability due to practical design constraints. In this country we like to use tests, and it’s more economically feasible to use tests for multiple purposes—to use it for accountability purposes as well as instructional purposes. We know that it’s difficult to design a test for both purposes. We do our best to design tests for both purposes, but there is an inherent tension in doing so. So, the design decision should be informed by the multiple purposes.

Those who mandate and use tests are obliged to make a case for the appropriateness of test decision uses and their resulting consequences. Validity and fairness of test score interpretations and uses may be compromised due to inequities in education. Thus, test-based decisions in uses of tests may be compromised, leading to negative consequences. As an example, using exit exams at the high school level may increase dropout rates, and the increase in dropout rates might be differential in affecting certain subgroups over others.

I put the same standard there, because I think it is extremely important, and the comment underneath it, “Certain educational testing programs have been advocated on the grounds that they would have a salutary influence on classroom instructional practices.” That basically implies that it is important to evaluate the extent to which they are having an effect on instructional practices in the classroom, and that is part of the validity argument.

When we’re thinking about consequences, we need to think about consequences based on using test scores and consequences associated with using tests as levers of educational change, and both of them are within that framework.

On-screen: [Research on Consequences of State Performance Assessments and Alternative Assessments.]

I’d like to go over some research on the consequences of state performance assessments, as well as alternate assessments that are geared for students with the most severe cognitive disabilities. The reason why I chose state performance assessments and alternate assessments is because these assessments, all assessments, beg for looking at the evaluation of consequences, but I think these do because they are supposed to have substantial impact on instruction.

Performance assessments. There was a renewed interest in the 1990s, due in part to being valuable tools for educational reform. Linn had an excellent quote, and I use it all the time, so people who have heard me talk probably have seen this quote before. “Consequential evidence is especially compelling for performance-based assessments, because particular intended consequences are an explicit part of the system’s rationale.”

One such program was the Maryland School Performance Assessment Program, and I hoped Steve Ferrara was going to be here because he was the director of the assessment during that and led the Maryland School Performance Assessment Program. It was based on performance tasks, integrated tasks for science, math, English/language arts, social studies and it required students to work collaboratively and student individual scores were not provided. This was before No Child Left Behind; school-level scores were provided.

On-screen: [Research on the Consequences of State Performance Assessments.]

I was asked, myself and my colleagues at the University of Pittsburgh, were asked to evaluate the consequences of the use of such tests. What we found is that the more teachers reported impact of MSPAP on instruction, including higher-level thinking skills and rigorous content, there was greater rates of change in MSPAP school performance in math and science over the five years. The positive changes in school performance were related to impact on MSPAP instruction. This was supported by some research that Bob Linn did, where he looked at the trends in math gains for NAEP and MSPAP and found that they were similar. Their performance on MSPAP and their gains over the year could be generalizable to their performance on NAEP, which was a different assessment measuring somewhat different skills and with different item formats.

We also found that mathematics instructional activities were more aligned with MSPAP for higher gained scores compared to lower gained scores, again providing some evidence that MSPAP and the policies instituted in Maryland to that point had an impact on performance-based instruction.

When using test scores to make inferences regarding the quality of education, contextual information is needed to inform the inferences and actions. School contextual variables, like socioeconomic status, we found was significantly related to school-level performance on MSPAP in math, reading, writing, science, and social studies. SES, however, was not significantly related to growth on MSPAP in math, writing, science, and social studies.

We argued that MSPAP did not have an adverse impact for students living in economically disadvantaged areas, because of comparable school gains, regardless of SES. However, others could argue, and have argued with me that, in fact, there was some degree of adverse impact, because these children, these schools should have showed even greater gains, because they started out at a lower level.

On-screen: [Research on the Consequences of State Alternate Assessments.]

In terms of alternate assessments, alternate assessments for students with the most severe cognitive disabilities have been put into place. They are put into place because they are seen to promote grade-level academic instruction for students. In one study, teachers from schools with high performance on state alternate assessment indicated that assessment had a positive impact on instruction and inclusion for students with disability, whereas teachers from schools with lower performance indicated assessment had little impact.

Another study indicated that the use of alternate assessment impacted instruction more than in the development of the individual educational plans. In a third study, teacher’s indication of student access to general curriculum not related to teacher’s perception of student assessment.

These kinds of studies provided a call for more professional development and how to provide inclusive environment and standards-based instruction for students with severe cognitive disabilities, which I’m going to be talking about a little later on.

On-screen: [Legislation for Test-Based Accountability Systems.]

In thinking about legislation for test-based accountability systems, I’m going to talk about each of the legislations recently and some of the consequences from them. But first I’d like to just mention the Minimum Competency Testing Reform in the 1970’s. This was an admirable goal, the Minimum Competency Testing, to ensure that all students attain certain standards of minimum competency and to have meaning behind the high school diploma. However, it led to undesirable consequences: instructional shifts to focus on basic skills and higher high school dropout rates, particularly for underserved students. I think this, the Minimum Competency Testing movement, was probably the movement that told us very carefully that we need to think about the unintended negative consequences and who is affected mostly.

The No Child Left Behind Act created incentives for educators and students to focus on rigorous content. The goal was for all students to reach proficiency by setting challenging performance standards. Tests were used as the primary measure of student outcomes for accountability, and the primary goal was for schools to focus on education for low-achieving students and to reduce the achievement gap. While the goals of the program were admirable, I think the focus was more on the accountability than insuring that instruction was changed for all students.

On-screen: [No Child Left Behind Act (2001-2015).]

Under the No Child Left Behind Act, tests were used to evaluate whether schools made adequate annual, yearly progress toward 100% of the students being proficient in the year 2014. It provided rewards and sanctions based on each year’s AYP. Sanctions, a school could be taken over; rewards, performance, teacher pay for performance. Pressure was to raise test scores, and it led to a narrowing curriculum, shifting resources from those subjects not tested, prolonged test preparation for some students, and for some students and schools it focused on students near the proficient cut scores.

States provided their own definitions of proficient and their own starting points for evaluating annual yearly progress. The starting point indicated the percent of students who were proficient in the 2002 year. Some states had proficient performance standards set close to the 70th percentile, which is similar to NAEP’s proficient achievement, which is considered challenging.

On-screen: [Table from No Child Left Behind Act.]

This is a table from Bob Linn’s study. It’s just—he did it across states, but this is showing some extremes with respect to Arizona and North Carolina. It’s for proficiency for grade eight, and I believe it was in mathematics. Where Arizona only had 7% of the students at proficient at its starting time, whereas North Carolina had 75% of the students at proficient on the state test. Basically, it was going to be much easier for states like Arizona to reach annual yearly progress who had lower standards than a state like North Carolina who set their standards very high. What is interesting on this graph is comparing it with the percent of students at proficient above NAEP, where you see the discrepancy is much less between the states.

Another quote from Bob Linn that I use often, “The variability in the percentage of students who are labeled proficient or above due to the context in which the standards are set, the choice of judges, and the choice of methods to set the standard is, in each instance so large that the term proficient becomes meaningless.” I think that’s what we found out with the beginning stages of No Child Left Behind.

On-screen: [NCLB Act and Consequences.]

Stetcher has looked at the consequences for a number of programs. For NCLB, he used surveys for this particular study, he found that some of the positive consequences based on teachers’ reports were that NCLB led to improvements in academic rigor and instruction and a focus on student learning, especially for those low-achieving students. Negative consequences, content standards, were too difficult for some, too easy for others; more time spent on tested standards than non-tested standards; more time spent on tested subjects than non-tested subjects, and a lack of time and resources had negative impact on their efforts.

NRC got a panel together where they reviewed research on implications of test-based incentives for schools, teachers, and students. They found that test-based incentive programs were associated with increased student achievement to some degree. However, in the report they stated that the U.S. performance was still lagging behind, as it is today, with the highest-achieving countries.

The largest estimates of achievement effects for NCLB-like school incentives and the use of high school exit exams decreased the rate of high school graduation. However, those programs that provided rewards for graduation showed a relationship between the use of the exams, as well as the rate of high school graduation. Providing positive incentives tended to increase the rate of high school graduation.

In another study that examined whether NCLB impacted student achievement using NAEP fourth and eighth grade state data, Dan Jacob found that increases in the average math performance on students in fourth grade across the score scale. There were actually larger effects among black and Hispanic students. There were also increases, but not significant, in the average math performance of students in the eighth grade, and there was no evidence of increased reading performance.

On-screen: [Race to the Top.]

With Race to the Top, the Race to the Top grant called for tests that assess cognitively challenging skills that are difficult to measure and that are designed to create conditions for education innovation and reform. Their theory of action stated that the tests will play a critical role in the educational system, provide data and information needed to continuously improve teaching and learning. Again, the need to focus on the impact on instruction to evaluate the assessment system.

We know that it encouraged states to join the assessment consortiums based on the Common Core State Standards, either PARCC or Smarter Balanced, and it supported measuring teacher effectiveness with state test results. States adopted the Common Core State Standards, PARCC or Smarter Balanced, or they developed their own assessments with Common Core State Standard-like standards. Race to the Top supported measuring teaching effectiveness with state test results and there was a shift in the rigor of the state content standards and performance standards. The consistency of students proficient or above on state tests in NAEP increased since the implementation of Race to the Top.

On-screen: [Grade 8 Mathematics State Test and NAEP Performance.]

This data comes from some of the Achieve reports, and it shows the discrepancy between the percent of students proficient or above on state tests as compared to states, and this was considerable prior to Race to the Top, which is in that second column. The first column shows the percent proficient or above discrepancy for state versus NAEP testing. If we take that 13, that indicates that 13 states had between 31% and 53% more students classified as proficient or above by the state test as compared to NAEP. Seventeen states had between 16% and 30% more students classified as proficient or above by the state test as compared to NAEP. Only two states had more students classified as proficient or above on the state test as compared to NAEP.

When we look at the second and the third column, you can see that states heightened their content standards, their tests, as well as their performance standards. In that third column, the three indicates that only three states had between 31% and 43% more students classified as proficient or above by the state test as compared to NAEP. In fact, if you look at the 16, that’s indicating that 16 states had between 0 and 22% more students classified as proficient and above on their state tests as compared to NAEP. States were increasing their rigor, as well as their performance, once Race to the Top came into effect. The data for 2017 is very similar.

On-screen: [RTTT and Consequences.]

In examining the consequences of the Common Core State Standards and Smarter Balanced and PARCC tests on curriculum instructional practices, Thomas Kane and his colleagues at Harvard found a relationship between student performance on the math test and the Common Core state implementation strategies used by teachers in grades four and eight. Increased test performance was related to teachers obtaining explicit feedback on how well they were implementing the standards in their instruction after they were observed. Increased test performance was related to the inclusion of test scores in teacher performance evaluations. However, there was no relationship between student test performance and teacher-reported shifts in instruction.

A widely reported, unintended consequence of these test-based accountability systems recently was the opt-out movement. This gained national recognition in 2015 when 20% of the students in New York opted out of their state tests. I think through shortening test results, shortening state tests, modifying academic standards, modifying graduation requirements, at least in New York and some other states, the opt-out movement has decreased.

On-screen: [Alternate Assessments under RTTT.]

In terms of alternate assessments, those assessments for students with the most severe cognitive disabilities, there’s two assessment consortiums: NCSC—or there was two—NCSC, and DLM is still running strong. Again, the primary goal of these is to focus instruction on academic content and improve academic learning for these students. NCSC had a theory of action and Rachel and Claudia and Ellen Forte put together where they treated the summative assessment as a component of the overall system and was considered in light of other components, such as instruction, professional development for educators. In their theory of action, their long-term outcomes were greater exposure to grade-level academic curricula, higher academic outcomes for students, high school students ready to participate in college and career and community.

The theory of action for DLM was somewhat similar, though it had its own features. In terms of their long-term outcomes, one of them was that teachers understand how to build breadth and depth of conceptual understanding and to make useful information, and to make use of the information provided by the test, and for teachers to think differently about how to educate students with severe cognitive disabilities.

There’s been some initial studies examining some of the consequences of DLM on instructional practices through a survey. It was found that the learning profiles from the summative assessment provided useful information, but the teachers treat it as a fixed map. The learning profiles identified what skills the students may have acquired or most likely have acquired and then what skills they needed to acquire.

In another study using focus groups, they found that teachers use information from the learning profiles, from the interim assessments, assessments that are given at different points in time in instructional year, in the development of the IEPs. The summative test results helped inform the IEP general goals. Here was where these assessment consortiums were really focusing on how do we use the information? How do teachers use that information to plan instruction for individual students?

On-screen: [Every Student Succeeds ACT (ESSA).]

ESSA, as we know, replaced NCLB, tests still used as tools for instructional change and accountability, and the goals were similar to NCLB. With ESSA there’s a narrowed role of the federal government. It allowed states to play a larger role in accountability, stipulated that states could use other measures in addition to tests in the school accountability systems, which is very positive. It helps relieve some of the pressure put on tests.

On-screen: [Use of College Admission Tests under ESSA.]

Under ESSA there was more flexibility, and this allowed states to more easily adopt the ACT or the SAT as its high school accountability test. When thinking about adopting the ACT or the SAT as a high school accountability test, we can think of both some positive consequences, one of them being that affords the opportunity for more students to go to college. Whereas a potentially negative consequence is that it’s not aligned necessarily well with the curriculum for high school students.

In thinking about this, you have to outweigh the positive consequences if, in fact, it does allow and does promote more students to go to college. Maine was one of the first states to adopt the SAT, and this was way before ESSA in 2006. Back then, in order to be used, it needed to augment the math section with additional items to improve alignment. One study found that there was a two to three percentage point increase in college-going rate was attributable to the use of the SAT.

In Michigan, they adopted the ACT as its high school accountability test. The use of the ACT in one study resulted in an increase of approximately 50% of students living in economically disadvantaged areas scoring at the college-ready level, but only 6% of these students enrolled in a four-year institution. We need programs to intervene to get these students to enroll in college, if that’s their choice, if they are college- and career-ready. A colleague of mine at the University of Pittsburgh, Lindsay Page, has such programs going on where she tracks students who are doing well and are ready to go to college and through the school year, as well as through the summer, to make sure that they are applying, that they know what financial aid that they can receive and other benefits.

On-screen: [Pilot Program under ESSA.]

One of the aspects under ESSA was the pilot program. This allowed up to seven states to develop innovative assessment systems. It allowed states to incorporate new measures of student performance that better support student learning, and it provided the opportunity for states to build assessment systems that lessen the tension between the goals of instruction and accountability. They could incorporate competency-based assessments, interim assessments, performance assessments and other forms.

ESSA specified that it should be developed in collaboration with a broad range of stakeholders, educators, civil rights organizations, and the parents, to name a few. Results need to be valid, fair, and comparable for all students in subgroups, and that states need to report annually how they will ensure that all students receive instructional support. Again, we need to make sure, we need to provide evidence that students are receiving the instructional support. New Hampshire and Louisiana were approved in 2018, and Georgia and North Carolina in 2019.

I’d like to talk a little bit about the New Hampshire program, the Performance Assessment of Competency-based Education, PACE. Within their program, under this pilot program, they have locally developed performance tasks and some common tasks that are shared by the participating districts. The summative state test is used only for a limited number of grades within these districts. The theory of action includes programs’ impact on classroom practices, specifying the assessment design features, mediating effects required to achieve its goals, and the end goal, of course, is that students are college- and career-ready. It specifies contextual factors that need to be considered in interpreting results, such as district size, and it includes potential unintended effects, such as doing no harm on the performance on this summative assessment for those districts using these performance assessments.

On-screen: [Theory of Action: PACE influence on classroom practices. (Adapted from State of NH, DOE, 2018, p. 55).]

This is the theory of action for this program where it has design features, where it is involving local educators in designing the assessment, providing support, using competency-based approaches for instruction. Some of the mediating effects, that it’s fostering positive organizational learning and change, building local capacity, restructuring the rigor and content representation of curriculum instruction and assessment. Outcomes, changes to the instructional core of student engagement, improved instructional quality, and meaningful content and skills.

On-screen: [Pilot Program under ESSA – NC PACE Program.]

An initial evaluation of their program looked at doing no harm to student performance on the summative test. One study examined the effects of the PACE program on state tests. They found small, positive effects for students in grades eight and eleven in PACE schools in contrast to students not in PACE schools, and PACE schools are the schools receiving the performance-based assessments. Students tended to perform slightly better than predicted on state tests as compared to students not in PACE schools, but with comparable school contextual variables.

On-screen: [Pilot Program under ESSA.]

My hope would be that in the design of locally developed performance tasks, whether it be in New Hampshire or anyplace else, that it would reflect the social nature of learning these particular tasks, reflect cultural practices in the community that shape learning, and therefore it would address some fairness issues. The assessments would be more culturally responsive and help alleviate potential adverse consequences.

Here, I have a quote from Sam Messick. “It should not be taken for granted that richly contextualized assessment tasks are uniformly good for all students, because contextual features that engage and motivate one student and facilitate effective task functioning may alienate and confuse another student and bias or distort task functioning.” I hope that when we can think about moving forward with test-based accountability systems, is to think a little more carefully about how we design these assessments and that they reflect differing ways of learning in different cultures.

On-screen: [Conceptual Framework for Evaluating Consequences.]

In terms of a conceptual framework, I’m just going to move to this slide. In thinking about how we’re going to evaluate consequences for test-based accountability systems, I’ve mentioned that values are inherent in everything that we do in testing, from test design to specifying what consequences are important, both positive and potentially negative consequences. We need to include stakeholders, and these stakeholders have different values. At times they have competing values.

In thinking about consequential evidence, we need to think about the stakeholders and the competing values of stakeholders in terms of design issues, in terms of use issues, in terms of consequences. We need to think about test score-based consequences, so we interpret scores, we use scores. We use scores to hopefully change instruction and what consequences result in doing that at both the individual and institutional level, the positive and the negative consequences.

As I mentioned before, we use tests as levers of educational change. They serve as tools to make things happen in education, so we need to think about the consequences of those, so what is actually happening in a systematic way to our education at the individual and institutional level, positive as well as negative consequences. Then differential impact or adverse impact; the extent to which some students are being affected more positively and other students are being affected more negatively with respect to the use of tests. In doing this, we have to think about the contextual variables: where children are growing up, the type of education students are receiving, the district, the monies being poured into some districts over other districts when we’re interpreting consequences.

How do we do this? Evaluating consequences—I was involved in one state, in Maryland, as a researcher evaluating the consequences of MSPAP. It’s huge. It takes money and it takes time. In terms of prioritizing our consequences, I think we first have to think about competing values and needs by the different stakeholders.

What are some competing values and needs? Using perhaps the SAT and the ACT, some stakeholders might consider that positive. Others might consider it negative. The more the stakeholders have competing values, then I think those are the consequences we need to focus on, such as using tests for teacher evaluation.

We need to look at the fundamental intended positive consequences, those based on test scores and those based on using tests as levers of educational change. Are things happening in instruction? Are things happening uniformly in instruction across schools? The severity of potential negative consequences. When we look back at the minimum competency testing, the education that some students were receiving was not what was intended. Differential consequences. Adverse impact—that should be a major feature when we think about consequential evidence.

On-screen: [Concluding Thoughts.]

Where do we go from here? ESSA has this pilot program that states are involved in. I mentioned one of them, New Hampshire, that’s using performance-based assessments. New Hampshire’s small, I know, so it’s easier to do there. There’s another program that is going to be used in interim assessments, assessments throughout the year so that data can be provided back to teachers. That’s difficult to do. PARCC thought about using— Gerunda, you were involved in those discussions of interim assessments in terms of what do you assess? Are we going to be following the sequence in mathematics appropriately with interim assessments? It’s a difficult thing, but one program is actually looking into that.

I think when we use test-based accountability systems, they have a goal. They have a goal to increase learning for all students. They have a goal for increasing the quality of instruction for all students. We are obliged to look at whether or not this is happening, so consequential evidence should be at the forefront and money should be put into looking at consequential evidence. When I looked at MSPAP and the consequences MSPAP had on its state—not me by myself, my colleagues at the University of Pittsburgh—money was allocated by the Department of Education for states to use in discretionary ways if it was approved, and that’s how it was funded. We need more of that, to really make sure that the intent of our state assessment systems are actually realized. Are they having an effect on all students? Is the achievement gap closing to some extent? Are we providing the education that we want for different subgroups of students, and also students with the most severe cognitive disabilities?

Thank you very much; I appreciate you coming out.


Ida Lawrence - What I wanted to say now is we have a fair amount of time to take questions. The way we’re set up is there’s two microphones. If people just want to queue up and you can just alternate in terms of—

Suzanne Lane - Sounds great, if anybody has questions or comments or would provide your beliefs or thoughts on the subject…

Speaker: Drew Atchison, (Audience) - Thank you for your talk. You talked a lot about values and understanding values are important when you’re evaluating unintended consequences. Can you give an example of how you went about understanding different values and how that guided your evaluation of whether unintended consequences then occurred?

Suzanne Lane - I think a classic example is when we develop a state summative assessment, the nature of the assessment is going to guide what content standards we can actually assess. If we look at the Common Core State Standards, those content standards, some of them cannot be assessed by a lot of the summative assessments.

What we value in those content standards and what we value in putting on the test are going to be at odds with each other, because some of the standards that actually require more performance-based, students actually doing a lot, are not going to be fully assessed. That’s one case where differing values, if I developed a different form of assessment, if I had a more balanced assessment system where we were assessing things throughout the year using different formats, then I would be able to bring in all of the content standards to be assessed. But it’s impossible for most state end-of-year assessments.

Speaker: Michael Feuer, (Audience) - Can I pick up on that? Thank you very much. Speaking of values, I think this was a wonderfully valuable presentation. I want to thank you. It’s nice to see every so often someone who can actually bring together so much of this context and these issues. Just for the sake of a plug for the measurement community, it is interesting to me that some of the most, shall we say, critical judgments about the uses of assessment are coming from the community that is responsible for producing them. I’m not sure that’s something I would say about other instances of production of technologies that do have unintended consequences, where the people who are producing them are actually as willing as in this case ETS® is to say “Look, some of this stuff has unintended consequences.” Anyway, that’s just a passing comment.

My question about values, I’m not sure if it’s a question, but much of what you’ve been talking about are the kinds of values that go into the actual decisions about what content are included and in what context the tests might or might not be used. But there’s another part of the values thing, which I think your talk illuminates, and that has to do with how we as a society weigh the downside risks of certain kinds of unintended consequences against the potential benefits.

I don’t think we have a real theory of how those decisions are made. For some people, the possibility that there will be teaching to the test is unfortunate, but it is justified by the other good things that happen as a result of accountability systems. For some people, the idea of teaching to the test or narrowing the curriculum is so abhorrent that they are willing to rule out the possibility that the positive effects would make up for that. Do you see what I’m getting at? There’s a values judgment that comes about as a result of even thinking as to what to do with these unintended consequences. I don’t know if you’ve given that some thought or whether you know where we might go on that.

Suzanne Lane - Thank you, Michael. I think when we’re looking at the potential unintended consequences and what we value, if the positive consequences are of more value, I really think we have to look at adverse impact. If we are adversely impacting subgroups of our students who traditionally have been underserved, then I think those negative consequences outweigh the positive consequences.

I go back to thinking about the Minimum Competency Testing movement, and that was supposed to help put meaning behind the diploma, to ensure all kids get certain skills. But it really did just do harm to a large number of students in this country by narrowing the curriculum. It was drill-and-kill time for some kids and these kids were typically kids living in economically disadvantaged areas—black students, Hispanic students, English language learners. When we think about a theory of evaluating the consequences, I think we have to think hard about what are the potential negative consequences for different groups in our country, and that should speak heavily in terms of when we’re looking at the positive consequences.

What are your thoughts, Michael?

Michael Feuer - I agree.


Suzanne Lane - I mean, these tests, we’re closing the achievement gap, but the problem is tests can’t close an achievement gap. The educational system needs to be closed—the gap needs to be closed for the educational system.

One last question? OK, Ida.

Ida Lawrence - I can kill two birds with one stone. I’ve been told to say after the last question, which maybe this is, to please join us. We have a reception down the hall in the Holman Lodge—Lounge, which is on this floor.

Suzanne, my question is: I’m trying to pick up on something you just said about how can we improve the situation through our testing systems? I was wondering if you had any comments around technology, artificial intelligence, any of these new tools that people are so excited about. If you had any thoughts about how maybe five to ten years from now we would feel that some of the unintended consequences had been—or the negative ones—had been mitigated. Also the thinking around how, unfortunately, the technology could actually introduce other unintended consequences.

Suzanne Lane - A lot of work at ETS® and elsewhere is looking at game-based assessments. That’s more of a performance-based assessment, game-based assessments. They’re actually doing things, demonstrating what they know through the games. It can be very motivating for some students, though we have to be careful because what is motivating for some students is not going to be motivating for other students.

Anecdotally, I heard that maybe boys are more attracted to game-based assessments than girls, so what do you do? You have to try to think about how do we engage girls in game-based assessments? What is the context of those assessments that are engaging them? I haven’t read anything or seen anything. Some in the audience might know if there’s been differences in different subgroups on game-based assessment.

I think whatever we can do to engage students in learning and to marry assessment with instruction.

If the computer, like states are always thinking about “Let’s put the assessment on the computer,” but they’re not using it in instruction, so how do we marry instruction with assessment? I think what Gerunda brought up in terms of more of a balanced assessment system will help, where we’re not so tied to this one end product of a summative assessment. Thought it has its purposes, clearly needed to some extent, but to not have the stakes so high for students and teachers with the summative assessment.

Thank you. I appreciate it. If you would like my slides, I can get you—

On-screen: [ETS® Copyright © 2019 by Educational Testing Service. All rights reserved. ETS and the ETS logo are registered trademarks of Educational Testing Service (ETS). MEASURING THE POWER OF LEARNING is a trademark of ETS.]

End of Video: ETS®. Test-based Accountability Systems: The Importance of Paying Attention to Consequences.

Promotional Links

Find a Publication

Advanced Search