Do Educational Assessments Yield Achievement Measurements,
Mark D. Reckase, Michigan State University

On-screen: [ETS®]

On-screen: [The National Press Club, Where News Happens, press.org]

Ida Lawrence - Okay, good afternoon everyone. Thank you all for coming, and welcome to the 16th annual William Angoff Memorial Lecture. My name is Ida Lawrence, and I'm the senior vice president at ETS. I lead the research and development division at ETS.

Some of you may be aware of sort of the background around the Angoff Memorial Lecture Series, but just to give a little background, it was established in 1994. And it was established to honor the life and work of Bill Angoff, who died in January 1993. Now, for more than 50 years, Bill made major contributions to educational and psychological measurement and he was deservedly recognized in the field. And in keeping with Bill's interest this lecture series is devoted to relatively non-technical discussions of important public interest issues related to educational measurement.

One of the things that Bill Angoff took a lot of pride in doing at ETS was in being able to clearly explain technical material to people who didn't necessarily have that expertise. The other thing he took great pains to do was to mentor junior scientists and psychometricians and all sorts of other people at ETS, around some technical issues where he could help them learn how to communicate effectively to non-experts.

This year's Angoff Memorial Lecture will be delivered by Dr. Mark Reckase, and he is the university-distinguished professor emeritus of measurement and quantitative methods at Michigan State University. He is a specialist in the development of educational and psychological tests, educational policy related to testing, and the psychometric theory behind the assessment of cognitive skills and content knowledge.

His topic today is "Do Educational Assessments Yield Achievement Measurements?" In his talk today, Doctor Reckase will discuss two main perspectives on how tests are designed and developed. The first perspective is that individuals differ along a continuum, and the goal of test design is to get information to accurately identify where a person is on that continuum. Now, the second perspective on test design is that there is a domain of knowledge and skills that a person should have, and the goal is to find out how much the person has acquired. So in comparing the two perspectives or approaches, Dr. Reckase will examine areas of possible conflict between them, and he will also provide recommendations to help clarify the issues and minimize problems that policymakers could encounter from interpreting the results of assessments that are built under each approach.

So we're really delighted to have Mark here to give this year's Angoff Lecture. He's made several important contributions to knowledge about standards setting, computer-adaptive testing, and models for the evaluation of educators, just to name a few. In 2009, he wrote Multi-dimensional Item Response Theory, which is a book that was published by Springer, and he's currently writing a book which will focus on how educational standards are set. Mark has served as president of the National Council of Measurement in Education, vice president of division D of the AERA, and editor of several journals. In 2016, he received the NCME Award for Career Contributions to Educational Measurement. Now, prior to coming to Michigan State, Mark served as an assistant vice president in assessment innovations at ACT, and earlier as a faculty member at the University of Missouri. I'd like to stop there, and please join me in welcoming Mark Reckase.

Audience - [Applause.]

Mark D. Reckase - Okay, thank you, Ida, for that wonderful introduction. When I hear those introductions I always wonder if it's really me. I hadn't really expected to be in this kind of position at this point in my life, but I appreciate it and I'm really glad to be here. I'd also like to thank all of you for coming. It's really a pleasure to talk to different audiences. This is an audience that I don't know all that well, so I'm a little bit nervous about it, because I want to try and connect with all of you, so I'll give my best shot at that.

I also want to thank ETS for actually inviting me to do this, to giving this Angoff Lecture. I knew Bill Angoff, not well, but I'm old enough so that I actually met Bill Angoff. I thought he was a really wonderful person and gentlemanly. As I said to people before, I thought he was the best dressed psychometrician around, because he always was well dressed. He always looked really good. I'll dress up occasionally, but my students knew that at the time when I first went to Michigan State, I changed from the suits and ties that I wore when I worked at ACT, to the lumberjack look. You know, jeans and plaid flannel shirts. I was a faculty member looking like that for a while. Since, I've regressed back and I tend to dress a little bit better than I did for a few years.

Before I get into the details of this on the slide — there's only about 25 slides or so, so I'm not going to have a huge thing. The slides are there more for you to have something to look at instead of looking at me. I'll use the slides to guide what I'm going to say, but honestly, I don't know what I'm going to say, until actually I get to that slide. I do have, actually, a long script. It's a 17-page outline of what it is, and I have to write a report for ETS afterwards that they're going to publish. This will be written up and you'll get a formal copy of this at the end.

Before I get into it, I want to give a little motivation for why I'm doing this particular presentation. Over the years, I've sat in a lot of meetings — and I'm sure if you're here you probably have sat in a lot of meetings, too — most of these meetings had to do with test design and development. When I worked at ACT I was in charge of those meetings, which was good. I could kind of push the direction of where I want to go to. But, now I go to a lot of meetings that have to do with state assessments or NAEP or some other kinds of assessment programs — language testing programs and various kinds of things.

When I'm in those meetings I am often confused. The reason I'm confused is because I have certain ideas in my head about how you go about designing and developing tests, and oftentimes the people who I'm talking to around the meetings seem to have a whole different idea about how to go over and design and develop tests. I'm trying to get into their head and understand what they're doing. It's taken me a lot of years to sort of maybe get an understanding. What I'm going to try and present today is what my understanding is of these different positions. Now I'll make them a little more extreme than they probably are in reality to make a point, but I'll try and show why there's a conflict and why we have some trouble when we start getting into educational assessment, because there is some confusion about what it is we're actually trying to do. That's sort of my overall goal for doing this.

On-screen: [

Slide 2, Direct Measurement vs. Technology of Measurement.

Mark D. Reckase - Before doing that, since this is about educational measurement, I want to talk a little bit about measurement. Now, I could talk about this in a very abstract way. There are terrible books on the theory of measurement. I've got a three-volume set in my library at home on the fundamentals of measurement that's almost impossible to read. So I'm not going to do that. I'm not going to get into the details of it. I'm going to try and get you to recall back to your days as an elementary school student, which is where I learned about measurement. I learned that you have a ruler and that you have a line and you can go and measure how long the line is. I learned that you can have a scale and you can put things on the scale and you can measure weight. If you were to go back far enough, you might have balance beams that you could go over and put something on one side, and you could put weights on the other side. I think they still teach that to kids. There's also time. You have stopwatches to go over and see how long it takes people to run a certain distance or work on these things. So we have these sort of direct measurement ideas that we learn as children, and that we carry on into adulthood. I think it shows the perspective that we bring to the idea when we think about measurement. I first think about measurement in these ways. But the reality of the world is, that stuff that you learned in elementary school is really not the way measurement works.

When I came over here from the hotel that I was staying in last night, I actually got my phone out and I got into Google Maps, and I said, how long is it going to take me to walk over to this place? It actually showed a path and it told me exactly how long the distance was, and how long it's going to take me to walk. Now, I don't know how it could do that, because it doesn't know how fast I walk. I know from other experiences that, that number they give you for distance isn't absolutely accurate, because it depends upon whether you're walking at the curb or in the middle. It's off a little bit sometimes, but the fact is it does — and I don't know if anybody knows how it does this, in the details of how that number comes up — if it says that it's a quarter mile or something or other, or so many yards, or so many meters, or whatever it is. How does Google do this? Now, I have some ideas. It's got a satellite up there, and it's got GPS things, and it's got all this stuff where it's got maps that it's produced over a period of time, and it's probably making approximations from the maps and the GPS location. But it's telling me this distance even before I've walked there, so it's got to know where the Press Club is, and where the hotel is, and all this stuff. There's a lot of technology that's behind this.

Basically, the point of all that is, the real world, we do things in a really complicated, technical way, but we interpret them in the same way we did when we were measuring things with a ruler. We could, in theory, get a long tape measure and run it from the hotel and down here and we could work this out in detail. Same thing is true with weight. I've got a kitchen scale, I put on some chicken on there and it tells me how much it weighs. I usually do it because it's frozen, I want to know how long it's going to take to defrost when I throw it in the microwave. But how does it do this? It's not the balance beam.

What I think is happening is that there's some kind of a crystal or piece of metal in there, and the weight of this thing changes the resistance to the electrical current that flows through this thing, and all of a sudden on a digital panel in front of it, it tells you how much this thing weighs. What it is, there's a physical model in the background of this thing, a model that has to do with the electrical resistance versus the amount of force that's on this crystal, and we're using that model to convert the weight of this object into a number on a screen.

So this is a complicated model. I do the same thing with time. I've got my phone here. I should put it out here so I can say, it's 2:16 right now, or I could start my little — but, how does it do this. How does it know? The 2:16 I think it picked up from the cell tower someplace. But if I start my little stopwatch inside of this to keep track of how long I'm taking, well, I think what's really going on here, there's a crystal inside that's vibrating a certain amount when the current goes through it, and it's counting the number of vibrations, and it has so many thousands of vibrations per second, then it counts up those things and it's telling how much time.

There's a physical model behind this, and we're using the physical model in order to go over and convert the length of time that passes through this physical thing into the number that shows up on my screen. It doesn't matter, in some sense. We can still treat this as the stopwatch or a clock with a pendulum and it's counting movements of a pendulum. We're getting the same number on the same scale. The important point out of all this is we're getting these now through a sophisticated scientific model that connects our observations — our distances and everything else — through stuff, into something that gives us the number that comes out on the screen. This is pretty complex.

Now, the reason for bringing this up is, educational testing in psychometrics and everything, all the testing that we do, does essentially the same thing. People can think about this, as well. When I was in school I had a teacher and they gave me a test, and at the end we might have been able to score it ourselves. We'd go over and get points, we'd count them up and we'd get a total score, and then I got a score on the test that told me how I did. I could compare it to the kids who were sitting around me and tell me, oh wow, I'm up here. I got 17, they only got 14. I did better. This is sort of a model that we have in our head about how educational measurement works. Or, the 17 might be converted into a percent correct, and we think 70% is passing; if I got above 70% then I'm doing okay.

Well, that's not really the way large-scale assessments work. We go over and we give these items, we get the responses. Depending on what sort of responses they are we have psychometric models that go on in the background — usually item response theory models. We go over and use those models, which are based on the idea that people who are more able will be more likely to get this item right than people who are less able. And then we do scaling and we get a number out, and then we report this number. It's a number that's sort of like that number, how many items did you get correct, but it's not really. It's a thing that's been translated through this complicated model. We like to make that model hidden in the background because we don't want to confuse people with it.

But this is what makes us, as people who work in the area of psychometrics — I don't know if it's we keep this secret so we'd be more valued. I know this secret of how all this is done. When I was talking there, I have to watch out for getting lost in stories as I get into this, but I'll tell one here because I think it's kind of fun. When I was working at ACT, every once in a while, I'd get a phone call transferred to me from the front desk because they didn't know how — they were looking for somebody to answer this, and they didn't know who to give it to, so they'd give it to me. The kind of question it would be was, my son or daughter got this score on the ACT test — which reports scores on a 1–36 scale — and they'd say, "How is this score computed? They took a test that's got 60 items in the mathematics test, but they only got 30 on this test. How come they only got 30 instead of some bigger number?"

Originally, when I would get these calls, I would start going into this thing, "Well, okay there's an equating process. We're trying to go over and convert all these things onto a common scale which goes back to 1968 or something, and we go through all this process and everything." And then people say, "No, no, no. I don't want to know all that. I just want to know where this score came from." Eventually I would just say, "Well, there's a table, and you go over there and you get the number correct score and in the column next to it you look in that column next to it, and that's the score they're going to report on the score report, you know. It comes from this table." They'd say, "Oh, good. Can you send me a copy of that table?" [laughter] and I'd say, "Sure." And we'd send them a copy of the table. This technology that's in the background, people don't really want to know. Likewise, I never really thought about how Google gets the distance or how my kitchen scale works, until I was making this presentation. I started wondering, how does all this stuff work? Maybe at the reception afterwards, people who know the answers to these things can inform me about all that.

The main point, though, is that the real way this all works is that we have models of the relationship between performance on these particular tasks that make up a test, and where a person is along some kind of a scale, and we use the responses through that model to go over and figure out where they are. It's the same thing we're doing with time and weight and other things. So, this is sort of the major idea.

On-screen: [

Slide 3, My Personal Context for this Topic.

Mark D. Reckase - Okay, now, in order to let you know a little bit about how I came into this, I'm going to tell a little bit of personal history, not because I think it's particularly interesting or I'm trying to go over and show off or something, but because it gives you a context, so you can understand where I'm coming from. I was one of these undergraduate students. I went to the University of Illinois, which used to be called Champaign-Urbana, but now they call it Urbana-Champaign. I think they moved the office across the border between Urbana and Champaign and they have to change the way it's listed.

But I started out as a math major, because I liked mathematics in high-school, and I thought, okay, if I liked it in high school I'll love it in — I found out when you get into upper level mathematics it's really abstract. I actually was taking an advanced course from the head of the department out of a book by Hardy, which is called Pure Mathematics, and I said, "This is too much." I still have the book on my shelf. I ought to go back to that book now and see if I like it any better now than I did when I took the course as an undergraduate. I started searching around for other majors, and I tried English literature. I found out I was not a good enough writer to be in English literature. I tried philosophy — we talked about being too abstract, that was kind of a little crazy. I tried human biology, or human physiology.

Eventually I ended up in psychology. There's a long story — that would be way too long, we'd run out of time if I told that story, about how that happened — but I specialized in psychology in the area of social research, in research and social psychology. I was particularly interested in one area, which is attitude change. I thought this is a great thing to study. Of course, if you're going to be studying attitude change — this is how you change attitudes and how people's attitudes get changed by events that are around them — when you're studying that, what you have to do is, first of all, know what the attitude was to begin with, and then see where it is later.

This required you create a scale of some sort to go over and measure the change in attitude. Then, we were looking at things like — which are still current, amazingly — attitudes towards the use of nuclear power; are you for or against it, positive towards that or negative towards that? Issues related to the environment; do you believe in global warming? There is strong believer that global warming is caused by human activity, or you believed in the opposite. You get a continuum of these attitudes, and then you do research studies to see what kind of a message would change a person's attitude, or what events in life would change those attitudes. I started out from that perspective.

After I left, I was planning to go to graduate school and study the same thing. My advisor, when I got to graduate school, the person who was assigned to me — who actually I think assigned me to him because he was the head of the department — was Eric Gardner, who was a really famous guy in measurement. He said, "Oh, the social science program is full, would you mind going into psychometrics for a year?" [Laughter.] I didn't know any better. I think he saw high GRE scores in quantitative and he saw I had been a math major and everything else. "We don't have anybody in psychometrics, and I'm the head of the department. I'll take you on as an advisee and you'll go into this area." Anyway, I stayed there and I did a lot of work with him over a period of time, and learned how psychological measurement really worked, and learned that from that point of view.

Now, what happened, though — and I don't want this story to be too long — is, when I got a job — after finishing graduate school — my first job was in educational psychology instead of psychology. This is another long story. It happened to do with the Vietnam war, and I was looking for a job in the middle of the year, in January, instead of the — and there happened to be a position that had not been filled at the University of Missouri in the usual hiring sequence to hire somebody for the fall, and so was open in January. My advisor knew the person in charge of that position. I went over and interviewed and ultimately got the job.

Now, I think, looking back on this — and I tell people this — I messed up that interview badly. They must have been desperate [laughter], because … I knew psychological measurement. This was in an ed-psych program and they wanted to know about education. They asked me about criterion-referenced testing, and I had no idea what criterion-referenced testing was. I thought they were talking about criterion and related validity. I answered the question as a validity thing about the correlating my test score with external variables and all this kind of stuff. I think that guy must have said, "What? He doesn't know what he's talking about." [Laughter.] And I didn't, but they hired me anyway, so they must have been desperate.

So, anyway, I got into that, and then the first thing they had me do was to teach a classroom assessment to prospective teachers in a large lecture, and I taught that course 15 times in a row. At that particular institution, I would teach it two sections each semester, and then I taught at summer school. I taught it for five years straight and — I taught it for 25 times then — five years straight and five times each year. I got to know this stuff on classroom assessment cold. I just really got this down. It's a great way to learn things, to go over and teach it over and over again. The bottom line of all this, that is from all that, I got two different perspectives on the idea of testing. One was this psychological perspective that has to do with what people might call trait measurement, or things working with hypothetical constructs. Then, I got this concept of educational measurement, which had to do with trying to figure out what kids are learning in classrooms and then assess like that. That's really what this whole talk is about.

On-screen: [

Slide 4, Two Perspectives on Test Design and Development.

Mark D. Reckase - I got these two perspectives, and I'm just going to be emphasizing them, and I'm going to try and talk about these two as we go through and then look at the consequences of having these two perspectives and what that does to the way we develop tests. The first perspective, I'll call the continuum perspective, I could call it the trait perspective or the hypothetical construct perspective, but I'll try to use a more general term. You think that people are ordered in some way, and that the purpose of the test is to try and find the location of people along this particular continuum that you have. It could be attitudes towards nuclear power. It could be spatial relations ability. It could be math, arithmetic mastery. Or you could define a continuum and then you could go over and think you're going to ask questions to figure out where a person's located on that continuum. I'll give you some more examples in a minute. That's one approach.

The second approach is what I'm going to call the domain approach. This says, okay, in our educational system, for an academic year, we have a domain of knowledge and skills that we would like students to acquire through that education. This is what our goal is. Through our achievement testing program, what we would like to do is to find out how much of the knowledge and skills that we intend them to learn do they actually acquire. This is a different model. You could argue — maybe I'll get an argument from the ETS folks tomorrow — that, well, you're defining a continuum there. It's a percent of the mastery of the domain. Well, it's a different continuum. We'll get into that a little bit.

To break this up, so I don't talk to fast, and that I don't get myself in trouble. I've got these metaphors — visual metaphors — to help set these things up. The one for the continuum, here I've got this railroad track. This is out in the west. The railroad tracks run a long distance out in the west, and there's not many stops. We can say, okay, here's my metaphor for the continuum idea. We're trying to find the location of where somebody is on this railroad track.

On-screen: [

Slide 5, Metaphor for the Continuum.

Finding a person's location along the continuum is like finding his/her location along a train track.]

Mark D. Reckase - It actually works out really well, because when you've got this continuum idea you don't really want to get off the continuum. The train doesn't usually get off the tracks unless you have a disaster. It also works pretty well that the railroads, when they do this, they know where the train is because as they go over and pass certain blocks of tracks, it sets off a signal and they've got a little electric marker. They can say, well, it passed all these things; it hasn't passed those things yet, so we know where the train is. This is the idea of the continuum, and this is the idea with what we want to do with testing to go and find a person's location on the continuum.

The other one I've got — which I thought was really funny — this is actually a picture of a statue that's in Leuven, Belgium. It's right in the middle of the square in the middle of the city, and it's a — I always looked at this and when I saw it I just laughed because this is the idea of education. Here's a student studying and we're pouring knowledge into the top of their head. Now, they used water, but the way I think about it more, because we're in this domain concept, is it's a smoothie. It's not water, because they put all kinds of stuff in a blender, they mashed it up, and they're pouring it in. Then, we can tell how much they've learned by putting a dipstick in and we can see what the level of the fluid is that's filled them up. This is sort of the visual metaphor for domain assessment.

On-screen: [

Slide 6, Metaphor for Achievement.

This statue is in the center of Leuven, Belgium. Although, this picture shows water poured in, I think of it more as a smoothie – a mix of all of the things the student is supposed to learn. We can measure achievement with a dipstick.]

Mark D. Reckase - Now, the actual funny thing about this picture, though, is — I didn't show the whole picture and you're only seeing the top part so we can see what it's about — all this stuff's leaking out of this person at the bottom, because it's a fountain. It kind of keeps recycling, so they're forgetting. They don't remember all of it. [Laughter.]. So, the water is not just filling this stuff up; it's also coming out as you're going through it. Anyway, these are just to help you kind of think about these two ideas of a continuum and a domain.

On-screen: [

Slide 7, Developing a Test Using the Continuum Model.

Mark D. Reckase - Let me get into, now, a little bit more details here. Actually, I was going to put my glasses on, but the lights are so nice and bright up here that I can read this without my glasses, so that's good. When I was working for my advisor at Syracuse University, I actually worked on a thing called the Adult Development Study as a research associate after I finished all my coursework, and I stuck around there for quite a while. I was one of these six-year graduate students who didn't want to leave town. Life was pretty good. They were paying me pretty well. We developed all kinds of tests. They got a grant to study the adult population of the United States, with the idea, are these people re-trainable.

They were worried about — as we still are — as jobs change and everything else, can the people who have been in the workforce, can they be retrained? We wanted to know the capabilities of the average adult population. People who were out of school, starting in their twenties all the way up into their nineties. We developed a whole bunch of tests, basically based on the continuum idea, because there isn't any curriculum we were trying to assess or anything else, about saying how good are these people in mathematical skills, how good are these people in vocabulary skills. Do we have any evidence that they were able to learn things after they got out of school and they're continuing to learn?

We created some wonderful tests. There was on that was called Death and Disease. This was the idea that, as you got older, if you're paying attention, you have to learn things about death and disease because they were personally things that you worry about. They figured, as for older people, if they knew a lot about this, that's showing they're still learning. We thought it was funny, because I was in my twenties at the time and I was kind of, oh okay, Death and Disease Test. We made up a test of slang, and it had slang from different periods, so we could see if people were learning new slang. I think the deal there is, do you have grandchildren? If you have grandchildren you'll learn new slang; if you don't you're stuck with your old slang. We created all kinds of interesting tests based on this continuum model.

Basically, the idea was, find items that are highly related to the continuum you're trying to access. What that means is that, you expect that people who are at different points on the continuum are likely to answer the items differently. From the point of view of the achievement testing, you expect people who are way up on the continuum will likely get items right, and people who are down lower on the continuum are likely to get items wrong. You look for these kinds of items. Actually, at ACT I found some wonderful items like this; Pythagorean theory items are great for defining locations on a continuum. Then you select ones that are of varying difficultly, so you can find out where they are on the continuum overall. Then you develop a data analysis procedure to figure out where they're located. Now this is actually item response theory, but there's an assumption here that items give a lot of information at particular points on this continuum, and you space items along the continuum to try to figure out where people are located. Then, the pattern of responses helps you figure out where they are.

On-screen: [

Slide 8, A Fanciful Example.

Mark D. Reckase - Now, I have an example that I actually use in class all the time. I call this my fanciful example. You have a friend out on K Street. In Washington D.C. on K street. The only reason I picked K Street is because the AERA offices are on K Street. This person's out there on K Street, and after this whole talk is done and you went to the reception, you're supposed to meet up with them to do something later on, but you don't know where they are. They just said, "I'll be on K Street doing something." But, they have a cell phone; you're supposed to call them. They have a cell phone, and their cell phone — they dropped it or got it wet or whatever it is — and it doesn't work right. It'll receive text messages, but it can only respond "yes" and "no." It doesn't work. You can send a message saying, "Where are you?" and they can't respond to tell you where they are. They can only say "yes" or "no." What you have to do is think of what questions can you ask of this person so that you can find them, and I say over here, get within ten feet of them so you can see them, so that you can connect up with them after the meeting is over. This is the task that you have.

You have to think, what questions would you ask to find this person on K Street in order to be able to meet up with them to do whatever you're going to do later in the day. This is actually, really, the equivalent of the continuum idea if I'm asking questions to find the location of a person. Turns out that if this person and you are both knowledgeable of Washington D.C. and K Street — you have to be knowledgeable about it, you have to have content expertise, you can't just do this, and they have to know the answer to your questions — that you can locate a person on K Street with about 11 questions. It doesn't take very many to go over and get down to this — actually five feet in 11 questions. You can really zero in.

You could ask a question like — and I've got in on another slide coming up — are you west of 12th Street, on K Street? And they could say, "Okay. Yeah." Then if they say yes, you know they're over on one side of town, if they say no, it's the other side of town. Then you can go over and say, okay, I know you're on one side or the other side, now I'm going to go over and find a landmark that's in between and I'll ask about that one. Then have you answer this. You keep zeroing in until you get in a certain region, and then you find out where they are.

On-screen: [

Slide 9, screen shot of map showing K Street and three arrows pointing to locations along K street.]

Mark D. Reckase - This works out pretty well. It could work. The question is — and I've got a diagram here — suppose that initially we had this person who's over by Rock Creek Park, so you found them, would you use the same set of questions for another person who happens to be over near Union Station? I'm hoping you're all going to say, "No, I wouldn't. Why would I do that? It doesn't make any sense," you know, ask questions about Rock Creek Park, when I know that they're over here at Union Station. Basically, what you would do is you would customize the set of questions you would ask depending upon the responses you got from them so that you could zero in on them. This is exactly the idea of computerized adaptive testing. You can play with this if you want to, with your kids. You can tell them, "I'm going to send you a text message. You can only answer yes and no, and see if I can find you." Be an interesting kind of activity.

There are some assumptions here, that the person knows the answer to your question, then they say 12th Street, I don't know where 12th Street is, but they can only answer yes or no, so you don't know. This is like guessing on a multiple-choice item. You don't know if they really know the answer. When that happens it means you've got to go over and give more questions, because you have to compensate for the fact that they don't really know what's going on. This idea, this concept of this continuum — K Street's the continuum. I could have used Pennsylvania Avenue. It doesn't have to be east and west and north and south; it can be even a curved highway. I could use Highway 1 when I go down and talk about this at ETS, and say, well, you've got some road or street that's nearby. I can't remember what the names of all those streets are.

On-screen: [

Slide 10, Elaboration on the Continuum Model.

Mark D. Reckase - The idea is, you get the answers to this, you get this pattern of responses, and from the pattern of responses you can find the location on the continuum. This is really the basis for item response theory, for what's called Guttman scaling, for computerized adaptive testing. This is all the technology that's behind these things that we're using. This is the model that we use when we're going through and getting a score that's related to this set of responses that you get off the test. It's used on NAEP; it's used on a lot of state assessments, and used a lot of different places.

On-screen: [

Slide 11, Developing a Test Using the Domain Model.

Mark D. Reckase - That's the one model. We're going to keep that separate. That's our continuum model. We have the second model, which is this domain model, and when you develop tests for the domain model, the way it basically goes, is you start out with having to have a detailed description of what the domain is. Now, usually in a state assessment, what they do is they have their state curriculum documents. It'll describe what it is that they're trying to say that a student is supposed to learn over the course of the academic year. You get this detailed domain description, and then usually what happens next is people say, "How much testing time do I have? How much time do I have?" The question is how many items can I ask, can I use to go over and find out how much of the domain do they know? The main question we have is not where are they on this continuum? We want to say, how much of this domain have they acquired, that we want them to acquire?

We figure out how many questions we can have, and you'd say, okay, 40 questions. You can have 40 questions. In 40 questions, we want to go over and figure this out. Now, the important point is, 40 questions is clearly not enough questions to totally cover a domain that's for an academic year. It's just not possible to go over and measure everything that a kid has learned in a full academic year with 40 questions, or 100 questions. Basically, if you wanted to really have thorough coverage of everything, you would have to have a test that is probably as long as the instruction was. You would, kind of, need to have that much time to go over and do it. The reality is that when you do this, you don't try and to cover the full domain. What you do is you do a sampling from the domain. You'd say, okay, we're going to take a sample from the domain, with the idea that after we sample from the domain whatever proportion of that sample the student gets right, I'm going to infer that that's a proportion of the full domain they can do.

This is a problem in that it requires that your sample be a random sample, or a representative sample of that domain. It's a question of whether you can actually get a representative sample of the domain. Usually, I think what really happens is people create the test, and then they say the domain is whatever it that this is a random sample of. It's sort of backwards thinking. Then they say, okay, I sort of got that. But, lately they've done a lot of other things in the educational community. They said, well, we want this to be a representative sample of the domain, so they do alignment studies.

They go over and say, here's the curriculum, we want to go over and see does this test really align with the curriculum, and if it's not aligned we have to add items, but you can't add items, because you don't have enough testing time. You only got 40 items, so how are you going to be — especially if in the alignment study they say you got to have at least three items per each objective that's listed. This is an attempt to try to make sure that this set of items is representative of the domain.

On-screen: [

Slide 12, A Fanciful Example Due to Bock.

Mark D. Reckase - When I first ran into this — this is kind of a fanciful example to make this a little bit more concrete, that's due to Darrell Bock, who was a professor at the University of Chicago, who did a lot of work on item response theory and testing and things like that. I was at a meeting — and I can't remember where it was, but maybe somebody else here can remind me where it was. He got up and he made an example up. He said, back in the days before word processors and computers, secretaries would have on their desks a book — I can remember this because I can remember a secretary having this book. It was red. It was a book that had to do with misspelled words or spelling words that they would use as a resource. They would think about how they would spell the word, they would look it up in this book, they would find that misspelling, and then it would tell them how to spell it properly, or it would tell them that they were spelling it right, yes, go ahead and use it. This is the paper-and-pencil version of a spell checker.

He said, okay, if I'm going to go over and create a test to see if people know how to spell business vocabulary that they need for working, I can take this book and I can do a random sample of words from this book. Let's say 100 of them. This would be a true random sample, so it's representative of this book — although that book had 10,000 and some words in it — and then give them this test with 100 words on it and see if they could spell it, each of those words. It would be done orally. You'd read off the word and then they'd have to spell it and we'd score this. If they could get 80% of those words correct, which infers they can do 80% of the full book, because it's a random sample. This is the same technology that we do for voting polling — for presidential polling — and everything else.

We know that sometimes those can go wrong, because we don't know that we really have the representative sample. We want to have the representative sample of people who are going to vote, and we want them to be honest about what they're going to do. To the extent that our sample is off, that means that our inference is going to be off. We can say, well, we think kids know 80% of this material, but if our sample isn't really representative of the full domain, we really don't know if they know 80% of the full domain. This is the concept of domain sampling. If they do it well, then you can make that kind of an inference.

On-screen: [

Slide 13, Elaboration of the Domain Model.

Mark D. Reckase - Now, in reality, when people do this, they don't just do a random sample from the domain. They would do more technically what we would call stratified sampling, because they have different content areas and they'd be selecting items from different content areas. In Bock's example we can say, we could have nouns and verbs, or we could have words with foreign origins and words that — actually, probably all words have foreign origins. Anyway, you could develop categories, long words, short words. Then, so we have statistical procedures and weighting and everything else to go and figure out how well this represents the whole domain, so we can get into all of this.

Now, the market basket concept, which has come up for testing, is very much like this. Market basket concept comes from economics, where they do it for the cost of living. You get a whole bunch of things that you buy and you see if the cost is going up or going down, but it's a big mix of stuff. It's got appliances, it's got gasoline for cars, it's got food, it's got healthcare. It's got all kinds of stuff, so you can see that market basket going up and down but you really have all kinds of things that are being mixed together, and that's a major characteristic of these domains. If you think about what's being taught in any given year in social studies or science or whatever, it's a big mix of things. It's very similar to that.

Okay, so two concepts here, and I hope you got both of these concepts down, because now we have to move to, great, we got two concepts, why is this a problem? That's where I want to sort of head to next.

On-screen: [

Slide 14, Comparison of the Two Approaches.

Mark D. Reckase - First, this continuum model doesn't require representative sample. When I'm trying to find people on K Street, I'm not going over there and saying I have to have a representative sample of items on K Street. I'm actually doing sort of an adaptive test, and I change my selection depending upon what information I have about where the person is, so it's not a representative sample, and it's consistent with the computerized adaptive testing and IRT kind of approach. I'm picking out items that give me good information about points on the particular scale that I'm interested in.

Now, it happens, the continuum model developed all the technology that we've got, the item response theory, scoring, equating technology, most of it has come from a continuum model. Most of it originally came out of psychology and then is transferred over to education. We've got huge books on psychometrics with all this technology and technique that we can use, which is mostly consistent with the continuum model. That's the nice thing about it. The domain model is pretty much consistent with a mastery interpretation and a curriculum design point of view. People who are doing curriculum design and working with kids, and they're trying to think about what it is that they're learning, they like the domain model, because that's the way they're thinking. We got all this stuff that we have to cover, we've got textbooks with all this material in it, and we want to go over and try and get kids to learn as much as possible about all these things.

The scoring model tends to ignore the differences between items. For me, the domain model is, let's say if I go over here and I pull out my wallet — I actually am an old-fashioned kind of person, I actually have cash — I go over here and I say, "I have this many bills in my wallet." Then, I ask somebody else, "Do you have some cash, too?" Maybe I can see Ernie's got cash. But anyway, and say, "Which of us has more bills?" Whoever it is would say, "Oh, I've got 10 and you have 12, so that must mean you have more of stuff." Well, it's not true, because it's ignoring the fact that some bills are larger denominations than others, and who knows, I might actually have some foreign currency stashed in here that I can't spend around here, so it doesn't matter if I've got it there or not.

The domain model sort of treats everything equally, and my garage sale example earlier on — maybe I didn't use that example yet — but, basically you've got all this stuff in there and you're counting these things sort of equally and you're saying, "I've got 80% of this stuff." That's sort of the scale that we're working with, but this stuff isn't all equal. Some of these things are more important than others. Of course, then the real challenge is to get that representative sample so you can make the kind of inferences that you're really interested in.

Okay, so these are the two approaches, and they're not really the same thing. That's the important point. One is to trying to go over to pick items to get them going down the line. The other one is trying to get coverage over a whole bunch of stuff, which is what the curriculum model says that you want to teach, and you're trying to find out how much of that it is.

On-screen: [

Slide 15, Desired Interpretations of Achievement Assessments.

Mark D. Reckase - Keeping those in mind, let's go over and think about what people really want to do. We get into the educational practices and uses. This is where I sit in these meetings and I start going a little bit crazy, because I'm listening to people tell me what we want to do.

One thing they want to do is they want to measure growth. Now, in actuality a lot of times they say they want to measure growth, but actually what they want to do is they want to measure change in rank-ordering of people. They really like the idea of sort of the growth chart, the wall chart for seeing how your kids grow, and you make lines on the wall, and you're seeing how much have they increased in achievement over years when they take additional coursework. This is a growth idea. That's one thing people like to do. That's used for teacher evaluation, for school evaluation. It's used for a lot of things.

The second thing you see that people want to do a lot of, they want to go over and get sub-scores, and sub-scales, and diagnostic classifications. The idea is that teachers will want to have this because it will help them customize their instruction for the children and know where it is that I need to work harder because they're not doing as well as this as they did in some other areas. We want sub-scores. The other thing they want to do, they'd like to go over and have what I call big items, or items that are good examples of instructional activities, to make this test more realistic — more authentic is the term they used to use — so that when people are teaching to the test it's not a bad thing, because they're teaching towards the kind of activities that they want kids to learn as well.

We've got these three things that I've seen often, and that are related to educational achievement testing. Now the first one, the growth, is related to the continuum model. This is where you're trying to move, to find out where somebody is along a line of growth. The second and the third one are really related to the domain model. When they create these sampling plans for the domain, they usually have categories and they say, can't we have a sub-score for that category so we know how much of that category they have, or they meet a mastery requirement, which is sort of like a diagnostic classification. Then, they'd also want to put in essay items or science simulations, or larger kinds of things that will be good models for instructional activities. This is sort of what's out there.

Even listing these, there's a conflict, because one wants to have sort of the continuum where you can say this is how far kids are growing — you could see that in NAEP reports; they want to go see up and down about how things are going, but states do this as well. Then you get these other ones, which are sort of justifications for the amount of time we're spending on testing. If I can get sub-scores and diagnostic classifications and if they're a good target for instruction, then it's a justification for the test.

On-screen: [

Slide 16, Problems with Supporting All the Interpretations.

Mark D. Reckase - The big problem is the continuum model is not consistent with the domain model. The reason I say that is that the domain model is really very multidimensional. The domains have a lot of things going on in them, and these domains are — it depends on the subject matter area, but science is really complex. It's got a lot of stuff in it. In fact, when you go from year to year, you may be going from teaching biological science in sixth grade, and then in seventh grade maybe it's astronomy, and then in eighth grade it might be earth science. Then it doesn't really make sense to measure growth when you're going from different content to different content, as you're going up through the grades. This is inconsistent with this idea of the continuum. It's just got a lot of things that kids are learning, and we want to go over and figure out whether they've acquired all of those.

The other problem is that items that are good targets for instruction take a lot of time. The fact that they take a lot of time, and you have usually a fixed length of time, means that these are going to reduce your domain coverage. Because if you go over and have a lot of big items that take a lot of time, it means you can't have as many, so you can't cover as many topics to cover the domain. You get these conflicts that kind of come up. You'd say, why is it that we're going and doing this? People like the continuum model, because it's got all this technology of testing that we've developed — the IRT models, the adaptive tests — so we would like to use the continuum model to analyze data, but we like the domain model because we're related to instruction. We want to go over and make sure these things are covered, and if we cover them all on the test — I think in states they really believe this — we got to have them all on the test, because if we don't have them on the test, people won't teach them. If we don't have a full coverage of the domain on the test, then we're going to limit instruction, or narrow instruction, and there will be problems. We have conflicts that come out of this.

On-screen: [

Slide 17, Parallel to the Fanciful Example.

Mark D. Reckase - Now, to go back to my example — and I have to make sure I don't spend too much time on these things — if we we're doing my K Street example, if I said, okay, I'm going to do domain coverage on my K Street example. I'm going to force questions to cover every region of K Street, and I'm going to ask people questions about all those regions of K Street when I'm trying to find out where that person is. If they're out near Union Station, it doesn't make any sense for me to ask a question about Rock Creek, because they're not anywhere near there. I'm wasting time. It's become very inefficient. That's sort of if you try to apply the domain model to a continuum.

On the other side, if you're going over and you've got, let's say, the person's really in D.C. total — the whole area of D.C. not just K Street — and I limit myself to asking questions about K Street, I'm not going to be able to tell the difference between somebody who's up near the Cathedral and somebody who's south of the Mall. Then I'm going to have a really bad estimate of their location, because I forced myself into thinking about using the continuum model when I've got a complicated domain. We can run into problems by trying to go over and use these models in a way that they don't really fit.

On-screen: [

Slide 18, Having It All.

Mark D. Reckase - Unfortunately, when we get to the end of all this, people want to have it all. I've sat in many meetings where people will tell me, okay, I want to have it all. I want to do all these things. Why can't you do all of these things? Especially when you talk to a governor. They're going to say, "Don't tell me you can't do this. Tell me how to do it." Oh, okay. Me as a person who's worked in this, I'm guilty of this as anybody else. I'm not blaming anybody. I'll go over here and I'll — and you can see these things. We're going to have different ways of doing it. Now, I know how to have it all, and this slide is actually saying, if it was up to me, this is the way that I would do things.

If people wanted to have sub-scores, I'd say, "Fine. I'll go over and I'll create a nice short adaptive test for each sub-score that you want to produce, and then I'll go over and I'll combine those continuum models to go over and give an overall composite to go over and get something that's based on what all these things are about." This is one approach to doing it. Or, since I wrote a book on multidimensional IRT — and everybody should buy a copy — I'd use a multidimensional model to go over and model the domain, so that I could go over and get information from the whole domain in a multidimensional way, instead of the simpler models for analysis, so that we could go over and see where they are in all of D.C. instead of just along K Street. Or, actually you'd probably need a three-dimensional model for D.C., because we are on the 13th floor now, and to find me right now it's not enough to go over and just find on a two-dimensional grid; we need to get in a three-dimensional grid. Actually, probably a four-dimensional grid, because the hallways are weird, so you couldn't just say the 13th floor. You've got to figure out — actually, when I tried to get up here, I had to try and find, the right elevator just to get up here, so it's kind of...

That's another approach. Or we could go over — and this is a little more farfetched. We can get some of these really big extended open-ended items, but we have to figure out how to pull all this information out of them, instead of just a single score, and get quicker scoring if they're really elaborate, so we can go over and do a better job of domain coverage from all these things. There's sort of directions we can go, but all the testing programs I know about don't want to do any of these things. They want to go over and have the domain coverage and go with it from there.

Now, there are some good models for these. The first one — the ASVAB actually does this, the Armed Services Vocational Aptitude Battery. It goes over and has a lot of separate little short adaptive tests that it creates, and then all the agencies go over and create their own composites out of those that best predict the performance in the courses for them. For the Air Force, and the Navy, and the Army, they all kind of make their own composites up out of all these little sub-scores that they create for the ASVAB. That's sort of model for that. Portfolios is a way of getting at the third one. If you go over and get portfolios. I tried to get one going over at ACT when I was there. I thought it was very successful, but it never became profitable, let's put it that way. It was too expensive and took a little bit too much effort on the part of people, so it never got to be successful. I still like the idea.

On-screen: [

Slide 19, Current State of Affairs.

Mark D. Reckase - Okay I'm up to slide 19. We only get to 25 or something, so we're almost there. What I see as the current affairs, this is where we're at right now. Most achievement tests that are out there for school-based education start out developing using a domain model. They go over there and they have a description of the curriculum, and they go over and make up a test plan that's based on trying to get a representative sample of items covering the domain as much as they can, given the number of items they can give in the time that's given to them. Then they also, probably because somebody has forced them to do it, put in big items, too, because they want to go over and be authentic and try and have good examples for instruction. We get this instructional relevance component that gets put in there, so we get larger science simulations and we get essays, and we get more elaborate items that go in there.

But, what happens is these tests are typically analyzed using the technology of the continuum model. I can think of bunches of state tests that they go and do all this, then they take the Rasch model, and they will apply the Rasch model to this thing to get the score on a continuum, and they'll actually start throwing away items that don't fit the Rasch model, which starts reducing their representativeness of the continuum. They'll get the thing to work and they'll come out fine. Then, usually what will happen after they got all this work with the uni-dimensional continuum model, then somebody will say, "Oh, we want sub-scores." And you say, "Well, we just forced this into fitting on a continuum and now you want to have sub-scores? Okay, let's go and see if we can find something that will be not too highly intercorrelated that we could make some sense of, to go over and give some sub-scores."

We go over and do all these things. The result is a testing program that's got conflicting priorities. On one hand we're trying to go over and use technology for analysis that's based on a continuum model, then on the other hand we're trying to go over and cover domains. The worst example that I think of is the one for adaptive testing, where you force content balancing onto the adaptive test, because you want to have domain coverage. Usually the methodology that's running the adaptive test it based on a uni-dimensional IRT model, but then you want to go over and say, I don't really believe that model so I'm going to put content balancing on top of that because I think the test is really multidimensional and I need to have all this stuff covered. You get these conflicting concepts.

When I go back to class I usually go over and tell my students I now need therapy because I was in this session, and I'm having a schizophrenic moment here because I got these different ideas that are being pushed at me at the same time and I'm having trouble dealing with them. I never do get therapy, but I tell people anyway. Then we get the extended items that are thrown in there as a little bit extra. That's what I think our current state of affairs are. We can discuss that later, if you don't agree with that.

On-screen: [

Slide 20, What Can Be Done?

Mark D. Reckase – Now what I think can be done about this is — I don't want to be just a downer here — is that we can sort of change the way that we go about doing this, and do this in a more reasoned way. I think the way to change this is, instead of going over there and thinking, okay, here's my domain and I've got to cover it, the thing to do first is to say what are, really, the inferences I want to make from the score when I'm done, when I'm reporting results of this. Let me think about all the things that I really want to say first. Then, let me see how I'm going to go over and develop the instrument that's going to allow me to say those and have support for it. You're going to have these inferences.

Now, great if you only had one. Some people are in license and certification and they want to say, can these people pass the exam and are they ready to go out into the workforce. They have one high priority and the rest is kind of minimal, but a lot of times in educational assessments, they have many priorities. The want to have a growth model, they want to do teacher evaluation, they want to do school evaluation, they want to give sub-scores, they want to give diagnostic things to the teachers. They want that list that I gave over there before, so they have all of these. What I want to do is say, okay, have all those, but let's prioritize them. Then let's say, what is the financial resources you have available for this? Let's start at the top priority, figure out how to do a good job on that one, and then work our way down the list until we run out of money, or time, or whatever those resources are. And admit to people, the rest of those we cannot do because we don't have the resources to do it and the model that we've got that we're working with is not going to work.

The advantage of this is you actually are developing the validity argument for what you want to say as you're going and doing the design. This is sort of, at least in the beginning of this, kind of approach. In the beginning, I want to find out do you really want to have a growth model, and this is really what you're after, and you want to have a continuum and find a location of a student on this continuum. Let's forget about every else and do a good job of doing that, and then see what else we can do afterwards.

On-screen: [

Slide 21, Final Thoughts.

Mark D. Reckase - I've been guilty of doing all these things myself, so I can't go over and criticize people too much. When I was at ACT I remember the first program I worked on at ACT was a thing called the Proficiency Analysis Evaluation Program. It was a credit-by-examination program. We did a really good job of creating the test forms and everything else, and I thought the program was working great, and then the next thing they said is, "Oh, we'd like to have sub-scores." I said, "Oh, okay. This test isn't that long. How many sub-scores can we get out of this?" We actually did some analyses. We found out we didn't want to report a sub-score unless it had at least 12 items, because with 12 items we could get a reasonable reliability out of it. Even there we didn't want to report scores on that, so we'd only report high, middle, and low on each of these sub-scores, so we could be confident in our classifications of this, because it's only 12 items. Then, we'd maybe only give two or three sub-scores per test, and that's it. Then say, okay we'll be responsible for this.

I'd go and try and push this and make it work. Me personally, I think I really like the idea of getting rid of testing overall. I know it would take a lot of us and get us out of work, but my real dream is that now we have computers in the classroom all over the place. If every kid had a computer and we collected everything that they did academically in the classroom during the whole academic year, and then developed a model for aggregating all of this. You go over — and we'd need some cognitive psychology models, too, in order to go over and figure out how the things we collected matched up with what they actually knew, that we wouldn't need to give the test; we'd just be accumulating all of what they had done, and then figure out how much they knew. That's the exact way of going and getting the domain.

We were challenged a number of years ago. I was in a research meeting that was funded by the military, and a guy from the Air Force came into the meeting and he said to us, he said, "Hey, I want to show you something here." He took us over and he showed us this great flight simulator. This is where they were training pilots. We went and looked at this thing, and they had all these... You know, it's a flight simulator. It just did everything. He said, "When we run a flight simulation through this thing, we collect everything that happens in the flight simulation. It's just all recorded on the computer." He says, "It's like a million observations. We haven't figured out how to score it. We know the person lands without crashing. We know if they got to the right place, but beyond that we don't know how to score this thing. It's just too much information, too much data, but why don't you guys figure out how to score this for us? That's why we're paying you." I said, "I don't know how to do that." Well, it's the same issue here. The portfolio was an attempt to do this in paper-and-pencil form, but as I said, it was expensive. If we can get them to computerize stuff it will work better. Anyway, that's a long-term research program. I think it's going to take 10 years, 15 years. Originally, when I wrote this, I said 30 years, but I didn't want to be that pessimistic about it. I don't think most testing programs are going to wait that long. They're going to want us to do these things right now, so we need to figure out things.

I have one little other side note and then the final thoughts, because I'm running out of time. The big issue on testing, right now the way we do it, is testing time. This is the real limit. Usually if we got a 60-item test, then we get 60 binary digits, what we're trying to do is get a picture of a student in 60 binary digits. I used this example once in an AERA presentation and I dug it back up.

On-screen: [

Slide 22, 56 graded responses, represented by black and white photo pixelated so much that subject cannot be determined.]

Mark D. Reckase - This is 56 graded responses. This is in that grid. This is like if you have 56 items, they're all scored on a five-point scale. This is a picture, a picture of something, and I don't know if you can tell what this is a picture of from this. No? Anybody know what this is? Only if you saw my presentation before, you saw the slides in advance. So, you can't tell what that is. I can't tell what it is. Let's say we go, instead of having only five points for each item, maybe we'll have maybe 10 or 15. We could put in color then.

On-screen: [

Slide 23, same pixelated photo is now in color but subject still cannot be determined.]

Mark D. Reckase - Okay, so now we've still got out pixels, 56 items, but now we're scoring them on like a 10-point scale. Can you tell what this is? Okay, well, this is what we have in reality when we're doing a test and we're trying to interpret. Here's what the picture is.

On-screen: [

Slide 24, photo of child.]

Mark D. Reckase - This is my grandson a couple years ago. He doesn't look like that anymore. He's gotten older. Anyway, this is what people want. They want this picture. That's what they want out of the testing program, but what we can give them is something like that. [Slide 22, black and white photo] [Laughter.] This is where our big problem is.

On-screen: [

Slide 25, Do Achievement Assessments Result in Achievement Measurement?

Mark D. Reckase - Final thoughts. There's our last two slides, and then it'll be done and then I'll open it up for questions. Do achievement tests — this is my original question — give achievement measurements? My answer right now would be to say, almost always no. I'm thinking in the terms of, like, the scale that we have for time. You have a reference point; you can tell how far away somebody is from a reference point. I'd say, in most cases, we don't have this continuum like K Street; we have this domain model, but it's on top of the continuum and we're getting projections of the domain onto a line, and we really don't know what the distances mean on this thing, so we don't really have solid measurement.

I think in some cases, like in mathematics where we have a strong hierarchal structure, you going to get a pretty good approximation. In other cases, like science and social studies and some other areas, where the domains are really, really complicated, we aren't getting anywhere close to what we would consider some kind of a measurement model. This is sort of where we're at. We could work more on this, but I think we need to go over and think more creatively.

On-screen: [

Slide 26, Two Thoughts.

Mark D. Reckase - Last slide is — I got two thoughts — the one thought is because do no harm — this is coming out of the medical community. If we go out and do this, we got to be able to tell people, no, we can't do this. Occasionally at least, if we really don't know. The other one is actually — I was going to use this as the title of the talk over here, because it was running through my head as I was developing this — you can't always get what you want. If I had a voice I'd sing it for you. Then I looked up the full quote to make sure that I got it right. Keith Richards and Mick Jagger, those great psychometricians that have been working in the field for a long time, since 1969. "You can't always get what you want, but if you try sometimes, well, you might find, you get what you need." I think what we want to do is make sure that people that they get what they need and sometimes ignore some of the things that they say that they want. Let me leave it at that, and thank you for listening. I hope that...

Unidentified Male Moderator - Ladies and gentlemen, we will now begin Q&A. If you have a question please line up behind the two microphones located on either side of the room. Please identify yourself and remember to keep these to questions rather than comments.

Walter Rosenkranz - Hello, my name is Walter Rosenkranz. I'm a retired professor of mathematics. I got my PhD at the University of Illinois, maybe at the time you were a graduate student there. My first job was at the coordinated science laboratory, working on ILLIAC I. This was before computer science. This was one of the first electronic computing machines. My task was a mathematical one, to use the Jacoby method to diagonalize a positive definite matrix. This problem arose most frequently in psychology. I was really amazed that some of the heaviest users of the ILLIAC I were graduate students in psychology who were doing factoring analysis. This lead to this interesting math problem.

Anyways, what I wanted to talk about is this evaluation of schools on the basis of these multidimensional criteria. For example, ESSA — the Every Student Succeeds Act — you're supposed to evaluate schools, and here in the district, they had a series of community meetings where they wanted to rate schools — and I thought this was horrifying — on 1–5 stars. It's like rating restaurants. When I teach mathematics, I have students, some of them are A, some of them are B, some of them are C, some of them are F's. I evaluate each individual student. I don't give them a course grade to the class as a whole. There's very outstanding students. What would it mean to give a course grade, of let's say, C-plus to a course where some students are outstanding? This is really ridiculous. It's stupid to evaluate these schools according to these criteria.

Mark D. Reckase - This is — is this mic on? Can you hear me? You bring up a really interesting point, because on one side — I'm on both sides of this issue. I was on a committee in Michigan where we were supposed to develop the teacher evaluation model, and we actually came up what we thought was a new and innovative teacher evaluation model during the process of all this, because of problems of exactly the kind that you're talking about. That, of course, the teachers are given a group of students, and then they have to do what they can with that group of students with the resources that are given by the school, and also the resources that are given by the community and the family to go over and support the students.

Evaluating teachers is very tricky because of the complexity of the situation that they're in. The basic, I think, the thought process of people is, we want children to learn this amount and we would hope that the teachers are going to help those students learn this amount. Now, this is, in terms of the talk over here, this is a continuum model, or growth model, that there was this amount and here's a threshold; we'd like to get them all above this threshold. It's called proficiency in a lot of states.

We want to get the above that proficiency level, and we hope that teachers can get them above that proficiency level. Of course, a teacher could be really good at working with very bright students and get the students way above that proficiency level. But if they've got challenging students, who don't have the background from previous courses and don't have family support and they're in a community where there's a lot of problems, that teacher's going to have trouble getting them above that proficiency level. We have figure out how to come up with an evaluation system for teachers that takes all that into account.

Now, in theory, value-added models try to do that by putting in what they call co-variants, and things to go over and balance out all this stuff, but they don't work very well, and they actually don't classify teachers very accurately. I've been involved in some research with a couple economists who have actually shown that. We've got some creative ideas about how to do that better, but none of those have actually been put in practice anywhere. This is a nice thing when you're a faculty member. You can come up with ideas and you can try them out, and you never know how it's going to play out. Then you have to go out and try and sell them to the world afterwards. This actually becomes a marketing thing, because if it's different from what people are used to, then you have to convince them that your different thing is going to be better than what they're already doing. That, sometimes, is a hard sell.

I agree with you. I think this is really tough, but it does — in terms of this talk — it fits in with the idea that a lot of states think about education as a continuum, and they want to see this continuum and see what students are above a point on the continuum. That's the way if you've got a teacher and you got your kids above the point in the continuum, then you're great; if you don't, you're not. That's an overly simplistic model, which needs a lot more work.

Jamaal Abdul-Alim - Jamaal Abdul-Alim, reporter with Diverse: Issues in Higher Education. One of the concerns that has been raised about the testing industry, is that from time to time, tests become laden with questions that are considered culturally biased. As a reporter, I've never really wanted to write that story unless I could find examples of what the people making this allegation were talking about. In trying to find discarded questions, questions that were discarded because they were considered culturally biased, I discovered that the testing industry was very guarded about that and would not release questions that had been discarded because they were deemed to be culturally offensive.

I think the thinking is that doing so is only against their legal interest, because it could lead to a lawsuit. In any event, I'm wondering if you could speak to, one, whether or not a cultural bias in testing is an issue, is it a significant issue, and also, any light you can shine on why the testing industry is so guarded about questions that they discard? Thank you.

Mark D. Reckase - That's a really interesting question. For one thing, I think I'm trying to remember back to this, because I've been out of the actual testing industry as a faculty member for a while, so I'm not up on the current processes, but when I worked at ACT, my job was to actually build the ACT test. I was actually the person who sat there and picked the items to go on the test. When we found items that were potentially a problem — back, early on, there was a law that was passed in New York state, that said that we were supposed to do these analyses and keep track of these items, and we were supposed to create a file of all of these items. We were supposed to file them in the State Department in New York, up in Albany, which we did. We sent things there.

Right now, I would say, first of all that, that legislation, even though I hated it at the time because I was in the business, I think overall has improved the quality of the tests, removed the kinds of items that were the most blatant type that might have existed at that time. Things have gotten a lot better over time because what we now talk about as DIF analysis or the bias review, or all these review processes that went on. When I was working at ACT we actually got to the point we have bias review committees, we had DIF analysis, we had all kinds of different kinds of analyses to try and weed out those items before they got onto the test.

There would be an occasion when one would show up. I'll take the risk here of talking about one item, because it's a little bit less sensitive than some other ones. We found one item, where we took a male-female difference item on the science test. What the item was about was actually about when you look at the side of a tire, that says inflate this tire to something like 45 pounds of pressure, and the question was, "What does that mean when you say that this tire's inflated to 45 pounds of pressure?" We found that males got this right a lot more than females did.

We joked about this a lot in the office, because we said, women just never fill up their tires; they always find some guy to do it for them because they get their hands dirty. Anyway, we were just being disparaging because of this item. What we found out is — and the answers to this were actually science answers — they were saying, "What the 45 pounds of pressure means is that there's a difference in the interior pressure of the tire and the exterior pressure of the air, which is 45 pounds per square inch." The other answer choices were, that the tire when it was filled up weighs 45 pounds more. We had a whole bunch of choices like this. This showed up as what we would call a DIF item. We were trying to decide on whether this is what we would call a biased item, in sense of culturally biased, because we were trying to decide if this was a legitimate topic of instruction in the science classroom, and if it isn't, then we shouldn't have had it on the test.

We did eventually delete this item, we didn't use it, because we considered that we weren't sure and we were being cautious about it. That was the kind of thing that sort of slipped in. I thought it was kind of interesting, but I never really could explain why there were really big differences between males and females on that item. At that time, we did have a file, and it may still be at ACT, that there's a file that has all these items that have been taken out because of whatever, that people found that there were differences in performance, and then they decided that they needed to take them off the test and took care of it. You might be able to find some of these if you were after them, or you might be able to find them up in Albany, if they're still filing them there. I haven't worked at ACT for 18 years, so things might have changed since the time that I was there. Let's see who should be next? Okay.

Unidentified Male - My name is … I'm a faculty member in Assessment and Testing Measurement Program in George Washington University. Recently I'm working on the interpretation and use of the item response theory, so it was such a great amount of deal I have learned from this presentation. As you mentioned, Professor Bock's paper related to domain score, and he also used his item response theory to produce domain scores. It's like a very nicely mapping into combined two idea, and then your expanding to a multidimensional way, so above being just uni-dimensional presentation.

Even though that domain score, Bock's definition they produced, which is like aggregated, weighted sum of item characteristic curve, which is TCC, sort of like a TCC with representative item, for example, simply defined. Still, we would like to explain that one to other people and that there's a data score there, but we could not directly interpret that data, even multidimensional data value. We have to rely on those test characteristic curve based. My idea is even though we have data we cannot use data, and we have to something else the data value in IRT.

I tried to understand where that problem came from, and that you mentioned, because the data values are scaled from the negative infinite to the positive infinite. Really not easy for the public to understand negative data value and negative difficulty. Instead of we're doing it, to go to the domain's score one, my idea is how about we transform our item response theory model to not produce those negative? How about we have an analytical form to the zero to one the data value will be scaled, and then we are using those zero to one data value scale and we don't have to do the additional transformation for interpretation to the public?

Mark D. Reckase - That's a really interesting question that could get a really long answer. I have to try and figure out how to answer that in a short way. I don't really like, let's say, number correct scores. Period. I don't like reporting things on zero to one scale, either. The reason is — and this I learned a long time ago — we've got a lot of technology that we use for data analysis — this is not IRT stuff, this is just doing analyses, like normal distributions and things like that. A lot of times in the metric of theta, we'll find out that the distribution of examinees' scores are normal. You can get the things that are not too far away from the normal distribution, but depending upon the set of items that you picked — and this is in Lord and Novick and … you can get a distribution of number-correct scores, which is very non-normal. Then, if it's a very easy test they can be very skewed in the opposite directions.

Moving from the theta scale to a zero to one scale, it's very tricky to go over and do this in a way that maintains the distributional properties of theta that you would have off of that theta scale, and would maybe end up with being something that causes problems when you do other secondary analysis afterwards on those things because of the statistical properties of that scale that you have. We could get into a much longer discussion about this, but I've thought about this problem. It's kind of an interesting one. We just need to figure out ways to deal with all these weird, non-normal distributions and then go and deal with this stuff. Yeah?

Jim Belanich - Jim Belanich from the Institute for Defense Analyses and my question is on selection. A lot of what you're talking about is identifying tests or scores to represent the person, but now when someone is looking across many people, trying to make a selection decision. Could you describe a little bit about, maybe about how the difference between continuum variables or domain would influence how you might go about doing it?

Mark D. Reckase - Making selection decisions?

Jim Belanich - Yes.

Mark D. Reckase - Are you talking about selection decisions into two categories, like pass-fail or are you talking about multiple categories?

Jim Belanich - Multiple. Yeah, let's say if you were a university trying to select a student and you had multiple variables that you want as a student population, some of those being academic, some of those being within certain — you need a certain number coming into this part of the university, this part of the university, and there may be some other characteristics that you care about as well.

Mark D. Reckase - Let me actually — the example that I mentioned very briefly of the Armed Services Vocational Aptitude Battery, is sort of an example that I think maybe applies to what you're thinking about. We've got a whole series of tests, and they're actually classifying people from those tests into schools in the military, different military schools. Each of these schools basically is able to create their own composites for going and predicting performance in those schools. If you thought about it in the academic area like universities, you can do a similar kind of model. Electrical engineering versus biological science versus teacher education versus communications versus — you got all these different majors, and actually the predictions to those majors probably ought to be different depending upon the sort of profile scores on a series of different characteristics that you're interested in.

They might not all be just academic characteristics. I've been interested in thinking about the non-cognitive side of that, too. That there's different personalities of people that go into different kind of areas. You might want to go look into those, as well, to go over and make your predictions in the categories. The idea there would be — and this unfortunately takes a lot of work and research — is to go over and find out which combinations of these things are going to be most predictive of success in those particular categories that you've got. They may be very different from one to the other.

Jim Belanich - With regards to continuum versus domain are there particulars?

Mark D. Reckase - I think each one of these I've got is a continuum on each one of these, but I'm forming a composite. There is an alternative model, which I'll just mention. There are some people who try to argue that there are discrete categories, and that you want to go over and try — and this a relatively new area of psychometrics where you're trying an adaptive test but it's not based on the continuum model, it's designed for discreet categories, and you're trying to select items to decide which category a person goes into. This is fairly new, but this would be another sort of approach to that, which is a whole different idea, and it's just being developed now. It looks interesting. I haven't seen it applied anywhere yet, but it's an interesting idea. To me, that's the kind of thing you're after.

Christine Carrino Gorowara - Hi, Christine Carrino Gorowara from the University of Delaware. To what extent is this a difference between measuring skills and measuring knowledge? You've got the continuum model, which seems like it would be good for measuring skills because you could have some hierarchy. The knowledge, like if you just needed to know anatomy or all the capitals of the 50 states, which the knowledge has no particular relationship to each other, that seems more like a domain.

Mark D. Reckase - I think you're right in a certain sense, but it has more to do with is the structure of the — there's two things it has to do with. What kind of inference you want to make, and the other one is the structure of the data that you're working with, so with your test items and things like that. If you're, really, when you're thinking about skills versus knowledge, when you're thinking about knowledge, then it covers a whole wide range of things. I was actually trying to think of how I could use Jeopardy as an example for this presentation, because in Jeopardy they got all these categories. You're getting knowledge related to all those categories, and you get the same number of dollars depending upon those categories, but depending upon the person, it's very multidimensional. That person may know absolutely nothing about sports but they're really great with music, so they run the thing on music and get a lot of dollar credit for it, but they can't answer any of the questions on sports, so you go over and they're having a lot of trouble.

This is where we run into this, I was thinking, Jeopardy domain, on the board is the domain. Fitting a uni-dimensional model to that is not a good idea. The domain model works pretty well for that. There is problem there about how much weight you give to each question. I'm not sure that the way they've got them properly ordered in difficulty. They haven't done the IRT analysis on their questions. I don't think. Who knows? I don't know anybody who — you know, I want to know how they do the Jeopardy questions. How does this work? [Laughter.] Anyway, I think it depends upon the structure of domain. That's why at the end I was trying to say, some domains — even though, let's say, they're knowledge domains — are pretty hierarchal. People tend to learn these things in a particular order and they build on each other. Other ones are not, and the ones that are not don't do very well in the continuum model, because they're all over the place. They've got a lot of different ideas that are going.

Christine Carrino Gorowara - Yeah, and sometimes I think you just have to know the stuff. If you're a doctor you have to know where things are, for example. You just have to remember all these things. It's not —

Mark D. Reckase - That's true. Although the nursing certification exam that's put out, called the NCLEX uses a continuum model. They scale the whole thing with a Rasch model, and they do a computerized adaptive test, but they put content balancing on top of it with eight content areas. This is another example of where people are mixing these models. They force everything to work with their uni-dimensional continuum, but they don't feel comfortable with it because they think, they've got to learn all this different stuff to be a good nurse, so we're going to put eight content constraints on this thing and make sure you have to give all these items from all these different eight content areas, even though we're picking the uni-dimensional model. This is where I say I need therapy afterward, because they're mixing these models, and because they're doing it, they're making each one less efficient and they're having, you know — anyway. Last question. Any questions left? If not, I will...

Audience - [Applause.]

Ida Lawrence - Thank you. Mark, I want to thank you. This was a great talk and these were terrific questions. For those of you who have extra time and would like to hang around, right down the hall here we're having a reception. I hope you'll join us in there and we can continue the conversation. Thank you.

Audience - [Applause.]

On-screen: [ETS®. Copyright © 2017 by Educational Testing Service. All rights reserved. ETS and the ETS logo are registered trademarks of Educational Testing Service (ETS). MEASURING THE POWER OF LEARNING is a trademark of ETS.]

End of ETS® Do Educational Assessments Yield Achievement Measurements video.

Video duration: 1:28:35