Wednesday, March 09, 2011

Which Gap to Close

By Skip Kifer

Both No Child Left Behind (NCLB) and Kentucky's Senate Bill 1 (SB1) refer to achievement gaps and include expectations for closing them. As defined by Kentucky's Senate Bill 168 (SB168) and NCLB, gaps are differences in test scores based on gender, disabilities, limited English proficiency, ethnicity, or socio-economic status. Each school in the Commonwealth - given a set of rules about what is a gap - is expected to minimize differences between those groups, thereby closing it. It is implied that one expects each student's score to increase but those with lower scores are expected to increase by greater amounts. There is a desire for overall improvement in test scores as well as improvement in closing a gap.

As I write, there has been no reauthorization of the Elementary and Secondary School Act (NCLB) so there is no decision about how to define a gap or how to decide whether a gap is closing. To my knowledge, no decision has been made about how the implementation of SB1 will deal with those issues, either. I expect, for reasons discussed below, NCLB definitions to change. My guess is that Kentucky's might also change.

NCLB now defines an achievement gap as a difference between the percent proficient for one group, say girls, versus that of another, say boys. For a school to close the gap, it must reduce the differences between the two percentages. For SB168, Kentucky initially used a complicated average difference to look at closing the gap. That is, a school closed an achievement gap when it reduced, by a set amount, the weighted average between groups of achievement differences across grade levels and content areas.

In what follows, I hope to point out the strengths and weaknesses of both approaches and then suggest a third alternative for consideration. It is so easy for one to mouth the words "closing achievement gaps" without being aware of the technical difficulties of defining the gap and knowing either when it exists or when it has been closed. As a way to discuss the issues, I created data[1] and drew pictures of them.

Figure 1. Six representations of an achievement gap.

Figure 1 contains six pictures of the data. The graphs depict comparisons for one grade level and content area; for example, fourth grade reading. Three pictures (A,C,E) on the left are ways to show shapes, centers and spreads of the data. Three pictures on the right (B,D,F) are ways to show the gaps across levels of the test scores. Pictures A&C and B&D are the same but will be used to describe different features of the data.

Centers, Shapes, Spreads - Averages as Gaps

Figures 1A and 1C compare two groups, one of which is four times larger than the other. The size difference could happen if, for example, one was comparing majority students to minority students. Such differences in size do not affect the ensuing discussion. The groups could be of equal size, too. These dotplots are just detailed histograms that better represent the shapes and spreads of the distributions. A reader should see several things in Figure 1A: the distributions overlap substantially, the shapes are rather similar; the spreads are similar; but, the centers are different. The bottom distribution is shifted to the left indicating lower average performance for Group 2. That average difference could be a measure of the "achievement gap."

Figure 1B is another way to describe the data. This is a particularly good way to view cut-points that are used as the percent proficient goals. The lines I added to the figure are guides to interpreting the data. These curves depict what parts of a score group are at or below certain values. For example, if one follows the lines, one can see that fifty percent of Group 2 students score at or below 35. The comparable number is 40 for Group 1, the higher scoring group. The differences in those percents is the measure of the gap when the cut-point is 40 (i.e., 40 represents the goal, the desired percent, the percent proficient). One's eye can see different achievement gaps as the curves move from about 10 to 70.

Percent Proficient (Cut-points) as Gap Measures

In Kentucky there are three major cut-points, producing four major scoring categories - Novice, Apprentice, Proficient, and Distinguished. NCLB requires at least three categories of performance and that percent proficient be the cut-point for determining gaps.

There are several desirable properties of defining the gap in terms of cut-points.

  1. There are several well-defined, judgmental methods to define the cut-points, i.e. what will be called a proficient performance.
  2. Given the defined cut-points, it is straight-forward to calculate the gap and changes in the gap. This is especially true for summing across grade levels and content areas within a school.
  3. Coupled with a long-term goal of each student being proficient, the gaps are eliminated when the goal is met.
  4. The notions of being proficient in a subject area and having the percent proficient be the indicator of success, are easily conveyed to a broad audience.

There are several undesirable properties as well.

Perhaps the most serious one is depicted in Figure 1D. It shows that if the cut-point is at 40 rather than 50, the gap will be almost double the size. That is, the size of the gap varies according to where a cut-point is placed. Since the methods used to determine cut-points are judgmental, there is no one logical, well-defined place on the scoring scale to place a cut-point. That is a major reason why different states have different percents of students who are proficient.

Another weakness of cut-points as proficiency standards is that if those in the school wished to "game" the system, it is clear how that might be done. A gap can be narrowed by dealing with only a small proportion of the students. One should focus on students in the lower scoring group who are below but not too far below the cut-point. When they are moved to or above the cut-point, the gap is narrowed despite the performance of lowest scoring students. So differences in the percent proficient can be minimized by working with relatively few students.

Conversely, a school could increase dramatically the scores of the lowest scoring students without having an impact on the percent proficient. Imagine moving each student below the cut-point closer to the cut-point. Although the accomplishment would be dramatic, it would have no impact on the percent proficient.

The combination of using cut-points with a rule that each student must be proficient in a certain amount of time, gives a school an impossible task. Figure 1C shows where the cut-points of 1D fall on the score distributions. When the percent proficient is at a score of 50, 90 per cent of students in Group 2 must be moved to or past the cut-off. For Group 1 which is four times greater than Group 2 more than 80 percent of students must be likewise moved. When the cut-point is lower, the task is less onerous, about 70 and 50 percent respectively. I know of no empirical results that show such dramatics effects.

Finally, the whole idea of being proficient may be illusory. Simply placing a label on a test score does not make it true. Tests labeled science, for instance, may be very different kinds of tests. The science portion of Explore, the ACT eighth grade test contains only multiple choice questions and requires an inordinate amount of reading. The National Assessment of Educational Progress (NAEP) eighth grade science contains constructed response and extended constructed response questions and tends to minimize the effects of reading. Whatever proficient may be, it is likely to result in substantially different definitions depending on what science measure is used. And they both are wrong!

Mean Differences as Gap Measures

Just as for cut-points, defining achievement gaps in terms of mean differences have both desirable and undesirable properties. The positive aspects of such a definition include:

  1. Given data that are approximately bell-shaped the mean is a good typical value;
  2. As opposed to a cut-point definition where not all students are affected, the mean takes into account all cases.
  3. An average is a number most persons understand.

But, as I tell my students "never a center without a spread." Figures 1E and 1F show the effects on differences between groups when the spreads differ. The difference between the figures is about 2 1/2 points, a standard deviation of 10 for the first four and between 7 and 8 for the last two. The differences in the cumulative distributions get rapidly "fatter" above the mean of 40 (incidentally, the area between cumulative distributions is equal to the difference between means for the two groups). Minimizing differences when spreads are small may mean something different than when they are large.

Because decreasing mean differences may mean different things depending on the spread of data, it creates interpretation problems across grade levels and content area. Unlike summing percents based on cut-points, there is a question of how one should sum the effects to get an overall school index.

It is possible to "game" the means, although effects may be smaller than what one gets when gaming the cut-point definitions. If one believes, for example, that there are faster and slower learners, then to focus on relatively fast learners in the lowest scoring group could provide bigger gains that focusing on each of the students.

Finally, if it were just a matter of reducing differences between means, there would not necessarily be improvement across the system. So, there should be some specification of an expected amount of improvement.

Effect Sizes and Mastery Learning

An effect size, classically defined, is the mean for a treatment group, minus the control group mean, divided by the control group standard deviation.

This standardizes mean differences making them interpretable in terms of standard deviation units. The general idea can be used in the context of gap differences. For the data I have displayed, Group 1 has a mean of 40 and Group 2 has a mean of 35. Using the larger group's standard deviation of 10, we come up with an effect size of .5, that is, Group 1 performance is on the average 1/2 of a standard deviation higher. That magnitude of effect often would be interpreted as a medium sized.

These effect sizes can be summed over content areas and grade levels in a school to produce a school index. It would take some empirical work to decide how much the index should be reduced in order to say that an achievement gap is closing.

Although effect sizes respond nicely to the question of different spreads they do not help when it comes to different shapes. When Ben Bloom in 1967 outlined the properties of his approach to Learning for Mastery, he recognized the problem of only dealing with average improvement. So his goals included not only influencing average performance but also influencing the spread and shape of performance. The goals are to raise the mean, minimize the variance, and skew the distribution! A desirable outcome, then, is a heavily positively skewed set of higher scores rather than ones that look bell-shaped.

I don't know of anyone who has argued for reducing spreads and creating positive skewness as measures related to closing the achievement gap. Perhaps someone should. It may be worth a look.


If I were to decide what to use as indicators for defining a gap and determining whether it has been closed, I would not use either a method based on cut-points or simple mean differences. I would start with effect sizes and then do some analyzes to determine whether indicators of reducing variation or creating positively skewed outcome data are other possible measures.

What ever measure is chosen, it should be grounded in empirical results. So, there is a major task for the assessment persons in the Kentucky Department of Education to analyze their assessment data and come up with defensible suggestions for measuring a gap, measuring how much it changes, and how much it must change before deciding that the gap has been reduced.


I have tried to respond directly to the gap issues without divulging my reluctance to base decisions about what is a good or effective school simply on the basis of test scores. Or, for that matter, whether schools should be held accountable for "gaps" that are based only on test scores. There is what I consider a naive view that backgrounds of students should be ignored when looking at whether schools are effective. At the same time there is an almost religious belief in the efficacy of test scores as the way to determine whether a school is good. Such views defy common experience and ignore research about schools and schooling. Some schools, for example, have relatively small amounts of turnover during a school year; others turnover almost completely. Some schools have huge amount of parental participation; others have virtually none. And, it remains true that the strongest within country correlations with test scores in international studies are based on the background characteristics of students.

The effects of schooling are many, diverse, desirable and undesirable, both short term and long term. Tests get at a small number of similar, desirable, short term effects. NCLB ignores most content areas in judging schools. The Commonwealth's assessment measures fewer than half of its goals. What ever happened to self-sufficiency, effective group membership, and integration of knowledge?

Tests do not get at whether a school produces persons who are thoughtful and reflective. They do not get at whether persons are well-informed. They do not get at how well persons work together or how they well they respect other persons and other points of view. They do not get at whether a school produces good citizens. Good schools do all of the above! Those things are as worth thinking about as is the achievement gap, however defined.

[1] I produced these data. They do, however, mimic those I analyzed for a paper on the gap.

No comments: