Friday, August 08, 2014

Wilhoit Signals Change away from Statewide Summative Assessments as the Basis for All Decisions

Wednesday, former Ky Ed Commissioner Gene Wilhoit, who led much of the Chief State School Officers' school reform effort (2007-12), signaled a possible policy shift, telling the Kentucky Board of Education that we must move away from summative assessments as the basis for all decisions. Wilhoit currently directs a Gates and Hewlett-funded policy center at UK.

Wilhoit's comment follows months of concerns raised here, there, and everywhere, that an over-reliance on student test score data for a host of accountability purposes, has contributed to unfair and inaccurate assessments of teachers, "teaching to the test," and a general degradation of the profession. Wilhoit rationalized his comment by claiming that there has been a shift in teaching and learning in recent years (?) and that the "current system of assessment and accountability is inhibiting us from reaching our goals," something Kentucky teachers have been saying for a while now. 

But the U. S. Office of Education, in a June conference call about Kentucky's NCLB Waiver,  informed KDE of their "concern" that the state was under-valuing (test score) growth as a determining factor in teacher effectiveness and that the lack of weights placed on growth might under-value state assessment data. (See PGES Staff Note)

In the past foundations, like Gates, helped drive national education policy which the U. S. government often followed. Are Wilhoit's comments a signal that policy leaders are backing off some of their early exuberance over test scores as the best (and too often only) measure of teacher effectiveness? It is too soon to say. Politically, one assumes the Obama administration will finish out its years under the same general philosophy. But might changes be underway that will influence the next administration?

This from KDE via email:


At its annual retreat today, the Kentucky Board of Education discussed a new approach to assessments meant to more accurately measure students’ higher level skills and drive instructional improvement toward the goal of college/career-readiness for all students. 

Former Commissioner Gene Wilhoit, who now leads the National Center for Innovation in Education, facilitated a session on Leadership for Instructional Transformation and told the board the “current system of assessment and accountability is inhibiting us from reaching our goals.” He said there has been a shift in teaching and learning in recent years and we are “beginning to see a system that is very different than the one we inherited, a system aligned with a new vision of learning” that is more personalized, competency-based and with a clear vision of the knowledge, skills and dispositions students must possess.

Wilhoit told the board this requires new types of assessments that are both state-designed and locally-developed, are more open-ended, focused on problem solving, and include authentic performance tasks. He said we must move away from statewide summative assessments as the basis for all decisions. 

Associate Commissioner Ken Draut gave an example of an assessment model that is being developed to assess the new Kentucky Core Academic Standards in science. The model includes classroom-embedded assessments as well as multiple through-course assessments that are used to transform instruction and improve student learning but do not count toward state accountability. Accountability would be determined by a two-part end-of-year summative assessment that includes a through-course task similar to what is given throughout the school year and a content-based assessment. The summative assessments would serve to validate the reliability of classroom- and through-course assessments delivered throughout the school year. 

“What this does is free us up to focus more on our commitment to education rather than just compliance,” said board member Nawanna Privett. “It moves us away from ‘teaching to the test’,” she said.

Board members expressed an interest in a vision for transforming the current assessment and accountability system from one that drives instruction to a tool that is used to improve instruction.
At the retreat, the board also discussed dual credit/enrollment in Kentucky and the need to develop a statewide solution for consistent, equitable access to dual credit courses whereby a student may receive credit from both the high school and postsecondary institution.

Dr. Jennifer Dounay Zinth from the Education Commission of the States told the board dual credit is typically recognized with improving college completion rates, especially among minority and low-income students. 

However, Associate Commissioner Dale Winkler said that while dual credit is growing in most states, it has seen a decline in Kentucky in recent years.  

Zinth presented 13 model state-level policy components gathered from other states that increase student access and success in dual credit programs.

A statewide task force will begin meeting next month with the goal of issuing a list of recommendations in November so that more students may experience rigorous programs of study that lead to college/career-readiness and persistence to postsecondary degree programs. 

At the start of the board retreat, Franklin Circuit Court Chief Judge Phillip Shepherd delivered the oath of office for two new board members, Debra L. Cook of Corbin and Samuel D. Hinkle of Shelbyville.           

Debra Cook is a retired educator and represents members at large. She replaces Brigitte Ramsey, who resigned to take a position with the Prichard Committee. Cook will serve for the remainder of the unexpired term ending April 14, 2016.

Samuel Hinkle is an attorney with Stoll Keenon Ogden PLLC and represents the 6th Supreme Court District. He replaces Judy Gibbons, who did not seek reappointment. Hinkle’s term expires April 14, 2018.


Richard Innes said...

I was at the school board retreat and the discussion on the science assessment sounded an awful lot like KIRIS resurrection.

For one example: The NGSS don’t specify when in the school year various topics are to be covered, so Through Course Assessments will probably create a big guessing game about scope and sequence. Get that game wrong, and your kids won’t be ready for the assessment.

The performance item approach, which was mentioned as a probable format for the Through Course Assessments, was also tried before in KIRIS assessments. These were one of the first elements to crash (dead by 1996). Problem: it is virtually impossible to come up with new performance type items that sample the same content and skills to the same level of difficulty. But, get that wrong and your trend lines go out the window.

I have posted more detailed comments here:

It would be good for teachers to chime in early on this, as they will be on the front lines of trying, once again, to cope with the unworkable.

Richard Day said...

There needs to be a general sequence of prerequisites within disciplines. I’m quite sure there is.

You are referring to the performance events in KIRIS. But you’ve misidentified the problem a bit. The KIRIS performance events did not die because a lack of equatable events. It was because (I forget the year) KDE did not (or could not) field test the items, as would be required to equate them, and thus make them meaningful. They did not have the budget to do so, as I recall. After the large infusion of money in 1991-92 the legislature progressively lessened their support. These big plans only work when they are funded. Think Senate Bill 1.

As the principal of a pretty good school, I did not find performance events to be unworkable at all. The kids at Cassidy generally enjoyed them. The one or two times we did them, the kids got a real world problem requiring them to apply some scientific and mathematical principle to solve a group problem. They were scored as a group, and as individuals. But, one must be able to discriminate between the hard problems and the easy.

Anonymous said...

Also worth noting that the current research on performance assessment coming out of the SCALE work at Stanford is significantly more sophisticated than the brand of performance assessment that we had in the 90s.

We have the potential to have master teachers designing localized assessments tied to nationally and internationally accepted standards of performance.

Implementation will be key, but I think that we, as a system of educators, have to begin reevaluating how we assess learning.

The conundrum: how do we design authentic local assessments that are norm-referenced? Figure that out and you get a cookie.

Richard Innes said...


Actually, you only have this partly right.

KDE and ASME (the testing contractor) did do a really dumb thing by not running a trial on the 1996 performance events, but there is more to the story.

Some of that is covered in the Catterall report for OEA issued in 1998, “Kentucky Instructional Results Information System: A Technical Review.” It is available online here:
Online at:

Here are a few quotes:

"…the decision to drop performance events was made only after several attempts to justify the equating of these assessments proved inadequate." (Page 18)

"It was clear from the beginning that 'equating' performance event across events and years would be a difficult task." (Page 41)

"The Deputy Commissioner for Learning Results Services, Ed Reidy, writes about, 'Puzzling patterns
of data for performance events ... ' in a letter to OEA's Dr. K. Penny Sanders on January 31, 1997 (page 3, paragraph 6). 'The patterns involved performance event results that were large in magnitude and inconsistent with results from open-response in the same content area, and inconsistent with results from performance events in the past'. This letter and other correspondence and materials accompanying it, provide careful documentation of the unsuccessful attempts to equate results based on the performance events and the rationale for the decision not to use them in the accountability index." (Page 41)

"The policy decision to include performance events was made for educational reasons in the absence of any technical procedures that could be used to equate or link scores across performance events. Performance events were administered and plans to use them in the accountability index were made. Only after the fact was the absence of adequate technical support recognized, confronted, and resolved by the decision not to use the performance events." (Page 42)

The bottom line is that there were unresolvable technical issues that went beyond the obvious mistake in 1996. Those technical issues have never been resolved, and I don’t think they really can be. Psychometricians might be able to do some number magic that makes it seem like you have a valid trend line over time, but the OEA Panel didn’t buy it in 1995 and Catterall wasn’t buying it in 1998, either.

Skip Kifer said...

The performance events were to influence instruction: science experiment as an assessment; hopes for science experiments in the classroom.

It does not matter if they scale.

Results look different from other kinds of responses because they are different.

Another example of terrible side effects of accountability systems.

Richard Day said...

Perhaps the day will come when we can put a high-tech helmet on a kid's head and harmlessly extract what students know. But until then we will test. All social science testing has its limitations...and performance events - doubly so.

When testing is used (as it should be, and was designed to be) to inform instruction, its limitations are less of a concern. But in a high-stakes environment where scores are used to reward and punish (mostly punish)students, teachers and school, you are quite correct. Test makers will struggle to make performance events statistically acceptable.

Thanks for the link to Catterall, et. al.

August 9, 2014 at 4:57 PM: I am not familiar with the "SCALE work at Stanford." I'll take a peek next chance I get.

Anonymous said...–-designing-deeper-learning-how-develop-performance-tasks-common-core

They are conducting a free MOOC next month. KY teachers will probably have a large presence.

Richard Innes said...

RE: Anonymous, August 16, 2014 at 2:12 PM and August 9, 2014 at 4:57 PM

I took a look at the materials related to the link you posted on August 16. As a note, this is Linda Darling-Hammonds’ organization.

The MOOC ( course outline only talks about helping classroom teachers develop a few of their own, individual performance events.

Reference your comments in the August 9th post, there is no discussion in the MOOC course outline about the very different requirements for a valid state assessment program built around performance events. Unlike individual, teacher-created events (which really are nothing new, I had to do such things as a student way back in the 1960s), the state must be able to link and equate performance events across years if a valid and reliable assessment program is to be maintained.

If the Stanford crowd has magic answers that can overcome the reasons why Kentucky’s KIRIS Performance Events crashed in 1996, I’d be interested to hear that.

However, I currently don’t think that is possible, and neither did testing experts who looked at the KIRIS problems in 1995 and 1998.

It’s a little like the writing portfolios. As teacher-operated classroom instruction tools, portfolios have value. But, put these performance items into assessment, and no end of problems result (and did).

Richard Day said...

Richard: It is like writing portfolios in the sense that both portfolios and performance events were designed to support sound instruction. It was only when these practices were jammed into state-wide accountability systems(where they fall prey to data hawks) that they become a bad idea.

If one's priority is accountability, they are messy. But if one's priority is teaching and learning they provide a picture of student performance denied to multiple choice questions...even with their relatively neat statistics.

Richard Day said...

Due to some glitch I can't explain, Skip Kifer's comment (above) was only posted today when it hit my in-box, despite being sent ten days earlier on Aug 11th. I sent Skip a note, and he confirmed when he sent it. My apologies to Skip, and my hope that this hasn't happened to any other commenters. I'll try to get some explanation/help from Blogger.

I want KSN&C readers to know that I don't spike comments I receive unless they are spam (which I get a handful of every week), meant to sell you something or send you to an unknown commercial (usually porn) site, or the commenter is reduced to name-calling. In those few cases, I usually inform readers that I spiked it and share the gist of the comment - if it had any substance to it.