Using and Interpreting



Assessment System


A Primer for Teachers and Principals



Samuel E. Bratton, Jr.

Coordinator of Research and Evaluation

Knox County Schools

Sandra P. Horn

Educational Consultant

Value-Added Research and Assessment Center

University of Tennessee

S. Paul Wright

Statistical Consultant

Value-Added Research and Assessment Center

University of Tennessee




To all the children in Tennessee

who deserve the best and fairest

accountability system we can give them





User's Guide

 This booklet will help you learn about Tennessee's Value-Added Assessment System (TVAAS) and about the Tennessee Comprehensive Assessment Program (TCAP). While the authors' intent was to focus on how teachers and principals may use the results of TVAAS/TCAP, they also thought it appropriate to include some basic principles of the TVAAS statistics and of achievement testing itself.

To facilitate the accomplishment of multiple purposes, the booklet has been divided into four parts. The lines separating the content of the four parts may appear fuzzy at times, but the authors are aware of that. It may even prove to be helpful to those rugged individuals who plan to read the whole thing. Use the Table of Contents freely. Skip around. The text of some topics will refer the reader to a related topic.

Part I, Evaluating and Using TVAAS Results, was designed for those who are ready for some help in interpreting results and putting them to good use. Part II, Basic Principles of TVAAS, was planned for anyone who wants to better understand how TVAAS works and what makes it new and different. Although it may not read as fast as a John Grisham novel, the authors attempted to write these explanations with reasonably simple, non-technical language. Part III, Standardized Testing and Alternative Assessments, deals with educational evaluation issues that have more to do with assessment of learning in general than with TVAAS, directly. Finally, it would be a mistake to pass lightly over Part IV, Odd Questions and Other Important Stuff. It covers an assortment of topics and may include the very thing the reader needs most.




 Part I: Evaluating and Using TVAAS Results



Click Here

To Go Here

How to read the official TVAAS reports

How to read the official TVAAS reports 
What are some tips for reading and accurately interpreting TVAAS reports?

What are standard errors?

Standard errors of estimates  
What are standard errors?  
How are they calculated?  
Why should we know about them?

Low gains at the school level

Low gains at the school level  
What are some possible causes for low gains at the school level?

Fluctuating gains

Fluctuating gains  
Why do gain scores sometimes rise and fall without apparent reason?  
How common is this problem?

 Figure your own student gains

Figure your own student gains  
How can teachers figure their own gains, and what are the benefits and pitfalls?



Part II: Basic Principles of TVAAS



 Importance of appropriate indicators of learning

Importance of appropriate indicators of learning  
What will educators use to determine whether students have achieved their goals?

 Mixed-model statistics

Mixed-model statistics  
What are some issues important to teachers that are addressed through mixed-model methodology?

 A longitudinally merged student data base

A longitudinally merged student data base  
What is a longitudinally merged data base and why is it so important to TVAAS?

 Student gains and TVAAS beat raw scores and no TVAAS in fairness test

Student gains and TVAAS beat raw scores and no TVAAS in fairness test  How can the TVAAS software fairly attribute student progress to teachers? 

Teacher effects

Teacher effects  
Should good teachers be worried about teacher effects?  
Should poor teachers be worried about teacher effects?

The cutting edge

The cutting edge  
Has anyone ever been on the cutting edge and lived to tell about it?



Part III: Standardized Testing and Alternative Assessments



Why the norm-referenced/criterion-referenced controversy is a non-issue with TVAAS

Why the norm-referenced/criterion-referenced controversy is a non-issue with TVAAS 
What is the historical perspective for the fuss over these two types of tests, and why are the TVAAS critics confused?

National norm gain

National norm gain  
What is the national norm gain, and where does it come from?

Improving performance on the TCAP achievement tests

Improving performance on the TCAP achievement tests (norm-referenced) 
Should teachers expect higher TCAP scores if they could obtain an item analysis of their previous year's test results and drill their students on items most frequently missed?

Validity and reliability of the TCAP achievement test

Validity and reliability of the TCAP achievement tests 
Are the TCAP achievement tests themselves worth a toot?  
Is it time for a change?

Authentic assessment

Authentic assessment 
Is the rest of the country abandoning standardized testing in favor of authentic assessment?



 Part IV: Odd Questions and Other Important Stuff




Around the world with value-added assessment

Around the world with value-added assessment  
Is value-added assessment working somewhere outside Tennessee?

Let's talk dollars and sense

Let's talk dollars and sense  
How does the cost of TVAAS compare to other Tennessee expenditures?

Classroom teachers

Classroom teachers and statistics: tuning in or turning off?  
Can TVAAS statistics be explained in simple, non-technical language, or must the majority of Tennessee's educators accept them on faith?

Test scores as estimates

Test scores as estimates  
Why are TVAAS gains reported as estimates?  
Where's the real thing?

No relationship between student gain and socio-economic status

No relationship between student gain and socio-economic status 
How can teacher effects fail to be affected by the achievement level of the students? 

Gerrymandering TCAP

 Gerrymandering TCAP scores? 
Can low gains at a given grade level be eliminated by somehow depressing or holding back the scores in the previous grade? 

The research potential of the TVAAS data base

The research potential of the TVAAS data base 
What are some research questions, enabled by TVAAS, which need to be addressed?





The Authors



Part I: Evaluating and Using TVAAS Results

How to read the official TVAAS reports.

What are some tips for reading and accurately interpreting TVAAS reports?

At the time of this writing two types of official TVAAS reports had been produced and distributed to superintendents/directors of schools-typically, in separate packages in early autumn. The first has been titled simply, 1995 TVAAS Report. For the past three years the State Department has distributed it at the annual fall meeting of the superintendent's study council in Gatlinburg. The second report is titled, Simple Paired Mean Gain by Scale Score Groups. It has been mailed to school superintendents/directors of schools shortly after the first report has been received, usually in October. A third report, tentatively titled, 1996 TVAAS Teacher Report, was scheduled to be produced for the first time in 1996. The first two reports, already existing at the time of this writing, will be examined in this section in some detail. Capital letters in bold print, e.g., A through DD, found throughout the text of this response, are keyed to the two sample reports reproduced on Pages 8 and 10.

The 1995 TVAAS Report is complete on one sheet of paper, front and back. A partial system-level example, (dealing with only one subject), can be seen on the following page. There will be one sheet for the entire school system and one sheet for each school within the school system. Math, reading, and language are reported on one side, while the other side contains social studies and science plus several explanatory notes which apply equally to both sides (all five subjects). Columns are headed by grades (2 through 8) A. Columns are printed only for the TVAAS grades that exist in a given school or school system. Figures that appear in the columns are (1) mean scale scores, e.g., 714.7 B; (2) mean scale score gains, e.g., 25.2 C; and (3) standard errors of measurement enclosed in parentheses, e.g., (0.6) D, which are printed to the right of each mean scale score gain. No mean scale score gains are reported in the second grade column because first grade testing is not mandated and no first grade mean scale scores are reported.

The top rows of figures for both mean scale scores E and mean gains F are labeled USA Norm. For the scale scores, these are the numbers that correspond to the 50th percentile (nationally). For the gains, these are the numbers that represent the scale score points required to maintain the same percentile rank when moving from the end of one grade to the end of the next grade. For example, the USA norm for (the end of) fifth grade math is 726 G. The USA norm gain for fifth grade math is 25 H. That 25 points is obtained by subtracting the end of fourth grade math scale score (701) I from the end of fifth grade math scale score: 726 - 701 = 25. If a student scores 701 in math at the end of the fourth grade he/she will be at the 50th percentile. During the fifth grade if that student gains exactly 25 points and scores a 726 at the end of the fifth grade, he/she will again be at the 50th percentile. In TVAAS these USA norm gains (25-points in this fifth grade math example) are considered to be the expected gains for any student regardless of where that student's score ranks on the percentile scale.

Notice that mean (average) scale scores are reported for the most recent four years. If you are looking at a 1995 report, those years are 1992, 1993, 1994, and 1995 J. Mean gains are reported

for the most recent three years K. To follow the mean scale scores or the mean gains for a single group of students, one must move through the tables diagonally. This makes perfect sense because

[Sample] 1995 TVAAS Report

System: New Prospect


Math -- Estimated Mean Scale Scores









(% of Norm)

USA Norm



I 701.0








































Math -- Estimated Mean Gains and (in parentheses) their Standard Errors

USA Norm



H 25.0




1993 Mean Gain

46.9 (0.7)

30.8 (0.6)

22.2 (0.5)

9.6 (0.5)

17.7 (0.5)

13.2 (0.5)

86.1 (0.9)

1994 Mean Gain

55.0 (0.7)

30.6 (0.6)

26.5 (0.5)

16.5 (0.5)

19.1 (0.5)

16.8 (0.5)

101.0 (0.9)

1995 Mean Gain

51.8 (0.7)

C 25.2 (0.6) D

25.3 (0.5)

18.6 (0.5)

16.6 (0.5)

13.3 (0.5)

92.4 (0.9)


1995 3-Year-Avg Gain:

51.2 (0.4) R*

28.9 (0.3) G

24.7 (0.3) R

14.9 (0.3) R*

17.8 (0.3) G

14.4 (0.3) R*

93.2 (0.4)

1994 3-Yr-Avg Gain:

49.4 (0.4) R*

34.2 (0.4) G

27.0 (0.3) G

12.9 (0.3) R*

19.4 (0.3) G

16.7 (0.3) R*


1995 Mean Gain

1994 3-Yr-Avg Gain:

2.4 S

5.6 S

-3.4 NS







third graders in 1993 would be fourth graders in 1994 and fifth graders in 1995. The mean gains are obtained by subtracting mean scale scores on the diagonal. You do not have to do that, since the mean gains are already printed there on the report, but understanding their origin may help simplify the report for you. One caveat to keep in mind is that when several students have transferred in from another school, the gain score calculations include them too. The difference one obtains by subtracting the scale scores, therefore, may not be an exact match with the gain score printed on the next section of the report.

When one finds an unusually large mean gain or an unusually small (even negative) mean gain, one should look at the mean scale score for that group of students in the previous year. When tracing a group back on the diagonal, one sometimes finds fluctuating scores-down one year, up the next, down the next, and so on. This phenomenon has been called the water bed effect. One should now turn to the second TVAAS report, Simple Paired Mean Gain by Scale Score Groups, because it will help determine which students contributed most to the score decline in a given year-low achievers, average achievers, or high achievers. [Also see Fluctuating gains beginning on Page 14.]

Below the three years' mean gains are two rows of figures labeled "3-Yr-Avg Gain" L. The top row is labeled as the most recent year. For 1995 it consists of an average of the gains for 1993, 1994, and 1995. The row right below it is labeled as the previous year. On the 1995 report it consists of an average of the gains for 1992, 1993, and 1994. By reporting three-year averages, any spiked annual scores will be tempered by two other years.

The column at the far right is labeled "Cumulative % of Norm" M. It reports cumulative gain across all the grades reported. In a typical elementary school that would be grades three through five and a middle school would be grades six through eight. Schools with a different grade level organization would be reported accordingly within the TVAAS limits of grades three through eight. Obviously, 100% of the national norm gain is the expected gain; therefore, all figures in this column should be at or above 100%. This is the cumulative mean gain (for the grades in the school between three and eight) expressed as a percent of the cumulative national norm gain. The figures for the most recent three-year average gains are the ones used to determine eligibility for state department incentive awards. To meet the TVAAS eligibility criterion, a school must have this figure at 100% or more for each of the five subjects.

The bottom row is labeled, "1995 Mean Gain minus 1994 3-Yr-Avg Gain" N. This row reports statistically significant increases during the most recent year where any previous three-year average gain is marked by R*. The most recent year on the sample is 1995. As the notes on the bottom of the report indicate, an entry (gain) in this row followed by an "S" O denotes statistically significant improvement, while an "NS" P denotes no statistically significant improvement. (This row will be blank if no previous three-year average gains are marked by R*.)

The other TVAAS report, Simple Paired Mean Gain by Scale Score Groups, is very interesting and useful to teachers. Unfortunately, some teachers say that they have never seen this report. A partial school-level example showing three of five subjects can be seen on the following page. The format was revised in 1995, so that there is now a page for each grade in a given school. All five subjects are reported on the page. Typically elementary and middle schools will have three grades reported; hence, three pages per school. Mean gains are calculated by scale score groups in 50-point increments. These groups are formed by averaging an individual student's scale scores from current and previous tests. In most cases this grouping results in three to five columns (groups) with the lower achievers reported to the left Q, the average achievers in the middle R and the higher achievers to the right S. The odd column on the far right of the report contains the USA Norm Gain T.

Vertically, each subject is divided into two parts, the top part containing 1995 data U and the bottom

part containing data representing the previous three-year average (1994, 1993, 1992) V. Rows labeled "N" contain the number of students whose gains were averaged W for the mean gain X in

[Sample] Simple Paired Mean Gain by Scale Score Groups

For Diagnostic Purposes Only

School System: New Prospect

School: Harper Valley Elementary





Grade 3













USA Norm Gain




Mean X








Std Error




















Previous years

Std Error






















Std Error
















Previous years

Std Error















CC 24.5


BB 29.0



Std Error



DD 5.8














Previous years

Std Error













each column of scale score groups. Cells containing less than eight students were left blank Y to prevent drawing inferences from too little data. The three-year average section contains larger N's Z because students from three years were added together.

By looking left and right one can determine which achievement groups were gaining more or less, and the USA Norm Gain provides the benchmark or target for all groups. By looking up and down one can determine whether the most recent gains (1995) have improved or declined from the previous three years' average. Moreover, if teachers had identified and targeted low gaining groups from the previous reports, the 1995 data will show whether those efforts are reflected in the latest report. In other words, if the high achievers had gained less than the average and low achievers and the teachers had concentrated on correcting this problem, it would be most interesting to see if gains among the high achievers had increased.

One can observe that as the number of students in a given cell increases, the size of the standard error usually decreases AA. When the standard error is both added and subtracted from the mean gain with which it is associated, it creates a kind of confidence band. Fluctuations within that range may be due to chance alone. For example, if the USA Norm Gain is 29 BB and a scale score group's mean gain is 24.5 CC with a standard error of 5.8 DD the group may have met the expected gain. When 5.8 is added to 24.5, the result is 30.3 which is above the target of 29. Of course, the 5.8 may also be subtracted from 24.5, resulting in an 18.7. In the color scheme used in the other TVAAS report, this mean gain of 24.5 would be yellow because it is within one standard error (in this case, 5.8) of the USA Norm Gain of 29. [For a more thorough discussion see the following section, Standard errors of estimates.]

A third TVAAS report, dealing with teacher effects, is expected to have a format similar to the first one, which provides system or school effects. Student mean gains for a given teacher may be compared to the expected gain (USA Norm Gain), the State mean gain, and the system mean gain. There will be a different report for each subject and/or grade taught. (When a teacher's grade or subject assignment is changed, TVAAS teacher effects are computed separately for each grade and/or subject taught.) Of course, copies of individual teacher reports will be furnished only to the teacher and his/her appropriate administrators. The law states:

The estimates of specific teacher effects on the educational progress of students will not be a public record, and will be made available only to the specific teacher, the teacher's appropriate administrators as designated by the local board of education, and school board members.1

  Home | Return | Table of Contents

Standard errors of estimates.

What are standard errors? How are they calculated? Why should we know about them?

The standard error of an estimate is a way to show, mathematically, how good an estimate is. The smaller the standard error, the better the estimate in the sense that an estimate with a small standard error is more likely to be closer to the true value than an estimate with a large standard error. In other words, the smaller the standard error, the more confidence one can have in the accuracy of the estimate. (Remember that all test scores are estimates of true test scores and that true test scores exist in theory only-see Test scores as estimates starting on Page 31.) Beginning in 1995 TVAAS reports included standard errors for the gains and percentages of national norm gains.

If standard errors of estimates are not taken into account, it is very easy to overreact to differences among reported gains and percentages. For example, suppose the target-in this case, the national norm gain-is 20 and the reported average gain is 18.3. One might erroneously conclude that a gain of 18.3 represents subpar performance; but if the standard error is 2.2, an estimated mean gain as low as 18.3 could very easily occur "just by chance" even though the "true" mean gain is 20 or more. If you add and subtract a standard error from the score with which it is associated, you create a kind of "confidence band" which brackets your estimate, providing the wiggle room sufficient to contain deviations from the estimate due to measurement error. For the example just given, the confidence band would extend from 16.1 (18.3 minus 2.2) to 20.5 (18.3 plus 2.2). Since the target gain of 20 falls within the confidence band, the reported mean gain of 18.3 may be attributable to estimation error rather than subpar performance.

The three-year average gains on the TVAAS report illustrate this concept. The mean gain of 18.3 would be flagged with the color yellow (Y) for "caution," because it might represent subpar performance, but then again it might not. The (Y) indicates that this area should be examined carefully to determine which is the case. On the TVAAS reports, red (R) is used to indicate an estimated mean gain more than one standard error below the norm, a serious cause for alarm. (R*) or "ultra-red" indicates estimated mean gains more than two standard errors below the norm.

How are standard errors calculated? Those who have taken a statistics course will almost certainly have calculated the sample mean and standard deviation of a set of scores. The sample mean estimates the true underlying average score. The sample standard deviation measures the amount of variability among the scores, i.e., how much the scores of different individuals differ from one another. In this simplest possible case, the standard error of the estimated mean, (i.e., of the sample mean which is an estimate of the true mean), is simply the standard deviation divided by the square root of the sample size. The calculation of standard errors in TVAAS is more complicated because each individual student has more than one score-there are scores in five subject areas for up to five consecutive years-but the underlying principle is the same. What is clear is that the standard error depends (among other things) on the amount of data. Nine scores will have a larger standard error than 109 scores. Similarly, schools will have larger standard errors than school systems, and small schools will have larger standard errors than large schools.

  Home | Return | Table of Contents


Low gains at the school level.

What are some possible causes for low gains at the school level?

The answers to low gain problems at the school level must be found at the school level. A careful analysis of a given school's TVAAS reports can usually provide at least some answers to this question. What we can do here is to provide some questions for you to ask yourselves as you look at your data, subject by subject and grade by grade. Begin with these six questions. Do not argue or fret over those that do not apply to your school-just read and move on.

First, look to see if certain subject areas are being short-changed by one or more teachers. In self-contained classrooms this can be due to a teacher's lack of interest and/or expertise in one or more of the five TVAAS subjects. In the self-contained, elementary classroom, science may be the most likely victim of this problem, although it could be any subject. Other reasons for slighting a particular subject could be a teacher's lack of resources or a teacher's perception of expectations held by the principal, the superintendent, or the community.

Second, use your TVAAS report of gains by achievement groups to determine if student gains are consistent across your low achievers, average achievers, and high achievers. All three groups need to achieve gains in order to sustain growth equal to the national norm gain. Most people find varied patterns from school to school or even from grade to grade and subject to subject within a given school. Appropriately challenging all students is one thing that makes good teaching difficult, but it is also one of the things that separates excellent teachers from average teachers. Incidentally, teachers who do not believe that all children can learn will have a lot of trouble with this one.

Third, gains may be low if students are frequently off task or teachers are not fully engaged in teaching. Some teachers seem to have difficulty recognizing this problem. There may be too many interruptions of the instructional process, daily, weekly, or monthly. Also, consider a lack of mission and vision for the school, scheduling problems, overcrowded facilities, trouble with support services, extra-curricular activities encroaching on instructional time, classroom management problems, a laissez-faire administration, poor instructional planning, a lack of instructional assistance, and so on. Note this word of caution: Do not let some of these instructional distractions be used as an excuse for adopting a "what's the use?" attitude.

Fourth, a human energy problem can be responsible for low gains. We knew, intuitively and before it was confirmed by TVAAS data, that some teachers are more effective than others. We also know that for a variety of reasons, some teachers have lost the energy, the drive, or the enthusiasm that is necessary to be a successful classroom teacher today. Perhaps a few never possessed it. If several such teachers happen to be employed in the same building, for whatever reasons, the instructional program will not function as smoothly as it should. That is not according to TVAAS; that is according to common sense. Ask yourself the question: Does my school have several teachers who fit the profile of the tired or the unfulfilled?

Fifth, if several student scores should be abnormally high due to inappropriate action by the students or by a test administrator, potential gains in the following year may be reduced. With the TVAAS analysis, it is easier to detect cheating and to pinpoint the guilty person(s). Incidentally, one of the fine features of the mixed-model statistical process is that it will detect score anomalies and adjust the effects, thereby protecting subsequent teachers when, in a previous year, some of their students have received artificially inflated test scores-whatever the reason for it.

Sixth, some low gains may defy explanation, at least for now. Some such anomalies may be found to be school-wide, district-wide, or even state-wide. A few may eventually be traced to curricular mis-alignment, or some other kind of error. Each such oddity should be checked out. In an undertaking as large as TVAAS, one should expect a few glitches. Again, this unexplained phenomena theory must not become a blanket excuse, a reason to stop looking for other explanations, or a "justification" for discrediting the entire process.

  Home | Return | Table of Contents



Fluctuating gains.

Why do gain scores sometimes rise and fall without apparent reason? How common is this problem?

The fluctuating gain phenomenon is a frequent topic of TVAAS conversation. Critics sometimes cite it as evidence that TVAAS is flawed. The TVAAS staff, concerned about the number of times the question was being raised, looked closely at the data and determined that the rise and fall problem was not nearly as common as some would suggest. In over 70% of cases checked, the problem simply was not found. In the first place, apparent fluctuations are not always real fluctuations. Unreal fluctuations mean changes due to chance. Chance is a statistical phenomenon which is always present. The degree to which it is present can be precisely measured and expressed as error of measurement or estimate. The remainder of this discussion, then, must be divided in response to two questions: "Are the fluctuations real?" and "What if they are?"

In order to determine whether there is a real difference between this year's gains and last year's gains requires the execution of the standard statistical procedure for calculating the standard error of the difference between two estimates. The next paragraph explains how to do that, and like a textbook, follows the discussion with an example.

Each estimate has its own standard error, but to look at the difference between estimates, one must derive the standard error of the difference by combining the two standard errors of estimates. Provided the two estimates are "independent" (as they will be if they are for two completely different sets of students, such as "this year's class" versus "last year's class"), the standard error of the difference is the square root of the sum of the squares of the individual standard errors. (Square each standard error. Add them together. Take the square root of that sum.) By the conventional rules for statistical decision-making, the difference between two estimates is "not significantly different from zero," i.e., probably not a "real" difference, unless the magnitude of the difference is larger than twice the standard error of the difference.


First class' mean estimated gain = 20.2 (standard error = 2.2)

Second class' mean estimated gain = 4.5 (standard error = 1.9)

2.2*2.2 = 4.84 1.9*1.9 = 3.61 4.84 + 3.61 = 8.45 Ö8.45 = 2.91

20.2 - 4.5 = 15.7 2.91*2 = 5.82 1 5.7 is larger than 5.82; therefore, the difference is real

If, after going through the above process, one determines that there is a real difference between gains from year to year, it may signify a genuine difference in educational outcome. When this is the case, the difference may be due to any number of practices. Doing "exactly the same thing" with this year's class as with last year's class does not guarantee comparable growth in achievement. This year's students may differ significantly from last year's students. For whatever reason, they may be less well prepared or, on the other hand, far more advanced than the previous year's students. Other possible causes of fluctuating gains may be the institution of new methods of instruction or different teaching strategies, curriculum revisions, changes in professional personnel, changes in school or district policies, changes in the school's mission and vision, or a change in the availability of resources.

If fluctuations over time result in a trend of increased gains, this is a good thing. In years where gains are below the national norm, even going so far as no gain or "negative gain," look for the cause among the suggestions (negative direction, of course) listed in the previous paragraph. In addition, look at the group's mean scale scores along with the group's insufficient gains. If a group has made exceptionally good gains the previous year, the current teacher must know that and adjust his/her instruction accordingly, or that group will almost certainly fail to "grow" enough to meet the current year's expected gain. The "secret" to obtaining consistent gains is to teach children from where they are when they enter the classroom. [Also see Improving performance on the TCAP achievement tests (norm-referenced) beginning on Page 26.]

  Home | Return | Table of Contents



Figure your own student gains.

How can teachers figure their own gains, and what are the benefits and pitfalls?

Like all "hands-on" learning, figuring your own gains will give you a better feel for the process. For most teachers it will help to relieve unwarranted anxiety about the teacher effects. Computing your own student gains is relatively easy, but it does require a very specific process to guard against unacceptable measurement error. The first part of this response is like a recipe that explains exactly how to figure your own student gains. The second part explains why you must do it this way and why the more sophisticated TVAAS process can do it even better. In most cases the "real" teacher effects, computed with the benefit of mixed-model statistics, will be more positive than those you figure yourself.

As you work your way through this paragraph and the next one, refer to the example which follows. Choose one of the five TVAAS subjects and list all of the students for whom you have TCAP test results in that subject for two (or more) years. Find and record the scale scores for the previous year in one column, and in a parallel column, do the same for the most recent year. Average the two scale scores for each student and record the result in a third column beside the appropriate student's name. Rearrange your list of students from high to low based on the average scale scores you just computed. Divide the list roughly into thirds. Let no group have less than five students. (If all the students total less than 15, use only two groups. Keep in mind that when there are very few students


in a category, the information about their gains is of less use in coming to conclusions about the effects of instruction on that group.

Obtain the gains for the most recent year by subtracting the older scale score from the most recent one. Do this for each student and record the gains in still another column. Compute an average gain for the students in each of your three achievement groups. If you happen to have any students with negative gains (actually losses) take particular care as you average negative numbers. The ideal finding is comparable gains that approximate or exceed the national norm across all groups. If the average gains of the three achievement groups are not similar, it may suggest some instructional strategy changes you should make. Always remember, too, that the more students you have in a group, the more confidence you can place in the results. The remainder of this section (following the classroom example) explains why that is so.

In this example the lowest achieving group made the highest average gain (+41), while the average achieving group made the lowest average gain (+19). The highest achieving group made an average gain (+31) between the other two. The average group was the only one of the three that did not achieve the expected gain (+25). This teacher needs to think about his/her delivery of instruction to the middle or average students. Does the profile make sense to this particular teacher and if so, what might be done to improve the gain scores of mid-range students? As a general rule, both low achieving and high achieving students tend to demand more attention than average achieving students. Perhaps this teacher needs to be more intentional with instructional strategies for the average achiever. Perhaps this teacher's expectations for these students were too low, although this would appear less likely, because the low achieving students were doing so well.

Anyone who seeks to use the information obtained from raw gains, as illustrated on the following page, must understand that the resulting information contains bias that TVAAS is designed to minimize. TVAAS is a sophisticated process that takes into account a multitude of factors in rendering estimates of student gains. Nevertheless, valuable insight can result from the study of simple gain scores by individual teachers or principals. By averaging two (or more) test scores it is possible to eliminate some of the bias or measurement error that is always present. It is far more likely that a student will be correctly assigned to an achievement group on the basis of an average score than on the basis of any single score. Although one might reason intuitively that two scores are better than one, let us explore the scientific reasons for it.

A true score, according to CTB/McGraw-Hill, ". . . is the hypothetical average score that would result if the test could be administered repeatedly without practice or fatigue effects."2 However, standardized tests are not administered repeatedly. Usually, they are administered once a year. In Tennessee the tests, although equivalent, are different each year. Even if the same tests were given again and again, students would undoubtedly experience learning and fatigue. Therefore, the score for any single student for any single testing situation is unlikely to be the exact score that is the true measure of his or achievement. Instead, the scores that students receive from a single testing experience reflect their actual level of achievement mitigated by elements of chance or luck.


Example: Computing gains at the classroom level

Grade 5 Mathematics National norm gain = 25





SS for 1994

SS for 1995

Average SS

1995 - 1994





- 1





+ 40





+ 55





+ 45





+ 22





+ 26





+ 27



Average gain = + 31









+ 16





+ 37





- 6





+ 43





+ 3



Average gain = +19









+ 57





+ 46





+ 46





+ 39





+ 46





+ 11



Average gain = + 41


All tests exhibit these errors of measurement. CTB/McGraw-Hill explains:

It is assumed that measurement error is associated with any test score. The standard error of measurement is an estimate of the amount of error to be expected in a particular score from a particular test. This statistic provides a range within which a student's true score is likely to fall. Therefore, an obtained score should be regarded not as an absolute value but as a point within a range that probably includes a student's true score.3

This phenomenon causes difficulty when we try to place students into achievement groups by using only one test score. The students who scored close to their true level of attainment would form the majority in each group, but some of their lucky and unlucky classmates would also appear in the high and low groups. The luck factor shows up as bias in their scores. Experience has shown that these very lucky and very unlucky students are likely to score closer to their own true scores the next time they are tested. This means that extremely lucky high scorers will tend to score lower the next time and extremely unlucky low scorers will tend to score higher. All of this may seem fairly obvious, but it lies at the root of the problem with the interpretation of raw scores.

TVAAS does not rely upon single scores to calculate gains. Students are followed longitudinally over a period of three to five years, and the variance of their scores is entered into the determination of system, school, and teacher effects. When scores are analyzed this way, it is possible to strip the bias from the individual scores and furnish unbiased estimates of gains. Although TVAAS employs complex computational and statistical methodologies, an individual may use averages (of at least two scale scores) as a simple way to mitigate an important part of the bias inherent in student scores.




1. Tennessee Code Annotated, Title 49, Chapter 1, Part 2 (g) (5).

2. CTB/McGraw-Hill, CTBS Spring Norms Book, March Through June, 4th ed. (Monterey, CA: CTB/McGraw-Hill, 1990), p. 7.

3. CTB/McGraw-Hill, p. 7.


   Home | Return | Table of Contents



 Part II: Basic Principles of TVAAS

Importance of appropriate indicators of learning.

What will educators use to determine whether students have achieved their goals?

The means teachers use to determine whether students have achieved their goals may be called indicators of learning. They range from simple observation to group-administered standardized tests, from daily homework to complex laboratory experiments. Test scores, documented performance, and portfolio artifacts are all indicators of learning. Determining which indicators are best suited to specific purposes is the core of the student assessment debate.

The determination of whether learning has taken place depends upon what questions are asked and how they are asked. Regardless of the subject or grade level, there is an infinity of indicators that can provide information about the action or subject under consideration. The precision of the measurement depends on the means used to gather data and the extent to which data are collected. To put it another way, a meaningful evaluation depends largely upon the quality of the indicators utilized.

Frequently, several different indicators may be considered. Statistics can easily determine the correlations among indicators. By knowing the capacity of the indicators to assess the subject and the correlation between various indicators, it is possible to form inferences from one indicator to another. If the indicators are highly correlated, then it is no longer a question of which is better, but which is more cost effective. For example, if the eighth grade TCAP language arts achievement test is highly correlated with the eighth grade writing assessment, then it is not necessary to use both from an evaluation perspective. Both may be needed for other reasons, e.g., the test results for calculating gains and the writing assessment as part of an instructional strategy.

On the other hand, a total lack of correlation between indicators indicates one of two things: they are not measuring the same things, or at least one of them is a poor measure of the subject. Absolute accuracy in any type of measurement is impossible. To disregard cost, time, and the impact on the subject would be irresponsible. The point is that statistical correlations can be extremely useful to educators in checking the validity of measurement devices and in providing valuable input for making cost effective decisions.

In education, the assessment of learning is generally built around the demonstration of competence in certain domains. These domains and the goals and objectives that address them are formalized in curricular frameworks and course outlines. Teachers design courses of instruction based on the curricular guidelines and, generally, even though teaching may take any number of forms, there is a high correlation between what is taught and the formal curriculum. It is this correlation that makes large-scale assessment possible. Indicators can be developed that measure learning along the articulated curriculum, and because of the correlation between instruction and the stated curriculum,

inferences can be drawn about the effectiveness of instruction in school systems, schools, and classrooms. [Also see Why the norm-referenced/criterion-referenced controversy is a non-issue with TVAAS beginning on Page 24.]

 Home | Return | Table of Contents

Mixed-model statistics.

What are some issues important to teachers that are addressed through mixed-model methodology?

First, the mixed-model methodology used in TVAAS makes it possible to use all the data available on each child. This is important because, as everyone knows, children sometimes miss tests. Other models that use test data for assessment must either eliminate all sets of incomplete data or must somehow "impute" data to fill in the blanks. By using mixed-model methodology, TVAAS can utilize all the available data without imputing any data. TVAAS does this by weighting complete records more heavily than partial records, so the records of children with fewer years of data or scores for fewer subjects count less in the determination of educational effects than do the records of children for whom more data are available.

Second, by using longitudinal data, TVAAS is able to produce more reliable estimates of the school, system, and teacher effects on the academic gains of students than other assessment systems. Because students are followed over time and because several years of data are used to determine these estimates, more data are utilized to determine the effects, making them more reliable than "one shot" assessment models.

Third, TVAAS contains a methodology that insures that no teacher will be misclassified as extremely good or extremely bad due to chance. The "shrinkage" estimate that is an integral part of TVAAS prevents this misclassification from occurring. In TVAAS, all teachers are considered to be at their system's mean until overwhelming data "pull" their estimates away from that mean. Since all teacher estimates are measured against their own system's mean gain, a teacher must be found to have gains significantly different from this system mean to be classified above or below average.

Fourth, other assessment systems based on standardized testing have depended on simple raw scores. TVAAS, on the other hand, has dealt with the same evaluation problems by focusing on the measurement of academic progress. TVAAS data have shown that academic progress of students cannot be determined by knowing the economic or racial composition of a school. This means that all students can be expected to make comparable gains, regardless of race or level of affluence, when taught in schools, systems, and classrooms of equal effectiveness.

Fifth, experts in the field of educational statistics and highly respected theoretical statisticians, who have studied TVAAS, have found the process sound and appropriate for the assessment of educational effects.

Mixed-model statistics were pioneered outside the field of education (in genetics), and though the statistical concepts have been around for several years, they have not been widely used until more recently because of the hardware and software requirements. Matrix algebra is used and thousands of equations must be solved simultaneously. Even now, one can find little in the literature on the use of mixed-model statistics in the social sciences. Tennessee is on the cutting edge of this methodology, and that is exciting. The cutting edge is never found in the comfort zone, but it is not necessarily in la-la land either. [Be sure to read Classroom teachers and statistics: tuning in or turning off? on Page 31.]

   Home | Return | Table of Contents

A longitudinally merged student data base.

What is a longitudinally merged data base and why is it so important to TVAAS?

The huge TVAAS data base, currently containing nearly four million student records, has sometimes been underemphasized in previous value-added explanations. This data base stores up to five years of test scores "on-line," allowing calculations to include a historical profile of each student's scores. TVAAS could not function without such a data base. The computer that handles this huge data base requires one gigabyte of random access memory (RAM).

Typically, achievement test scores have been reported annually, and all calculations involved the current year's data only. The next year the whole process began anew. Only in Title I programs or in small school districts was one likely to find multiple-year data, and those annual merges were typically performed at the central office level. Moreover, gain scores were calculated only for students with matched pre-test and post-test scores. With TVAAS, individual student scores are retained on-line for up to five years, and mixed-model statistics allows all individuals to be included in teacher, school, and district effects calculations, including those with only one year's scores. [Also see No relationship between student gain and socio-economic status on Page 32.]

  Home | Return | Table of Contents

Student gains and TVAAS beat raw scores and no TVAAS in fairness test.

How can the TVAAS software fairly attribute student progress to teachers?

TVAAS addresses the issue of fairness in calculating individual teacher effects by capitalizing on what is known about normal behavior (of both students and test items) and measuring the magnitude of any significant deviations from that normal behavior. Such deviations are aggregated over time (at least three years) for each teacher. Deviations (inconsistencies) among a few students are "normal" and may be attributed to many different causes. Deviations by a majority of the students a given teacher has taught flags that teacher as being different from the norm-positively or negatively, depending on the direction of the deviations. Read on to see how this works.

Any student taking achievement tests more than once tends to make similar scores on them. We call this consistency. After we have taught students for a little while, we even label them. (Whether we should or not is another question, but we do.) We say John is an "A" student, and Mac is a "C" student. We do this because over a period of time we find that John normally performs at a higher level than Mac (on tests or any other criteria we might use to judge their work). This consistency of performance is found whether we are looking in the suburbs or in the inner city, whether we are looking at high achievers or low achievers.

Turning to the TCAP achievement tests, we find that individual test items tend to be answered the same way by similar students time after time. We call this reliability. Because of TVAAS's test score history on each student (the longitudinal data base), each student can be evaluated for consistency. Remember that the test items are reliable or they would have been thrown out by the test publisher. When a given student's profile (test score history) is found to contain inconsistencies, there has to be some reason for it. TVAAS data clearly show that gain scores are not sensitive to socio-economic differences or racial differences. When the student inconsistencies (deviations from the norm) are counted and aggregated, the greatest difference always comes up the same: who the teacher was. Incidentally, the teacher effects calculations in TVAAS were intentionally designed very conservatively to prevent any teacher from being mis-labeled. When a teacher's effects deviate significantly from the average of all teachers, one can be almost certain it is not a fluke.

 Home | Return | Table of Contents

Teacher effects.

Should good teachers be worried about teacher effects? Should poor teachers be worried about teacher effects?

Good teachers have nothing to worry about. Poor teachers should have been worried long before TVAAS came along. Here are words of comfort for the great majority:

(1) Teacher effects will not be published in newspapers.

(2) Based on three feasibility studies and preliminary indications from state-wide data, we believe that most teachers will profile very well.

(3) The law says value-added data may be used in teacher evaluations. The law does not speak of consequences at the teacher level at all.

(4) In a teacher evaluation process, student achievement would always be only one of several components. The State teacher evaluation process has stipulated for several years that student data may be included in one's data sources. Without TVAAS, however, there was no way to filter a number of confounding variables and ensure fairness.

(5) Since school and district effects are computed differently from teacher effects, school effects in small schools, e.g., those with one teacher per grade, will not necessarily be the same as the teacher effects. Because of safeguards included in computing the teacher effects, they will almost always be more positive than the corresponding school effects.

 Home | Return | Table of Contents


The cutting edge.

Has anyone ever been on the cutting edge and lived to tell about it?

Yes indeed. If none of us had ventured out there, we would all still be waiting for lightning to start our camp fires. TVAAS has taken us to the cutting edge in the use of student achievement data in educational evaluation, and that is exciting. The TVAAS development team has created customized software, brought mixed model statistics from other disciplines, assembled what may be education's largest longitudinally merged student data base, utilized a good norm-referenced achievement test, and with state-of-the-art computing power, has resolved some educational evaluation problems that had previously defied resolution. We are witnessing creative problem solving and technology on the move in educational evaluation. We expect and accept innovation and progress in other fields; why do some of us consider it impossible in our own? [Also see Around the world with value-added assessment on Page 30.]

 Home | Return | Table of Contents

Part III: Standardized Testing and Alternative Assessments

Why the norm-referenced/criterion-referenced controversy is a non-issue with TVAAS.

What is the historical perspective for the fuss over these two types of tests, and why are the TVAAS critics confused?

It is indeed unfortunate that TVAAS has been cast into a controversy between norm-referenced tests (NRT's) and criterion-referenced tests (CRT's). TVAAS has been a pawn in a battle it did not start or need-a fight that is irrelevant to the success or failure of value-added assessment in Tennessee.

Here are the facts: In order to function as it was conceived, TVAAS needed a set of scaled tests that are reasonably related to the curriculum and that contain questions of varying difficulty in order to adequately discriminate among a wide range of achievers typically found in most classrooms. When the Education Improvement Act was enacted in 1992 and TVAAS got started, the norm-referenced portion of the TCAP achievement tests best fit the necessary criteria. TVAAS can also function with properly constructed criterion-referenced tests, (refer to the criteria just listed), as we will see with the high school subject matter tests, beginning with five mathematics courses in 1996. The labels, NRT's, CRT's, or whatever, are unimportant to TVAAS as long as the needed measurement properties are present.

Turning to the historical perspective: In the beginning there were norm-referenced tests. Like most things, they had their advantages and their disadvantages. Then, about 30 years ago a new kind of test was proposed, at least partially to address some of the perceived problems of norm-referenced tests. The first skirmishes of the war were fought between the proponents of the new CRT's and the defenders of the established NRT's. It soon became obvious that the new CRT's had some disadvantages of their own, one being that they could not, by their very nature, furnish the same kind of information that was available, (and needed, many said), from norm-referenced tests. Subsequently, a third party was formed with a platform suggesting that both types of tests were needed. By this time, however, battle lines had been drawn deep in the sands for some people, and some of those folks continue to fight even after the war has ended.

Meanwhile, one of the significant innovations to come along has been TVAAS. While searching for tests with properties which would meet their specifications, the developers fell into this now well-worn testing controversy. Originally, circumstances led them toward NRT's-for quite logical reasons, incidentally. The good news is that TVAAS, with its mixed-model statistics and longitudinal student data base, has solved or by-passed almost all of the historical disadvantages of norm-referenced tests. The bad news is that hardly anyone knows it, and those who do have not been very successful, so far, in convincing those who doubt it. Many of the doubters are among TVAAS's severest critics. These critics are busy trying to shut down the entity that has solved some of their most perplexing problems-irony at its zenith.


Finally, for those who are not sure what the NRT/CRT dichotomy is all about, the remainder of this section compares and contrasts the properties of the two. Syntactical clues should help solve the basic difference between NRT's and CRT's (What do the respective tests reference?); and we should understand this difference, because in addition to filling a void in our pedagogical toolboxes, the TCAP achievement tests consist of both NRT and CRT items. An easy way to distinguish between norm-referenced and criterion-referenced tests is to compare and contrast critical points as outlined below:

Norm-referenced tests Criterion-referenced tests

One's score is compared (referenced) to scores One's score stands alone, indicating a level

of a peer group which may be local, e.g., one's of mastery of objectives (criteria) on which

own school or state, or national. the test was based.

NRT's are timed. CRT's may or may not be timed.

Questions vary as to difficulty, ranging from a Questions have a much narrower range of

few very easy questions to a majority of "grade difficulty than NRT's, the vast majority

level" questions to a few very difficult questions. being "on (or below) grade level."

An average student is expected to correctly An average student is expected to correctly

answer only about 60% of the questions. There answer 100% of the questions. Since a near-

should be a "reasonable" match between the perfect match exists between the CRT and

NRT and the curriculum taught. the curriculum taught, objectives not

mastered by a given student should be

retaught and retested.

Several types of scores may be derived from the Scores may be reported as a simple number

number of questions answered correctly, but all or percent of questions answered correctly.

show how a given student ranks in relation to Objectives or domains may be reported

his/her peers. Score types are: percentiles, separately as mastery, partial mastery, or

stanines, normal curve equivalents, scale scores, non-mastery.

and grade equivalent scores.

There is no such thing as passing or failing a A pass/fail cut-off score may be set for a

norm-referenced test. criterion-referenced test, as is the case with

the TCAP competency test (70%) in

language arts and mathematics.

A lot of teachers seem to be more comfortable with criterion-referenced tests. This is probably because CRT'S more nearly resemble their own teacher-made tests, and the test items tend to be very course specific and limited in difficulty level to average or below. What many do not yet understand is that through mixed model statistics and a longitudinal data base, TVAAS can accomplish with an

NRT that which could previously only be accomplished with a CRT. It is like having both the advantages of NRT's and the advantages of CRT's without taking on any disadvantages. Again, irrespective of labels, TVAAS needs an achievement test series with (1) a continuous scale, (2) items related to the curriculum, and (3) some items both above and below grade level. Check out the topic, The cutting edge, on Page 23.

 Home | Return | Table of Contents

National norm gain.

What is the national norm gain and where does it come from?

National norm gains may also be called target gains or expected gains. One of the first things you should know is that there is nothing mysterious or secretive about national norm gains. They are derived from the norming process which is typically planned and directed by the test publisher prior to the introduction of a new or revised achievement test series. National norm gains are printed in the annual TVAAS reports. National norm gains for the TCAP achievement tests may also be obtained from the publisher, CTB/McGraw-Hill, or from State Testing and Evaluation Center (STEC). National norm gains remain constant for the duration of a particular edition of a test. Each of the five TVAAS subjects has its own set of national norm gains, so if you are interested in all subjects, you will have not one, but five sets of expected gains.

If you prefer, you may compute the national norm gains that TVAAS uses and construct your own graphs to illustrate them. Take a piece of ordinary graph paper and put time on the horizontal axis in the form of years (actually, grade levels, beginning and ending anywhere you would like between kindergarten and the twelfth grade). On the vertical axis put scale scores. For the entire span of grades on the TCAP's, the scale will begin with 1 and end with 999. At each grade marked off on your horizontal axis, plot the scale score which corresponds to the 50th percentile. These scale scores can also be found in the annual TVAAS reports. If you do not have a TVAAS report, you can find these scores on a class list (STEC print-out), although you may have to search through several students. Find a student that ranked at the 50th percentile for each grade on your graph, and look to see what that student's scale score was. To avoid clutter you should do each subject on a separate graph. Use a ruler to connect the points and you will have the normal growth "curve" for the grades you chose to plot. To obtain the national norm gain for a given subject/grade-level, subtract the previous year's scale score (that corresponds to the 50th percentile) from the current year's scale score (that corresponds to the 50th percentile for that grade).

 Home | Return | Table of Contents

Improving performance on the TCAP achievement tests (norm-referenced).

Should teachers expect higher TCAP scores if they could obtain an item analysis of their previous year's test results and drill their students on items most frequently missed?

Strangely enough, the answer to that specific question is, "No," if it's a norm-referenced test (NRT). TVAAS utilizes only norm-referenced items on the TCAP achievement tests. (If you need to know the difference between an NRT and a criterion-referenced test, detour through Why the norm-referenced/criterion-referenced controversy is a non-issue with TVAAS, beginning on Page 24, before you proceed with this response.)

The question of how to improve test scores (legally and ethically, of course) is a legitimate one. The answer is a little complicated. Let's break it down and take one piece at a time.

(1) One may reasonably expect to increase scores slightly by teaching test-taking techniques. This includes generic skills such as familiarity with test session logistics, test format familiarity, handling separate answer sheets, bubbling, anxiety relief, mental readiness, physical readiness, and practice with timed written exercises.

(2) From a practical perspective, it is almost impossible to guess the content (and then teach it) of a norm-referenced test. Why? Because NRT's sample content from domains of knowledge that are simply too broad. For example, a topic covered on one year's test may be absent from the next year's test-replaced by a related topic with different specific objectives.

(3) Another difficulty of teaching to the content of a NRT is that a majority (60% according to the publisher) of the items require mental processing-skills in logic, synthesis, analysis, etc. It is simply not true, as most test bashers claim, that multiple choice tests consist mainly of items that ask for the recall of facts. Anyone who doubts this statement should obtain a copy of a practice book for the ACT or SAT and work through one of the practice tests. Almost all the items require higher order thinking skills; almost none rely purely on factual recall.

(4) A norm-referenced test, by definition, contains some items above and some items below grade level. Low achievers in your classroom will not be ready to be taught some of the more difficult concepts.

What, then, is the teacher's solution for improving student gains? Do not worry about specific little content skills. Teach the child, not the test. Begin where the child is. Teach all the children. Remember that TVAAS gains resulting from good teaching will be reflected irrespective of where the child may rank among other children. TVAAS does not suggest or prescribe a particular method for encouraging academic growth because how teachers help students learn is, and should be, a highly individual decision based on teacher expertise and the needs of students. Typically, students perform well on norm-referenced achievement tests whenever good teachers, day after day, promote scholarship and make sound instructional decisions.

 Home | Return | Table of Contents

Validity and reliability of the TCAP achievement tests.

Are the TCAP achievement tests themselves worth a toot? Is it time for a change?

The content validity of the TCAP achievement tests is good. There are four reasons why this is so: First, back in 1988, the original test selection committee of some 35 Tennessee educators chose CTB/McGraw-Hill over the other two bidders, and an important one of the selection criteria was the degree of curricular match between the proposed tests and the Tennessee curricula. Second, the CRT items were written by Tennessee teachers to intentionally match the Tennessee curricula in

language arts and mathematics, and there is a high correlation between the CRT items and the NRT items. Third, the NRT test items come from the same item bank used to build achievement tests that are marketed world-wide by the publisher, CTB/McGraw-Hill. They design mainstream tests, i.e., tests that mirror a national curriculum, albeit a hypothetical one, because they are a profit-making organization. They have no interest in producing a test that will sell only to a narrow market. We believe the curricula in Tennessee are as close to the average national curricula as those in any other state. In other words, Tennessee has no significant curricular deviations from the norm. Fourth, TVAAS calculations prove a sufficient relationship between the TCAP NRT's and the Tennessee curricula, because the gains demonstrated all across the state would simply not exist if the tests and the curricula were not sufficiently related.

Reliability of the TCAP is also good. Achieving the necessary reliability for a given test is a matter of applying appropriate technical expertise to the test construction process. Again, the publisher is more interested than anyone in producing a reliable test. Since test reliability can be easily demonstrated statistically, the figures are available to show it for the TCAP's.

Whether it is time to change achievement tests is a matter of opinion. Our opinion is probably not,

and here come the reasons. First, the three or four major publishers of tests are all reputable and qualified to produce good tests. The recent experience the current publisher has had with the Tennessee testing program, however, gives them a slight edge. Second, no test is perfect, and there are always individuals who are dissatisfied with whatever they have; we believe many of the critics of TCAP fall in that category. Third, we believe most of Tennessee's teachers would prefer to remain with the known than be faced with a new test format. TVAAS can adapt to any of the major achievement test series, but in our opinion, there is no compelling reason to do so at this time.

 Home | Return | Table of Contents

Authentic assessment.

Is the rest of the country abandoning standardized testing in favor of authentic assessment?

Enemies of TVAAS would like for you to think that standardized testing is going down the tubes in every other state-in favor of authentic assessment or alternative assessment or performance assessment, but that is simply not true. Stay with us on this one, and we will try to sort out this very complicated mess. There are, of course, other student evaluation strategies besides standardized tests, and there has been a lot of interest in alternative assessments recently. Much of this interest has been sparked by persons who have been dissatisfied with standardized tests. The most rabid of these test bashers would do away with standardized testing entirely. Others take a more moderate position-frequently concluding that both types of assessments are needed.

Before going any further, it would probably be wise to pause to discuss terminology, loose as it is. Alternative assessment seems to mean any alternative to testing. If authentic assessment does not imply that everything else must be unauthentic, then we are missing something. Performance assessment seems to mean that the student will do something (perform) that can be observed and

evaluated. Portfolios, writing assessments, research projects, and collaborative assignments seem to be examples of alternative (authentic) assessments.

None of these methods of demonstrating proficiency was invented yesterday. They are all legitimate instructional tools, but attempts to use them as alternatives to standardized testing have met with difficulty because of reliability problems. Attempts to enhance repeatability and inter-rater reliability have not been totally successful while tending to drive costs beyond fiscally responsible limits. Some states that have embraced alternative assessments are now reversing gears and going back to norm-referenced tests or at least including them in a blend of evaluative tools.

On the other side of the coin, TVAAS has enhanced the results of standardized tests by focusing on gain, employing a longitudinal data base, and using mixed model statistics to analyze the scores. One might be at least partially correct to conclude that the TVAAS development team and the test bashers are trying to achieve the same goal-to enhance student assessment. The TVAAS team would do it by bringing new technology to the analysis of test results, while the severest critics would do it by eliminating standardized tests altogether and substituting alternative assessments.

 Home | Return | Table of Contents

Part IV: Odd Questions and Other Important Stuff

Around the world with value-added assessment.

Is value-added assessment working somewhere outside Tennessee?

Tennessee leads the nation with an educational accountability system based on student improvement as measured by standardized tests. To the best of our knowledge Tennessee's Value-Added Assessment System includes the largest student data base of test scores in the world. The statistical analysis which uses mixed-model statistics to report district, school, and teacher effects is the most sophisticated system in use anywhere. It solves traditional measurement problems associated with norm-referenced testing and, therefore, attributes credit to instructional programs and personnel fairly and without confounding variables such as socio-economic status of communities.

A few other places around the globe are beginning to focus on value-added concepts in various types of educational evaluations. Some of this work is occurring in England, Australia, and here in the United States in the Dallas, Texas Independent School District. Most are using hierarchical linear modeling (HLM), which is less sophisticated and less efficacious. When a prototype is being designed, there are no models to go by. The endeavor is a bit more risky, but when it is successful, it is very exciting, and by contrast the alternative, stagnation, is very dreary. [Also see The cutting edge on Page 23.]

 Home | Return | Table of Contents

Let's talk dollars and sense.

How does the cost of TVAAS compare to other Tennessee expenditures?

Since it is possible to do funny things with figures, this response needs to be very clear and precise. The direct cost of TVAAS has been paid by the State Department of Education (SDE) by way of a contract to the Value-Added Research and Assessment Center, headed by Dr. William L. Sanders, at the University of Tennessee. Those funds are used for the analysis of test scores, specifically, for developing and refining the necessary customized software; hardware upgrades (after the original, one-time computer purchase); annual merges of test data; annual data analyses; and other minor operational costs. The test data have been supplied to TVAAS on magnetic tape by the State Testing and Evaluation Center (STEC).

The TVAAS costs, then, are separate from the costs of obtaining and scoring the achievement tests. The TVAAS contract for the fiscal year, 1995, amounted to $275,000 or about 60¢ per student [grades two through eight (456,165)]. Compare that 60¢ per student TVAAS cost with the average of $3.59 per student TCAP cost.1 The 60¢ TVAAS cost is even more dramatic when one observes the per pupil expenditure in Tennessee (1994-95) was $4,544.2 That means the combined costs of TVAAS and TCAP achievement tests averaged $4.19 per student or nine-hundredths of one percent of the amount spent per student. If TCAP achievement tests and value-added assessment applied to all grades, which they do not, the State-wide cost would have been approximately 0.1% of the 3.7 billion dollars spent on K-12 education in Tennessee in 1994-95.

Most achievement test scoring in Tennessee has been done by the State Testing and Evaluation Center for the last six decades, usually at a lower cost than similar services provided by test publishers. No matter who scores the achievement tests, however, the cost of student assessment with TCAP/TVAAS is much cheaper than any alternative assessment program with which we are familiar. The bottom line is that the current TCAP/TVAAS cost in Tennessee is an unusually good bargain.

 Home | Return | Table of Contents

Classroom teachers and statistics: tuning in or turning off?

Can TVAAS statistics be explained in simple, non-technical language, or must the majority of Tennessee's educators accept them on faith?

Educators should view statistics for what they are: interesting and useful tools which can help teachers and principals make better educational decisions. Unfortunately, a few people think of statistics as incomprehensible and impractical nonsense-if they think at all. TVAAS statistics can be explained, if not simply or in a few words. Anyone willing to take the time to develop a knowledge base in statistics can come to understand the TVAAS model. Moreover, there is nothing incomprehensible about mixed-model statistics, but the subject is complex, and without spending some time and energy on it, one will probably have to go with faith. Incidentally, there is nothing wrong with faith, and it is certainly preferable to ignorance and prejudice. Although some TVAAS opponents may cite the complexity of the model as a basis for scrapping the concept, it is not necessary to be a statistician in order to benefit from the use of statistics.

In the nineties a contractor may use both a tape measure and computer software for estimating the quantity of building materials needed. A physician may use both a tongue depressor and sophisticated imaging equipment to make a diagnosis. Should contractors suspend construction until they are competent at computer programming? Should physicians delay diagnoses until they can build their own imaging equipment? Men and women in many professions effectively utilize information derived through statistical processes they may or may not fully understand. It is a disservice to educators and their students to promote panic over the complexity of the TVAAS model rather than to encourage and support its effective use.

 Home | Return | Table of Contents

Test scores as estimates.

Why are TVAAS gains reported as estimates? Where's the real thing?

To statisticians all statistics are estimates. An achievement test score is an estimate of a student's true achievement. One must understand that a given student's achievement test score does not represent

"truth" but, rather, an estimate of truth. The "real thing" exists in theory only. A good estimate is as good as it gets except that the more estimates (test scores) one has, the closer one can come to estimating the truth. The current state of the technology does not yet permit the direct measurement of student learning (like taking one's temperature with a thermometer), but indicators of learning can be used to proxy direct measurement.

If one can begin to think of test scores as estimates, one can begin to accept the notion that last year's test scores (estimates) can be legitimately revised when this year's scores become available. The new information makes it possible to calculate a better estimate of last year's "truth" as well as this year's "truth." Such calculations are possible only because of mixed-model statistics operating on a longitudinally merged student data base-neither of which were available to Tennessean's or anyone else before TVAAS.

 Home | Return | Table of Contents

No relationship between student gain and socio-economic status.

How can teacher effects fail to be affected by the achievement level of the students?

In the past, attempts to use student achievement scores for educational assessment have been confounded by a great number of factors including socio-economic status, race, gender, educational attainment of parents, etc. These factors biased the results, because they were associated, to greater or lesser degree, with the scores children made on tests.

TVAAS uses a sophisticated methodology to partition the effect of these factors from the effects of educational entities. Furthermore, the determinant of educational effectiveness is no longer the score a child makes on a given test but the gain a child achieves from year to year. Three years of state-wide TVAAS reports have conclusively shown that mean gain scores cannot be predicted by the racial composition, percent of students on free and reduced-price lunches, or the location of the school or system. Neither can gains be predicted by previous academic attainment. Students of all backgrounds and achievement levels can make appropriate gains if they are taught from the level at which they enter the classroom.

 Home | Return | Table of Contents

Gerrymandering TCAP scores?

Can low gains at a given grade level be eliminated by somehow depressing or holding back the scores in the previous grade?

No. TVAAS is capable of attributing gain-or lack, thereof-accurately to the appropriate teacher. Moreover, this "gerrymandering of TCAP scores" seems to be one of the newer examples of an "urban legend." The TVAAS data base makes it possible to investigate questions of this nature directly, and there is no evidence of this phenomenon occurring in reality, anywhere in the state. Consider the difficulty of achieving such an effect: All teachers on a grade level would have to decide

to depress scores so that teachers at the subsequent grade level would appear to produce exceptional gains. They would then have to prevent an entire grade level of students from learning at a normal rate for an entire year. It is hard to imagine that enough teachers could be found in the entire state of Tennessee to accomplish such a morally reprehensible outcome in even one small school for one year, much less repeatedly, as would have to be the case if such a fabrication were to be perpetuated. However, even if such an outcome could be achieved, TVAAS can pinpoint such an aberration and "correct" for it, i.e., attribute the effects appropriately in spite of the attempted manipulation.

 Home | Return | Table of Contents

The research potential of the TVAAS data base.

What are some research questions, enabled by TVAAS, which need to be addressed?

Three important and exciting elements have come together to provide a potentially historic impact on the educational assessment community-a paradigm shift of enormous proportions: (1) enabling legislation passed by a far-sighted Tennessee legislature; (2) a new methodology for analyzing student achievement data (thereby enhancing the usability and acceptability of achievement tests); and (3) personnel at the Value-Added Research and Assessment Center who know how to implement the methodology.

The historic TVAAS data base [See A longitudinally merged student data base on Page 21] and statistical methodology [See Mixed-model statistics beginning on Page 20] opens up a field of research opportunities that is almost limitless. Bock and Wolfe wrote in their TVAAS review:

The educational data collection and management system implemented for TVAAS, in combination with the Tennessee Comprehensive Assessment Program annual achievement testing in grades 2-8, is virtually unique [italics ours] among the states in its ability to keep a continuing record of students' achievement test scores as they move from grade to grade or school to school in each county of the state.3

Just a few examples of research questions which need to be pursued include:

(1) Can new teacher screening devices predict future success as measured by TVAAS teacher effects?

(2) How does student mobility affect achievement?

(3) What patterns or trends can be discovered by tracking student gains over long periods of time?

(4) Can teachers or schools with high gains be profiled?

(5) What can be learned from very effective teachers?

(6) How can less effective teachers be assisted to become more effective?

(7) What are the cumulative achievement effects on students that result from various sequences of prior teachers-predicated on their relative effectiveness?

Those who take the time to thoughtfully consider and act upon the educational ramifications of a state-wide longitudinal data base of achievement test scores and the intricacies of mixed-model statistical methodology will be personally and professionally rewarded. As a result of the

TCAP/TVAAS partnership, Tennessee educators have an unusual opportunity to participate in significant and unique research and to provide world-wide leadership in the field of educational accountability.


1. State Testing and Evaluation Center, Review and Comparative Analysis of Services Provided by the State Testing and Evaluation Center, (Knoxville: State Testing and Evaluation Center, 1996), Second page of Section 4 (unnumbered).

2. State Department of Education, State of Tennessee Report Card, October 1995, (Nashville: State Department of Education, 1995), p. 3.

3. R. Darrell Bock and Richard Wolfe, A Review and Analysis of the Tennessee Value-Added Assessment System, Part 1, (Nashville: Comptroller of the Treasury, Office of Education Accountability, 1996), p. 69.

 Home | Return | Table of Contents


 The Authors
Samuel E. Bratton, Jr. is Coordinator of Research and Evaluation for the Knox County public school system. He has been responsible for coordinating that district's testing program since 1975. Prior to that he was a science teacher and a supervisor of science and mathematics. He has directed numerous research projects, published several articles and reports, and made presentations to a wide variety of audiences. His interests, in addition to testing and measurement, include organizational theory, curriculum evaluation, and communication. He received his Ed.D. from the University of Tennessee.

Sandra P. Horn is an educational consultant to the University of Tennessee Value-Added Research and Assessment Center. She brings twenty-three years as a library information specialist in the public schools to the position. She is a member of the Tennessee State Board of Education's Advisory Council on Teacher Education and Certification and is past state chair of the Tennessee Teachers Study Council. She is also a member of the founding board of the Consortium for Research on Educational Accountability and Teacher Evaluation (CREATE). She received her Ed.D. from the University of Tennessee.

S. Paul Wright is an instructor in the Department of Statistics and a statistical consultant for the Value-Added Research and Assessment Center at the University of Tennessee at Knoxville. He holds master's degrees in both psychology and statistics from the University of Tennessee. His statistical interests include mixed-model methodology, multivariate analysis, and model selection techniques.

 Home | Return | Table of Contents