School Improvement in Maryland

Assessment Literacy Glossary

Achievement gap
In common usage, achievement gap often refers to the differences in levels of achievement among groups of students (such as Asian, African American, Hispanic, White, students with disabilities, and students living in poverty). A more accurate definition of the achievement gap is the difference between the performance of each student group and the standards. The reason for this distinction is that the goal is not to have every student group achieving at the same level, but to have every student group meeting the same high standards or expectations.

Adequate Yearly Progress (AYP)
The goal of No Child Left Behind is that all students will be proficient in reading and mathematics by the year 2014. To meet this goal, each school is required to make Adequate Yearly Progress (AYP) every year. Schools meet AYP by having all the identified AYP student groups -- including American Indian, African American, Asian, Hispanic, White, English Language Learners (ELL), students with disabilities (receiving special education services) and students living in poverty -- meet a set of standards each year. The standards include proficiency in reading and mathematics as demonstrated on MSA and HSA, student attendance, and other indicators set by the state. Failure to make AYP over multiple years results in increasing sanctions for the school and additional requirements for its district.

Analytic scoring
Analytic scoring is scoring done in parts or units. For example, an essay may be scored for several different criteria such as content accuracy, organization, grammar, punctuation, and spelling. This type of scoring provides more detailed information to students because it allows them to see more precisely where improvement is needed. It also allows teachers to see where additional instruction and/or reteaching might be required.

Annual Measurable Objective (AMO)
The Annual Measurable Objectives (AMOs) are performance objectives or targets of student achievement for schools. AMOs have been set for each year between 2002-03 and 2013-14 to measure progress in moving toward the 100% proficiency in reading and mathematics that is required of schools in 2013-14 by No Child Left Behind. The AMOs increase each year, requiring schools to improve incrementally. Having every student group meet the AMO in reading and mathematics each year is a key to making AYP. (See "Adequate Yearly Progress.")

Authentic assessment
A concept related to performance assessment, authentic assessment avoids the more contrived types of assessment, such as multiple choice or other selected response items, in favor of a format more closely related to how the assessed knowledge is applied in the "real" world. For example, students may be asked to compute the tip for a restaurant bill or to measure a segment of the classroom floor for new tile. Advantages of this type of assessment are that it provides teachers insights as to the extent to which students can apply their learning in realistic settings and it demonstrates to students some ways they will use what they are learning, making education more relevant to them. (See "performance assessment.")

Computer-adaptive
Computer-adaptive tests usually consist of large banks of test items that can be chosen to "customize" an appropriate assessment for each student. Assessment items are identified by the computer program based on the student's previous responses (correct or incorrect) to focus quickly on areas of strength and weakness. These types of tests are very efficient, and feedback is immediate.

Confidence interval
Because all tests have built-in error, decisions made based on tests may not be accurate. This is especially true when the group being evaluated is small. Therefore, a confidence interval is calculated to widen the target around each year's AMO depending on the size of the group and subgroups. The smaller the group, the larger the confidence interval will be, because possible errors are less likely to be compensated for. The larger the group, the smaller the confidence interval will be. As long as group and subgroups perform within the confidence interval, they are considered to have met AYP. The confidence interval is an error adjustment applied to a group; the standard error is an estimate of error associated with an individual student's score.

Constructed response
Assessment items requiring students to supply their own answers are called constructed response items. On these assessments, instead of students being given a choice of several answers to choose from, students must write out their own answers, show their work, organize their thoughts, construct tables or graphs, or complete a variety of tasks on their own. Constructed response items take more time for students to answer than selected response items, therefore limiting the content coverage of an assessment. However, constructed response items provide a more authentic mode for students to demonstrate their knowledge. Care must be taken when scoring these types of items so that it is as objective as possible. (See "rubrics and scoring guides.")

Correlation
A correlation is a statistical index that demonstrates whether or not a relationship exists between two variables and provides an indication of the strength of that relationship. Correlations range from negative one (-1), indicating that a perfect negative relationship exists between the variables (when one variable is high, the other is low), to positive one (+1), indicating a perfect positive relationship exists between the variables (when one variable is high, the other is high). A correlation of zero means there is no relationship between the two variables. It is important to keep in mind that a relationship between two variables does not mean that one caused the other.

Criterion-referenced
A criterion-referenced test compares individual student performance to a standard. Its purpose is not to rank order a student's performance in relation to other students, but to classify or categorize students in relation to the standard, such as into pass/fail or by levels of proficiency. On criterion-referenced assessments, it is possible for all students to "pass" (i.e., meet the standard) or for all students to "fail" (i.e., not meet the standard). A criterion-referenced test does not spread students out in a normal distribution the way a norm-referenced test does. (See "cut score.")

Cut score
The cut score is the score on a criterion-referenced test that determines pass/fail or the level of proficiency. For a test that assigns only pass or fail status, there will be one cut score. Students who score at or above the cut score will pass; those scoring below the cut score will fail. Other tests may use several cut scores. For example, to classify students into one of three groups, such as advanced, proficient or basic on MSA, there are two cut scores.

Distracter
Distracters are the incorrect answer choices that are presented along with the correct answer in multiple choice assessment items. Distracters should be designed so that students who know the content get the item correct, while students who do not know the content get the item incorrect. Distracters should not be designed to trick or fool students. The best distracters are often common misconceptions, over generalizations, or simplifications about the content. Students who have not had experience with some distracters, such as "none of the above" and "best as it is," can be drawn to those distracters even though they might know the material. Writing effective distracters can be one of the biggest challenges to generating good multiple-choice test items.

Extended or brief constructed response/essay
In constructed response assessment items, students generate original responses. Answers for students to select are not included in the test. A constructed response item can be brief (BCR -- several sentences) or extended (ECR -- several paragraphs). Constructed responses are most effective when assessing student organization and writing skills, understanding of sophisticated concepts, and application of concepts to real-life situations, as well as when students must justify or explain a point of view or how they solved a problem. Scoring can be subjective and should use rubrics or scoring guides.

Formative assessment
Formative assessments are not a type of assessment but the way in which the assessment results are used. Formative assessments are administered for the purpose of measuring progress toward a goal. Formative assessments should occur often enough so that teachers can discover when instruction has not been effective in time to correct it. This continual monitoring of progress prevents students from going too long before a weakness or conceptual misunderstanding is detected and addressed.

Grade equivalent
The grade equivalent is the only score that helps educators look at student performance across grade levels. The grade equivalent score represents the year and month in years of school. A grade equivalent score of 3.6, for example, represents performance typical of a student in the sixth month of third grade. Grade equivalent scores are by far the most difficult to understand. A grade equivalent of 4.8 means that the student scored the same on the assessment as a student in the norm group who was in the eighth month of the fourth grade scored on an identical assessment. If a second grader obtains a grade equivalent of 4.8 on a second grade reading test, the student did well, but it does not mean that the student has the skills of a typical student in the eighth month of the fourth grade. It means that the student's performance on the second grade test is theoretically equivalent to the typical performance of students in the norm group who had completed eight months of grade 4. (The fourth graders in the norm group took a second grade test, so they are expected to do well on it.) Therefore, grade equivalents should not be used to place students in grade levels corresponding to their test scores. Most test publishers recommend against using grade equivalent scores to report results to parents as they are so easily misinterpreted.

Holistic scoring
In holistic scoring, one combined grade is assigned that takes into account all the components of a task. Holistic scoring is not as informative as analytic scoring, but is used sometimes in the interest of time or to evaluate the quality of a product or assignment as a whole. Because holistic scoring provides no specific information to the student or teacher that can be used for improvement, a combination of holistic and analytic scoring is best when a single grade is needed.

Inter-rater reliability
Inter-rater reliability is the extent to which scores assigned by different raters to the same brief or extended constructed response or performance assessment are consistent. A scoring guide or rubric, containing a set of "look fors," is usually used to guide the rater and help the rater score each student's response as objectively as possible. Team members can develop increased inter-rater reliability by all scoring the same papers and then comparing the scores. If team members all assign the same score to a given response, inter-rater reliability is high. If scores vary (usually by more than one point on a four-point rubric), team members should discuss their differences and come to consensus. If scores among raters consistently vary, intensive dialogue is necessary about the characteristics of the performance or piece of writing that are necessary for it to be called "proficient."

Item analysis
Item analysis is the process of looking at the item-by-item responses of a test. Analysis should center on the difficulty of the items and the extent to which the distracters discriminate appropriately. (See "item difficulty" and "item discrimination.") Ineffective or confusing test items can be identified through this process. Students' error patterns can provide invaluable information to inform classroom instructional planning. Item analysis also gives an opportunity to evaluate the effectiveness of multiple choice item distracters. (See "distracters.") Item analysis can be used to study the performance of various groups to see if there is potential bias built into the question. (See "item bias.")

Item bias
Biased test items function differently for various cultural, racial, or gender groups of students on a systematic basis. This may be because of a lack of a group's exposure to a concept, word, or idea through no fault of their own. One way to reduce test bias is to analyze the content of all test items in order to make sure they are likely to be fair to every student group. Test publishers go to great expense to have all test items reviewed carefully by panels of experts for social, cultural, religious, ethnic, and racial bias. Teachers need to be cognizant of their students' backgrounds and prior knowledge and do everything they can to make sure that no student is disadvantaged by a test item.

Item difficulty
Item difficulty is an index of how challenging an item is for the students taking the test. Item difficulty indices range from zero to 1.0. Item difficulty is calculated by taking the number of students answering an item correctly and dividing it by the total number of students answering the item. For instance, if half of the students in a testing group answer an item correctly, the item difficulty is .50. If all of the students answer the item correctly, the item difficulty is 1.0. The more difficult the item, the less likely a large percentage of students will answer it correctly, and the lower the item difficulty statistic will be.

Item discrimination
Item discrimination is related to item difficulty because it is an indicator of how well the item discriminates between students who know the content and students who don't. Item discrimination is determined using students' total test scores as an indicator of their competence with the content being assessed. High-scoring students would be expected to answer items correctly, and low scoring students would be expected to answer items incorrectly. When the low-scoring students answer an item correctly, and the high-scoring students answer it incorrectly, the item did not function as expected. It didn't discriminate appropriately. An item with an item difficulty of 1.0 does not distinguish between students at all, because everyone answers it correctly. (It is an easy item.) An item with a difficulty of zero doesn't discriminate between students either, because it is so difficult everyone answers it incorrectly.

Matching
Matching is a type of selected response item used most effectively to assess definitions, functions, and relationships. Matching items require students to look at two columns and match the information in one to the corresponding information in the other. The columns may contain words and their definitions, categories and examples of them, element names and their scientific notations, etc. Like multiple choice items, students need only recognize the information; they do not have to produce it on their own.

Mean
Often, groups of student test scores are reported as means, or averages. To obtain the mean score, all the student scores are added up and the total is divided by the number of students in the group. Means can be very misleading under certain circumstances. Means can have the impact of "hiding" or de-emphasizing the scores of individual students. They can also be influenced by a few very high or very low scores, "pulling" the mean up or down. For example, average or mean scores can be high, but there may be students with low performance in the group that would not be evident.

Measurement error
Each time students are assessed, the result is actually an estimate of what they know and can do. How good this estimate is will be determined by the amount of measurement error in the score. In test theory, every score is made up of two independent components: the "true" score and the random "measurement error" score. But since all scores include measurement error, the "true" score is never known.

Median
The median is the score that divides a distribution of scores in half. By definition, half of the students score above the median, and half of them score below the median. For example, if a third grade class had a median score of the 75th percentile on a norm-referenced reading test, half of the students had scores above the 75th percentile. On a norm-referenced test given to a normal distribution of students, the expectation would be that half of the students score above the 50th percentile. Students in this third grade example performed considerably better than expected.

Multiple choice
In a multiple choice question, students are presented with a problem in the "stem" of the item and are asked to choose the one best, or correct, answer from a list of four to five possible answers. Multiple choice items can be written to assess basic facts or higher level problem solving.

No Child Left Behind
In 2002, the federal government passed the No Child Left Behind Act (NCLB), a complex piece of legislation that includes higher standards for teachers and yearly assessments to demonstrate progress for individual students. Although the legislation is specific and prescriptive, each state designs its own program components, such as content standards, performance standards, and assessments, which are then approved by the federal government.

Norm group
The norm group is a group of students that was given a test prior to its publication and that is representative of the students expected to take the assessment after it is published. The performance of all students taking the test at any time or place in the future is compared to the performance of this norm group. A student with a high score on a norm-referenced test has performed better than many of the students in the norm group. A student who performs at the same level of the average student in the norm group will receive an average score. A student whose performance is below that of most of the students in the norm group will receive a low score. In analyzing norm-reference test results, it is important to know how long has it been since the test was "normed." (See "norm-referenced.")

Norm-referenced
A norm-referenced test ranks, or orders, the scores of students who take it by comparing them to others. After a test is written by a publisher but before it is used in schools, it is given to a sample of students called a "norm group." (See "norm group.") Scores from all students who take the norm-referenced test in the future -- wherever and whenever they take it -- are compared to the performance of the initial norm group, not to others taking the test at the same time. The purpose of the norm-referenced test is to compare student performance against that of the norm group by spreading scores out in a distribution.

Number correct (See "raw score.")

Percentage correct score
The percentage correct score is calculated by taking the raw score (the number correct or number of points earned) divided by the total number of items on the test (or total possible points). The result is multiplied by 100 to convert it to a percentage.

Percentage passing/mastering/proficient/advanced
No Child Left Behind requires data on the percentage of students meeting state proficiency standards in order to determine if adequate yearly progress has been made. Therefore, states are using cut scores to report the percentage of students scoring in each category required by the law. (See "cut score.") Teachers may use the same technique to monitor the progress of their students and help determine instructional groupings. Percentage scores are also easy to implement and to explain to parents. They are determined by taking the number of students who improved, met the standard, or were not successful, etc., and dividing it by the total number in the class or grade level. The result may be multiplied by 100 to convert it to a percentage.

Percentile rank
A percentile rank describes a student's performance relative to other students who took the same test. On a norm-referenced test, the percentile rank will indicate where the student scored in comparison to the norm group who took the test prior to its publication. (See "norm group.") Percentile ranks range from one to 99. By definition, the score tells the percentage of the norm group that the student scored better than. For example, a student at the 75th percentile scored as well as or better than 75 percent of the norm group. A percentile of 50 means that the student scored better than half of the norm group, and the other half of the norm group performed better than the student. Percentile ranks can be determined for classroom assessments too. A common misinterpretation of a percentile rank is that it is the same as the percentage correct.

Performance assessment
The premise behind performance assessment is that students should be asked to demonstrate their skills and knowledge in a way that is as close as possible to how they will be called upon to use them in their everyday life. (See "authentic assessment.") For example, when evaluating a music student's ability to play the trumpet, the student is asked to play the instrument, and a judgment is made about the quality of the performance. Performance assessments can take the form of speeches, debates, story writing, science laboratory experiments, etc. There are several challenges to using performance assessment in the classroom. These include that each student must be given the opportunity to demonstrate the targeted knowledge or skills, scoring can be subjective if scoring guides are not carefully prepared, and the strategy is not time efficient for sampling many objectives. However, using performance assessment, when appropriate, can provide a more complete understanding of the full range of student achievement.

Proficiency score
A proficiency score is intended to provide information concerning a student's level of achievement. Cut scores are established and used to divide students into groups, and labels or descriptions are developed to illustrate the level and characteristics of each group's typical performance.

Prompt
A prompt is the question or problem presented to the student in a constructed response assessment item. The prompt must be specific and clear so that students can understand exactly what is expected of them.

Quartile
Norm-referenced percentile ranks can easily be divided into four equal groups or "quartiles." Typically, quartiles include scores from the 76th percentile to the 99th percentile, the 51st to the 75th, the 26th to the 50th, and the 1st to 25th percentile. Reporting the percentages of students scoring in each quartile helps to make the performance of all students visible in the data.

Raw score (number correct)
The raw score is the starting point for all other test scores. It is a numerical count of the number of items the student answered correctly, or the sum of all of the points awarded to the students' responses. By itself, the raw score has very little meaning unless the score is accompanied by additional information, such as the total number of items on the test. For example, two brothers each come home with their science test scores. Ed's score is a "10" correct and Frank's score is a "50" correct. At first, their parents think that Frank has done much better than Ed. But, after more closely inspecting the test papers, they find that Ed answered every item correctly and Frank answered some items incorrectly. A percent correct score would have provided more information because it takes into account the total number of points on the test.

Reliability
Reliability is the consistency of a measurement. If students could be given an assessment multiple times with no change in their motivation and/or knowledge and skill from one administration to the other, and no practice effects, reliable score results would be consistent over the measurements. Reliability is a characteristic of the scores. It is not a characteristic of the test. When reliability coefficients (the term used to describe an index of reliability) are reported, they relate to a specific set of test scores. A test score may be reliable for one group of students and not reliable for another. The reliability coefficient is an index ranging from 0-1.0, with zero indicating no relationship between the two sets of test scores and 1.0 indicating a perfect relationship. The higher the reliability coefficient, the more reliable the scores will be.

Root cause
The root cause of the current levels of student performance is the deepest, underlying cause or causes that can reasonably be identified, that educators have control to influence, and that, if modified, would result in increased student achievement.27 Often, the first explanation offered for student achievement data is not the true root cause. The root cause is the underlying reason that must be addressed before significant and lasting change can occur. Discovering a root cause takes courageous and honest conversation, but the root cause must be addressed if improvement is expected.

Rubric
A rubric is a specific type of scoring criteria used for performance assessments or constructed response items that will be scored with more than two score points (as opposed to correct or incorrect). The rubric provides a description of the characteristics of the response for each possible score point. For example, a performance to be scored using a four-point scale (one to four, with four being the highest) will have four descriptions. The rubric is typically specific to a given item on the assessment. A "generic" rubric is more general and can be applied to a variety of items.

Safe Harbor
If a subgroup in a school does not meet or exceed its AMO (annual measurable objective) in reading or mathematics, adequate yearly progress (AYP) can still be met if:

  • The school meets all participation requirements; and
  • The school meets all annual measurable objectives in the aggregate; and
  • The percentage of students in the student group not making the AMO improved its proficiency level by at least 10%.

Scale score
A scale score is a linear transformation of a raw score. Scale scores are equal interval scores. This means that the distance between any scale score and the next scale score is the same. Therefore, regardless of where the score falls on the scale (top, middle or bottom), distances between scores are equal and can be compared. The most widely used scale score is the SAT. Each component of the SAT (critical reading, mathematics, and writing) has a mean of 500 and a standard deviation of 100. The equal interval property means that the difference between a score of 800 and 750 is the same as the distance between a score of 500 and 550, or between 300 and 350. The property of equal intervals also means that scale scores can be manipulated mathematically (summed, averaged, etc.). Scale scores are used to make sure that the expectations on a test remain the same over time.

Scoring guide
A scoring guide is a document used to assign scores to student work. The purpose of a scoring guide is to increase the objectivity of scoring for constructed response items. The guide includes the criteria used to determine performance (usually in the form of rubrics), as well as sample responses that illustrate typical responses for each score point.

Scoring key
A scoring key consists of the correct answers to the items on a selected response or short answer test. Student responses are compared to the key to determine which items were answered correctly and which were not. Sometimes scoring keys also provide instruction to the teacher about how to calculate summary test performance (raw scores, percentages, etc.).

Selected response
In selected response (SR) items, students are presented with response options from which they must select the one best answer. The most commonly used selected response items are multiple choice, true/false, and matching. Scoring of SR items is fast and objective. Selected response items are most appropriate for assessing facts, relationships, and content knowledge. While selected response items are efficient because they require a short response time, they are limited to testing student ability to recognize the correct answer, not to produce it. Selected response items are also seen as an "artificial" way for students to demonstrate knowledge or skills.

Short answer
A type of constructed response item, the short answer, provides the student with a question or problem and requires the student to provide a brief answer (such as a word, phrase, sentence, or number).

Standard (objective, indicator, learning goal, target, outcome)
The term standard is used in several different ways in education and is also used synonymously with several other terms. A content standard is a broad and general statement of skills or knowledge in a content area that students are expected to acquire (what students are to know and be able to do) by the end of grade 12. In the K-8 Maryland State Curriculum, standards are subdivided into indicators, objectives, and assessment limits. In the grades 9-12 Maryland State Curriculum, standards are divided into expectations, indicators, and assessment limits. In other curriculum, standards may be subdivided into learning goals, targets, or outcomes. A performance standard is the extent to which students are expected to meet the content standards by a given year and is expressed as the AMO (annual measurable objective).

Standard deviation
Standard deviation is a measure of the dispersion of assessment scores. It indicates the spread of the scores or how far away a score is from the mean score. The larger the standard deviation, the farther the scores are from the mean. A small standard deviation means that the scores tend to be grouped together, very close to the mean. The standard deviation provides additional information that the mean does not and, together, the mean and standard can describe a data set better than either measure alone.

Standard error
Standard error is an estimate of how much measurement error exists in each individual student's test score. Because of measurement error, we can never know a student's true score. Each measurement is therefore an approximation. If we were able to take many measurements of the same student, the test scores would cluster around the true score. If the test scores were very reliable, the scores would be close to the true score. If the scores were not reliable, they would fall away from the true score. The best way to use a standard error of measurement is to add the value of the standard error of measurement to the score and subtract the standard error of measurement from the score. This provides a range that probably includes the student's true score.

Standardized
An assessment that is standardized consists of a set of items administered to students under the same conditions, scored in the same way, and with results interpreted in the same way. Since every tested student is treated in exactly the same way, the test is said to be "standardized." These standardized conditions allow comparisons to be made between students and groups of students.

Stanine score
Stanine stands for "standard nine." This is a scale with nine points, three describing scores above average (7, 8, 9), three at average (4, 5, 6), and three scores below average (1, 2, 3). They are popular because they provide a broad indication of achievement that is easy to understand and does not over-emphasize small differences between scores. Stanines are much less precise than the percentile rank score with its ninety-nine possible scores. The "average" stanines (4 to 6) encompass a lot of the scores -- from percentile ranks of 23 to 76. This is most of the scores. However, the "4" is a low "average" and the "6" is a high "average."

Summative assessment
Summative assessment is administered for the purpose of obtaining a final, comprehensive evaluation of student knowledge and skills, often for accountability purposes, rather than for short-term instructional decision making. A few examples of summative assessments are course final exams, MSA, HSA, and SAT.

Test bias
Test bias is present when the scores are valid for some groups and not valid for others. For example, a test biased against Asian students would yield scores for other student groups that were valid, but scores for Asian students that were not valid because the scores did not accurately represent what they know or could do. An important distinction is that bias is only a source of measurement error when the subgroups in question are actually equivalent on the material being assessed but do not have similar scores on the test. The source of the score difference is therefore due to bias inherent in the measurement instrument. Therefore, if girls in the class really are better writers than the boys, then differential test scores are valid and do not represent bias. (See "item bias.")

Test-taking skills
Students who know content but have difficulty demonstrating their knowledge on a test are often said to lack test-taking skills. Test-taking skills are thinking and problem-solving skills and strategies and are often learned through experience with various types of test item formats. These are skills that some students need assistance to develop. Test-taking skills should provide students with the test information they need in order to be able to show what they know. If the teaching of test-taking skills threatens the validity of the assessment (i.e., students are taught "tricks" to help them answer correctly when they do not know the information being assessed), then test preparation has gone too far.

Trend analysis
Education is focused on long-term growth in student achievement. To document this growth, or to discover a lack of it, data must be analyzed and monitored over time (the school year, or year to year). Looking at data over time is a trend analysis. The No Child Left Behind legislation is based on the analysis of the trends in student performance over time (from year to year over a several year period). However trend data are also useful to teachers when they adjust classroom instruction based on their analysis.

Triangulation
Triangulation, or the use of multiple indicators, involves looking at and acting on the basis of more than one indicator of student achievement. Just as tests become increasingly reliable as items are added by providing students more opportunities to demonstrate what they know, looking at a variety of assessment results provides a more complete and accurate picture of what students know and can do. In the classroom, teachers triangulate data when they use a combination of tests, projects, classroom discussion, homework, research papers, etc., to generate students' report card grades.

True/false
A type of selected response assessment item, a true/false question consists of a statement that the student must judge as either "true" or "false." Since students have a fifty percent chance of answering correctly even if they guess, true/false items are not widely recommended as good test items.

True score
The true score is the most accurate assessment score and is therefore the one it would be optimum to have. Theoretically it is the mean of an infinite number of repeated measurements of the same content, assuming no changes in motivation, knowledge, and/or skill level. It can be thought of as the "real" picture of the student's achievement. Because every sample of student behavior has some measurement error associated with it, the true score doesn't actually exist. Instead, each test result is a combination of the true score and measurement error. The smaller measurement error is, the closer the test result is to the student's true score.

Validity
Validity asks the questions, "Does the test assess what it was designed to assess?" "Did I get the information I intended to get?" Like reliability, validity is not a characteristic of the test; validity is a characteristic of the test score interpretation. A test score interpretation may be valid for one student in the class and not at all valid for the second. For example, a reading test may be valid for an English-speaking student, but not for a fluent reader who speaks a language other than English. Validity is the key to the appropriate interpretation of assessment data and to its usefulness in informing instruction to meet student needs.

  • 27 Definition adapted from Preuss, P. (2003). School leader's guide to root cause analysis. Larchmont, NY: Eye on Education.