Home Table of Contents Orientation Support Lessons
Navigation Tabs
Divider bar space Previous Page  26 of 45  Next Page space
space line
Preview Main Page Graphic Presentation: Validity
Graphic Baseline
Play in Flash Player Download Quicktime media Download Ipod audio book media Download MP3 media
Image 01 Mrs. Rodriquez teaches 8th grade mathematics. She notices great variability in the performance of her students, and worries that she is not able to give the lower performing students the help they need while keeping a pace that is beneficial for the entire class. To start the next school year, the district has agreed to give all incoming 8th graders a test designed to measure math competency. Students scoring below a set cut point on the test will be assigned to a remedial math class where they can receive more attention, boost their knowledge, and ultimately work their way back into the mainstream classroom. Mrs. Rodriguez wonders if the test being used is valid; the words she uses in her conversations with others, though, are words like appropriate, right, and meaningful. space
space
Image 02 Assuming the test has been designed for use with a population of 8th graders for which this school is representative, the situation above appears to illustrate a valid use of this math competency test. A test is said to be valid if it measures the knowledge, behavior, or characteristic that it is intended to measure. Validity is not a property of the test itself, but rather of the interpretation of the scores resulting from the test. Therefore, proper selection and use of a test depends upon the population for which the test is intended. Use of a test with a population other than that which was intended by the test developer (which should be noted in the test manual), could lead to incorrect or misleading interpretations. Likewise, a test that is valid for one subset of a population may not be valid for another subset of the same population. Such a test is said to lack Generalizability. For example, the math competency test used by Mrs. Rodriquez is valid for her eighth grade mathematics students. The test would not be valid, however, for third grade mathematics students. space
space
Image 03 Because standardized tests are continuously being revised (the common renewal cycle has a test changing an average of every 4 to 7 years) and because of the dynamic characteristics of testing populations (ethnic makeup, values, etc.), test validation is a continual concern. If there is reason to believe that a testing population is different from that which a test intends, or if substantial time has passed since a test was last evaluated for validity, then it is necessary to analyze the test to ensure that it is indeed valid. It is necessary to make certain that the interpretations resulting from the test data are sound and accurate in regards to the testing population. space
space
Image 04
Validity Overview

A traditional view of validity recognizes three main sources of evidence for the validity of a test: criterion, content, and Construct evidence. Criterion related validity is empirical evidence of the ability of a test to estimate performance on some other measure, such as how well a student will perform on the ACT or SAT. Content evidence is the degree to which the items on a test fairly represent the items that could or should be on the test. Construct evidence is the degree to which a test measures the trait or characteristic it is designed to measure.
space
space
Image 05 Content-Related Evidence

The first area of validity evidence that will be covered is Content validity. Content validity is the necessary evidence when the test user desires to draw inferences from examinee test scores to a larger Domain of items similar to those on the test itself. For example, if a teacher wanted to find validity evidence for an 8th grade math test, the teacher would first define the domain of interest. In this case it would be 8th grade math. Then the domain would be defined in terms of the different areas, skills, topics, or types of problems. If the items on the assessment seem to match well with the domain, that would be considered content-related validity evidence.

In test development, a structured framework, a Table of specifications, for the process of matching items to the performance domain is created. These specifications include a test description and a Test blueprint that guides the building of the test. The test description includes who will be tested and what the purpose of the test is. It can also include the overall test length, time limit, and item types (e.g., multiple choice, essay). The blueprint identifies the objectives and skills to be measured by the test, as well as the relative weight given to each. The purpose of the table of specifications is to determine the distribution of items meeting set standards and/or fulfilling certain cognitive domains and the match between the content and the cognitive domain standards set for the students.

Finally, data from the matching process is collected and summarized to see if the test´s content adequately meets the test specifications. This matching process is sometimes referred to as alignment.

Classroom teachers often use this process when developing their own tests. Teachers create tables of specifications before composing items to ensure a classroom assessment that is content valid. Their tables are typically based on their own judgments on the appropriate domain for testing, what they taught, how much time was spent on a topic, and outlines from their lectures or the textbook.
space
space
Image 06 Criterion-Related Validity Evidence

Another type of validity evidence is Criterion validity. This type is particularly important when a test user desires to draw inferences from examinee test scores to performance on some other test or real-world behavioral variable of practical importance. It is typically conducted by determining the strength of associations between test scores and criteria of interest. The indices computed to reflect the size of these relationships are called correlation coefficients (or validity coefficients).

The process of determining criterion validity begins by first identifying a criterion of interest such as grades in an English course. Then an appropriate sample of examinees is selected that represents the testing populations (e.g. random sample of 6th grade English students). The test is then administered and scored. After the criterion data is collected it is matched to each examinee in the sample. The strength of the relationship between the test scores and the criterion performance is analyzed.

If the test is used to predict future achievement, the criterion evidence should establish predictive criterion validity. For example, the SAT and ACT are designed to produce scores that predict first-year college grades so colleges can determine the probability of applicants succeeding with their course of study. Thus empirical evidence that SAT and ACT scores do predict college grades is their central piece of validity evidence.
space
space
Image 07 Concurrent validity

If the test is meant to replace some other test, the criterion evidence should establish concurrent criterion validity. Good evidence would be to establish that test scores on the new test correlate highly with scores on the old test.

Concurrent validity is a type of criterion validity evidence. It demonstrates that the scores on one test are related to scores on another test, which could be administered at the same time or in place of the other test. Two different forms of a state assessment, for example, should have concurrent validity.
space
space
Image 08 Predictive validity

Predictive validity is also a type evidence for criterion validity. It demonstrates that scores on one test are related to scores on another test, which cannot (or will not) be administered until sometime in the future. For example, classroom unit tests should predict scores on the annual state assessment. Tests used for admissions purposes or selection purposes should provide evidence of predictive validity.
space
space
Image 09 Construct-Related Validity Evidence

A construct is an invisible trait that is hypothesized to exist but cannot be directly observed. When tests are created, it is usually not the test score that is of greatest interest, but the underlying trait that we intend to measure. Construct validity is the gathering of empirical evidence that shows what test scores are and are not related to. In this way we can determine whether we are truly measuring the construct of interest.

There are several strategies for gathering construct validity evidence. A common method is to, first, formulate and define the psychological construct (e.g. intelligence) and then show that it is related to other relevant characteristics, such as specific criteria. Perhaps there already exists an instrument that is an accepted measure of the construct in question. If so we administer that. After the instrument is given and other measures of other relevant criteria that might support the proposed relationship are collected, the data is evaluated to see if it is consistent with the hypothesis. It is always important at this stage to consider alternative hypotheses and rival theories.
space
space
Image 10
Unified view of validity

Since the 1980s there has been a growing view of validity as the systematic provision of evidence to support inferences made from test scores. That evidence may take many forms and occurs over time. We can continue to learn more about even tests that have been used for a long time. In essence, the validation of inferences about scores for a test is an ongoing scientific enterprise - test validation is a process of gathering multiple sources of evidence. These inferences fall into two categories - interpretive inferences (what do these scores mean) and action inferences (what actions should we take based on these scores).

Because validity studies take time and can be expensive the particular types of evidence gathered and the order in which evidence is gathered will vary depending on the purposes and size of the testing program. For tests which have critical consequences it is important to gather and document more evidence sooner.
space
space
Image 11 Impact and Consequences

Upon reviewing the mid-semester grades for students in her mainstream math class and those in the remedial class, Mrs. Rodriquez is surprised to find that the students in the remedial class are actually performing worse than they would have been expected to do in the mainstream class, based on previous years´ data. She wonders why this could be.

What is the reason for this anomaly? Suppose for instance, that students placed in the remedial class decide to stop trying to learn because they think they are "dummies" because they are in the "slow" class. They may perceive their failure as an inescapable fate, something beyond their control. Perhaps this leads to a downward spiral in which the students, because they expect less of themselves, put out less effort, leading to poorer results (thus causing a further drop in expectations, etc.). Alternatively, perhaps the remedial class is taught by the least experienced teacher. Either could explain the unanticipated poor performance of these students.

The consequential evaluation of validity addresses concerns such as focusing on the intended and unintended effects of testing, interpretation, and subsequent action taken on the population being tested. Both types of effects, those that are expected and those that are not expected, must be taken into consideration when evaluating the worth or validity of a test for a specific purpose.
space
space
Image 12 "What impact will this have on the population being tested?" is a critical question to ask when considering consequential evaluation of validity of a test. In the initial example, the impact of the use of the math competency test does not stop at the placement of the students in either the remedial or mainstream class. The ultimate goal is improvement in math competency. Because this goal is not being achieved, the actions taken as a result of the use and interpretation of this test must be reassessed.

Consider other potential negative consequences from this example. If a test is not well aligned with curriculum (because it regularly leaves out important topics, perhaps because they are hard to assess with a test), some teachers may limit their instruction by teaching only what they know will be on the math competency test. Compared to hypothetical scores on the desired construct (covering math topics across the full curriculum) this will artificially boosting students´ scores. Another example might be a student with high test anxiety who performs poorly on the math competency test, even though his school work shows a high level of understanding in mathematics.

In this particular example, the school district might be better suited to base their classification of remedial students on more than just the math competency test scores, perhaps making use of past grades and input from previous teachers. While the test may be a valid way of assessing math competency, the ultimate goal of improving math competency is not being met under the current system. Instead of removing the low scoring students from the class, additional help and tutoring could be provided to these students while they remain in Mrs. Rodriquez´classroom.

Decisions made from the interpretation of test scores can have both positive and negative consequences. Serious consequences - those that cannot be reversed - will have a heavy impact on the individual being tested (for instance, a college entrance exam that determines whether or not a student is admitted to a certain school). For less serious consequences, the impact may be reversible or recoverable.
space
space
Image 13
Summary

It is important to realize that the assessment of validity is an ongoing process that does not stop once the test has been administered. The validity of a test is dependent upon the decisions made from the test results and the consequences of those decisions, both intended and unintended.

To say a test is valid, or, more correctly, to say a test produces scores that are valid, is to say a lot. It means a test produces scores that fairly reflect the domain of tasks that could have been sampled. The scores are a direct reflection of the level of the construct being measured and, when relevant, correlate well with performance on other tests and measures. It also means that the test does not produce undesirable unintended consequences and those who take the assessment are helped by the experience, not harmed.
space
space
Principles of Measurementspace
Divider bar space Previous Page Top of Page Next Page space
space line