|
|||||||
| Lesson 2: Outline | Notes | Glossary | Presentation | Activities | Directed Questions | Assessment | |
![]() |
Presentation: Validity |
![]() |
![]() |
![]() |
|||||||
| Assuming the test has been designed for use with a population of 8th graders for which this school is representative, the situation above appears to illustrate a valid use of this math competency test. A test is said to be valid if it measures the knowledge, behavior, or characteristic that it is intended to measure. Validity is not a property of the test itself, but rather of the interpretation of the scores resulting from the test. Therefore, proper selection and use of a test depends upon the population for which the test is intended. Use of a test with a population other than that which was intended by the test developer (which should be noted in the test manual), could lead to incorrect or misleading interpretations. Likewise, a test that is valid for one subset of a population may not be valid for another subset of the same population. Such a test is said to lack Generalizability. For example, the math competency test used by Mrs. Rodriquez is valid for her eighth grade mathematics students. The test would not be valid, however, for third grade mathematics students. | ||
|
A traditional view of validity recognizes three main sources of evidence for the validity of a test: criterion, content, and Construct evidence. Criterion related validity is empirical evidence of the ability of a test to estimate performance on some other measure, such as how well a student will perform on the ACT or SAT. Content evidence is the degree to which the items on a test fairly represent the items that could or should be on the test. Construct evidence is the degree to which a test measures the trait or characteristic it is designed to measure. | ||
|
Content-Related Evidence The first area of validity evidence that will be covered is Content validity. Content validity is the necessary evidence when the test user desires to draw inferences from examinee test scores to a larger Domain of items similar to those on the test itself. For example, if a teacher wanted to find validity evidence for an 8th grade math test, the teacher would first define the domain of interest. In this case it would be 8th grade math. Then the domain would be defined in terms of the different areas, skills, topics, or types of problems. If the items on the assessment seem to match well with the domain, that would be considered content-related validity evidence. In test development, a structured framework, a Table of specifications, for the process of matching items to the performance domain is created. These specifications include a test description and a Test blueprint that guides the building of the test. The test description includes who will be tested and what the purpose of the test is. It can also include the overall test length, time limit, and item types (e.g., multiple choice, essay). The blueprint identifies the objectives and skills to be measured by the test, as well as the relative weight given to each. The purpose of the table of specifications is to determine the distribution of items meeting set standards and/or fulfilling certain cognitive domains and the match between the content and the cognitive domain standards set for the students. Finally, data from the matching process is collected and summarized to see if the test´s content adequately meets the test specifications. This matching process is sometimes referred to as alignment. Classroom teachers often use this process when developing their own tests. Teachers create tables of specifications before composing items to ensure a classroom assessment that is content valid. Their tables are typically based on their own judgments on the appropriate domain for testing, what they taught, how much time was spent on a topic, and outlines from their lectures or the textbook. | ||
|
Criterion-Related Validity Evidence Another type of validity evidence is Criterion validity. This type is particularly important when a test user desires to draw inferences from examinee test scores to performance on some other test or real-world behavioral variable of practical importance. It is typically conducted by determining the strength of associations between test scores and criteria of interest. The indices computed to reflect the size of these relationships are called correlation coefficients (or validity coefficients). The process of determining criterion validity begins by first identifying a criterion of interest such as grades in an English course. Then an appropriate sample of examinees is selected that represents the testing populations (e.g. random sample of 6th grade English students). The test is then administered and scored. After the criterion data is collected it is matched to each examinee in the sample. The strength of the relationship between the test scores and the criterion performance is analyzed. If the test is used to predict future achievement, the criterion evidence should establish predictive criterion validity. For example, the SAT and ACT are designed to produce scores that predict first-year college grades so colleges can determine the probability of applicants succeeding with their course of study. Thus empirical evidence that SAT and ACT scores do predict college grades is their central piece of validity evidence. | ||
|
Concurrent validity If the test is meant to replace some other test, the criterion evidence should establish concurrent criterion validity. Good evidence would be to establish that test scores on the new test correlate highly with scores on the old test. Concurrent validity is a type of criterion validity evidence. It demonstrates that the scores on one test are related to scores on another test, which could be administered at the same time or in place of the other test. Two different forms of a state assessment, for example, should have concurrent validity. | ||
|
Predictive validity Predictive validity is also a type evidence for criterion validity. It demonstrates that scores on one test are related to scores on another test, which cannot (or will not) be administered until sometime in the future. For example, classroom unit tests should predict scores on the annual state assessment. Tests used for admissions purposes or selection purposes should provide evidence of predictive validity. | ||
|
Construct-Related Validity Evidence A construct is an invisible trait that is hypothesized to exist but cannot be directly observed. When tests are created, it is usually not the test score that is of greatest interest, but the underlying trait that we intend to measure. Construct validity is the gathering of empirical evidence that shows what test scores are and are not related to. In this way we can determine whether we are truly measuring the construct of interest. There are several strategies for gathering construct validity evidence. A common method is to, first, formulate and define the psychological construct (e.g. intelligence) and then show that it is related to other relevant characteristics, such as specific criteria. Perhaps there already exists an instrument that is an accepted measure of the construct in question. If so we administer that. After the instrument is given and other measures of other relevant criteria that might support the proposed relationship are collected, the data is evaluated to see if it is consistent with the hypothesis. It is always important at this stage to consider alternative hypotheses and rival theories. | ||
|
Since the 1980s there has been a growing view of validity as the systematic provision of evidence to support inferences made from test scores. That evidence may take many forms and occurs over time. We can continue to learn more about even tests that have been used for a long time. In essence, the validation of inferences about scores for a test is an ongoing scientific enterprise - test validation is a process of gathering multiple sources of evidence. These inferences fall into two categories - interpretive inferences (what do these scores mean) and action inferences (what actions should we take based on these scores). Because validity studies take time and can be expensive the particular types of evidence gathered and the order in which evidence is gathered will vary depending on the purposes and size of the testing program. For tests which have critical consequences it is important to gather and document more evidence sooner. | ||
| Principles of Measurement |
|
||||||