Are You Using the Right Evaluation Tool to Assess Learners? Putting Validity on Trial

Evaluation Tool for Learners - Validity on Trial

As medical educators, we often rely on assessment tools to evaluate our learners. Whether in the form of a post-lecture survey or a checklist by a standardized patient assessment, tools are used throughout medical training and beyond. How do we know the tool we are using is appropriate? Is it assessing the right things? Do the scores have any meaning? We often search for tools that have been “validated” and feel more confident applying the results. But what makes a tool “valid”? With a few simple concepts, we can better choose and create our assessment tools and therefore better cater our education to the needs of learners.

What is Validity?

Defining validity is challenging, especially when the terms seem to be redefined just as soon as we get comfortable with them. Even before validity, we must address the construct (what the instrument is intended to measure). Most often this construct does not have a standard or inherent normal or abnormal, such as physician attitudes or patient symptoms. Messick defines construct as “an intangible collection of abstract concepts inferred from behavior and used to measure validity”. [1] Defining the construct is the first step in creating an assessment tool or choosing an appropriate pre-conceived tool.

When applied to assessments, validity is a hypothesis not a statement of fact. A hypothesis requires evidence, either in support or opposition. A tool itself is not “valid” or “invalid” but instead the interpretation of whether the data has or does not have validity [2]. A tool may have validity in one context but not another; with one type of learner but not another. Dr. David Cook, an expert in medical education validation research, defines validity as “the degree to which the interpretations of scores resulting from an assessment activity are ‘well-grounded or justifiable’” [3].

The Courtroom Analogy

Validity lies on a continuum and relies on 4 foundational concepts:

Propositions
Evidence
Argument
Decisions

Cook uses the analogy of the courtroom to simplify these concepts [3].

Propositions

Start with the prosecution, who proposes that the defendant is guilty. This proposition is the basis of the trial and evidence will be presented to support this proposition. Similarly, propositions guide the collection of validity evidence and are essential when evaluating the validity of a tool. Propositions are to the validity hypothesis as objectives are to the goal. For example, I propose my leadership assessment tool will include elements identified as essential to effective leadership in a resuscitation.

Evidence

Next, the prosecution presents its evidence or a collection of evidence. One eyewitness does not make a case. However, an eyewitness, DNA evidence, and a motive might seal the deal. In the assessment of validity there are 5 main types of evidence, defined by Messick [1, 2].

Content evidence asks whether the instrument completely represents the construct. To return to the leadership assessment tool example, does the tool truly measure leadership skills? Are the questions, or items, important and necessary? Are there too many or too few?
Internal structure refers to reliability of the instrument, including interrater-reliability and test-retest reliability. Is score variation among participants expected? We would expect novice leaders to score lower than seasoned attendings. Are scores from different observers similar?
Response process refers to the relationship between the intended construct and the thought process of the subjects or observers. Do those being assessed understand the items on the tool as intended? If not, the tool is not assessing your construct as expected.
Consequences, intended and unintended, of an assessment can affect the tool’s validity. Do low scores lead to remediation and therefore improved performance? Or alternatively, do low scores cause self-doubt and decreased confidence, leading to poor performance?
Relation to other variables, previously known as construct validity, refers to the correlation of scores to other tools that assess the same construct. How does my tool compare to other leadership assessment tools? Just like in a courtroom, more corroborating evidence is better, but you don’t need evidence in every category to get a conviction or acquittal.

Argument

After collecting the evidence, each side has an opportunity to make their arguments. As Cook states “the evidence doesn’t speak for itself” and a strong validity argument requires the structured presentation of evidence [3].

Decision

How will the tool be used and what is the effect of its use? How much evidence is necessary to use the instrument in a certain environment or with a group of learners depends on how the scores will be interpreted. Educational assessments of learners used primarily for the educator’s benefit to develop a curriculum are arguably low stakes and require less evidence before use. However, assessment tools with significant long-term consequences, such as remediation or a failing grade, are high stakes and require stronger validity evidence prior to application [3].

How to Choose a Tool

When assessing an evaluation tool to be used with learners, consider the above concepts of validity and specifically the validity evidence provided. A tool with presented evidence in multiple categories and with the plan to be used on a similar population and/or environment is ideal, although not always possible. Creating your own tool may be necessary [4]. When doing this, consider testing the instrument’s validity before applying the tool. And then consider the potential outcome of the application of the tool and its significance to the learner. A high-stakes outcome, such as pass/fail or granting of increased autonomy, requires assessment tools with large amounts of validity evidence to be applied with confidence. Unfortunately, in medical education we are guilty of using assessment tools frequently that do not meet these standards. If we pause and ask some basic questions about our instrument and what it is assessing we can better choose and create tools that truly have benefit to our learners.

References:

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-104). New York, NY: American Council on education and Macmillan.
Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: theory and application. Am J Med. 2006;119(2): doi:10.1016/j.amjmed.2005.10.036. PMID: 16443422
Cook DA. When I say… validity. Med Educ. 2014;48(10):948-949. doi:10.1111/medu.12401. PMID: 25200015
Reid J, Stone K, Brown J, et al. The Simulation Team Assessment Tool (STAT): development, reliability and validation. Resuscitation. 2012;83(7):879-886. doi:10.1016/j.resuscitation.2011.12.012. PMID: 22198422

Additional Reading

ALiEM Education Theories Made Practice eBooks [ALiEM Library]
Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ. 2003;37(9):830-837. doi:10.1046/j.1365-2923.2003.01594.x PMID: 14506816
Zamanzadeh V, Ghahramanian A, Rassouli M, Abbaszadeh A, Alavi-Majd H, Nikanfar AR. Design and Implementation Content Validity Study: Development of an instrument for measuring Patient-Centered Communication. J Caring Sci. 2015;4(2):165-178. Published 2015 Jun 1. doi:10.15171/jcs.2015.017. PMID: 26161370
Kessler CS, Kalapurayil PS, Yudkowsky R, Schwartz A. Validity evidence for a new checklist evaluating consultations, the 5Cs model. Acad Med. 2012;87(10):1408-1412. doi:10.1097/ACM.0b013e3182677944, PMID: 22914527
Ayre C, Sally AJ. Critical Values for Lawshe’s content validity ratio: revisiting the original methods of calculation. Measurement and Evaluation in Counseling and Development. 2014;47(1),79-86. doi:10.1177/0748175613513808

About Shannon Flood, MD