Are You Using the Right Evaluation Tool to Assess Learners? Putting Validity on Trial

Evaluation Tool for Learners - Validity on Trial

As medical educators, we often rely on assessment tools to evaluate our learners. Whether in the form of a post-lecture survey or a checklist by a standardized patient assessment, tools are used throughout medical training and beyond. How do we know the tool we are using is appropriate? Is it assessing the right things? Do the scores have any meaning? We often search for tools that have been “validated” and feel more confident applying the results. But what makes a tool “valid”? With a few simple concepts, we can better choose and create our assessment tools and therefore better cater our education to the needs of learners.

What is Validity?

Defining validity is challenging, especially when the terms seem to be redefined just as soon as we get comfortable with them. Even before validity, we must address the construct (what the instrument is intended to measure). Most often this construct does not have a standard or inherent normal or abnormal, such as physician attitudes or patient symptoms. Messick defines construct as “an intangible collection of abstract concepts inferred from behavior and used to measure validity”. [1] Defining the construct is the first step in creating an assessment tool or choosing an appropriate pre-conceived tool.

When applied to assessments, validity is a hypothesis not a statement of fact. A hypothesis requires evidence, either in support or opposition. A tool itself is not “valid” or “invalid” but instead the interpretation of whether the data has or does not have validity [2]. A tool may have validity in one context but not another; with one type of learner but not another. Dr. David Cook, an expert in medical education validation research, defines validity as “the degree to which the interpretations of scores resulting from an assessment activity are ‘well-grounded or justifiable’” [3].

The Courtroom Analogy

Validity lies on a continuum and relies on 4 foundational concepts:

Propositions
Evidence
Argument
Decisions

Cook uses the analogy of the courtroom to simplify these concepts [3].

Propositions

Start with the prosecution, who proposes that the defendant is guilty. This proposition is the basis of the trial and evidence will be presented to support this proposition. Similarly, propositions guide the collection of validity evidence and are essential when evaluating the validity of a tool. Propositions are to the validity hypothesis as objectives are to the goal. For example, I propose my leadership assessment tool will include elements identified as essential to effective leadership in a resuscitation.

Evidence

Next, the prosecution presents its evidence or a collection of evidence. One eyewitness does not make a case. However, an eyewitness, DNA evidence, and a motive might seal the deal. In the assessment of validity there are 5 main types of evidence, defined by Messick [1, 2].

Content evidence asks whether the instrument completely represents the construct. To return to the leadership assessment tool example, does the tool truly measure leadership skills? Are the questions, or items, important and necessary? Are there too many or too few?
Internal structure refers to reliability of the instrument, including interrater-reliability and test-retest reliability. Is score variation among participants expected? We would expect novice leaders to score lower than seasoned attendings. Are scores from different observers similar?
Response process refers to the relationship between the intended construct and the thought process of the subjects or observers. Do those being assessed understand the items on the tool as intended? If not, the tool is not assessing your construct as expected.
Consequences, intended and unintended, of an assessment can affect the tool’s validity. Do low scores lead to remediation and therefore improved performance? Or alternatively, do low scores cause self-doubt and decreased confidence, leading to poor performance?
Relation to other variables, previously known as construct validity, refers to the correlation of scores to other tools that assess the same construct. How does my tool compare to other leadership assessment tools? Just like in a courtroom, more corroborating evidence is better, but you don’t need evidence in every category to get a conviction or acquittal.

Argument

After collecting the evidence, each side has an opportunity to make their arguments. As Cook states “the evidence doesn’t speak for itself” and a strong validity argument requires the structured presentation of evidence [3].

Decision

How will the tool be used and what is the effect of its use? How much evidence is necessary to use the instrument in a certain environment or with a group of learners depends on how the scores will be interpreted. Educational assessments of learners used primarily for the educator’s benefit to develop a curriculum are arguably low stakes and require less evidence before use. However, assessment tools with significant long-term consequences, such as remediation or a failing grade, are high stakes and require stronger validity evidence prior to application [3].

How to Choose a Tool

When assessing an evaluation tool to be used with learners, consider the above concepts of validity and specifically the validity evidence provided. A tool with presented evidence in multiple categories and with the plan to be used on a similar population and/or environment is ideal, although not always possible. Creating your own tool may be necessary [4]. When doing this, consider testing the instrument’s validity before applying the tool. And then consider the potential outcome of the application of the tool and its significance to the learner. A high-stakes outcome, such as pass/fail or granting of increased autonomy, requires assessment tools with large amounts of validity evidence to be applied with confidence. Unfortunately, in medical education we are guilty of using assessment tools frequently that do not meet these standards. If we pause and ask some basic questions about our instrument and what it is assessing we can better choose and create tools that truly have benefit to our learners.

References:

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-104). New York, NY: American Council on education and Macmillan.
Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: theory and application. Am J Med. 2006;119(2): doi:10.1016/j.amjmed.2005.10.036. PMID: 16443422
Cook DA. When I say… validity. Med Educ. 2014;48(10):948-949. doi:10.1111/medu.12401. PMID: 25200015
Reid J, Stone K, Brown J, et al. The Simulation Team Assessment Tool (STAT): development, reliability and validation. Resuscitation. 2012;83(7):879-886. doi:10.1016/j.resuscitation.2011.12.012. PMID: 22198422

Additional Reading

ALiEM Education Theories Made Practice eBooks [ALiEM Library]
Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ. 2003;37(9):830-837. doi:10.1046/j.1365-2923.2003.01594.x PMID: 14506816
Zamanzadeh V, Ghahramanian A, Rassouli M, Abbaszadeh A, Alavi-Majd H, Nikanfar AR. Design and Implementation Content Validity Study: Development of an instrument for measuring Patient-Centered Communication. J Caring Sci. 2015;4(2):165-178. Published 2015 Jun 1. doi:10.15171/jcs.2015.017. PMID: 26161370
Kessler CS, Kalapurayil PS, Yudkowsky R, Schwartz A. Validity evidence for a new checklist evaluating consultations, the 5Cs model. Acad Med. 2012;87(10):1408-1412. doi:10.1097/ACM.0b013e3182677944, PMID: 22914527
Ayre C, Sally AJ. Critical Values for Lawshe’s content validity ratio: revisiting the original methods of calculation. Measurement and Evaluation in Counseling and Development. 2014;47(1),79-86. doi:10.1177/0748175613513808

By Jason Woods, MD|2021-06-16T09:29:35-07:00Jun 19, 2021|Medical Education|

PECARN: Its relevance and importance in pediatric emergency care

Did you know that many of the landmark pediatric emergency medicine (EM) studies come from the Pediatric Emergency Care Applied Research Network (PECARN) collaborative? It works to address the challenging pediatric questions that only multicenter studies can. In this blog post, we highlight PECARN’s goal to translate, disseminate, and implement evidence to all providers of emergent and urgent care for pediatric patients.

(more…)

By Jason Woods, MD|2020-05-01T15:32:49-07:00Jan 7, 2020|Pediatrics|

Sore throat accounts for a whopping 7.3 million outpatient pediatric visits. Group A Streptococcus (GAS) accounts for 20-30% of pharyngitis cases with the rest being primarily viral in etiology. However, clinically differentiating viral versus bacterial causes of pharyngitis is difficult and we, as providers, often don’t get it right. In addition, antimicrobial resistance is increasing.. So who do we test and when do we treat for strep throat? The 2012 Infectious Diseases Society of America (IDSA) guideline on GAS pharyngitis helps answer these questions.

(more…)

By Jason Woods, MD|2021-08-15T10:21:49-07:00Jul 15, 2019|ENT, Guideline Review, Infectious Disease, Pediatrics|

The diagnosis and risk stratification of febrile young infants continues to present a clinical challenge. Serious bacterial infection (SBI) rates in infants ≤60 days have continued to be reported between 8-13%. Despite several different classification rules and pathways, we continue to struggle to accurately delineate which infants have SBI and which do not. A paper titled “A Clinical Prediction Rule to Identify Febrile Infants 60 days and Younger at Low Risk for Serious Bacterial Infections” was published in JAMA Pediatrics in February of 2019.¹ The authors sought to derive a new clinical prediction rule for infants with fever. The research was conducted as part of the Pediatric Emergency Care Applied Research Network (PECARN). We discussed this publication with lead author Dr. Nathan Kuppermann on a podcast and summarize our discussion below.

(more…)

By Jason Woods, MD|2021-07-01T21:15:48-07:00Jun 19, 2019|Pediatrics|

PECARN Study: Accuracy of Urinalysis for Febrile Infants ≤60 Days Old

The reported accuracy of the urinalysis (UA) for diagnosing urinary tract infections (UTI) is febrile infants ≤ 60 days has been widely variable. Some guidelines specifically exclude these patients due to this variability or recommend urine culture as the primary test.¹

Accuracy of the Urinalysis for Urinary Tract Infections in Febrile Infants 60 Days and Younger, published in Pediatrics in February of 2018, addressed this topic head-on.² The authors sought to evaluate the accuracy of the UA by analyzing data in a planned secondary analysis of a prospectively collected data set, as part of the Pediatric Emergency Care Applied Research Network (PECARN). We review this publication and present a behind-the-scenes podcast interview with lead author Dr. Leah Tzimenatos.
(more…)

By Jason Woods, MD|2021-07-01T21:07:41-07:00Sep 27, 2018|Infectious Disease, Pediatrics|

PEM Pearls: Treatment of Pediatric Diabetic Ketoacidosis and the Two-Bag Method

Insulin does MANY things in the body, but the role we care about in the Emergency Department is glucose regulation. Insulin allows cells to take up glucose from the blood stream, inhibits liver glucose production, increases glycogen storage, and increases lipid production. When insulin is not present, such as in patients with Type 1 diabetes mellitus (DM), all of the opposite effects occur.

(more…)

By Jason Woods, MD|2018-04-02T02:54:56-07:00Jul 3, 2017|Endocrine-Metabolic, Pediatrics, PEM Pearls|

About Jason Woods, MD

What is Validity?

The Courtroom Analogy

Propositions

Evidence

Argument

Decision

How to Choose a Tool

References:

Additional Reading