This past December it was reported in the Harvard Crimson that the median grade at their prestigious University was an A-.1 A flood of articles followed bemoaning grade inflation at educational institutions with a former Harvard President noting cheekily that “the most unique honor you could graduate with was none”.2 This might be alright if well-developed criterion-based instruments are used to grade the students, but given the variability in courses taught at the University and difficulty of developing such tools, it is unlikely. That being the case, if the median is an A-, one wonders how sub-par performance must be to fail.
Like Harvard University students, medical students and residents are an exceptional bunch who have succeeded in highly competitive application processes and are expected to perform well. However, the problem of grade inflation and assessment in medical education has also been acknowledged. For example, a recent survey of US internal medicine clerkship directors found that 78% felt that it was a serious problem, while 38% had passed students on their rotations whom they thought should have failed.3 This is problematic as accurate and reliable assessment will be necessary for competency-based medical education [pdf] to have a future.4 Substantial work has been done on developing and validating assessment instruments. Unfortunately, faculty frequently fail to note deficiencies in trainee performance and their assessments have poor inter-rater reliability.5 Faculty development efforts designed to improve these skills, despite their substantial costs, do not seem to be very effective.6 That’s depressing. For a broader exploration of the reasons for inter-rater reliability, check out the latest KeyLIME podcast (episode 59)7 and the related article.8
Adjusting assessment instruments
These problems have led to the development of various perspectives in the burgeoning field of rater cognition. Some educators focus heavily on qualitative elements while others attempt to improve the consistency of quantitative instruments using criterion-based scales and rater training. Dr. Keith Baker, the Program Director of the Massachusetts General Hospital Anesthesia residency program, presented another approach to these problems at the recent Harvard Macy Institute course: A Systems Approach to Assessment in Health Professions Education. Rather than trying to teach his faculty to provide accurate, consistent assessment, he calculated how they assessed and normalized their results based on that. His multi-year project incorporated >14,000 evaluations over a 2 year period and was published in 2011.
Baker K. Determining resident clinical performance: getting beyond the noise. Anesthesiology. 2011 Oct;115(4):862-78. PMID: 21795965.
The assessment instrument
An assessment instrument with multiple components was developed. It was sent to each attending anesthesiologist for every resident whom they worked with each week. The program aimed for a goal of completing 60% of the evaluations and, on average, each resident received >70 evaluations from >40 faculty during the study period. The instrument included space for free-text comments along with four quantitative components, including:
- Relative performance designations for each ACGME milestone ranging from 1 (distinctly below peer level) to 5 (distinctly above peer level).
- Anchored competency designations for each ACGME milestone ranging from 1 (needed significant attending assistance, input or correction) to 7 (expert and able to serve as a resource to fully trained anesthesiologists).
- A list of eight increasingly difficult cases (e.g. these ranged from a skin biopsy in a health patient to a repair of a ruptured AAA in a patient with CHF and atrial fibrillation). For each, the attending was asked if they were confident that the learner could perform anesthesia independently and unsupervised. These cases were similar to entrustable professional activities that have been broadly recommended for competency-based medical education.9
The quantitative scores for each resident evaluation were normalized based on how the evaluator has scored residents in the past using a Z-score. The number of cases the attending was confident that the resident could perform was normalized in the same way.
There is so much data presented in the article that it is impossible to present it in detail. However, some important findings included:
- Positive bias: Despite well-anchored normative scales (ranging from 1-5 with 3 considered ‘at peer level’; evaluators without a positive bias would have a mean score of 3), evaluators had a positive bias that increased over the course of the program with average scores of 3.36, 3.51, and 3.68 in years one, two, and three, respectively. The amount of bias varied by the faculty member (e.g. a score of 4 from one was easier to get than a score of 4 from another).
- Score consistency: When a faculty member scored the same resident twice, the previous evaluation predicted only 23.1% of the variance in subsequent scores. This suggests the conclusion that single evaluations are inconsistent (however, as noted by Dr. Holmboe [@boedudley] in his expert peer review of this post, it is more complicated than that). The average scores of the evaluations for each resident were remarkably consistent over time.
- Performing procedures: The evaluators’ confidence that residents could perform procedures increased throughout the residency substantially more than the relative scores. This finding would be expected if these scores measured increasing performance with additional training as they were intended to.
- Predictive power: Several outcome measures were used to demonstrate the predictive ability of the system. Tests of clinical knowledge (in-training exams) correlated mild-moderately (r=0.3-0.38; r2=0.09-0.14) with scores on the instrument. Low scores on the instrument predicted referral to the Clinical Competency Committee for remediation (OR = 27).
Evaluation is difficult. The highlights of the system outlined by Dr. Baker are the use of multiple measurement instruments (relative, anchored, and performance-based scales), the inclusion of qualitative evaluation with each component, frequent low-stakes evaluation based on direct observation by multiple raters, and the removal of inter-rater reliability/positive bias using simple mathematical principles. This article is particularly relevant to emergency medicine (EM) educators because our teaching and learning environment is similar to anesthesia’s in several ways. In both fields, attending physicians work directly with specific residents for predefined periods.
While I am aware of anchored competency assessment tools have been developed by EM residency programs, I have not read about any that incorporate normalization. The normalization method utilized in this article could be criticized because it assumes that the variability between evaluations is a function of consistent rater characteristics and the study did not demonstrate the extent to which this is the case. That said, a previous study found that 67% of the variance in their online encounter cards was due to the rater.10
Questions and Google Hangout
Dr. Baker and I met to discuss his article on Google Hangouts.
I am interested in hearing your thoughts, specifically:
- As faculty, do you think this type of an assessment system would work in your context? Why or why not?
- As a resident, how would you feel about a similar assessment system being implemented in your program?