USMLE Scores – Debunking Common Myths
Lot’s of people have misguided notions as to the true nature of USMLE scores and what exactly they represent. In my opinion, this occurs in part due to a lack of interest in understanding the logistic considerations of the exam. Another contributing factor could be the bordering brainless, mentally zero-ed scientific culture most exam goers happen to be cultivated in. Many if not most of these candidates, in their naive wisdoms got into Medicine hoping to rid themselves of numerical burdens forever!
The following, I hope, will help debunk some of these common myths.
Percentile? Uh…what percentile?
This myth is without doubt, the king of all . It isn’t uncommon that you find a candidate basking in the self-righteous glory of having scored a ’99 percent’ or worse, a ’99 percentile’. The USMLE at one point used to provide percentile scores. That stopped sometime in the mid to late ’90s. Why? Well, the USMLE organization believed that scores were being unduly given more weightage than they ought to in medics’ careers. This test is a licensure exam, period. That has always been the motto. Among other things, when residency programs started using the exam as a yard stick to differentiate and rank students, the USMLE saw this as contrary to its primary purpose and said enough is enough. To make such rankings difficult, the USMLE no longer provides percentile scores to exam takers.
The USMLE does have an extremely detailed FAQ on what the 2-digit (which people confuse as a percentage or percentile) and 3-digit scores mean. I strongly urge all test-takers to take a hard look at it and ponder about some of the stuff said therein.
Simply put, the way the exam is designed, it measures a candidate’s level of knowledge and provides a 3-digit score with an important import. This 3-digit score is an unfiltered indication of an individual’s USMLE know-how, that in theory shouldn’t be influenced by variations in the content of the exam, be it across space (another exam center and/or questions from a different content pool) or time (exam content from the future or past). This means that provided a person’s knowledge remains constant, he or she should in theory, achieve the same 3-digit score regardless of where and when he or she took the test. Or, supposedly so. The minimum 3-digit score that is required to ‘pass’ the exam is revised on an annual basis to preserve this space-time independent nature of the score. For the last couple of years, the passing score has hovered around 185. A ‘pass’ score makes you eligible to apply for a license.
What then is the 2-digit score? For god knows what reason, the Federation of State Medical Boards (these people provide medics in the US, licenses based on their USMLE scores) has a 2-digit format for a ‘pass’ score on the USMLE exam. Unlike the 3-digit score this passing score is fixed at 75 and isn’t revised every year.
How does one convert a 3-digit score to a 2-digit score? The exact conversion algorithm hasn’t been disclosed (among lots of other things). But for matters of simplicity, I’m going to use a very crude approach to illustrate:
Equate the passing 3-digit score to 75. So if the passing 3-digit score is 180, then 180 = 75. 185 = 80, 190 = 85 … and so on.
I’m sure the relationship isn’t linear as shown above. For one, by very definition, a 2-digit score ends at 99. 100 is a 3-digit number! So let’s see what happens with our example above:
190 = 85, 195 = 90, 199 = 99. We’ve reached the 2-digit limit at this point. Any score higher than 199 will also be equated to 99. It doesn’t matter if you scored a 240 or 260 on the 3 digit scale. You immediately fall under the 99 bracket along with the lesser folk!
These distortions and constraints make the 2-digit score an unjust system to rank test-takers and today, most residency programs use the 3-digit score to compare people. Because the 3-digit to 2-digit scale conversion changes every year, it makes sense to stick to the 3-digit scale which makes comparisons between old-timers and new-timers possible, besides the obvious advantage in helping comparisons between candidates who deal/dealt with different exam content.
Making Assumptions And Approximate Guesses
The USMLE does provide Means and Standard Deviations on students’ score cards. But these statistics don’t strictly apply to them because they are derived from different test populations. The score card specifically mentions that these statistics are “for recent” instances of the test.
Each instance of an exam is directed at a group of people which form its test population. Each population has its own characteristics such as whether or not it’s governed by Gaussian statistics, whether there is skew or kurtosis in its distribution, etc. The summary statistics such as the mean and standard deviation will also vary between different test populations. So unless you know the exact summary statistics and the nature of the distribution that describes the test population from which a candidate comes, you can’t possibly assign him/her a percentile rank. And because Joe and Jane can be from two entirely different test populations, percentiles in the end don’t carry much meaning. It’s that simple folks.
You could however make assumptions and arbitrary conclusions about percentile ranks though. Say for argument sake, all populations have a mean equal to 220 and a standard deviation equal to 20 and conform to Gaussian statistics. Then a 3-digit score of:
220 = 50th percentile
220 + 20 = 84th percentile
220 + 20 + 20 = 97th percentile
[Going back to our ’99 percentile’ myth and with the specific example we used, don’t you see how a score equal to 260 (with its 2-digit 99 equivalent) still doesn’t reach the 99 percentile? It’s amazing how severely people can delude themselves. A 99 percentile rank is no joke and I find it particularly fascinating to observe how hundreds of thousands of people ludicrously claim to have reached this magic rank with a 2-digit 99 score. I mean, doesn’t the sheer commonality hint that something in their thinking is off?]
This calculator makes it easy to calculate a percentile based on known Mean and Standard Deviations for Gaussian distributions. Just enter the values for Mean and Standard Deviation on the left, and in the ‘Probability’ field enter a percentile value in decimal form (97th percentile corresponds to 0.97 and so forth). Hit the ‘Compute x’ button and you will be given the corresponding value of ‘x’.
99th Percentile Ain’t Cake
Another point of note about a Gaussian distribution:
The distance from the 0th percentile to the 25th percentile is also equal to the distance between the 75th and 100th percentile. Let’s say this distance is x. The distance between the 25th percentile and the 50th percentile is also equal to the distance between the 50th percentile and the 75th percentile. Let’s say this distance is y.
It so happens that x>>>y. In a crude sense, this means that it is disproportionately tougher for you to score extreme values than to stay closer to the mean. Going from a 50th percentile baseline, scoring a 99th percentile is disproportionately tougher than scoring a 75th percentile. If you aim to score a 99 percentile, you’re gonna have to seriously sweat it out!
It’s the interval, stupid
Say there are infinite clones of you existent in this world and you’re all like the Borg. Each of you is mentally indistinguishable from the other – possessing ditto copies of USMLE knowhow. Say that each of you took the USMLE and then we plot the frequencies of these scores on a graph. We’re going to end up with a Gaussian curve depicting this sample of clones, with its own mean score and standard deviation. This process is called ‘parametric sampling’ and the distribution obtained is called a ‘sampling distribution’.
The idea behind what we just did is to determine the variation that we would expect in scores even if knowhow remained constant – either due to a flaw in the test or by random chance.
The standard deviation of a sampling distribution is also called ‘standard error’. As you’ll probably learn during your USMLE preparation, knowing the standard error helps calculate what are called ‘confidence intervals’.
A confidence interval for a given score can be calculated as follows (using the Z-statistic):-
True score = Measured score +/- 1.96 (standard error of measurement) … for 95% confidence
True score = Measured score +/- 2.58 (standard error of measurement) … for 99% confidence
For many recent tests, the standard error for the 3-digit scale has been 6 [Every score card quotes a certain SEM (Standard Error of Measurment) for the 3-digit scale]. This means that given a measured score of 240, we can be 95% certain that the true value of your performance lies between a low of 240 – 1.96 (6) and a high of 240 + 1.96 (6). Similarly we can say with 99% confidence that the true score lies between 240 – 2.58 (6) and 240 + 2.58 (6). These score intervals are probablistically flat when graphed – each true score value within the intervals calculated has an equal chance of being the right one.
What this means is that, when you compare two individuals and see their scores side by side, you ought to consider what’s going on with their respective confidence intervals. Do they overlap? Even a nanometer of overlapping between CIs makes the two, statistically speaking, indistinguishable, even if in reality there is a difference. As far as the test is concerned, when two CIs overlap, the test failed to detect any difference between these two individuals (some statisticians disagree. How to interpret statistical significance when two or more CIs overlap is still a matter of debate! I’ve used the view of the authors of the Kaplan lecture notes here). Capiche?
Beating competitors by intervals rather than pinpoint scores is a good idea to make sure you really did do better than them. The wider the distance separating two CIs, the larger is the difference between them.
There’s a special scenario that we need to think about here. What about the poor fellow who just missed the passing mark? For a passing mark of 180, what of the guy who scored, say 175? Given a standard error of 6, his 95% CI definitely does include 180 and there is no statistically significant (using a 5% margin of doubt) difference between him and another guy who scored just above 180. Yet this guy failed while the other passed! How do we account for this? I’ve been wondering about it and I think that perhaps, the pinpoint cutoffs for passing used by the USMLE exist as a matter of practicality. Using intervals to decide passing/failing results might be tedious, and maybe scientific endeavor ends at this point. Anyhow, I leave this question out in the void with the hope that it sparks discussions and clarifications.
If you care to give it a thought, the graphical subject-wise profile bands on the score card are actually confidence intervals (95%, 99% ?? I don’t know). This is why the score card clearly states that if any two subject-wise profile bands overlap, performance in these subjects should be deemed equal.
I hope you’ve found this post interesting if not useful. Please feel free to leave behind your valuable suggestions, corrections, remarks or comments. Anything !
Readability grades for this post:
Flesch Index: 64.3/100 (plain English)
Fog Index: 12.0
Lix: 40.3 = school year 6
Powered by Kubuntu Linux 8.04
Copyright © 2006 – 2008 Firas MR. All rights reserved.