Archive for the ‘Medical Statistics’ Category
Let’s Face It, We Are Numskulls At Math!
Noted mathematician, Timothy Gowers, talks about the importance of math
I’ve often written about Mathematics before ^{Footnotes}. As much as math helps us better understand our world (Modern Medicine’s recent strides have a lot to do with applied math for example), it also tells us how severely limited man’s common thinking is.
Humans and yes some animals too, are born with or soon develop an innate ability for understanding numbers. Yet, just like animals, our proficiency with numbers seems to stop short of the stuff that goes beyond our immediate activities of daily living (ADL) and survival. Because we are a higher form of being (or allegedly so, depending on your point of view), our ADLs are a lot more sophisticated than say those of, canaries or hamsters. And consequently you can expect to see a little more refined arithmetic being used by us. But fundamentally, we share this important trait – of being able to work with numbers from an early stage. A man who has a family with kids knows almost by instinct that if he has two kids to look after, that would mean breakfast, lunch and dinner times 2 in terms of putting food on the table. He would have to buy two sets of clothes for his kids. A kid soon learns that he has two parents. And so on. It’s almost natural. And when someone can’t figure out their way doing simple counting or arithmetic, we know that something might be wrong. In Medicine, we have a term for this. It’s called acalculia and often indicates the presence of a neuropsychiatric disorder.
It’s easy for ‘normal’ people to do 2 + 2 in their heads. Two oranges AND two oranges make a TOTAL of four oranges. This basic stuff helps us get by daytoday. But how many people can wrap their heads around 1 divided by 0? If you went to school, yea sure your teachers must have hammered the answer into you: infinity. But how do you visualize it? Yes, I know it’s possible. But it takes unusual work. I think you can see my point, even with this simple example. We haven’t even begun to speak about probability, wave functions, symmetries, infinite kinds of infinities, multiplespacedimensions, time’s arrow, quantum mechanics, the Higgs field or any of that stuff yet!
As a species, it is so obvious that we aren’t at all good at math. It’s almost as if we construct our views of the universe through this tunneled vision that helps us in our daytoday tasks, but fails otherwise.
We tend to think of using math as an ability when really it should be thought of as a sensory organ. Something that is as vital to understanding our surroundings as our eyes, ears, noses, tongues and skins. And despite lacking this sense, we tend to go about living as though we somehow understand everything. That we are aware of what it is to be aware of. This can often lead to trouble down the road. I’ve talked about numerous PhDs having failed at the Monty Hall Paradox before. But a recent talk I watched, touched upon something with serious consequences that meant people being wrongfully convicted because of a stunted interpretation of DNA, fingerprint evidence, etc. by none other than “expert” witnesses. In other words, serious life and death issues. So much for our expertise as a species, eh?!
How the human mind struggles with math!
We recently also learned that the hullabaloo over H1N1 pandemic influenza had a lot do with our naive understanding of math, the pitfalls of corporatedriven publicinterest research notwithstanding.
Anyhow, one of my main feelings is that honing one’s math not only helps us survive better, but it can also teach us about our place in the universe. Because we can then begin to fully use it as a sensory organ in its own right. Which is why a lot of pure scientists have argued that doing math for math’s own sake can not only be great fun (if done the right way, of course :P) but should also be considered necessary. Due to the fact that such research has the potential to reveal entirely new vistas that can enchant us and surprise us at the same time (take Cantor’s work on infinity for example). For in the end, discovery, really, is far more enthralling than invention.
UPDATE 1: Check out the Khan Academy for a virtually AZ education on math — and all of it for free! This is especially a great resource for those of us who can’t even recall principles of addition, subtraction, etc. let alone calculus or any of the more advanced stuff.
Copyright © Firas MR. All rights reserved.
# Footnotes:
The Doctor’s Apparent Ineptitude
As a fun project, I’ve decided to frame this post as an abstract.
AIMS/OBJECTIVES:
To elucidate factors influencing perceived incompetence on the part of the doctor by the layman/patient/patient’s caregiver.
MATERIALS & METHODS:
Armchair pontification and a little gedankenexperiment based on prior experience with patients as a medical trainee.
RESULTS:
Preliminary analyses indicate widespread suspicions among patients on the ineptitude of doctors no matter what the level of training. This is amply demonstrated in the following figure:
As one can see, perceived ineptitude forms a wide spectrum – from most severe (med student) to least severe (attending). The underlying perceptions of incompetence do not seem to abate at any level however, and eyewitness testimonies include phrases such as ‘all doctors are inept; some more so than others’. At the med student level, exhausted patients find their anxious questions being greeted with a variety of responses ranging from the dumb ‘I don’t know’, to the dumber ‘well, I’m not the attending’, to the dumbest ‘uhh…mmmm..hmmm <eyes glazed over, pupils dilated>’. Escape routes will be meticulously planned in advance both by patients and more importantly by med students to avert catastrophe.
As for more senior medics such as attendings, evasion seems to be just a matter of hiding behind statistics. A gedankenexperiment was conducted to demonstrate this. The settings were two patients A and B, undergoing a certain surgical procedure and their respective caregivers, CA and CB.
Patient A
Consent & Preop
CA: (anxious), Hey doc, ya think he’s gonna make it?
Doc: It’s difficult to say and I don’t know that at the moment. There are studies indicating that 95% live and 5% die during the procedure though.
CA: ohhh kay (slightly confused) (murmuring)…’All this stuff about knowing medicine. What does he know? One simple question and he gives me this? What the heck has this guy spent all these years studying for?!’
Postop & Recovery
CA: Ah, I just heard! He made it! Thank you doctor!
Doc: You’re welcome (smug, godcomplex)! See, I told ya 95% live. There was no reason for you to worry!
CA: (sarcastic murmur) ‘Yeah, right. Let him go through the pain of not knowing and he’ll see. Look at him, so full of himself – as if he did something special; luck was on our side anyway. Heights of incompetence!’
Patient B
Consent & Preop
CB: (anxious) Hey doc, ya think he’s gonna make it?
Doc: It’s difficult to say and I don’t know that at the moment. There are studies indicating that 95% live and 5% die during the procedure though.
CB: ohhh kay (slightly confused) (murmuring)…’All this stuff about knowing medicine. What does he know? One simple question and he gives me this? What the heck has this guy spent all these years studying for?!’
Postop & Recovery
CB: (angry, shouting numerous explicatives) What?! He died on the table?!
Doc: Well, I did mention that there was a 5% death rate.
CB: (angry, shouting numerous explicatives).. You (more explicatives) incompetent quack! (murmuring) “How convenient! A lawsuit should fix him for good!”
The Doctor’s Coping Strategy
Although numerous psychology models can be applied to understand physician behavior, the Freudian model reveals some interesting material. Common defense strategies that help doctors include:
Isolation of affect: eg. Resident tells Fellow, “you know that patient with the …well, she had a massive MI and went into VFib..died despite ACLS..poor soul…so hey, I hear they’re serving pizza today at the conference…(the conference about commercializing healthcare and increasing physician paygrades for ‘a better and healthier tomorrow’)”
Intellectualization: eg. Attending tells Fellow, “so you understand why that particular patient bled to death? Yeah it was DIC in the setting of septic shock….plus he had a prior MI with an Ejection Fraction of 33% so there was that component as well..but we couldn’t really figure out why the antibiotics didn’t work as expected…ID gave clearance….(ad infinitum)…so let’s present this at our M&M conference this week..”
Displacement: eg. Caregiver yells at Fellow, “<explicatives>”. Fellow yells at intern, “You knew that this was a case that I had a special interest in and yet you didn’t bother to page me? Unacceptable!…” Intern then yells at med student, “Go <explicatives> disimpact Mr. X’s bowels…if I don’t see that done within the next 15 minutes, you’re in for a class! Go go go…clock’s ticking…tck tck tck!”
We believe there are other coping mechanisms that are important too, but in our observations these appear to be the most common. Of the uncommon ones, we think med students as a group in particular, are the most vulnerable to Regression & Dissociation, duly accounting for confounding factors.
All of these form a systematic egosyntonic pattern of behavior, but for reasons we are still exploring, is not included in the DSMIV manual’s section on Personality Disorders.
CONCLUSIONS:
Patients and their caregivers seem to think that ALL doctors are fundamentally inept, period. Ineptitude follows a wide spectrum however – ranging from the bizarre to the mundane. Further studies (including but not limited to armchair pontification) need to be carried out to corroborate these startling results and the factors that we have reported. Other studies need to elucidate remedial measures that can be employed to save the doctorpatient relationship.
–
NOTE: I wrote this piece as a reminder of how the doctorpatient relationship is experienced from the patient’s side. In our businessasusual frenzy, we as medics often don’t think about these things. And these things often DO matter a LOT to our patients!
–
Copyright © Firas MR. All rights reserved.
USMLE – Designing The Ultimate Questions
There are strategies that examiners can employ to frame questions that are designed to stump you on an exam such as the USMLE. Many of these strategies are listed out in the Kaplan Qbook and I’m sure this stuff will be familiar to many. My favorite techniques are the ‘multistep’ and the ‘baitandswitch’.
The MultiStep
Drawing on principles of probability theory, examiners will often frame questions that require you to know multiple facts and concepts to get the answer right. As a crude example:
“This inherited disease exclusive to females is associated with acquired microcephaly and the medical management includes __________________.”
Such a question would be reframed as a clinical scenario (an outpatient visit) with other relevant clinical data such as a pedigree chart. To get the answer right, you would need:
 Knowledge of how to interpret pedigree charts and identify that the disease manifests exclusively in females.
 Knowledge of Mendelian inheritance patterns of genetic diseases.
 Knowledge of conditions that might be associated with acquired microcephaly.
 Knowledge of medical management options for such patients.
Now taken individually, each of these steps – 1, 2, 3 and 4 – has a probability of 50% that you could get it right purely by random guessing. Combined together however, which is what is necessary to get the answer, the probability would be 50% * 50% * 50% * 50% = 6.25% [combined probability of independent events]. So now you know why they actually prefer multistep questions over one or twoliners! :) Notice that this doesn’t necessarily have anything to do with testing your intelligence as some might think. It’s just being able to recollect hard facts and then being able to put them together. They aren’t asking you to prove a math theorem or calculate the trajectory of a space satellite :P !
The BaitandSwitch
Another strategy is to riddle the question with chockfull of irrelevant data. You could have paragraph after paragraph describing demographic characteristics, anthropometric data, and ‘bait’ data that’s planted there to persuade you to think along certain lines and as you grind yourself to ponder over these things you are suddenly presented with an entirely unrelated sentence at the very end, asking a completely unrelated question! Imagine being presented with the multistep question above with one added fly in the ointment. As you finally finish the halfpage length question, it ends with ‘<insertsimilardisease> is associated with the loss of this enzyme and/or body part: _______________’. Very tricky! Questions like these give flashbacks and dejavu of days from 2nd year med school, when that patient with a neck lump begins by giving you his demographic and occupational history. As an inexperienced med student you immediately begin thinking: ‘hmmm..okay, could the lump be related to his occupation? …hmm…’. But wait! You haven’t even finished the physical exam yet, let alone the investigations. As medics progress along their careers they tend to phase out this kind of analysis in favor of more refined ‘heuristics’ as Harrison’s puts it. A senior medic will often wait to formulate opinions until the investigations are done and will focus on triaging problems and asking if management options are going to change them. The keyword here is ‘triage’. Just as a patient’s clinical information in a real office visit is filled with much irrelevant data, so too are many USMLE questions. That’s not to say that demographic data, etc. are irrelevant under all conditions. Certainly, an occupational history of being employed at an asbestos factory would be relevant in a case that looks like a respiratory disorder. If the case looks like a respiratory disorder, but the question mentions an occupational history of being employed as an office clerk, then this is less likely to be relevant to the case. Similarly if it’s a case that overwhelmingly looks like an acute abdomen, then a stray symptom of foot pain is less likely to be relevant. Get my point? That is why many recommend reading the last sentence or two of a USMLE question before reading the entire thing. It helps you establish what exactly is the main problem that needs to be addressed.
Hope readers have found the above discussion interesting :). Adios for now!
–
Copyright © Firas MR. All rights reserved.
Decision Tree Questions In Genetics And The USMLE
Just a quick thought. It just occurred to me that some of the questions on the USMLE involving pedigree analysis in genetics, are actually typical decision tree questions. The probability that a certain individual, A, has a given disease (eg: Huntington’s disease) purely by random chance is simply the disease’s prevalence in the general population. But what if you considered the following questions:
 How much genetic code do A and B share if they are third cousins?
 If you suddenly knew that B has Huntington’s disease, what is the new probability for A?
 What is the disease probability for A‘s children, given how much genetic code they share with B?
When I’d initially written about decision trees, it did not at all occur to me at the time how this stuff was so familiar to me already!
Apply a little Bayesian strategy to these questions and your mind is suddenly filled with all kinds of probability questions ripe for decision tree analysis:
 If the genetic test I utilize to detect Huntington’s disease has a falsepositive rate x and a falsenegative rate y, now what is the probability for A?
 If the pretest likelihood is m and the posttest likelihood is n, now what is the probability for A?
I find it truly amazing how so many geneticists and genetic counselors accomplish such complex calculations using decision trees without even realizing it! Don’t you :) ?
Copyright © Firas MR. All rights reserved.
Why Equivalence Studies Are So Fascinating
Objectives and talking points:
 To recap basic concepts of hypothesis testing in scientific experiments. Readers should readup on hypothesis testing in reference works.
 To contrast drug vs. placebo and drug vs. standard drug study designs.
 To contrast nonequivalence and equivalence studies.
 To understand implications of these study designs, in terms of interpreting study results.
——————————————————————————————————–
Howdy readers! Today I’m going to share with you some very interesting concepts from a fabulous book that I finished recently – “Designing Clinical Research – An Epidemiologic Approach” by Stephen Hulley et al. The book speaks fairly early on, on what are called “equivalence studies”. Equivalence studies are truly fascinating. Let’s see how.
When a new drug is tested for efficacy, there are multiple ways for us to do so.
A Nonequivalence Study Of Drug vs. Placebo
A drug can be compared to something that doesn’t have any treatment effect whatsoever – a ‘placebo’. Examples of placebos include sugar tablets, distilled water, inert substances, etc. Because pharmaceutical companies try hard to make drugs that have a treatment effect and that are thus different from placebos, the objective of such a comparison is to answer the following question:
Is the new drug any different from the placebo?
Note the emphasis on ‘any different’. As is usually the case, a study of this kind is designed to test for differences between drug and placebo effects in both directions^{1}. That is:
Is the new drug better than the placebo?
OR
Is the new drug worse than the placebo?
The boolean operator ‘OR’, is key here.
Since we can not conduct such an experiment on all people in the target ‘population’ (eg. all people with diabetes from the whole country), we conduct it on a random and representative ‘sample’ of this population (eg. randomly selected diabetes patients from the whole country). Because of this, we can not directly extrapolate our findings to the target population without doing some fancy roundabout thinking and a lot of voodoo first – a.k.a. ‘hypothesis testing’. Hypothesis testing is crucial to take in to account random chance (error) effects that might have crept in to the experiment.
In this experiment:
 The null hypothesis is that the drug and the placebo DO NOT differ in the real world^{2}.
 The alternative hypothesis is that the drug and the placebo DO differ in the real world.
So off we go, with our experiment with an understanding that our results might be influenced by random chance (error) effects. Say that, before we start, we take the following error rates to be acceptable:
 Even if the null hypothesis is true in the real world, we would find that the drug and the placebo DO NOT differ only 95% of the time, purely by random chance. [Although this rate doesn't have a name, it is equal to (1  Type 1 error)].
 Even if the null hypothesis is true in the real world, we would find that the drug and the placebo DO differ 5% of the time, purely by random chance. [This rate is also called our Type 1 error, or critical level of significance, or critical α level, or critical 'p' value].
 Even if the alternative hypothesis is true in the real world, we would find that the drug and the placebo DO differ only 80% of the time, purely by random chance. [This rate is also called the 'Power' of the experiment. It is equal to (1  Type 2 error)].
 Even if the alternative hypothesis is true in the real world, we would find that the drug and the placebo DO NOT differ 20% of the time, purely by random chance. [This rate is also called our Type 2 error].
The strategy of the experiment is this:
If we are able to accept these error rates and show in our experiment that the null hypothesis is false (that is ‘reject‘ it), the only other hypothesis left on the table is the alternative hypothesis. This has then, GOT to be true and we thus ‘accept’ the alternative hypothesis.
Q: With what degree of uncertainty?
A: With the uncertainty that we might arrive at such a conclusion 5% of the time, even if the null hypothesis is true in the real world.
Q: In English please!
A: With the uncertainty that we might arrive at a conclusion that the drug DOES differ from the placebo 5% of the time, even if the drug DOES NOT differ from the placebo in the real world.
Our next question would be:
Q: How do we reject the null hypothesis?
A: We proceed by initially assuming that the null hypothesis is true in the real world (i.e. Drug effect DOES NOT differ from Placebo effect in the real world). We then use a ‘test of statistical significance‘ to calculate the probability of observing a difference in treatment effect in the real world, as large or larger than that actually observed in the experiment. If this probability is <5%, we reject the null hypothesis. We do this with the belief that such a conclusion is within our preselected margin of error. Our preselected margin of error, as mentioned previously, is that we would be wrong about rejecting the null hypothesis 5% of the time (our Type 1 error rate)^{3}.
If we fail to show that this calculated probability is <5%, we ‘fail to reject‘ the null hypothesis and conclude that a difference in effect has not been proven^{4}.
A lot of scientific literature out there is riddled with drug vs. placebo studies. This kind of thing is good if we do not already have an effective drug for our needs. Usually though, we already have a standard drug that we know works well. It is of more interest to see how a new drug compares to our standard drug.
A Nonequivalence Study Of Drug vs. Standard Drug
These studies are conceptually the same as drug vs. placebo studies and the same reasoning for inference is applied. These studies ask the following question:
Is the new drug any different than the standard drug?
Note the emphasis on ‘any different’. As is often the case, a study of this kind is designed to test the difference between the two drugs in both directions^{1}. That is:
Is the new drug better than the standard drug?
OR
Is the new drug worse than the standard drug??
Again, the boolean operator ‘OR’, is key here.
In this kind of experiment:
 The null hypothesis is that the new drug and the standard drug DO NOT differ in the real world^{2}.
 The alternative hypothesis is that the new drug and the standard drug DO differ in the real world.
Exactly like we discussed before, we initially assume that the null hypothesis is true in the real world (i.e. the new drug’s effect DOES NOT differ from the standard drug’s effect in the real world). We then use a ‘test of statistical significance‘ to calculate the probability of observing a difference in treatment effect in the real world, as large or larger than that actually observed in the experiment. If this probability is <5%, we reject the null hypothesis – with the belief that such a conclusion is within our preselected margin of error. Just to repeat ourselves here, our preselected margin of error, is that we would be wrong about rejecting the null hypothesis 5% of the time (our Type 1 error rate)^{3}.
If we fail to show that this calculated probability is <5%, we ‘fail to reject’ the null hypothesis and conclude that a difference in effect has not been proven^{4}.
An Equivalence Study Of Drug vs. Standard Drug
Sometimes all you want is a drug that is as good as the standard drug. This can be for various reasons – the standard drug is just too expensive, just too difficult to manufacture, just too difficult to administer, … and so on. Whereas the new drug might not have these undesirable qualities yet retain the same treatment effect.
In an equivalence study, the incentive is to prove that the two drugs are the same. Like we did before, let’s explicitly formulate our two hypotheses:
 The null hypothesis is that the new drug and the standard drug DO NOT differ in the real world^{2}.
 The alternative hypothesis is that the new drug and the standard drug DO differ in the real world.
We are mainly interested in proving the null hypothesis. Since this can’t be done^{4}, we’ll be content with ‘failing to reject’ the null hypothesis. Our strategy is to design a study powerful enough to detect a difference close to 0 and then ‘fail to reject’ the null hypothesis. In doing so, although we can’t ‘prove’ for sure that the null hypothesis is true, we can nevertheless be more comfortable saying that it in fact is true.
In order to detect a difference close to 0, we have to increase the Power of the study from the usual 80% to something like 95% or higher. We wan’t to maximize power to detect the smallest difference possible. Usually though, it’s enough if we are able to detect the the largest difference that doesn’t have clinical meaning (eg: a difference of 4mm on a BP measurement). This way we can compromise a little on Power and choose a less extreme figure, say 88% or something.
And then just as in our previous examples, we proceed with the assumption that the null hypothesis is true in the real world. We then use a ‘test of statistical significance‘ to calculate the probability of observing a difference in treatment effect in the real world, as large or larger than that actually observed in the experiment. If this probability is <5%, we reject the null hypothesis – with the belief that such a conclusion is within our preselected margin of error. And to repeat ourselves yet again (boy, do we like doing this :P ), our preselected margin of error is that we would be wrong about rejecting the null hypothesis 5% of the time (our Type 1 error rate)^{3}.
If we fail to show that this calculated probability is <5%, we ‘fail to reject‘ the null hypothesis and conclude that although a difference in effect has not been proven, we can be reasonably comfortable saying that there is in fact no difference in effect.
So Where Are The Gotchas?
If your study isn’t designed or conducted properly (eg: without enough power, inadequate sample size, improper randomization, loss of subjects to followup, inaccurate measurements, etc.) you might end up ‘failing to reject’ the null hypothesis whereas if you had taken the necessary precautions, this might not have happened and you would have come to the opposite conclusion. Purely because of random chance (error) effects. Such improper study designs usually dampen any obvious differences in treatment effect in the experiment.
In a nonequivalence study, researchers, whose incentive it is to reject the null hypothesis, are thus forced to make sure that their designs are rigorous.
In an equivalence study, this isn’t the case. Since researchers are motivated to ‘fail to reject’ the null hypothesis from the get go, it becomes an easy trap to conduct a study with all kinds of design flaws and very conveniently come to the conclusion that one has ‘failed to reject’ the null hypothesis!
Hence, it is extremely important, more so in equivalence studies than in nonequivalence studies, to have a critical and alert mind during all phases of the experiment. Interpreting an equivalence study published in a journal is hard, because one needs to know the very guts of everything the research team did!
Even though we have discussed these concepts with drugs as an example, you could apply the same reasoning to many other forms of treatment interventions.
Hope you’ve found this post interesting :) . Do send in your suggestions, corrections and comments!
Adios for now!
Copyright © Firas MR. All rights reserved.
Readability grades for this post:
Automated readability index: 8.1
FleschKincaid grade level: 7.4
ColemanLiau index: 9
Gunning fog index: 11.8
SMOG index: 11
–
1. An alternative hypothesis for such a study is called a ‘twotailed alternative hypothesis‘. A study that tests for differences in only one direction has an alternative hypothesis that is called a ‘onetailed alternative hypothesis‘.
2. This situation is a good example of a ‘null’ hypothesis also being a ‘nil’ hypothesis. A null hypothesis is usually a nil hypothesis, but it’s important to realize that this isn’t always the case.
4. Note that we never use the term, ‘accept the null hypothesis’.
Does Changing Your Anwer In The Exam Help?
The Monty Hall Paradox
One of the 3 doors hides a car. The other two hide a goat each. In search of a new car, the player picks a door, say 1. The game host then opens one of the other doors, say 3, to reveal a goat and offers to let the player pick door 2 instead of door 1. Is there an advantage if the the player decides to switch? (Courtesy: Wikipedia)
Hola amigos! Yes, I’m back! It’s been eons and I’m sure many of you may have been wondering why I was MIA. Let’s just say it was academia as usual.
This post is unique as it’s probably the first where I’ve actually learned something from contributors and feedback. A very critical audience and pure awesome discussion. The main thrust was going to be an analysis of the question, “If you had to pick an answer in an MCQ randomly, does changing your answer alter the probabilities to success?” and it was my hope to use decision trees to attack the question. I first learned about decision trees and decision analysis in Dr. Harvey Motulsky’s great book, “Intuitive Biostatistics“. I do highly recommend his book. As I pondered over the question, I drew a decision tree that I extrapolated from his book. Thanks to initial feedback from BrownSandokan (my venerable computer scientist friend from yore :P) and Dr. Motulsky himself, who was so kind as to write back to just a random reader, it turned out that my diagram was wrong and so was the original analysis. The problem with the original tree (that I’m going to maintain for other readers to see and reflect on here) was that the tree in the book is specifically for a math (or rather logic) problem called the Monty Hall Paradox. You can read more about it here. As you can see, the Monty Hall Paradox is a special kind of unequal conditional probability problem, in which knowing something for sure, influences the probabilities of your guesstimates. It’s a very interesting problem, and has bewildered thousands of people, me included. When it was originally circulated in a popular magazine, “nearly 1000 PhDs” (cf. Wikipedia) wrote back to say that the solution put forth was wrong, prompting numerous psychoanalytical studies to understand human behavior. A decision tree for such a problem is conceptually different from a decision tree for our question and so my original analysis was incorrect.
So what the heck are decision trees anyway? They are basically conceptual tools that help you make the right decisions given a couple of known probabilities. You draw a line to represent a decision, and explicitly label it with a corresponding probability. To find the final probability for a number of decisions (or lines) in sequence, you multiply or add their individual probabilities. It takes skill and a critical mind to build a correct tree, as I learned. But once you have a tree in front of you, its easier to see the whole picture.
Let’s just ignore decision trees completely for the moment and think in the usual sense. How good an idea is it to change an answer on an MCQ exam such as the USMLE? The Kaplan lecture notes will tell you that your chances of being correct are better off if you don’t. Let’s analyze this. If every question has 1 correct option and 4 incorrect options (the total number of options being 5), then any single try on a random choice gives you a probability of 20% for the correct choice and 80% for the incorrect choice. The odds are higher that on any given attempt, you’ll get the answer wrong. If your choice was correct the first time, it still doesn’t change these basic odds. You are still likely to pick the incorrect choice 80% of the time. Borrowing from the concept of “regression towards the mean” (repeated measurements of something, yield values closer to said thing’s mean), we can apply the same reasoning to this problem. Since the outcomes in question are categorical (binomial to be exact), the measure of central tendency used is the Mode (defined as the most commonly or frequently occurring thing in a series). In a categorical series – cat, dog, dog, dog, cat – the mode is ‘dog’. Since the Mode in this case happens to be the category “incorrect”, if you pick a random answer and repeat this multiple times, you are more likely to pick an incorrect answer! See, it all make sense :) ! It’s not voodoo after all :D !
Coming back to decision analysis, just as there’s a way to prove the solution to the Monty Hall Paradox using decision trees, there’s also a way to prove our point on the MCQ problem using decision trees. While I study to polish my understanding of decision trees, building them for either of these problems will be a work in progress. And when I’ve figured it all out, I’ll put them up here. A decision tree for the Monty Hall Paradox can be accessed here.
To end this post, I’m going to complicate our main question a little bit and leave it out in the void. What if on your initial attempt you have no idea which of the answers is correct or incorrect but on your second attempt, your mind suddenly focuses on a structure flaw in one or more of the options? Assuming that an option with a structure flaw can’t be correct, wouldn’t this be akin to Monty showing the goat? One possible structure flaw, could be an option that doesn’t make grammatical sense when combined with the stem of the question. Does that mean you should switch? Leave your comments below!
Hope you’ve found this post interesting. Adios for now!
Copyright © Firas MR. All rights reserved.
Readability grades for this post:
Flesch reading ease score: 72.4
Automated readability index: 7.8
FleschKincaid grade level: 7.3
ColemanLiau index: 8.5
Gunning fog index: 11.4
SMOG index: 10.7
Intuitive Biostatistics, by Harvey Motulsky
Powered by ScribeFire.
USMLE Scores – Debunking Common Myths
Lot’s of people have misguided notions as to the true nature of USMLE scores and what exactly they represent. In my opinion, this occurs in part due to a lack of interest in understanding the logistic considerations of the exam. Another contributing factor could be the bordering brainless, mentally zeroed scientific culture most exam goers happen to be cultivated in. Many if not most of these candidates, in their naive wisdoms got into Medicine hoping to rid themselves of numerical burdens forever!
The following, I hope, will help debunk some of these common myths.
Percentile? Uh…what percentile?
This myth is without doubt, the king of all :) . It isn’t uncommon that you find a candidate basking in the selfrighteous glory of having scored a ’99 percent’ or worse, a ’99 percentile’. The USMLE at one point used to provide percentile scores. That stopped sometime in the mid to late ’90s. Why? Well, the USMLE organization believed that scores were being unduly given more weightage than they ought to in medics’ careers. This test is a licensure exam, period. That has always been the motto. Among other things, when residency programs started using the exam as a yard stick to differentiate and rank students, the USMLE saw this as contrary to its primary purpose and said enough is enough. To make such rankings difficult, the USMLE no longer provides percentile scores to exam takers.
The USMLE does have an extremely detailed FAQ on what the 2digit (which people confuse as a percentage or percentile) and 3digit scores mean. I strongly urge all testtakers to take a hard look at it and ponder about some of the stuff said therein.
Simply put, the way the exam is designed, it measures a candidate’s level of knowledge and provides a 3digit score with an important import. This 3digit score is an unfiltered indication of an individual’s USMLE knowhow, that in theory shouldn’t be influenced by variations in the content of the exam, be it across space (another exam center and/or questions from a different content pool) or time (exam content from the future or past). This means that provided a person’s knowledge remains constant, he or she should in theory, achieve the same 3digit score regardless of where and when he or she took the test. Or, supposedly so. The minimum 3digit score that is required to ‘pass’ the exam is revised on an annual basis to preserve this spacetime independent nature of the score. For the last couple of years, the passing score has hovered around 185. A ‘pass’ score makes you eligible to apply for a license.
What then is the 2digit score? For god knows what reason, the Federation of State Medical Boards (these people provide medics in the US, licenses based on their USMLE scores) has a 2digit format for a ‘pass’ score on the USMLE exam. Unlike the 3digit score this passing score is fixed at 75 and isn’t revised every year.
How does one convert a 3digit score to a 2digit score? The exact conversion algorithm hasn’t been disclosed (among lots of other things). But for matters of simplicity, I’m going to use a very crude approach to illustrate:
Equate the passing 3digit score to 75. So if the passing 3digit score is 180, then 180 = 75. 185 = 80, 190 = 85 … and so on.
I’m sure the relationship isn’t linear as shown above. For one, by very definition, a 2digit score ends at 99. 100 is a 3digit number! So let’s see what happens with our example above:
190 = 85, 195 = 90, 199 = 99. We’ve reached the 2digit limit at this point. Any score higher than 199 will also be equated to 99. It doesn’t matter if you scored a 240 or 260 on the 3 digit scale. You immediately fall under the 99 bracket along with the lesser folk!
These distortions and constraints make the 2digit score an unjust system to rank testtakers and today, most residency programs use the 3digit score to compare people. Because the 3digit to 2digit scale conversion changes every year, it makes sense to stick to the 3digit scale which makes comparisons between oldtimers and newtimers possible, besides the obvious advantage in helping comparisons between candidates who deal/dealt with different exam content.
Making Assumptions And Approximate Guesses
The USMLE does provide Means and Standard Deviations on students’ score cards. But these statistics don’t strictly apply to them because they are derived from different test populations. The score card specifically mentions that these statistics are “for recent” instances of the test.
Each instance of an exam is directed at a group of people which form its test population. Each population has its own characteristics such as whether or not it’s governed by Gaussian statistics, whether there is skew or kurtosis in its distribution, etc. The summary statistics such as the mean and standard deviation will also vary between different test populations. So unless you know the exact summary statistics and the nature of the distribution that describes the test population from which a candidate comes, you can’t possibly assign him/her a percentile rank. And because Joe and Jane can be from two entirely different test populations, percentiles in the end don’t carry much meaning. It’s that simple folks.
You could however make assumptions and arbitrary conclusions about percentile ranks though. Say for argument sake, all populations have a mean equal to 220 and a standard deviation equal to 20 and conform to Gaussian statistics. Then a 3digit score of:
220 = 50th percentile
220 + 20 = 84th percentile
220 + 20 + 20 = 97th percentile
[Going back to our '99 percentile' myth and with the specific example we used, don't you see how a score equal to 260 (with its 2digit 99 equivalent) still doesn't reach the 99 percentile? It's amazing how severely people can delude themselves. A 99 percentile rank is no joke and I find it particularly fascinating to observe how hundreds of thousands of people ludicrously claim to have reached this magic rank with a 2digit 99 score. I mean, doesn't the sheer commonality hint that something in their thinking is off?]
This calculator makes it easy to calculate a percentile based on known Mean and Standard Deviations for Gaussian distributions. Just enter the values for Mean and Standard Deviation on the left, and in the ‘Probability’ field enter a percentile value in decimal form (97th percentile corresponds to 0.97 and so forth). Hit the ‘Compute x’ button and you will be given the corresponding value of ‘x’.
99th Percentile Ain’t Cake
Another point of note about a Gaussian distribution:
The distance from the 0th percentile to the 25th percentile is also equal to the distance between the 75th and 100th percentile. Let’s say this distance is x. The distance between the 25th percentile and the 50th percentile is also equal to the distance between the 50th percentile and the 75th percentile. Let’s say this distance is y.
It so happens that x>>>y. In a crude sense, this means that it is disproportionately tougher for you to score extreme values than to stay closer to the mean. Going from a 50th percentile baseline, scoring a 99th percentile is disproportionately tougher than scoring a 75th percentile. If you aim to score a 99 percentile, you’re gonna have to seriously sweat it out!
It’s the interval, stupid
Say there are infinite clones of you existent in this world and you’re all like the Borg. Each of you is mentally indistinguishable from the other – possessing ditto copies of USMLE knowhow. Say that each of you took the USMLE and then we plot the frequencies of these scores on a graph. We’re going to end up with a Gaussian curve depicting this sample of clones, with its own mean score and standard deviation. This process is called ‘parametric sampling’ and the distribution obtained is called a ‘sampling distribution’.
The idea behind what we just did is to determine the variation that we would expect in scores even if knowhow remained constant – either due to a flaw in the test or by random chance.
The standard deviation of a sampling distribution is also called ‘standard error’. As you’ll probably learn during your USMLE preparation, knowing the standard error helps calculate what are called ‘confidence intervals’.
A confidence interval for a given score can be calculated as follows (using the Zstatistic):
True score = Measured score +/ 1.96 (standard error of measurement) … for 95% confidence
True score = Measured score +/ 2.58 (standard error of measurement) … for 99% confidence
For many recent tests, the standard error for the 3digit scale has been 6 [Every score card quotes a certain SEM (Standard Error of Measurment) for the 3digit scale]. This means that given a measured score of 240, we can be 95% certain that the true value of your performance lies between a low of 240 – 1.96 (6) and a high of 240 + 1.96 (6). Similarly we can say with 99% confidence that the true score lies between 240 – 2.58 (6) and 240 + 2.58 (6). These score intervals are probablistically flat when graphed – each true score value within the intervals calculated has an equal chance of being the right one.
What this means is that, when you compare two individuals and see their scores side by side, you ought to consider what’s going on with their respective confidence intervals. Do they overlap? Even a nanometer of overlapping between CIs makes the two, statistically speaking, indistinguishable, even if in reality there is a difference. As far as the test is concerned, when two CIs overlap, the test failed to detect any difference between these two individuals (some statisticians disagree. How to interpret statistical significance when two or more CIs overlap is still a matter of debate! I’ve used the view of the authors of the Kaplan lecture notes here). Capiche?
Beating competitors by intervals rather than pinpoint scores is a good idea to make sure you really did do better than them. The wider the distance separating two CIs, the larger is the difference between them.
There’s a special scenario that we need to think about here. What about the poor fellow who just missed the passing mark? For a passing mark of 180, what of the guy who scored, say 175? Given a standard error of 6, his 95% CI definitely does include 180 and there is no statistically significant (using a 5% margin of doubt) difference between him and another guy who scored just above 180. Yet this guy failed while the other passed! How do we account for this? I’ve been wondering about it and I think that perhaps, the pinpoint cutoffs for passing used by the USMLE exist as a matter of practicality. Using intervals to decide passing/failing results might be tedious, and maybe scientific endeavor ends at this point. Anyhow, I leave this question out in the void with the hope that it sparks discussions and clarifications.
If you care to give it a thought, the graphical subjectwise profile bands on the score card are actually confidence intervals (95%, 99% ?? I don’t know). This is why the score card clearly states that if any two subjectwise profile bands overlap, performance in these subjects should be deemed equal.
I hope you’ve found this post interesting if not useful. Please feel free to leave behind your valuable suggestions, corrections, remarks or comments. Anything :) !
–
Readability grades for this post:
Kincaid: 8.8
ARI: 9.4
ColemanLiau: 11.4
Flesch Index: 64.3/100 (plain English)
Fog Index: 12.0
Lix: 40.3 = school year 6
SMOGGrading: 11.1
–
Powered by Kubuntu Linux 8.04

Copyright © 2006 – 2008 Firas MR. All rights reserved.
How Examinations And Diagnostic Tests Are Similar
“You are more than a score”, or so the saying goes. But how much of that comes out as an emotional outburst as opposed to objective and rational thinking? Let’s try to see why the above is totally true, scientifically speaking.
In medicine, we’ve learned a lot about diagnostic tests, right? In fact everything investigative in nature can be considered a diagnostic test. Be it a screening exam for cervical cancer, that blood test for glucose, an Xray for a broken arm, or your palpating hand feeling for that enlarged liver. Heck, even an entire research study could be considered a diagnostic test. The ‘null hypothesis’ technique often used in analytical research studies is nothing more than a diagnostic test of sorts.
When considering the dynamics of a diagnostic test, a fundamental underlying principle is that we separate what is observed via the test from the actual truth. In the case of tangible phenomena like death, disease and disability, it is quite easy to distinguish the actual truth from what the test predicts. Because of this, you have terms like ‘false positives’, ‘false negatives’ and the like. A pregnancy test for example could be positive, but you could easily compare that prediction to the actual outcome (pregnancy vs. nonpregnancy) and say that this particular test has got such and such false positive rates. More or less, all tests have the following attributes in this regard:
 Sensitivity
 Specificity
 Positive Predictive Value
 Negative Predictive Value
 Validity/Accuracy
 Reliability/Precision
We ought to think about examinations such as the USMLE, etc. in this manner as well. Why? Well, because they are investigations too! Think of them as Xrays to diagnose your intelligence or whatever, if that metaphor helps. And as a consequence, notions about false positives, false negatives and all of the other things on that list also apply to them. Being the abstract intangible thing intelligence is, it is impossible to know its true value. And because there’s no way to compare prediction versus truth, it is impossible to say for sure what the false positive or negative rates (or any item on that list) for an exam are. And that’s why, ‘you are more than a score’ ! Statistically speaking, examinations are just so lame !
Do send in your comments!
–
Readability grades for this post:
Kincaid: 9.3
ARI: 9.8
ColemanLiau: 12.9
Flesch Index: 58.0/100
Fog Index: 13.2
Lix: 42.9 = school year 7
SMOGGrading: 12.0
–
Powered by BlogJet and Ubuntu Linux 7.04

Copyright © 2006 – 2008 Firas MR. All rights reserved.
PostDavidson’s – ROC graphs and medical statistics
The much awaited 20th ed of Davidson’s Principles & Practice of Medicine is here with a bang and besides a host of new additions the book boldly claims to have taken the content up a notch – both in terms of depth and breadth, keeping the needs of the MRCP examinee especially borne in mind. The very first chapter, Good Medical Practice is a welcome addition and will likely provide the new generation/edition reader an edge over his or her senior counterparts. Studded with key topics in medical ethics, basic biostatistics and Baye’s theorem besides an array of handy tips pertaining to the everyday practice of medicine, this chapter fits like a necklace around the other contents of the book. The logic behind restricting the number of Dx tests to a reasonable minimum inorder to decrease the overall falsepositives is just one example of what this treasure trove holds. Readers interested in medical ethics will want to use Kumar & Clark’s Clinical Medicine 6ed to fill any minor gaps, particularly things such as the medical relevance of the European Convention on Human Rights, etc.
The tradeoff between sensitivity and specificity has been depicted using a Reciever Operator Characteristic graph which in itself is supposed to be a sophisticated biostatistical concept. Readers will find it interesting to note the following about the history of the ROC curve, courtesy Wikipedia :
“….ROC curves are used to evaluate the results of a prediction and were first employed in the study of discriminator systems for the detection of radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor. The initial research was motivated by the desire to determine how the US RADAR “receiver operators” had missed the Japanese aircraft.
In the 1960s they began to be used in psychophysics, to assess human (and occasionally animal) detection of weak signals. They also proved to be useful for the evaluation of machine learning results, such as the evaluation of Internet search engines….”
ROC curves are extensively used under the domain of the signal detection theory.
Notice that Davidson’s ROC curve plots sensitivity vs. specificity and that the same curve can be obtained by plotting sensitivity vs. (1specificity), only the sequence of the numbering about the x and y axes needs rearrangement.
“A simple example of a ROC curve” is maintained by The University of Nebraska Medical Center and quite succintly explains the entire dynamics of an ROC curve.