No, JAMA Internal Medicine, acupuncture should not be considered an option for preventing migraines.

….And no further research is needed.

These 3 excellent articles provide some background for my blog, but their titles alone are worth leading with:

Acupuncture is astrology practice with needles.

Acupuncture: 3000 studies and more research is not needed.

Acupuncture is  theatrical placebo.

Each of these articles helps highlights an important distinction between an evidence-based medicine versus science based medicine perspective on acupuncture that will be discussed here.

A recent article in the prestigious JAMA Internal Medicine concluded:

“Acupuncture should be considered as one option for migraine prophylaxis in light of our findings.”

The currently freely accessible article can be found here.

A pay-walled editorial from Dr.Amy Gefland can be found here.

The trial was registered long after patient recruitment had started and the trial protocol can be found here 

[Aside: What is the value of registering a trial long after recruitment commenced? Do journal articles have a responsibility to acknowledge a link they publish for trial registration is for what occurred after the trial commenced? Is trial registration another ritual like acupuncture?]

Uncritical reports of the results of the trial as interpreted by the authors echoed through both the lay and physician-aimed media.

news coverage

Coverage by Reuters was somewhat more interesting than the rest. The trial authors’ claim that acupuncture for preventing migraines was ready for prime time was paired with some reservations expressed in the accompanying editorial.

reuters coverage

“Placebo response is strong in migraine treatment studies, and it is possible that the Deqi sensation . . . that was elicited in the true acupuncture group could have led to a higher degree of placebo response because there was no attempt made to elicit the Deqi sensation in the sham acupuncture group,” Dr. Amy Gelfand writes in an accompanying editorial.

gelfand_amy_antmanCome on, Dr. Gelfand, if you checked the article, you would have that Deqi is not measured. If you checked the literature, even proponents concede that Deqi remains a vague, highly subjective judgment in, this case, being made by an unblinded acupuncturist. Basically the acupuncturist persisted in whatever was being done until there was indication that a sensation of soreness, numbness, distention, or radiating seemed to be elicited from the patient. What part of a subjective response to acupuncture, with or without Deqi, would you consider NOT a placebo response?

 Dr. Gelfand  also revealed some reasons why she may bother to write an editorial for a treatment with an incoherent and implausible nonscientific rationale.

“When I’m a researcher, placebo response is kind of a troublesome thing, because it makes it difficult to separate signal from noise,” she said. But when she’s thinking as a doctor about the patient in front of her, placebo response is welcome, Gelfand said.

“You know, what I really want is my patient to feel better, and to be improved and not be in pain. So, as long as something is safe, even if it’s working through a placebo mechanism, it may still be something that some patients might want to use,” she said.

Let’s contemplate the implications of this. This editorial in JAMA Internal Medicine accompanies an article in which the trial author suggests acupuncture is ready to become a standard treatment for migraine. There is nothing in the article to which suggests that the unscientific basis of acupuncture had been addressed, only that it might have achieved a placebo response. Is Dr. Gelfand suggesting that would be sufficient, although there are some problems in the trial. What if that became the standard for recommending medications and medical procedures?

With increasing success in getting acupuncture and other now-called “integrative medicine” approaches ensconced in cancer centers and reimbursed by insurance, will be facing again and again some of the issues that started this blog post. Is acupuncture not doing obvious from a reason for reimbursing it? Trials like this one can be cited in support for reimbursement.

The JAMA: Internal Medicine report of an RCT of acupuncture for preventing migraines

Participants were randomly assigned to one of three groups: true acupuncture, sham acupuncture, or a waiting-list control group.

Participants in the true acupuncture and sham acupuncture groups received treatment 5 days per week for 4 weeks for a total of 20 sessions.

Participants in the waiting-list group did not receive acupuncture but were informed that 20 sessions of acupuncture would be provided free of charge at the end of the trial.

As the editorial comment noted, this is incredibly intensive treatment that burdens patients coming in five days a week for treatment for four weeks. Yet the effects were quite modest in terms of number of migraine attacks, even if statistically significant:

The mean (SD) change in frequency of migraine attacks differed significantly among the 3 groups at 16 weeks after randomization (P < .001); the mean (SD) frequency of attacks decreased in the true acupuncture group by 3.2 (2.1), in the sham acupuncture group by 2.1 (2.5), and the waiting-list group by 1.4 (2.5); a greater reduction was observed in the true acupuncture than in the sham acupuncture group (difference of 1.1 attacks; 95%CI, 0.4-1.9; P = .002) and in the true acupuncture vs waiting-list group (difference of 1.8 attacks; 95%CI, 1.1-2.5; P < .001). Sham acupuncture was not statistically different from the waiting-list group (difference of 0.7 attacks; 95%CI, −0.1 to 1.4; P = .07).

There were no group by time differences in use of medication for migraine. Receiving “true” versus sham acupuncture did not matter.

Four acupoints were used per treatment. All patients received acupuncture on 2 obligatory points, including GB20 and GB8. The 2 other points were chosen according to the syndrome differentiation of meridians in the headache region. The potential acupoints included SJ5, GB34, BL60, SI3, LI4, ST44, LR3, and GB40.20. The use of additional acupoints other than the prescribed ones was not allowed.We chose the prescriptions as a result of a systematic review of ancient and modern literature,22,23 consensus meetings with clinical experts, and experience from our previous study.

Note that the “headache region” is not the region of the head where headaches occur, the selection of which there is no scientific basis. Since when does such a stir fry of ancient and contemporary wisdom, consensus meetings with experts, and the clinical experience of the investigators become the basis of the mechanism justified for study in a clinical trial published in a prestigious American medical journal?

What was sham about the sham acupuncture (SA) treatment?

The number of needles, electric stimulation, and duration of treatment in the SA group were identical in the TA group except that an attempt was not made to induce the Deqi sensation. Four nonpoints were chosen according to our previous studies.

From the trial protocol, we learn that the effort to induce the Deqi sensation involves the acupuncturist twirling and rotating the needles.

In a manner that can easily escape notice, the authors indicate that they acupuncture was administered by electro stimulation.

In the methods section, they abruptly state:

Electrostimulation generates an analgesic effect, as manual acupuncture does.21

I wonder if the reviewers or the editorialist checked this reference. It is to an article that provides the insight that “meridians” -the 365 designated acupuncture points- are identified on a particular patient by

feeling for 12 organ-specific pulses located on the wrists and with cosmological interpretations including a representation of five elements: wood, water, metal, earth, and fire.

The authors further state that they undertook a program of research to counter the perception in the United States in the 1970s that acupuncture was quackery and even “Oriental hypnosis.” Their article describes some of the experiments they conducted, including one in which the benefits of a rabbit having received finger-pressure acupuncture was transferred to another via a transfusion of cerebrospinal fluid.

In discussing the results of the present study in JAMA Internal Medicine, the authors again comment in passing:

We added electrostimulation to manual acupuncture because manual acupuncture requires more time until it reaches a similar analgesic effect as electrical stimulation.27 Previous studies have reported that electrostimulation is better than manual acupuncture in relieving pain27-30 and could induce a longer lasting effect.28

The citations are to methodologically poor laboratory studies in which dramatic results are often obtained with very small cell size (n= 10).

Can we dispense with the myth that the acupuncture provided in this study is an extension of traditional Chinese needle therapy?

It is high time that we dispense with the notion that acupuncture applied to migraines and other ailments represents a traditional Chinese medicine that is therefore not subject to any effort to critique its plausibility and status as a science-based treatment. If we dispense with that idea, we still have to  confront how unscientific and nonsensical the rationale is for the highly ritualized treatment provided in this study.

An excellent article by Ben Kavoussi offers a carefully documented debunking of:

 reformed and “sanitized” acupuncture and the makeshift theoretical framework of Maoist China that have flourished in the West as “Traditional,” “Chinese,” “Oriental,” and most recently as “Asian” medicine.

Kavoussi, who studied to become an acupuncturist, notes that:

Traditional theories for selecting points and means of stimulation are not based on an empirical rationale, but on ancient cosmology, astrology and mythology. These theories significantly resemble those that underlined European and Islamic astrological medicine and bloodletting in the Middle-Ages. In addition, the alleged predominance of acupuncture amongst the scholarly medical traditions of China is not supported by evidence, given that for most of China’s long medical history, needling, bloodletting and cautery were largely practiced by itinerant and illiterate folk-healers, and frowned upon by the learned physicians who favored the use of pharmacopoeia.

In the early 1930s a Chinese pediatrician by the name of Cheng Dan’an (承淡安, 1899-1957) proposed that needling therapy should be resurrected because its actions could potentially be explained by neurology. He therefore repositioned the points towards nerve pathways and away from blood vessels-where they were previously used for bloodletting. His reform also included replacing coarse needles with the filiform ones in use today.38 Reformed acupuncture gained further interest through the revolutionary committees in the People’s Republic of China in the 1950s and 1960s along with a careful selection of other traditional, folkloric and empirical modalities that were added to scientific medicine to create a makeshift medical system that could meet the dire public health and political needs of Maoist China while fitting the principles of Marxist dialectics. In deconstructing the events of that period, Kim Taylor in her remarkable book on Chinese medicine in early communist China, explains that this makeshift system has achieved the scale of promotion it did because it fitted in, sometimes in an almost accidental fashion, with the ideals of the Communist Revolution. As a result, by the 1960s acupuncture had passed from a marginal practice to an essential and high-profile part of the national health-care system under the Chinese Communist Party, who, as Kim Taylor argues, had laid the foundation for the institutionalized and standardized format of modern Chinese medicine and acupuncture found in China and abroad today.39 This modern construct was also a part of the training of the “barefoot doctors,” meaning peasants with an intensive three- to six-month medical and paramedical training, who worked in rural areas during the nationwide healthcare disarray of the Cultural Revolution era.40 They provided basic health care, immunizations, birth control and health education, and organized sanitation campaigns. Chairman Mao believed, however, that ancient natural philosophies that underlined these therapies represented a spontaneous and naive dialectical worldview based on social and historical conditions of their time and should be replaced by modern science.41 It is also reported that he did not use acupuncture and Chinese medicine for his own ailments.42

What is a suitable comparison/control group for a theatrical administration of a placebo?

A randomized double-blind crossover pilot study published in NEJM highlight some of the problems arising from poorly chosen control groups. The study compared an inhaled albuterol bronchodilator to one of three control conditions placebo inhaler, sham acupuncture, or no intervention. Subjective self-report measures of perceived improvement in asthma symptoms and perceived credibility of the treatments revealed only that the no-intervention condition was inferior to the active treatment of inhaled albuterol and the two placebo conditions, but no difference was found between the active treatment and the placebo conditions. However, strong differences were found between the active treatment in the three comparison/control conditions in an objective measure of physiological responses – improvement in forced expiratory volume (FEV1), measured with spirometry.

One take away lesson is we should be careful about accepting subjective self-report measures when objective measures are available. One objective measure in the present study was the taking of medication for migraines and there were no differences between groups. This point is missed in both the target article in JAMA Internal Medicine and the accompanying editorial.

The editorial does comment on the acupuncturists being unblinded – they clearly knew when they are providing the preferred “true” acupuncture and when they were providing sham. They had some instructions to avoid creating a desqi sensation in the sham group, but some latitude in working till it was achieved in the “true” group. Unblinded treatment providers are always a serious risk of bias in clinical trials, but we here we have a trial where the primary outcomes are subjective, the scientific status of desqi is dubious, and the providers might be seen as highly motivated to promote the “true” treatment.

I’m not sure why the editorialist was not stopped in her tracks by the unblinded acupuncturists – or for that matter why the journal published this article. But let’s ponder a bit difficulties in coming up with a suitable comparison/control group for what is – until proven otherwise – a theatrical and highly ritualized placebo. If a treatment has no scientifically valid crucial ingredient, how we construct a comparison/control group differs only in the absence of the active ingredient, but is otherwise equivalent?

There is a long history of futile efforts to apply sham acupuncture, defined by what practitioners consider the inappropriate meridians. An accumulation of failures to distinguish such sham from “true” acupuncture in clinical trials has led to arguments that the distinction may not be valid: the efficacy of acupuncture may depend only on the procedure, not choice of a correct meridian. There are other studies would seem to show some advantage to the active or “true” treatments. These are generally clinical trials with high risk of bias, especially the inability to blind practitioners as to what she treatment they are providing.

There are been some clever efforts to develop sham acupuncture techniques that can fool even experienced practitioners. A recent PLOS One article  tested needles that collapsed into themselves.

Up to 68% of patients and 83% of acupuncturists correctly identified the treatment, but for patients the distribution was not far from 50/50. Also, there was a significant interaction between actual or perceived treatment and the experience of de qi (p = 0.027), suggesting that the experience of de qi and possible non-verbal clues contributed to correct identification of the treatment. Yet, of the patients who perceived the treatment as active or placebo, 50% and 23%, respectively, reported de qi. Patients’ acute pain levels did not influence the perceived treatment. In conclusion, acupuncture treatment was not fully double-blinded which is similar to observations in pharmacological studies. Still, the non-penetrating needle is the only needle that allows some degree of practitioner blinding. The study raises questions about alternatives to double-blind randomized clinical trials in the assessment of acupuncture treatment.

This PLOS One study is supplemented by a recent review in PLOS One, Placebo Devices as Effective Control Methods in Acupuncture Clinical Trials:

Thirty-six studies were included for qualitative analysis while 14 were in the meta-analysis. The meta-analysis does not support the notion of either the Streitberger or the Park Device being inert control interventions while none of the studies involving the Takakura Device was included in the meta-analysis. Sixteen studies reported the occurrence of adverse events, with no significant difference between verum and placebo acupuncture. Author-reported blinding credibility showed that participant blinding was successful in most cases; however, when blinding index was calculated, only one study, which utilised the Park Device, seemed to have an ideal blinding scenario. Although the blinding index could not be calculated for the Takakura Device, it was the only device reported to enable practitioner blinding. There are limitations with each of the placebo devices and more rigorous studies are needed to further evaluate their effects and blinding credibility.

Really, must we we await better technology the more successfully fool’s acupuncturists and their patients whether they are actually penetrating the skin?

Results of the present study in JAMA: Internal Medicine are seemingly contradicted by the results of a large German trial  that found:

Results Between baseline and weeks 9 to 12, the mean (SD) number of days with headache of moderate or severe intensity decreased by 2.2 (2.7) days from a baseline of 5.2 (2.5) days in the acupuncture group compared with a decrease to 2.2 (2.7) days from a baseline of 5.0 (2.4) days in the sham acupuncture group, and by 0.8 (2.0) days from a baseline if 5.4 (3.0) days in the waiting list group. No difference was detected between the acupuncture and the sham acupuncture groups (0.0 days, 95% confidence interval, −0.7 to 0.7 days; P = .96) while there was a difference between the acupuncture group compared with the waiting list group (1.4 days; 95% confidence interval; 0.8-2.1 days; P<.001). The proportion of responders (reduction in headache days by at least 50%) was 51% in the acupuncture group, 53% in the sham acupuncture group, and 15% in the waiting list group.

Conclusion Acupuncture was no more effective than sham acupuncture in reducing migraine headaches although both interventions were more effective than a waiting list control.

I welcome someone with more time on their hands to compare and contrast the results of these two studies and decide which one has more credibility.

Maybe we step should back and ask “why is anyone care about such questions, when there is such doubt that a plausible scientific mechanism is in play?”

Time for JAMA: Internal Medicine to come clean

The JAMA: Internal Medicine article on acupuncture for prophylaxis of migraines is yet another example of a publication where revelation of earlier drafts, reviewer critiques, and author responses would be enlightening. Just what standard to which the authors are being held? What issues were raised in the review process? Beyond resolving crucial limitations like blinding of acupuncturists, under what conditions would be journal conclude that studies of acupuncture in general are sufficiently scientifically unsound and medically irrelevant to warrant publication in a prestigious JAMA journal.

Alternatively, is the journal willing to go on record that it is sufficient to establish that patients are satisfied with a pain treatment in terms of self-reported subjective experiences? Could we then simply close the issue of whether there is a plausible scientific mechanism involved where the existence of one can be seriously doubted? If so, why stop with evaluations with subjective pain would days without pain as the primary outcome?

acupuncture treatmentWe must question the wisdom of JAMA: Internal Medicine of inviting Dr. Amy Gelfand for editorial comment. She is apparently willing to allow that demonstration of a placebo response is sufficient for acceptance as a clinician. She also is attached to the University of California, San Francisco Headache Center which offers “alternative medicine, such as acupuncture, herbs, massage and meditation for treating headaches.” Endorsement of acupuncture in a prestigious journal as effective becomes part of the evidence considered for its reimbursement. I think there are enough editorial commentators out there without such conflicts of interest.


eBook_Mindfulness_345x550I will soon be offering scientific writing courses on the web as I have been doing face-to-face for almost a decade. Sign up at my new website to get notified about these courses, as well as upcoming blog posts at this and other blog sites.  Get advance notice of forthcoming e-books and web courses. Lots to see at

‘Replace male doctors with female ones and save at least 32,000 lives each year’?

The authors of a recent article in JAMA Internal Medicine

Physician Gender and Outcomes of Hospitalized Medicare Beneficiaries in the U.S.,” Yusuke Tsugawa, Anupam B. Jena, Jose F. Figueroa, E. John Orav, Daniel M. Blumenthal, Ashish K. Jha, MD, MPH1,2,8, JAMA Internal Medicine, online December 19, 2016, doi: 10.1001/jamainternmed.2016.7875

Stirred lots of attention in the media with direct quotes like these:

“If we had a treatment that lowered mortality by 0.4 percentage points or half a percentage point, that is a treatment we would use widely. We would think of that as a clinically important treatment we want to use for our patients,” said Ashish Jha, professor of health policy at the Harvard School of Public Health. The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

Washington Post: Women really are better doctors, study suggests.

LA  Times: How to save at least 32,000 lives each year: Replace male doctors with female ones.

NPR: Patients cared for by female doctors fare better than those treated by men.

My immediate reactions after looking at the abstract were only confirmed when I delved deeper.

Basically, we have a large, but limited and very noisy data set. It is unlikely that these data allow us to be confident about the strength of any signal concerning the relationship between physician gender and patient outcome that is so important to the authors. The small apparent differences could be just more noise on which the authors have zeroed in so that they can make a statement about the injustice of gender differences in physician pay.

 I am unwilling to relax methodological and statistical standards to manufacture support for such a change. There could be unwanted consequences of accepting that arguments can be made with such weak evidence, even for a good cause.

What if the authors had found the same small differences in noisy data in the reverse direction? Would they argue that we should preserve gender differences in physician pay? What if the authors focus on a different variable in all this noise and concluded that lower pay which women receive was associated with reduced mortality? Would we then advocate that will reduce the pay of both male and female physicians in order to improve patient outcomes?

Despite all the excitement that claim about an effect of physician gender on patient mortality is generating, it is most likely that we are dealing with noise arising from overinterpretation of complex analyses that assume more completeness and precision than can be found in the data being analyzed.

These claims are not just a matter of causal relationships being spun from correlation. Rather, they are causal claims being made on the basis of partial correlations emerging in complex multivariate relationships found in an administrative data set.

  • Administrative data sets, particularly Medicaid data sets like this one, are not constructed with such research questions in mind. There are severe constraints on what variables can be isolated and which potential confounds can be identified and tested.
  • Administrative data sets consist of records, not actual behaviors. It’s reasonable to infer a patient death associated with a record of a death. Association of a physician gender associated with a particular record is more problematic, as we will see. Even if we accept the association found in these records, it does not necessarily mean that physicians engaged in any particular behaviors or that the physician behavior is associated with the pattern of deaths emerging in these multivariate analyses.
  • The authors start out with a statement about differences in how female and male physicians practice. In the actual article and the media, they have referred to variables like communication skills, providing evidence-based treatments, and encouraging health-related behaviors. None of these variables are remotely accessible in a Medicaid data set.
  • Analyses of such administrative data sets do not allow isolation of the effects of physician gender from the effects of the contexts in which their practice occurs and relevant associated variables. We are not talking about a male or female physician encountering a particular patient being associated with a death or not, but an administrative record of physician gender arising in a particular context being interpreted as associated with a death. Male and female physicians may differ in being found in particular contexts in nonrandom fashion. It’s likely that these differences will dwarf any differences in outcomes. There will be a real challenge in even confidently attributing those outcomes to whether patients had an attending male or female physician.

The validity of complex multivariate analyses are strongly threatened by specification bias and residual confounding. The analyses must assume that all of the relevant confounds have been identified and measured without error. Departures from these ideal conditions can lead to spurious results, and generally do. Examination of the limitations in the variables available in a Medicaid data set and how they were coded can quickly undermine any claim to validity.

Acceptance of claims about effects of particular variables like female physician gender arising in complex multivariate analyses involve assumptions of “all-other-things-being-equal.” If we attempt to move from statistical manipulation to inference about a real world encounter, we no longer talking about a particular female physician, but a construction that may be very different from particular physicians interacting with particular patients in particular contexts.

The potential for counterfactual statements can be seen if we move from the study to one of science nerds and basketball players and hypothesize if John and Jason were of equivalent height, John would not study so hard.

Particularly in complex social situations, it is usually a fantasy that we can change one variable, and only one variable, not others. Just how did John and Jason get of equal height? And how are they now otherwise different?

Associations discovered in administrative data sets most often do not translate into effects observed in randomized trials. I’m not sure how we could get a representative sample of patients to disregard their preferences and accept random assignment to a male or female physician. It would have to be a very large study to detect the effect sizes reported in this observational study, and I’m skeptical this sufficiently strong signal would emerge from all of the noise.

We might relax our standards and accept a quasi-experimental design that would be smaller but encompass a wider range of relevant variables. For instance, it is conceivable that we could construct a large sample in which physicians varied in terms of whether they had formal communication skills training. We might examine whether communications training influenced subsequent patient mortality, independent of physician gender, and vice versa. This would be a reasonable translation of the authors’ hypothesis that communication skills differences between male and female physicians account for what the authors believe is the observed association between physician gender and mortality. I know of no such study having been done. I know of no study demonstrating that physician communication training affects patient mortality. I’m skeptical that the typical communication training is so powerful in its effects. If such a study required substantial resources, rather than relied on data on hand, I would not be encouraged to invest in it by the strength of the results of the present study to marshal those resources.

What I saw when I looked at the article

 We dealing with very small adjusted differences in percentage arising in a large sample.

Patients treated by female physicians had lower 30-day mortality (adjusted mortality, 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233).

Assignment of a particular patient to a particular physician is done with a lot of noise.

We assigned each hospitalization to a physician based on the National Provider Identifier in the Carrier File that accounted for the largest amount of Medicare Part B spending during that hospitalization.25 Part B spending comprises professional and other fees determined by the physician. On average, these physicians were responsible for 51.1% of total Part B spending for a given hospitalization.

One commentator quoted in a news article noted:

William Weeks, a professor of psychiatry at Dartmouth’s Geisel School of Medicine, said that the researchers had done a good job of trying to control for other factors that might influence the outcome. He noted that one caveat is that hospital care is usually done by a team. That fact was underscored by the method the researchers used to identify the doctor who led the care for patients in the study. To identify the gender of the physician, they looked for the doctor responsible for the biggest chunk of billing for hospital services — which was, on average, about half. That means that almost half of the care was provided by others.

Actually, much of the care is not provided by the attending physician, but other staff, including nurses and residents.

The authors undertook the study to call attention to gender disparities in physician pay. But could disparities show up in males being able to claim more billable procedures – greater credit administratively for what is done with patients during hospitalization, including by other physicians? This might explain at least some of the gender differences, but could undermine the validity of this key variable in relating physician gender to differences in patient outcome.

The statistical control of differences in patient and physician characteristics afforded by variables in this data set is inadequate.

Presumably, a full range of patient variables is related to whether patients die within 30 days of a hospitalization. Recall the key assumption that all of the relevant confounds have been identified and assessed without error in considering the variables used to characterize patient characteristics:

Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year.

Note that the comorbidity index is based on collapsing 27 other variables into one number. Simplifies the statistics, yes, but with a tremendous loss of information.

Recall the assumption that this set of variables represent not just what is available in administrative data set, but all the patient characteristics relevant to their dying within 30 days after discharge from the hospital. Are we really willing to accept this assumption?

For the physician variables displayed at the top of Table 1, there are huge differences between male and female physicians, relative to the modest difference in patient mortality, adjusted mortality, 11.07% vs 11.49%.

smaller table of patient characiteristics

These authors encourage us to think about the results as simulating a randomized trial, except that statistical controls are serving the function that randomization of patients to physician gender would serve. We are being asked to accept that these difference in baseline characteristics of the practices of female versus physicians can be eliminated through statistics. We would never accept that argument in a randomized trial.

Addressing criticisms of the authors interpretation of their results.

 The senior author provided a pair of blog posts in which he acknowledges criticism of his study, but attempts to defuse key objections. It’s unfortunate that the sources of these objections are not identified, and so we dependent on the author’s summary out of context. I think the key responses are to straw man objections.

Correlation, Causation, and Gender Differences in Patient Outcomes

Do women make better doctors than men?

Correlation is not causation.

We often make causal inferences based on observational data – and here’s the kicker: sometimes, we should.  Think smoking and lung cancer.  Remember the RCT that assigned people to smoking (versus not) to see if it really caused lung cancer?  Me neither…because it never happened.  So, if you are a strict “correlation is not causation” person who thinks observational data only create hypotheses that need to be tested using RCTs, you should only feel comfortable stating that smoking is associated with lung cancer but it’s only a hypothesis for which we await an RCT.  That’s silly.  Smoking causes lung cancer.

No, it is this argument that is silly. We can now look back on the data concerning smoking and lung cancer and benefit from the hindsight provided by years of sorting smoking as a risk factor from potential confounds.  I recall at some point, drinking coffee being related to lung cancer in the United States, whereas drinking tea was correlated in the UK. Of course, if we don’t know that smoking is the culprit, we might miss that in the US, smoking was done while drinking coffee, whereas the UK, while drinking tea.

And isolating smoking as a risk factor, rather than just a marker for risk, is so much simpler than isolating whatever risk factors for death are hidden behind physician gender as a marker for risk of mortality.

Coming up with alternative explanations for the apparent link between physician gender and patient mortality.

The final issue – alternative explanations – has been brought up by nearly every critic. There must be an alternative explanation! There must be confounding!  But the critics have mostly failed to come up with what a plausible confounder could be.  Remember, a variable, in order to be a confounder, must be correlated both with the predictor (gender) and outcome (mortality).

This is similarly a fallacious argument. I am not arguing for alternative substantive explanations, I’m proposing that spurious results were produced by pervasive specification bias, including measurement error. There is no potential confounder I have to identify. I am simply arguing that that the small differences in mortality are dwarfed by specification and measurement error.

This tiny difference is actually huge in its implications.

Several critics have brought up the point that statistical significance and clinical significance are not the same thing.  This too is epidemiology 101.  Something can be statistically significant but clinically irrelevant.  Is a 0.43 percentage point difference in mortality rate clinically important? This is not a scientific or a statistical question.  This is a clinical question. A policy and public health question.  And people can reasonably disagree.  From a public health point of view, a 0.43 percentage point difference in mortality for Medicare beneficiaries admitted for medical conditions translates into potentially 32,000 additional deaths. You might decide that this is not clinically important. I think it is. It’s a judgment call and we can disagree.

The author taking a small difference and magnifying its importance by applying to a larger population. He is attributing the “additional deaths” to patients being treated by men. I feel he hasn’t made a case that physician gender is the culprit and so nothing is accomplished except introducing shock and awe by amplifying the small effect into its implications for the larger population.

In response to a journalist, the author makes a parallel argument:

The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

In addition to what I have already argued, if we know the same number of deaths are attributable to automobile crashes, we at least know how to take steps to reduce these crashes and the mortality associated with them. We don’t know how to change the mortality the authors claim is associated with physician gender. We don’t even know that the author’s claims are valid.

Searching for meaning where meaning no meaning is to be found.

In framing the study and interpreting the results to the media, the authors undertake a search of the literature with a heavy confirmation bias, ignoring the many contradictions that are uncovered with a systematic search. For instance, one commentator on the senior author’s blog notes

It took me about 5 minutes of Google searching to find a Canadian report suggesting that female physicians in that country have workloads around 75% to 80% of male physicians:

If US data is even vaguely similar, that factor would be a serious omission from your article.

But the authors were looking for what supported the results, not for studies that potentially challenged or contradicted their results. They are looking to strengthen a narrative, not expose it to refutation.

Is there a call to action here?

As consumers of health services, we could all switch to being cared for by female physicians. I suspect that some of the systems and structural issues associated with the appearance that care by male physicians inferior would be spread among females, including increased workloads. The bias in the ability of male physicians to claim credit for the work of others would be redistributed to women. Neither would improve patient mortality.

We should push for reduction in inequalities in pay related to gender. But we don’t need results of this study to encourage us.

I certainly know health care professionals and researchers who have more confidence in communication learning modules producing clinically significant changes in position behavior. I don’t know any of them who could produce evidence that these changes include measurable reductions in patient mortality. If someone produces such data, I’m capable of being persuaded. But the present study adds nothing to my confidence in that likelihood.

If we are uncomfortable with the communication skills or attention to evidence that our personal physicians display, we should replace them. But I don’t think this study provides additional evidence for us doing so, beyond the legitimacy of us acting on our preferences.

In the end, this article reminds us to stick to our standards and not be tempted to relax them to make socially acceptable points.