Lessons we need to learn from a Lancet Psychiatry study of the association between exercise and mental health

The closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

giphyThe closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

Apparently, the editor of Lancet Psychiatry and reviewers did not give the study a close look before it was accepted.

The article was used to raise funds for a startup company in which one of the authors was heavily invested. This was disclosed, but doesn’t let the authors off the hook for promoting a seriously flawed study. Nor should the editor of Lancet Psychiatry or reviewers escape criticism, nor the large number of people on Twitter who thoughtlessly retweeted and “liked” a series of tweets from the last author of the study.

This blog post is intended to raise consciousness about bad science appearing in prestigious journals and to allow citizen scientists to evaluate their own critical thinking skills in terms of their ability to detect misleading and exaggerated claims.

1.Sometimes a disclosure of extensive conflicts of interest alerts us not to pay serious attention to a study. Instead, we should question why the study got published in a prestigious peer-reviewed journal when it had such an obvious risk of bias.

2.We need citizen scientists with critical thinking skills to identify such promotional efforts and alert others in their social network that hype and hokum are being delivered.

3.We need to stand up to authors who use scientific papers for commercial purposes, especially when they troll critics.

Read on and you will see what a skeptical look at the paper and its promotion revealed.

  • The study failed to capitalize on the potential of multiple years of data for developing and evaluating statistical models. Bigger is not necessarily better. Combining multiple years of data was wasteful and served only the purpose of providing the authors bragging rights and the impressive, but meaningless p-values that come from overly large samples.
  • The study relied on an unvalidated and inadequate measure of mental health that confounded recurring stressful environmental conditions in the work or home with mental health problems, even where validated measures of mental health would reveal no effects.
  • The study used an odd measure of history of mental health problems that undoubtedly exaggerated past history.
  • The study confused physical activity with (planned) exercise. Authors amplified their confusion by relying on an exceedingly odd strategy for getting estimate of how much participants exercised: Estimates of time spent in a single activity was used in analyses of total time spent exercising. All other physical activity was ignored.
  • The study made a passing acknowledgment of the problems interpreting simple associations as causal, but then went on to selectively sample the existing literature to make the case that interventions to increase exercise improve mental health.
  • Taken together, a skeptical of assessment of this article provides another demonstration that disclosure of substantial financial conflicts of interests should alert readers to a high likelihood of a hyped, inaccurately reported study.
  • The article was pay walled so that anyone interested in evaluating the authors claims for themselves had to write to the author or have access to the article through a university library site. I am waiting for the authors to reply to my requests for the supplementary tables that are needed to make full sense of their claims. In the meantime, I’ll just complain about authors with significant conflicts of interest heavily promoting studies that they hide behind paid walls.

I welcome you to  examine the author’s thread of tweets. Request the actual article from the author if you want to evaluate independently my claims. This can be great material for a masters or honors class on critical appraisal, whether in psychology or journalism.

title of article

Let me know if you think that I’ve been too hard on this study.

A thread of tweets  from the last author celebrated the success of well orchestrated publicity campaign for a new article concerning exercise and mental health in Lancet Psychiatry.

The thread started:

Our new @TheLancetPsych paper was the biggest ever study of exercise and mental health. it caused quite a stir! here’s my guided tour of the paper, highlighting some of our excitements and apprehensions along the way [thread] 1/n

And ended with pitch for the author’s do-good startup company:

Where do we go from here? Over @spring_health – our mental health startup in New York City – we’re using these findings to develop personalized exercise plans. We want to help every individual feel better—faster, and understand exactly what each patient needs the most.

I wasn’t long into the thread before my skepticism was stimulated. The fourth tweet in the thread had a figure that didn’t get any comments about how bizarre it was.

The tweet

It looks like those differences mattered. for example, people who exercised for about 45 minutes seemed to have better mental health than people who exercised for less than 30, or more than 60 minutes. — a sweet spot for mental health, perhaps?

graphs from paper

Apparently the author does not comment on an anomaly either. Housework appears to be better for mental health than a summary score of all exercise and looks equal to or better than cycling or jogging. But how did housework slip into the category “exercise”?

I begin wondering what the authors meant by “exercise” or if they’d given the definition serious consideration when constructing their key variable from the survey data.

But then that tweet was followed by another one that generated more confusion with a  graph the seemingly contradicted the figures in the last one

the type of exercise people did seems important too! People doing team sports or cycling had much better mental health than other sports. But even just walking or doing household chores was better than nothing!

Then a self-congratulatory tweet for a promotional job well done.

for sure — these findings are exciting, and it has been overwhelming to see the whole world talking openly and optimistically about mental health, and how we can help people feel better. It isn’t all plain sailing though…

The author’s next tweet revealed a serious limitation to the measure of mental health used in the study in a screenshot.

screenshot up tweet with mental health variable

The author acknowledged the potential problem, sort of:

(1b- this might not be the end of the world. In general, most peple have a reasonable understanding of their feelings, and in depressed or anxious patients self-report evaluations are highly correlated with clinician-rated evaluations. But we could be more precise in the future)

“Not the end of the world?” Since when does the author of the paper in the Lancet family of journals so casually brush off a serious methodological issue? A lot of us who have examined the validity of mental health measures would be skeptical of this dismissal  of a potentially fatal limitation.

No validation is provided for this measure. On the face of it, respondents could endorse it on basis of facing  recurring stressful situations that had no consequences for their mental health. This reflects ambiguity of the term stress for both laypersons and scientists. “Stress” could variously refer to an environmental situation, a subjective experience of stress, or an adaptational outcome. Waitstaff could consider Thursday when the chef is off, a recurrent, weekly stress. Persons with diagnosable persistent depressive disorder would presumably endorse more days than not as being a mental health challenge. But they would mean something entirely different.

The author acknowledged that the association between exercise and mental health might be bidirectional in terms of causality

adam on lots of reasons to believe relationship goes both ways.PNG

But then made a strong claim for increased exercise leading to better mental health.

exercise increases mental health.PNG

[Actually, as we will see, the evidence from randomized trials of exercise to improve mental health is modest, and entirely disappears one limits oneself to the quality studies.]

The author then runs off the rail with the claim that the benefits of exercise exceed benefits of having greater than poverty-level income.

why are we so excited.PNG

I could not resist responding.

Stop comparing adjusted correlations obtained under different circumstances as if they demonstrated what would be obtained in RCT. Don’t claim exercising would have more effect than poor people getting more money.

But I didn’t get a reply from the author.

Eventually, the author got around to plugging his startup company.

I didn’t get it. Just how did this heavy promoted study advance the science fo such  “personalized recommendation?

Important things I learned from others’ tweets about the study

I follow @BrendonStubbs on Twitter and you should too. Brendon often makes wise critical observations of studies that most everyone else is uncritically praising. But he also identifies some studies that I otherwise would miss and says very positive things about them.

He started his own thread of tweets about the study on a positive note, but then he identified a couple of critical issues.

First, he took issue with the author’s week claiming to have identified a tipping point, below which exercise is beneficial, and above which exercise could prove detrimental the mental health.

4/some interpretations are troublesome. Most confusing, are the assumptions that higher PA is associated/worsens your MH. Would we say based on cross sect data that those taking most medication/using CBT most were making their MH worse?

A postdoctoral fellow @joefirth7  seconded that concern:

I agree @BrendonStubbs: idea of high PA worsening mental health limited to observation studies. Except in rare cases of athletes overtraining, there’s no exp evidence of ‘tipping point’ effect. Cross-sect assocs of poor MH <–> higher PA likely due to multiple other factors…

Ouch! But then Brendan follows up with concerns that the measure of physical activity has not been adequately validated, noting that such self-report measures prove to be invalid.

5/ one consideration not well discussed, is self report measures of PA are hopeless (particularly in ppl w mental illness). Even those designed for population level monitoring of PA https://journals.humankinetics.com/doi/abs/10.1123/jpah.6.s1.s5 … it is also not clear if this self report PA measure has been validated?

As we will soon see, the measure used in this study is quite flawed in its conceptualization and its odd methodology of requiring participants to estimate the time spent exercising for only one activity, with 70 choices.

Next, Brandon points to a particular problem using self-reported physical activity in persons with mental disorder and gives an apt reference:

6/ related to this, self report measures of PA shown to massively overestimate PA in people with mental ill health/illness – so findings of greater PA linked with mental illness likely bi-product of over-reporting of PA in people with mental illness e.g Validity and Value of Self-reported Physical Activity and Accelerometry in People With Schizophrenia: A Population-Scale Study of the UK Biobank [ https://academic.oup.com/schizophreniabulletin/advance-article/doi/10.1093/schbul/sbx149/4563831 ]

7/ An additional point he makes: anyone working in field of PA will immediately realise there is confusion & misinterpretation about the concepts of exercise & PA in the paper, which is distracting. People have been trying to prevent this happening over 30 years

Again, Brandon provides a spot-on citation clarifying the distinction between physical activity and exercise:, Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research 

The mysterious pseudonymous Zad Chow @dailyzad called attention to a blog post they had just uploaded and let’s take a look at some of the key points.

Lessons from a blog post: Exercise, Mental Health, and Big Data

Zad Chow is quite balanced in dispensing praise and criticism of the Lancet Psychiatry paper. They noted the ambiguity of any causality in cross-sectional correlation and that investigated the literature on their own.

So what does that evidence say? Meta-analyses of randomized trials seem to find that exercise has large and positive treatment effects on mental health outcomes such as depression.

Study Name     # of Randomized Trials             Effects (SMD) + Confidence Intervals

Schuch et al. 2016       25         1.11 (95% CI, 0.79-1.43)

Gordon et al. 2018      33         0.66 (95% CI, 0.48-0.83)

Krogh et al. 2017          35         −0.66 (95% CI, -0.86, -0.46)

But, when you only pool high-quality studies, the effects become tiny.

“Restricting this analysis to the four trials that seemed less affected of bias, the effect vanished into −0.11 SMD (−0.41 to 0.18; p=0.45; GRADE: low quality).” – Krogh et al. 2017

Hmm, would you have guessed this from the Lancet Psychiatry author’s thread of tweets?

Zad Chow showed the hype and untrustworthiness of the press coverage in prestigious media with a sampling of screenshots.

zad chou screenshots of press coverage

I personally checked and don’t see that Zad Chow’s selection of press coverage was skewed. Coverage in the media all seemed to be saying the same thing. I found the distortion to continue with uncritical parroting – a.k.a. churnaling – of the claims of the Lancet Psychiatry authors in the Wall Street Journal. 

The WSJ repeated a number of the author’s claims that I’ve already thrown into question and added a curiosity:

In a secondary analysis, the researchers found that yoga and tai chi—grouped into a category called recreational sports in the original analysis—had a 22.9% reduction in poor mental-health days. (Recreational sports included everything from yoga to golf to horseback riding.)

And the NHS England totally got it wrong:

NHS getting it wrong.PNG

So, we learned that the broad category “recreational sports” covers yoga and tai chi , as well as golf and  horseback riding. This raises serious questions about the lumping and splitting of categories of physical activity in the analyses that are being reported.

I needed to access the article in order to uncover some important things 

I’m grateful for the clues that I got from Twitter, and especially Zad Chow that I used in examining the article itself.

I got hung up on the title proclaiming that the study involved 1·2 million individuals. When I checked the article, I saw that the authors use three waves of publicly available data to get that number. Having that many participants gave them no real advantage except for bragging rights and the likelihood that modest associations could be expressed in expressed in spectacular p-values, like p<2・2 × 10–16. I don’t understand why the authors didn’t conduct analyses with one-way and Qwest validate results in another.

The obligatory Research in Context box made it sound like a systematic search of the literature had been undertaken. Maybe, but the authors were highly selective in what they chose to comment upon, as seen in its contradiction by the brief review of Zad Chow. The authors would have us believe that the existing literature is quite limited and inconclusive, supporting the need for like their study.

research in context

Caveat Lector, a strong confirmation bias is likely ahead in this article.

Questions accumulated quickly as to the appropriateness of the items available from a national survey undoubtedly constructed with other purposes. Certainly these items would not have been selected if the original investigators were interested in the research question at the center of this article.

Participants self-reported a previous diagnosis of depression or depressive episode on the basis of the following question: “Has a doctor, nurse, or other health professional EVER told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?”

Our own work has cast serious doubt on the correspondence of reports of a history of depression in response to a brief question embedded in a larger survey with results of a structured interview in which respondents’ answers can be probed. We found that answers to such questions were more related to current distress, then to actual past diagnoses and treatment of depression. However, the survey question used in the Lancet Psychiatry study added the further ambiguity and invalidity with the added  “or minor depression.” I am not sure under what circumstances a health care professional would disclose a diagnosis of “minor depression” to a patient, but I doubt it would be in context in which the professional felt treatment was needed.

Despite the skepticism that I was developing about the usefulness of the survey data, I was unprepared for the assessment of “exercise.”

Other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” Participants who answered yes to this question were then asked: “What type of physical activity or exercise did you spend the most time doing during the past month?” A total of 75 types of exercise were represented in the sample, which were grouped manually into eight exercise categories to balance a diverse representation of exercises with the need for meaningful cell sizes (appendix).

Participants indicated the number of times per week or month that they did this exercise and the number of minutes or hours that they usually spend exercising in this way each time.

I had already been tipped off by the discussion on twitter that there would be a thorough confusion of planned exercise and mere physical activity. But now that was compounded. Why was physical activity during employment excluded? What if participants were engaged in a number of different physical activities,  like both jogging and bicycling? If so, the survey obtained data for only one of these activities, with the other excluded, and the choice could’ve been quite arbitrary as to which one the participant identified as the one to be counted.

Anyone who has ever constructed surveys would be alert to the problems posed by participants’ awareness that saying “yes” to exercising would require contemplating  75 different options, arbitrarily choosing one of them for a further question how much time the participant engaged in this activity. Unless participants were strongly motivated, then there was an incentive to simply say no, they didn’t exercise.

I suppose I could go on, but it was my judgment that any validity what the authors were claiming  had been ruled out. Like someone once said on NIH grant review panel, there are no vital signs left, let’s move on to the next item.

But let’s refocus just a bit on the overall intention of these authors. They want to use a large data set to make statements about the association between physical activity and a measure of mental health. They have used matching and statistical controls to equate participants. But that strategy effectively eliminates consideration of crucial contextual variables. Persons’ preferences and opportunities to exercise are powerfully shaped by their personal and social circumstances, including finances and competing demands on their time. Said differently, people are embedded in contexts in which a lot of statistical maneuvering has sought to eliminate.

To suggest a small number of the many complexities: how much physical activity participants get  in their  employment may be an important determinant of choices for additional activity, as well as how much time is left outside of work. If work typically involves a lot of physical exertion, people may simply be left too tired for additional planned physical activity, a.k.a. exercise, and the physical health may require it less. Environments differ greatly in terms of the opportunities and the safety of engaging in various kinds of physical activities. Team sports require other people being available. Etc., etc.

What I learned from the editorial accompanying the Lancet Psychiatry article

The brief editorial accompanying the article aroused my curiosity as to whether someone assigned to reading and commenting on this article would catch things that apparently the editor and reviewer missed.

Editorial commentators are chosen to praise, not to bury articles. There are strong social pressures to say nice things. However, this editorial leaked a number of serious concerns.

First

In presenting mental health as a workable, unified concept, there is a presupposition that it is possible and appropriate to combine all the various mental disorders as a single entity in pursuing this research. It is difficult to see the justification for this approach when these conditions differ greatly in their underlying causes, clinical presentation, and treatment. Dementia, substance misuse, and personality disorder, for example, are considered as distinct entities for research and clinical purposes; capturing them for study under the combined banner of mental health might not add a great deal to our understanding.

The problem here of categorisation is somewhat compounded by the repeated uncomfortable interchangeability between mental health and depression, as if these concepts were functionally equivalent, or as if other mental disorders were somewhat peripheral.

Then:

A final caution pertains to how studies approach a definition of exercise. In the current study, we see the inclusion of activities such as childcare, housework, lawn-mowing, carpentry, fishing, and yoga as forms of exercise. In other studies, these activities would be excluded for not fulfilling the definition of exercise as offered by the American College of Sports Medicine: “planned, structured and repetitive bodily movement done to improve or maintain one or more components of physical fitness.” 11 The study by Chekroud and colleagues, in its all-encompassing approach, might more accurately be considered a study in physical activity rather than exercise.

The authors were listening for a theme song with which they could promote their startup company in a very noisy data set. They thought they had a hit. I think they had noise.

The authors’ extraordinary disclosure of interests (see below this blog post) should have precluded publication of this serious flawed piece of work, either simply for reason of high likelihood of bias or because it promoted the editor and reviewers to look more carefully at the serious flaws hiding in plain sight.

Postscript: Send in the trolls.

On Twitter, Adam Chekroud announced he felt no need to respond to critics. Instead, he retweeted and “liked” trolling comments directed at critics from the twitter accounts of his brother, his mother, and even the official Twitter account of a local fried chicken joint @chickenlodge, that offered free food for retweets and suggested including Adam Chekroud’s twitter handle if you wanted to be noticed.

chicken lodge

Really, Adam, if you can’t stand the heat, don’t go near  where they are frying chicken.

The Declaration of Interests from the article.

declaration of interest 1

declaration of interest 2

 

Headspace mindfulness training app no better than a fake mindfulness procedure for improving critical thinking, open-mindedness, and well-being.

The Headspace app increased users’ critical thinking and being open-minded. So did practicing a sham mindfulness procedure- participants simply sat with their eyes closed, but thought they were meditating.

mind the brain logo

The Headspace app increased users’ critical thinking and open-mindedness. So did practicing a sham mindfulness procedure. Participants simply sat with their eyes closed, but thought they were meditating.

cat_ dreamstime_164683 (300 x 225)Results call into question claims about Headspace  coming from other studies that did not have such a credible, active control group comparison.

Results also call into question the widespread use of standardized self-report measures of mindfulness to establish whether someone is in the state of mindfulness. These measures don’t distinguish between the practice of standard versus fake mindfulness.

Results can be seen as further evidence that practicing mindfulness depends on nonspecific factors (AKA placebo), rather than any active, distinctive ingredient.

Hopefully this study will prompt better studies evaluating the Headspace App, as well as evaluations of mindfulness training more generally, using credible active treatments, rather than no treatment or waitlist controls.

Maybe it is time for a moratorium on trials of mindfulness without such an active control or at least a tempering of claims based on poorly controlled  trials.

This study points to the need for development of more psychometrically sophisticated measures of mindfulness that are not so vulnerable to experiment expectations and demand characteristics.

Until the accumulation of better studies with better measures, claims about the effects of practicing mindfulness ought to be recognized as based on relatively weak evidence.

The study

Noone, C & Hogan,M. Randomised active-controlled trial of effects of online mindfulness intervention on executive control, critical thinking and key thinking dispositionsBMC Psychology, 2018

Trial registration

The study was initially registered in the AEA Social Science Registry before the recruitment was initiated (RCT ID: AEARCTR-0000756; 14/11/2015) and retrospectively registered in the ISRCTN registry (RCT ID: ISRCTN16588423) in line with requirements for publishing the study protocol.

Excerpts from the Abstract

The aim of this study was…investigating the effects of an online mindfulness intervention on executive function, critical thinking skills, and associated thinking dispositions.

Method

Participants recruited from a university were randomly allocated, following screening, to either a mindfulness meditation group or a sham meditation group. Both the researchers and the participants were blind to group allocation. The intervention content for both groups was delivered through the Headspace online application, an application which provides guided meditations to users.

And

Primary outcome measures assessed mindfulness, executive functioning, critical thinking, actively open-minded thinking, and need for cognition. Secondary outcome measures assessed wellbeing, positive and negative affect, and real-world outcomes.

Results

Significant increases in mindfulness dispositions and critical thinking scores were observed in both the mindfulness meditation and sham meditation groups. However, no significant effects of group allocation were observed for either primary or secondary measures. Furthermore, mediation analyses testing the indirect effect of group allocation through executive functioning performance did not reveal a significant result and moderation analyses showed that the effect of the intervention did not depend on baseline levels of the key thinking dispositions, actively open-minded thinking, and need for cognition.

The authors conclude

While further research is warranted, claims regarding the benefits of mindfulness practice for critical thinking should be tempered in the meantime.

Headscape Be used on an iPhone

The active control condition

The sham treatment control condition was embarrassingly straightforward and simple. But as we will see, participants found it credible.

This condition presented the participants with guided breathing exercises. Each session began by inviting the participants to sit with their eyes closed. These exercises were referred to as meditation but participants were not given guidance on how to control their awareness of their body or breath. This approach was designed to control for the effects of expectations surrounding mindfulness and physiological relaxation to ensure that the effect size could be attributed to mindfulness practice specifically. This content was also delivered by Andy Puddicombe and was developed based on previous work by Zeidan and colleagues [55, 57, 58].

What can we conclude about the standard self-report measures of the state of mindfulness?

The study used the Five Facet Mindfulness Questionnaire, which is widely used to assess whether people are in a state of mindfulness. It has been cited almost 4000 times.

Participants assigned to the mindfulness condition had significant changes for all five facets from baseline to follow up: observing, non-reactivity, non-judgment, acting with awareness, and describing. In the absence of a comparison with change in the sham mindfulness group, these pre-post results would seem to suggest that the measure was sensitive to whether participants had practiced mindfulness. However, there were no differences from the changes observed for the participants assigned to mindfulness and those which were simply asked to sit with their eyes closed.

I asked Chris Noone about the questionnaires his group used to assess mindfulness:

The participants genuinely thought they were meditating in the sham condition so I think both non-specific and demand characteristics were roughly equivalent across both groups. I’m also skeptical regarding the ability of the Five-Facet Mindfulness Questionnaire (or any mindfulness questionnaire for that matter) to capture anything other than “perceived mindfulness”. The items used in these questionnaires feature similar content to the scripts used by the people delivering the mindfulness (and sham) guided meditations. The improvement in critical thinking across both groups is just a mix of learning across a semester and habituation to the task (as the same problems were posed at both measurements).

What I like about this trial

The trial provides a critical test of a key claim for mindfulness:

Mindfulness should facilitate critical thinking in higher-education, based on early Buddhist conceptualizations of mindfulness as clarity of thought.

The trial was registered before recruitment and departures from protocol were noted.

Sample size was determined by power analysis.

The study had a closely matched, active control condition, a sham mindfulness treatment.

The credibility and equivalence of this sham condition versus the active treatment under study was repeatedly assessed.

“Manipulation checks were carried out to assess intervention acceptability, technology acceptance and meditation quality 2 weeks after baseline and 4 weeks after baseline.”

The study tested some a priori hypotheses about mediators and moderation:

Analyses were intention to treat.

 How the study conflicts with past studies

Previous studies claimed to show positive effects of mindfulness on aspects of executive functioning [25 and  26]

How the contradiction of past studies by these results is resolved

 “There are many studies using guided meditations similar to those in our mindfulness meditation condition, delivered through smartphone applications [49, 50, 52, 90, 91], websites [92, 93, 94, 95, 96, 97] and CDs [98, 99], which show effects on measures of outcomes reliably associated with increases in mindfulness such as depression, anxiety, stress, wellbeing and compassion. There are two things to note about these studies – they tend not to include a measure of dispositional mindfulness (e.g. only 4% of all mindfulness intervention studies reviewed in a recent meta-analysis include such measures at baseline and follow-up; [54]) and they usually employ a weak form of control group such as a no-treatment control or waitlist control [54]. Therefore, even when change in mindfulness is assessed in mindfulness meditation intervention studies, it is usually overestimated and this must be borne in mind when comparing the results of this study with those of previous studies. This combined with generally only moderate correlations with behavioural outcomes [54] suggests that when mindfulness interventions are effective, dispositional measures do not fully capture what has changed.”

The broader take away messages

“Our results show that, for most outcomes, there were significant changes from baseline to follow-up but none which can be specifically attributed to the practice of mindfulness.’

This creative use of a sham mindfulness control condition is a breakthrough that should be widely followed. First, it allowed a fair test of whether mindfulness is any better than another active, credible treatment. Second, because the active treatment was a sham, results provide a challenge to the notion that apparent effects of mindfulness on critical thinking are anything more than a placebo effect.

The Headspace App is enormously popular and successful, based on claims about what benefits its use will provide. Some of these claims may need to be tempered, not only in terms of critical thinking, but effects on well-being.

The Headspace App platform lends itself to such critical evaluations with respect to a sham treatment with a degree of standardization that is not readily possible with face-to-face mindfulness training. This opportunity should be exploited further with other active control groups constructed on the basis of specific hypotheses.

There is far too much research on the practice of mindfulness being done that does not advance understanding of what works or how it works. We need a lot fewer studies, and more with adequate control/comparison groups.

Perhaps we should have a moratorium on evaluations of mindfulness without adequate control groups.

Perhaps articles being aimed at audiences making enthusiastic claims for the benefits of mindfulness should routinely note whether these claims are based on adequately controlled studies. Most are not.

When psychotherapy trials have multiple flaws…

Multiple flaws pose more threats to the validity of psychotherapy studies than would be inferred when the individual flaws are considered independently.

mind the brain logo

Multiple flaws pose more threats to the validity of psychotherapy studies than would be inferred when the individual flaws are considered independently.

We can learn to spot features of psychotherapy trials that are likely to lead to exaggerated claims of efficacy for treatments or claims that will not generalize beyond the sample that is being studied in a particular clinical trial. We can look to the adequacy of sample size, and spot what Cochrane collaboration has defined as risk of bias in their handy assessment tool.

We can look at the case-mix in the particular sites where patients were recruited.  We can examine the adequacy of diagnostic criteria that were used for entering patients to a trial. We can examine how blinded the trial was in terms of whoever assigned patients to particular conditions, but also what the patients, the treatment providers, and their evaluaters knew which condition to which particular patients were assigned.

And so on. But what about combinations of these factors?

We typically do not pay enough attention multiple flaws in the same trial. I include myself among the guilty. We may suspect that flaws are seldom simply additive in their effect, but we don’t consider whether they may be even synergism in the negative effects on the validity of a trial. As we will see in this analysis of a clinical trial, multiple flaws can provide more threats to the validity trial than what we might infer when the individual flaws are considered independently.

The particular paper we are probing is described in its discussion section as the “largest RCT to date testing the efficacy of group CBT for patients with CFS.” It also takes on added importance because two of the authors, Gijs Bleijenberg and Hans Knoop, are considered leading experts in the Netherlands. The treatment protocol was developed over time by the Dutch Expert Centre for Chronic Fatigue (NKCV, http://www.nkcv.nl; Knoop and Bleijenberg, 2010). Moreover, these senior authors dismiss any criticism and even ridicule critics. This study is cited as support for their overall assessment of their own work.  Gijs Bleijenberg claims:

Cognitive behavioural therapy is still an effective treatment, even the preferential treatment for chronic fatigue syndrome.

But

Not everybody endorses these conclusions, however their objections are mostly baseless.

Spoiler alert

This is a long read blog post. I will offer a summary for those who don’t want to read through it, but who still want the gist of what I will be saying. However, as always, I encourage readers to be skeptical of what I say and to look to my evidence and arguments and decide for themselves.

Authors of this trial stacked the deck to demonstrate that their treatment is effective. They are striving to support the extraordinary claim that group cognitive behavior therapy fosters not only better adaptation, but actually recovery from what is internationally considered a physical condition.

There are some obvious features of the study that contribute to the likelihood of a positive effect, but these features need to be considered collectively, in combination, to appreciate the strength of this effort to guarantee positive results.

This study represents the perfect storm of design features that operate synergistically:

perfect storm

 Referral bias – Trial conducted in a single specialized treatment setting known for advocating psychological factors maintaining physical illness.

Strong self-selection bias of a minority of patients enrolling in the trial seeking a treatment they otherwise cannot get.

Broad, overinclusive diagnostic criteria for entry into the trial.

Active treatment condition carry strong message how patients should respond to outcome assessment with improvement.

An unblinded trial with a waitlist control lacking the nonspecific elements (placebo) that confound the active treatment.

Subjective self-report outcomes.

Specifying a clinically significant improvement that required only that a primary outcome be less than needed for entry into the trial

Deliberate exclusion of relevant objective outcomes.

Avoidance of any recording of negative effects.

Despite the prestige attached to this trial in Europe, the US Agency for Healthcare Research and Quality (AHRQ) excludes this trial from providing evidence for its database of treatments for chronic fatigue syndrome/myalgic encephalomyelitis. We will see why in this post.

factsThe take away message: Although not many psychotherapy trials incorporate all of these factors, most trials have some. We should be more sensitive to when multiple factors occur in the same trial, like bias in the site for patient recruitment; lacking of blinding; lack of balance between active treatment and control condition in terms of nonspecific factors, and subjective self-report measures.

The article reporting the trial is

Wiborg JF, van Bussel J, van Dijk A, Bleijenberg G, Knoop H. Randomised controlled trial of cognitive behaviour therapy delivered in groups of patients with chronic fatigue syndrome. Psychotherapy and Psychosomatics. 2015;84(6):368-76.

Unfortunately, the article is currently behind a pay wall. Perhaps readers could contact the corresponding author Hans.knoop@radboudumc.nl  and request a PDF.

The abstract

Background: Meta-analyses have been inconclusive about the efficacy of cognitive behaviour therapies (CBTs) delivered in groups of patients with chronic fatigue syndrome (CFS) due to a lack of adequate studies. Methods: We conducted a pragmatic randomised controlled trial with 204 adult CFS patients from our routine clinical practice who were willing to receive group therapy. Patients were equally allocated to therapy groups of 8 patients and 2 therapists, 4 patients and 1 therapist or a waiting list control condition. Primary analysis was based on the intention-to-treat principle and compared the intervention group (n = 136) with the waiting list condition (n = 68). The study was open label. Results: Thirty-four (17%) patients were lost to follow-up during the course of the trial. Missing data were imputed using mean proportions of improvement based on the outcome scores of similar patients with a second assessment. Large and significant improvement in favour of the intervention group was found on fatigue severity (effect size = 1.1) and overall impairment (effect size = 0.9) at the second assessment. Physical functioning and psychological distress improved moderately (effect size = 0.5). Treatment effects remained significant in sensitivity and per-protocol analyses. Subgroup analysis revealed that the effects of the intervention also remained significant when both group sizes (i.e. 4 and 8 patients) were compared separately with the waiting list condition. Conclusions: CBT can be effectively delivered in groups of CFS patients. Group size does not seem to affect the general efficacy of the intervention which is of importance for settings in which large treatment groups are not feasible due to limited referral

The trial registration

http://www.isrctn.com/ISRCTN15823716

Who was enrolled into the trial?

Who gets into a psychotherapy trial is a function of the particular treatment setting of the study, the diagnostic criteria for entry, and patient preferences for getting their care through a trial, rather than what is being routinely provided in that setting.

 We need to pay particular attention to when patients enter psychotherapy trials hoping they will receive a treatment they prefer and not to be assigned to the other condition. Patients may be in a clinical trial for the betterment of science, but in some settings, they are willing to enroll because of a probability of getting treatment they otherwise could not get. This in turn also affects the evaluation of both the condition in which they get the preferred treatment, but also their evaluation of the condition in which they are denied it. Simply put, they register being pleased with what they wanted or not being pleased if they did not get what they wanted.

The setting is relevant to evaluating who was enrolled in a trial.

The authors’ own outpatient clinic at the Radboud University Medical Center was the site of the study. The group has an international reputation for promoting the biopsychosocial model, in which psychological factors are assumed to be the decisive factor in maintaining somatic complaints.

All patients were referred to our outpatient clinic for the management of chronic fatigue.

There is thus a clear referral bias  or case-mix bias but we are not provided a ready basis for quantifying it or even estimating its effects.

The diagnostic criteria.

The article states:

In accordance with the US Center for Disease Control [9], CFS was defined as severe and unexplained fatigue which lasts for at least 6 months and which is accompanied by substantial impairment in functioning and 4 or more additional complaints such as pain or concentration problems.

Actually, the US Center for Disease Control would now reject this trial because these entry criteria are considered obsolete, overinclusive, and not sufficiently exclusive of other conditions that might be associated with chronic fatigue.*

There is a real paradigm shift happening in America. Both the 2015 IOM Report and the Centers for Disease Control and Prevention (CDC) website emphasize Post Exertional Malaise and getting more ill after any effort with M.E. CBT is no longer recommended by the CDC as treatment.

cdc criteriaThe only mandatory symptom for inclusion in this study is fatigue lasting 6 months. Most properly, this trial targets chronic fatigue [period] and not the condition, chronic fatigue syndrome.

Current US CDC recommendations  (See box  7-1 from the IoM document, above) for diagnosis require postexertional malaise for a diagnosis of myalgic encephalomyelitis (ME). See below.

pemPatients meeting the current American criteria for ME would be eligible for enrollment in this trial, but it’s unclear what proportion of the patients enrolled actually met the American criteria. Because of the over-inclusiveness of the entry diagnostic criteria, it is doubtful whether the results would generalize to American sample. A look at patient flow into the study will be informative.

Patient flow

Let’s look at what is said in the text, but also in the chart depicting patient flow into the trial for any self-selection that might be revealed.

In total, 485 adult patients were diagnosed with CFS during the inclusion period at our clinic (fig. 1). One hundred and fifty-seven patients were excluded from the trial because they declined treatment at our clinic, were already asked to participate in research incompatible with inclusion (e.g. research focusing on individual CBT for CFS) or had a clinical reason for exclusion (i.e. they received specifically tailored interventions because they were already unsuccessfully treated with individual CBT for CFS outside our clinic or were between 18 and 21 years of age and the family had to be involved in the therapy). Of the 328 patients who were asked to engage in group therapy, 99 (30%) patients indicated that they were unwilling to receive group therapy. In 25 patients, the reason for refusal was not recorded. Two hundred and four patients were randomly allocated to one of the three trial conditions. Baseline characteristics of the study sample are presented in table 1. In total, 34 (17%) patients were lost to follow-up. Of the remaining 170 patients, 1 patient had incomplete primary outcome data and 6 patients had incomplete secondary outcome data.

flow chart

We see that the investigators invited two thirds of patients attending the clinic to enroll in the trial. Of these, 41% refused. We don’t know the reason for some of the refusals, but almost a third of the patients approached declined because they did not want group therapy. The authors left being able to randomize 42% of patients coming to the clinic or less than two thirds of patients they actually asked. Of these patients, a little more than two thirds received the treatment to which were randomized and were available for follow-up.

These patients receiving treatment to which they were randomized and who were available for follow-up are self-selected minority of the patients coming to the clinic. This self-selection process likely reduced the proportion of patients with myalgic encephalomyelitis. It is estimated that 25% of patients meeting the American criteria a housebound and 75% are unable to work. It’s reasonably to infer that patients being the full criteria would opt out of a treatment that require regular attendance of a group session.

The trial is biased to ambulatory patients with fatigue and not ME. Their fatigue is likely due to some combinations of factors such as multiple co-morbidities, as-yet-undiagnosed medical conditions, drug interactions, and the common mild and subsyndromal  anxiety and depressive symptoms that characterize primary care populations.

The treatment being evaluated

Group cognitive behavior therapy for chronic fatigue syndrome, either delivered in a small (4 patients and 1 therapist) or larger (8 patients and 2 therapists) group format.

The intervention consisted of 14 group sessions of 2 h within a period of 6 months followed by a second assessment. Before the intervention started, patients were introduced to their group therapist in an individual session. The intervention was based on previous work of our research group [4,13] and included personal goal setting, fixing sleep-wake cycles, reducing the focus on bodily symptoms, a systematic challenge of fatigue-related beliefs, regulation and gradual increase in activities, and accomplishment of personal goals. A formal exercise programme was not part of the intervention.

Patients received a workbook with the content of the therapy. During sessions, patients were explicitly invited to give feedback about fatigue-related cognitions and behaviours to fellow patients. This aspect was introduced to facilitate a pro-active attitude and to avoid misperceptions of the sessions as support group meetings which have been shown to be insufficient for the treatment of CFS.

And note:

In contrast to our previous work [4], we communicated recovery in terms of fatigue and disabilities as general goal of the intervention.

Some impressions of the intensity of this treatment. This is a rather intensive treatment with patients having considerable opportunities for interactions with providers. This factor alone distinguishes being assigned to the intervention group versus being left in the wait list control group and could prove powerful. It will be difficult to distinguish intensity of contact from any content or active ingredients of the therapy.

I’ll leave for another time a fuller discussion of the extent to which what was labeled as cognitive behavior therapy in this study is consistent with cognitive therapy as practiced by Aaron Beck and other leaders of the field. However, a few comments are warranted. What is offered in this trial does not sound like cognitive therapy as Americans practice it. What is often in this trial seems emphasize challenging beliefs, pushing patients to get more active, along with psychoeducational activities. I don’t see indications of the supportive, collaborative relationship in which patients are encouraged to work on what they want to work on, engage in outside activities (homework assignments) and get feedback.

What is missing in this treatment is what Beck calls collaborative empiricism, “a systemic process of therapist and patient working together to establish common goals in treatment, has been found to be one of the primary change agents in cognitive-behavioral therapy (CBT).”

Importantly, in Beck’s approach, the therapist does not assume cognitive distortions on the part of the patient. Rather, in collaboration with the patient, the therapist introduces alternatives to the interpretations that the patient has been making and encourages the patient to consider the difference. In contrast, rather than eliciting goal statements from patients, therapist in this study imposes the goal of increased activity. Therapists in this study also seem ready to impose their views that the patients’ fatigue-related beliefs are maladaptive.

The treatment offered in this trial is complex, with multiple components making multiple assumptions that seem quite different from what is called cognitive therapy or cognitive behavioral therapy in the US.

The authors’ communication of recovery from fatigue and disability seems a radical departure not only from cognitive behavior therapy for anxiety and depression and pain, but for cognitive behavior therapy offered for adaptation to acute and chronic physical illnesses. We will return to this “communication” later.

The control group

Patients not randomized to group CBT were placed on a waiting list.

Think about it! What do patients think about having gotten involved in all the inconvenience and burden of a clinical trial in hope that they would get treatment and then being assigned to the control group with just waiting? Not only are they going to be disappointed and register that in their subjective evaluations of the outcome assessments patients may worry about jeopardizing the right to the treatment they are waiting for if they overly endorse positive outcomes. There is a potential for  nocebo effect , compounding the placebo effect of assignment to the CBT active treatment groups.

What are informative comparisons between active treatments and  control conditions?

We need to ask more often what inclusion of a control group accomplishes for the evaluation of a psychotherapy. In doing so, we need to keep in mind that psychotherapies do not have effect sizes, only comparisons of psychotherapies and control condition have effect sizes.

A pre-post evaluation of psychotherapy from baseline to follow-up includes the effects of any active ingredient in the psychotherapy, a host of nonspecific (placebo) factors, and any changes that would’ve occurred in the absence of the intervention. These include regression to the mean– patients are more likely to enter a clinical trial now, rather than later or previously, if there has been exacerbation of their symptoms.

So, a proper comparison/control condition includes everything that the patients randomized to the intervention group get except for the active treatment. Ideally, the intervention and the comparison/control group are equivalent on all these factors, except the active ingredient of the intervention.

That is clearly not what is happening in this trial. Patients randomized to the intervention group get the intervention, the added intensity and frequency of contact with professionals that the intervention provides, and all the support that goes with it; and the positive expectations that come with getting a therapy that they wanted.

Attempts to evaluate the group CBT versus the wait-list control group involved confounding the active ingredients of the CBT and all these nonspecific effects. The deck is clearly being stacked in favor of CBT.

This may be a randomized trial, but properly speaking, this is not a randomized controlled trial, because the comparison group does not control for nonspecific factors, which are imbalanced.

The unblinded nature of the trial

In RCTs of psychotropic drugs, the ideal is to compare the psychotropic drug to an inert pill placebo with providers, patients, and evaluate being blinded as to whether the patients received psychotropic drug or the comparison pill.

While it is difficult to achieve a comparable level of blindness and a psychotherapy trial, more of an effort to achieve blindness is desirable. For instance, in this trial, the authors took pains to distinguish the CBT from what would’ve happened in a support group. A much more adequate comparison would therefore be CBT versus either a professional or peer-led support group with equivalent amounts of contact time. Further blinding would be possible if patients were told only two forms of group therapy were being compared. If that was the information available to patients contemplating consenting to the trial, it wouldn’t have been so obvious from the outset to the patients being randomly assigned that one group was preferable to the other.

Subjective self-report outcomes.

The primary outcomes for the trial were the fatigue subscale of the Checklist Individual Strength;  the physical functioning subscale of the Short Health Survey 36 (SF-36); and overall impairment as measured by the Sickness Impact Profile (SIP).

Realistically, self-report outcomes are often all that is available in many psychotherapy trials. Commonly these are self-report assessments of anxiety and depressive symptoms, although these may be supplemented by interviewer-based assessments. We don’t have objective biomarkers with which to evaluate psychotherapy.

These three self-report measures are relatively nonspecific, particularly in a population that is not characterized by ME. Self-reported fatigue in a primary care population lacks discriminative validity with respect to pain, anxiety and depressive symptoms, and general demoralization.  The measures are susceptible to receipt of support and re-moralization, as well as gratitude for obtaining a treatment that was sought.

Self-report entry criteria include a score 35 or higher on the fatigue severity subscale. Yet, a score of less than 35 on this scale at follow up is part of what is defined as a clinically significant improvement with a composite score from combined self-report measures.

We know from medical trials that differences can be observed with subjective self-report measures that will not be found with objective measures. Thus, mildly asthmatic patients will fail to distinguish in their subjective self-reports between [  between the effective inhalant albuterol, an inert inhalant, and sham acupuncture, but will rate improvement better than getting no intervention.  However,  there will be a strong advantage over the other three conditions with an objective measure, maximum forced expiratory volume in 1 second (FEV1) as assessed  with spirometry.

The suppression of objective outcome measures

We cannot let these the authors of this trial off the hook in their dependence on subjective self-report outcomes. They are instructing patients that recovery is the goal, which implies that it is an attainable goal. We can reasonably be skeptical about acclaim of recovery based on changes in self-report measures. Were the patients actually able to exercise? What was their exercise capacity, as objectively measured? Did they return to work?

These authors have included such objective measurements in past studies, but not included them as primary outcomes, nor, even in some cases, reported them in the main paper reporting the trial.

Wiborg JF, Knoop H, Stulemeijer M, Prins JB, Bleijenberg G. How does cognitive behaviour therapy reduce fatigue in patients with chronic fatigue syndrome? The role of physical activity. Psychol Med. 2010 Jan 5:1

The senior authors’ review fails to mention their three studies using actigraphy that did not find effects for CBT. I am unaware of any studies that did find enduring effects.

Perhaps this is what they mean when they say the protocol has been developed over time – they removed what they found to be threats to the findings that they wanted to claim.

Dismissing of any need to consider negative effects of treatment

Most psychotherapy fail to assess any adverse effects of treatment, but this is usually done discretely, without mention. In contrast, this article states

Potential harms of the intervention were not assessed. Previous research has shown that cognitive behavioural interventions for CFS are safe and unlikely to produce detrimental effects.

Patients who meet stringent criteria for ME would be put at risk for pressure to exert themselves. By definition they are vulnerable to postexertional malaise (PEM). Any trail of this nature needs to assess that risk. Maybe no adverse effects would be found. If that were so, it would strongly indicate the absence of patients with appropriate diagnoses.

Timing of assessment of outcomes varied between intervention and control group.

I at first did not believe what I was reading when I encountered this statement in the results section.

The mean time between baseline and second assessment was 6.2 months (SD = 0.9) in the control condition and 12.0 months (SD = 2.4) in the intervention group. This difference in assessment duration was significant (p < 0.001) and was mainly due to the fact that the start of the therapy groups had to be frequently postponed because of an irregular patient flow and limited treatment capacities for group therapy at our clinic. In accordance with the treatment manual, the second assessment was postponed until the fourteenth group session was accomplished. The mean time between the last group session and the second assessment was 3.3 weeks (SD = 3.5).

So, outcomes were assessed for the intervention group shortly after completion of therapy, when nonspecific (placebo) effects would be stronger, but a mean of six months later than for patients assigned to the control condition.

Post-hoc statistical controls are not sufficient to rescue the study from this important group difference, and it compounds other problems in the study.

Take away lessons

Pay more attention to how limitations any clinical trial may compound each other in terms of the trial provide exaggerated estimates of the effects of treatment or the generalizability of the results to other settings.

Be careful of loose diagnostic criteria because a trial may not generalize to the same criteria being applied in settings that are different either in terms of patient population of the availability of different treatments. This is particularly important when a treatment setting has a bias in referrals and only a minority of patients being invited to participate in the trial actually agree and are enrolled.

Ask questions about just what information is obtained in comparing active treatment group and the study to its control/comparison. For start, just what is being controlled and how might that affect the estimates of the effectiveness of the active treatment?

Pay particular attention to the potent combination of the trial being unblinded, a weak comparision/control, and an active treatment that is not otherwise available to patients.

Note

*The means of determining whether the six months of fatigue might be accounted for by other medical factors was specific to the setting. Note that a review of medical records for sufficient for an unknown proportion of patients, with no further examination or medical tests.

The Department of Internal Medicine at the Radboud University Medical Center assessed the medical examination status of all patients and decided whether patients had been sufficiently examined by a medical doctor to rule out relevant medical explanations for the complaints. If patients had not been sufficiently examined, they were seen for standard medical tests at the Department of Internal Medicine prior to referral to our outpatient clinic. In accordance with recommendations by the Centers for Disease Control, sufficient medical examination included evaluation of somatic parameters that may provide evidence for a plausible somatic explanation for prolonged fatigue [for a list, see [9]. When abnormalities were detected in these tests, additional tests were made based on the judgement of the clinician of the Department of Internal Medicine who ultimately decided about the appropriateness of referral to our clinic. Trained therapists at our clinic ruled out psychiatric comorbidity as potential explanation for the complaints in unstructured clinical interviews.

workup

Power pose: I. Demonstrating that replication initiatives won’t salvage the trustworthiness of psychology

An ambitious multisite initiative showcases how inefficient and ineffective replication is in correcting bad science.

 

mind the brain logo

Bad publication practices keep good scientists unnecessarily busy, as in replicability projects.- Bjoern Brembs

Power-PoseAn ambitious multisite initiative showcases how inefficient and ineffective replication is in correcting bad science. Psychologists need to reconsider pitfalls of an exclusive reliance on this strategy to improve lay persons’ trust in their field.

Despite the consistency of null findings across seven attempted replications of the original power pose study, editorial commentaries in Comprehensive Results in Social Psychology left some claims intact and called for further research.

Editorial commentaries on the seven null studies set the stage for continued marketing of self-help products, mainly to women, grounded in junk psychological pseudoscience.

Watch for repackaging and rebranding in next year’s new and improved model. Marketing campaigns will undoubtedly include direct quotes from the commentaries as endorsements.

We need to re-examine basic assumptions behind replication initiatives. Currently, these efforts  suffer from prioritizing of the reputations and egos of those misusing psychological science to market junk and quack claims versus protecting the consumers whom these gurus target.

In the absence of a critical response from within the profession to these persons prominently identifying themselves as psychologists, it is inevitable that the void be filled from those outside the field who have no investment in preserving the image of psychology research.

In the case of power posing, watchdog critics might be recruited from:

Consumer advocates concerned about just another effort to defraud consumers.

Science-based skeptics who see in the marketing of the power posing familiar quackery in the same category as hawkers using pseudoscience to promote homeopathy, acupuncture, and detox supplements.

Feminists who decry the message that women need to get some balls (testosterone) if they want to compete with men and overcome gender disparities in pay. Feminists should be further outraged by the marketing of junk science to vulnerable women with an ugly message of self-blame: It is so easy to meet and overcome social inequalities that they have only themselves to blame if they do not do so by power posing.

As reported in Comprehensive Results in Social Psychology,  a coordinated effort to examine the replicability of results reported in Psychological Science concerning power posing left the phenomenon a candidate for future research.

I will be blogging more about that later, but for now let’s look at a commentary from three of the over 20 authors get reveals an inherent limitation to such ambitious initiatives in tackling the untrustworthiness of psychology.

Cesario J, Jonas KJ, Carney DR. CRSP special issue on power poses: what was the point and what did we learn?.  Comprehensive Results in Social Psychology. 2017

 

Let’s start with the wrap up:

The very costly expense (in terms of time, money, and effort) required to chip away at published effects, needed to attain a “critical mass” of evidence given current publishing and statistical standards, is a highly inefficient use of resources in psychological science. Of course, science is to advance incrementally, but it should do so efficiently if possible. One cannot help but wonder whether the field would look different today had peer-reviewed preregistration been widely implemented a decade ago.

 We should consider the first sentence with some recognition of just how much untrustworthy psychological science is out there. Must we mobilize similar resources in every instance or can we develop some criteria to decide what is on worthy of replication? As I have argued previously, there are excellent reasons for deciding that the original power pose study could not contribute a credible effect size to the literature. There is no there to replicate.

The authors assume preregistration of the power pose study would have solved problems. In clinical and health psychology, long-standing recommendations to preregister trials are acquiring new urgency. But the record is that motivated researchers routinely ignore requirements to preregister and ignore the primary outcomes and analytic plans to which they have committed themselves. Editors and journals let them get away with it.

What measures do the replicationados have to ensure the same things are not being said about bad psychological science a decade from now? Rather than urging uniform adoption and enforcement of preregistration, replicationados urged the gentle nudge of badges for studies which are preregistered.

Just prior to the last passage:

Moreover, it is obvious that the researchers contributing to this special issue framed their research as a productive and generative enterprise, not one designed to destroy or undermine past research. We are compelled to make this point given the tendency for researchers to react to failed replications by maligning the intentions or integrity of those researchers who fail to support past research, as though the desires of the researchers are fully responsible for the outcome of the research.

There are multiple reasons not to give the authors of the power pose paper such a break. There is abundant evidence of undeclared conflicts of interest in the huge financial rewards for publishing false and outrageous claims. Psychological Science about the abstract of the original paper to leave out any embarrassing details of the study design and results and end with a marketing slogan:

That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

 Then the Association for Psychological Science gave a boost to the marketing of this junk science with a Rising Star Award to two of the authors of this paper for having “already made great advancements in science.”

As seen in this special issue of Comprehensive Results in Social Psychology, the replicationados share responsibility with Psychological Science and APS for keeping keep this system of perverse incentives intact. At least they are guaranteeing plenty of junk science in the pipeline to replicate.

But in the next installment on power posing I will raise the question of whether early career researchers are hurting their prospects for advancement by getting involved in such efforts.

How many replicationados does it take to change a lightbulb? Who knows, but a multisite initiative can be combined with a Bayesian meta-analysis to give a tentative and unsatisfying answer.

Coyne JC. Replication initiatives will not salvage the trustworthiness of psychology. BMC Psychology. 2016 May 31;4(1):28.

The following can be interpreted as a declaration of financial interests or a sales pitch:

eBook_PositivePsychology_345x550I will soon be offering e-books providing skeptical looks at positive psychology and mindfulness, as well as scientific writing courses on the web as I have been doing face-to-face for almost a decade.

 Sign up at my website to get advance notice of the forthcoming e-books and web courses, as well as upcoming blog posts at this and other blog sites. Get advance notice of forthcoming e-books and web courses. Lots to see at CoyneoftheRealm.com.

 

‘Replace male doctors with female ones and save at least 32,000 lives each year’?

The authors of a recent article in JAMA Internal Medicine

Physician Gender and Outcomes of Hospitalized Medicare Beneficiaries in the U.S.,” Yusuke Tsugawa, Anupam B. Jena, Jose F. Figueroa, E. John Orav, Daniel M. Blumenthal, Ashish K. Jha, MD, MPH1,2,8, JAMA Internal Medicine, online December 19, 2016, doi: 10.1001/jamainternmed.2016.7875

Stirred lots of attention in the media with direct quotes like these:

“If we had a treatment that lowered mortality by 0.4 percentage points or half a percentage point, that is a treatment we would use widely. We would think of that as a clinically important treatment we want to use for our patients,” said Ashish Jha, professor of health policy at the Harvard School of Public Health. The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

Washington Post: Women really are better doctors, study suggests.

LA  Times: How to save at least 32,000 lives each year: Replace male doctors with female ones.

NPR: Patients cared for by female doctors fare better than those treated by men.

My immediate reactions after looking at the abstract were only confirmed when I delved deeper.

Basically, we have a large, but limited and very noisy data set. It is unlikely that these data allow us to be confident about the strength of any signal concerning the relationship between physician gender and patient outcome that is so important to the authors. The small apparent differences could be just more noise on which the authors have zeroed in so that they can make a statement about the injustice of gender differences in physician pay.

 I am unwilling to relax methodological and statistical standards to manufacture support for such a change. There could be unwanted consequences of accepting that arguments can be made with such weak evidence, even for a good cause.

What if the authors had found the same small differences in noisy data in the reverse direction? Would they argue that we should preserve gender differences in physician pay? What if the authors focus on a different variable in all this noise and concluded that lower pay which women receive was associated with reduced mortality? Would we then advocate that will reduce the pay of both male and female physicians in order to improve patient outcomes?

Despite all the excitement that claim about an effect of physician gender on patient mortality is generating, it is most likely that we are dealing with noise arising from overinterpretation of complex analyses that assume more completeness and precision than can be found in the data being analyzed.

These claims are not just a matter of causal relationships being spun from correlation. Rather, they are causal claims being made on the basis of partial correlations emerging in complex multivariate relationships found in an administrative data set.

  • Administrative data sets, particularly Medicaid data sets like this one, are not constructed with such research questions in mind. There are severe constraints on what variables can be isolated and which potential confounds can be identified and tested.
  • Administrative data sets consist of records, not actual behaviors. It’s reasonable to infer a patient death associated with a record of a death. Association of a physician gender associated with a particular record is more problematic, as we will see. Even if we accept the association found in these records, it does not necessarily mean that physicians engaged in any particular behaviors or that the physician behavior is associated with the pattern of deaths emerging in these multivariate analyses.
  • The authors start out with a statement about differences in how female and male physicians practice. In the actual article and the media, they have referred to variables like communication skills, providing evidence-based treatments, and encouraging health-related behaviors. None of these variables are remotely accessible in a Medicaid data set.
  • Analyses of such administrative data sets do not allow isolation of the effects of physician gender from the effects of the contexts in which their practice occurs and relevant associated variables. We are not talking about a male or female physician encountering a particular patient being associated with a death or not, but an administrative record of physician gender arising in a particular context being interpreted as associated with a death. Male and female physicians may differ in being found in particular contexts in nonrandom fashion. It’s likely that these differences will dwarf any differences in outcomes. There will be a real challenge in even confidently attributing those outcomes to whether patients had an attending male or female physician.

The validity of complex multivariate analyses are strongly threatened by specification bias and residual confounding. The analyses must assume that all of the relevant confounds have been identified and measured without error. Departures from these ideal conditions can lead to spurious results, and generally do. Examination of the limitations in the variables available in a Medicaid data set and how they were coded can quickly undermine any claim to validity.

Acceptance of claims about effects of particular variables like female physician gender arising in complex multivariate analyses involve assumptions of “all-other-things-being-equal.” If we attempt to move from statistical manipulation to inference about a real world encounter, we no longer talking about a particular female physician, but a construction that may be very different from particular physicians interacting with particular patients in particular contexts.

The potential for counterfactual statements can be seen if we move from the study to one of science nerds and basketball players and hypothesize if John and Jason were of equivalent height, John would not study so hard.

Particularly in complex social situations, it is usually a fantasy that we can change one variable, and only one variable, not others. Just how did John and Jason get of equal height? And how are they now otherwise different?

Associations discovered in administrative data sets most often do not translate into effects observed in randomized trials. I’m not sure how we could get a representative sample of patients to disregard their preferences and accept random assignment to a male or female physician. It would have to be a very large study to detect the effect sizes reported in this observational study, and I’m skeptical this sufficiently strong signal would emerge from all of the noise.

We might relax our standards and accept a quasi-experimental design that would be smaller but encompass a wider range of relevant variables. For instance, it is conceivable that we could construct a large sample in which physicians varied in terms of whether they had formal communication skills training. We might examine whether communications training influenced subsequent patient mortality, independent of physician gender, and vice versa. This would be a reasonable translation of the authors’ hypothesis that communication skills differences between male and female physicians account for what the authors believe is the observed association between physician gender and mortality. I know of no such study having been done. I know of no study demonstrating that physician communication training affects patient mortality. I’m skeptical that the typical communication training is so powerful in its effects. If such a study required substantial resources, rather than relied on data on hand, I would not be encouraged to invest in it by the strength of the results of the present study to marshal those resources.

What I saw when I looked at the article

 We dealing with very small adjusted differences in percentage arising in a large sample.

Patients treated by female physicians had lower 30-day mortality (adjusted mortality, 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233).

Assignment of a particular patient to a particular physician is done with a lot of noise.

We assigned each hospitalization to a physician based on the National Provider Identifier in the Carrier File that accounted for the largest amount of Medicare Part B spending during that hospitalization.25 Part B spending comprises professional and other fees determined by the physician. On average, these physicians were responsible for 51.1% of total Part B spending for a given hospitalization.

One commentator quoted in a news article noted:

William Weeks, a professor of psychiatry at Dartmouth’s Geisel School of Medicine, said that the researchers had done a good job of trying to control for other factors that might influence the outcome. He noted that one caveat is that hospital care is usually done by a team. That fact was underscored by the method the researchers used to identify the doctor who led the care for patients in the study. To identify the gender of the physician, they looked for the doctor responsible for the biggest chunk of billing for hospital services — which was, on average, about half. That means that almost half of the care was provided by others.

Actually, much of the care is not provided by the attending physician, but other staff, including nurses and residents.

The authors undertook the study to call attention to gender disparities in physician pay. But could disparities show up in males being able to claim more billable procedures – greater credit administratively for what is done with patients during hospitalization, including by other physicians? This might explain at least some of the gender differences, but could undermine the validity of this key variable in relating physician gender to differences in patient outcome.

The statistical control of differences in patient and physician characteristics afforded by variables in this data set is inadequate.

Presumably, a full range of patient variables is related to whether patients die within 30 days of a hospitalization. Recall the key assumption that all of the relevant confounds have been identified and assessed without error in considering the variables used to characterize patient characteristics:

Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year.

Note that the comorbidity index is based on collapsing 27 other variables into one number. Simplifies the statistics, yes, but with a tremendous loss of information.

Recall the assumption that this set of variables represent not just what is available in administrative data set, but all the patient characteristics relevant to their dying within 30 days after discharge from the hospital. Are we really willing to accept this assumption?

For the physician variables displayed at the top of Table 1, there are huge differences between male and female physicians, relative to the modest difference in patient mortality, adjusted mortality, 11.07% vs 11.49%.

smaller table of patient characiteristics

These authors encourage us to think about the results as simulating a randomized trial, except that statistical controls are serving the function that randomization of patients to physician gender would serve. We are being asked to accept that these difference in baseline characteristics of the practices of female versus physicians can be eliminated through statistics. We would never accept that argument in a randomized trial.

Addressing criticisms of the authors interpretation of their results.

 The senior author provided a pair of blog posts in which he acknowledges criticism of his study, but attempts to defuse key objections. It’s unfortunate that the sources of these objections are not identified, and so we dependent on the author’s summary out of context. I think the key responses are to straw man objections.

Correlation, Causation, and Gender Differences in Patient Outcomes

Do women make better doctors than men?

Correlation is not causation.

We often make causal inferences based on observational data – and here’s the kicker: sometimes, we should.  Think smoking and lung cancer.  Remember the RCT that assigned people to smoking (versus not) to see if it really caused lung cancer?  Me neither…because it never happened.  So, if you are a strict “correlation is not causation” person who thinks observational data only create hypotheses that need to be tested using RCTs, you should only feel comfortable stating that smoking is associated with lung cancer but it’s only a hypothesis for which we await an RCT.  That’s silly.  Smoking causes lung cancer.

No, it is this argument that is silly. We can now look back on the data concerning smoking and lung cancer and benefit from the hindsight provided by years of sorting smoking as a risk factor from potential confounds.  I recall at some point, drinking coffee being related to lung cancer in the United States, whereas drinking tea was correlated in the UK. Of course, if we don’t know that smoking is the culprit, we might miss that in the US, smoking was done while drinking coffee, whereas the UK, while drinking tea.

And isolating smoking as a risk factor, rather than just a marker for risk, is so much simpler than isolating whatever risk factors for death are hidden behind physician gender as a marker for risk of mortality.

Coming up with alternative explanations for the apparent link between physician gender and patient mortality.

The final issue – alternative explanations – has been brought up by nearly every critic. There must be an alternative explanation! There must be confounding!  But the critics have mostly failed to come up with what a plausible confounder could be.  Remember, a variable, in order to be a confounder, must be correlated both with the predictor (gender) and outcome (mortality).

This is similarly a fallacious argument. I am not arguing for alternative substantive explanations, I’m proposing that spurious results were produced by pervasive specification bias, including measurement error. There is no potential confounder I have to identify. I am simply arguing that that the small differences in mortality are dwarfed by specification and measurement error.

This tiny difference is actually huge in its implications.

Several critics have brought up the point that statistical significance and clinical significance are not the same thing.  This too is epidemiology 101.  Something can be statistically significant but clinically irrelevant.  Is a 0.43 percentage point difference in mortality rate clinically important? This is not a scientific or a statistical question.  This is a clinical question. A policy and public health question.  And people can reasonably disagree.  From a public health point of view, a 0.43 percentage point difference in mortality for Medicare beneficiaries admitted for medical conditions translates into potentially 32,000 additional deaths. You might decide that this is not clinically important. I think it is. It’s a judgment call and we can disagree.

The author taking a small difference and magnifying its importance by applying to a larger population. He is attributing the “additional deaths” to patients being treated by men. I feel he hasn’t made a case that physician gender is the culprit and so nothing is accomplished except introducing shock and awe by amplifying the small effect into its implications for the larger population.

In response to a journalist, the author makes a parallel argument:

The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

In addition to what I have already argued, if we know the same number of deaths are attributable to automobile crashes, we at least know how to take steps to reduce these crashes and the mortality associated with them. We don’t know how to change the mortality the authors claim is associated with physician gender. We don’t even know that the author’s claims are valid.

Searching for meaning where meaning no meaning is to be found.

In framing the study and interpreting the results to the media, the authors undertake a search of the literature with a heavy confirmation bias, ignoring the many contradictions that are uncovered with a systematic search. For instance, one commentator on the senior author’s blog notes

It took me about 5 minutes of Google searching to find a Canadian report suggesting that female physicians in that country have workloads around 75% to 80% of male physicians:

https://secure.cihi.ca/free_products/PracticingPhysicianCommunityCanada.pdf

If US data is even vaguely similar, that factor would be a serious omission from your article.

But the authors were looking for what supported the results, not for studies that potentially challenged or contradicted their results. They are looking to strengthen a narrative, not expose it to refutation.

Is there a call to action here?

As consumers of health services, we could all switch to being cared for by female physicians. I suspect that some of the systems and structural issues associated with the appearance that care by male physicians inferior would be spread among females, including increased workloads. The bias in the ability of male physicians to claim credit for the work of others would be redistributed to women. Neither would improve patient mortality.

We should push for reduction in inequalities in pay related to gender. But we don’t need results of this study to encourage us.

I certainly know health care professionals and researchers who have more confidence in communication learning modules producing clinically significant changes in position behavior. I don’t know any of them who could produce evidence that these changes include measurable reductions in patient mortality. If someone produces such data, I’m capable of being persuaded. But the present study adds nothing to my confidence in that likelihood.

If we are uncomfortable with the communication skills or attention to evidence that our personal physicians display, we should replace them. But I don’t think this study provides additional evidence for us doing so, beyond the legitimacy of us acting on our preferences.

In the end, this article reminds us to stick to our standards and not be tempted to relax them to make socially acceptable points.

 

 

 

 

 

An open-minded, skeptical look at the success of “zero suicides”: Any evidence beyond the rhetoric?

  • Claims are spreading across social media that a goal of zero suicides can be achieved by radically re-organizing resources in health systems and communities. Extraordinary claims require extraordinary evidence.
  • I thoroughly searched for evidence backing claims of “zero suicides” being achieved.
  • The claims came up short, after expectations were initially raised by some statistics and a provocative graph. But any persuasiveness to these details quickly dissipated when they were scrutinized. Lesson: Abstract numbers and graphs are not necessarily quality evidence and dazzling ones can obscure a lack of evidence.
  • The goal of “zero suicides” has attracted support of Pharma and generated programs around the world, with little fidelity to the original concept developed in the  Henry Ford Health System in Detroit. In many contexts in which it is now being invoked, “zero suicides” is a vacuous buzz term, not a coherent, organizational strategy
  • Preventing suicide is a noble goal to which a lot of emotion gets attached. It also creates lucrative financial opportunities and attracts vested interests which often simply repackage existing programs for resale.
  • How can anyone oppose the idea that we should eliminate suicide? Clever sloganeering can stifle criticism and suppress embarrassing evidence to the contrary
  • Yet, we should not be bullied, nor distracted by slogans from our usual, skeptical insistence on those who make strong claims having the burden to provide strong evidence.
  • Deaths by suicide are statistically infrequent, poorly predicted events that occur in troubled contexts of interpersonal and institutional breakdown. These aspects can frustrate efforts to eliminate suicide entirely – or even accurately track these deaths.
  • Eliminating deaths by suicide is only very loosely analogous to wiping out polio and lots of pitfalls await those who get confused by a false equivalence.
  • Pursuit of the goal of “zero suicides,” particularly in under-resourced and not well-organized community settings can have unintended, negative consequences.
  • “Zero suicides” is likely a fad, to be replaced by next year’s fashion or maybe a few years after.
  • We need to step back and learn from the rise and fall of slogans and the unintended impact on distribution of scarce resources and the costs to human well-being.
  • My take away message is that increasingly sophisticated and even coercive communications about clinical and public health policies often harness the branding of prestigious medical journals. Interpreting these claims require a matching skepticism, critical thinking skills, and renewed demands for evidence.

Beginning the search for evidence for the slogan “Zero Sucide.”

zero tweetNumerous gushy tweets about achieving “zero suicides” drew me into a search for more information. I easily traced the origins of the campaign to a program at the Henry Ford Health System, a Detroit-based HMO, but the concept has now gone thoroughly international. My first Google Scholar search did not yield quality evidence from any program evaluations, but a subsequent Google search produced exceptionally laudatory and often self-congratulatory statements.

I briefly diverted my efforts to contacting authorities whom I expected might comment about “zero suicides.” Some indicated a lack of familiarity prevented them from commenting, but others were as evasive as establishment Republicans asked about Donald Trump. One expert, however, was forthcoming with an interesting article, which proved to have just right tone.  I recommend:

Kutcher S, Wei Y, Behzadi P. School-and Community-Based Youth Suicide Prevention Interventions Hot Idea, Hot Air, or Sham?. The Canadian Journal of Psychiatry. 2016 Jul 12:0706743716659245.

Continuing my search, I found numerous links to other articles, including a laudatory, Medical News and Perspectives opinion piece in JAMA behind a readily circumvented pay wall. There was also a more accessible source with a branding by New England Journal of Medicine.

Clicking on these links, I found editorial and even blatantly promotional material, not randomized trials or other quality evidence.

This kind of non-evidence-based publicity in highly visible medical journals is extraordinary in itself, although not unprecedented. Increasingly, the brand of particular medical journals is sold and harnessed to bestow special credibility on political and financial interests, has seen in 1 and 2.

NEJM Catalyst: How We Dramatically Reduced Suicide.

 NEJM Catalyst is described as bringing

Health care executives, clinician leaders, and clinicians together to share innovative ideas and practical applications for enhancing the value of health care delivery.

0 suicide takeaway
From NEJM Catalyst

The claim of “zero suicides” originated in the Perfect Care for Depression in a division of the Henry Ford Health System.

The audacious goal of zero suicides was part of the Behavioral Health Services division’s larger goal to develop a system of perfect care for depression. Our roadmap for transformation was the Quality Chasm report, which defined six dimensions of perfect care: safety, timeliness, effectiveness, efficiency, equity, and patient-centeredness. We set perfection goals and metrics for each dimension, with zero suicides being the perfection goal for effectiveness. Very quickly, however, our team seized on zero suicides as the overarching goal for our entire transformation.

The strategies:

We used three key strategies to achieve this goal. The first two — improving access to care and restricting access to lethal means of suicide — are evidence-based interventions to reduce suicide risk. While we had pursued these strategies in the past, setting the target at zero suicides injected our team with gumption. To improve access to care, we developed, implemented, and tested new models of care, such as drop-in group visits, same-day evaluations by a psychiatrist, and department-wide certification in cognitive behavior therapy. This work, once messy and arduous for the PDC team, became creative, fun, and focused. To reduce access to lethal means of suicide, we partnered with patients and families to develop new protocols for weapons removal. We also redesigned the structure and content of patient encounters to reflect the assumption that every patient with a mental illness, even if that illness is in remission, is at increased risk of suicide. Therefore, we eliminated suicide screens and risk stratification tools that yielded non-actionable results, freeing up valuable time. Eventually, each of these approaches was incorporated into the electronic health record as decision support.

The third strategy:

…The pursuit of perfection was not possible without a just culture for our internal team. Ultimately, we found this the most important strategy in achieving zero suicides. Since our goal was to achieve radical transformation, not just to tweak the margins, PDC staff couldn’t justly be punished if they came up short on these lofty goals. We adopted a root cause analysis process that treated suicide events equally as tragedies and learning opportunities.

Process of patient care described in JAMA

What happens to a patient being treated in the context of Perfect Depression Care is described in the JAMA  piece:

Each patient seen through the BHS is first assessed and stratified on the basis of suicide risk: acute, moderate, or low. “Everyone is at risk. It’s just a matter of whether it’s acute or whether it requires attention but isn’t emergent,” said Coffey. A patient considered to be at high risk undergoes a psychiatric evaluation the same day. A patient at low risk is evaluated within 7 days. Group sessions for patients also allow individuals to connect and offer support to one another, not unlike the supportive relationships between sponsors and “sponsees” in 12-step programs

The claim of Zero Suicides, in numbers and a graph

…A dramatic and statistically significant 80% reduction in suicide that has been maintained for over a decade, including one year (2009) when we actually achieved the perfection goal of zero suicides (see the figure below). During the PDC initiative, the annual HMO network membership ranged from 182,183 to 293,228, of which approximately 60% received care through Behavioral Health Services. From 1999 to 2010, there were 160 suicides among HMO members. In 1999, as we launched PDC, the mean annual suicide rate for these mental health patients was 110.3 per 100,000. During the 11 years of the initiative, the mean annual suicide rate dropped to 36.21 per 100,000. This decrease is statistically significant and, moreover, took place while the suicide rate actually increased among non–mental health patients and among the general population of the state of Michigan.

Improved_Suicide_Rates_Among_Henry_Ford_Medical_Group_HMO_Members

[This graph conflicts a bit with a graph in NEJM Catalyst that indicates suicides in the health care system were 0 suicides for 2008 and this continued through the first quarter of 2010]

It is clear that rates of suicide fluctuate greatly from year-to-year in the health system. It also appears from the graph that for most years during the program, rates of suicide among patients in the Henry Ford Health System were substantially greater than those of the general population in Michigan, which were relatively flat. Any comparisons between the program and the general statistics for the state of Michigan are not particularly informative. Michigan is a state of enormous health care disparities. During this period, there was a large insured population. Demographics differ greatly, but patients receiving care within an HMO were a substantially more privileged group than the general population of Michigan. During this time, there were many uninsured and a lot of annual movement in and out of the Henry Ford Health System. At any one time, only 60% of the patients within the health system were enrolled in the behavioral health system in which the depression program occurred.

A substantial proportion of suicides occur with individuals who are not previously known to health systems. Such persons are more represented in the statistics for the state of Michigan. Another substantial proportion of suicides occur in individuals with weakened or recently broken contact with health systems. We don’t know how the statistics reported for the health system accommodated biased departures from the health system or simply missing data. We don’t know whether behavior related to risk of suicide affected migration into the health care system or to the small group receiving behavioral healthcare through the health system. For instance, what became of patients with a psychiatric disorder in a comorbid substance use disorder? Those who were incarcerated?

Basically, the success of the program is not obvious within the noisy fluctuation of suicides within the Henry Ford Health System or the smaller behavioral health program. We cannot control for basic confounding factors or selective enrollment and disenrollment in the health care system, or even expelling from the behavioral health system of persons at risk.

 “Zero suicides” as a literal and serious goal?

The NEJM Catalyst article gave the originator of the program free reign for self-praise.

The most unexpected hurdles were skepticism that perfection goals like zero suicides were reasonable or feasible (some objected that it was “setting us up for failure”), and disbelief in the dramatic improvements obtained (we heard comments like “results from quality improvement projects aren’t scientifically rigorous”). We addressed these concerns by ensuring the transparency of our results and lessons, by collaborating with others to continually improve our methodological issues, and by supporting teams across the world who wish to pursue similar initiatives.

Our team challenged this assumption and asked, If zero is not the right goal for suicide occurrence, then what number is? Two? Twelve? Which twelve? In spite of its radicalism — indeed because of it — the goal of zero suicides became the galvanizing force behind an effort that achieved one of the most dramatic and sustained reductions in suicide in the clinical literature.

Will the Henry Ford program prove sustainable?

Edward Coffey moved to  President, CEO, and Chief of Staff at the Menninger Clinic 18 months before his article in the NEJM Catalyst. I am curious to what aspects of his Zero Suicides/Perfect Depression Care Program are still maintained at Henry Ford. As it is described, the program was designed with admirably short waiting times for referral to behavioral healthcare. If the program persists as originally described, many professionals are kept vigilant and engaged in activities to reduce suicide without any statistical likelihood of having the opportunity to actually prevent one.

In decades of work within health systems, I have found that once demonstration projects have run their initial course, their goals are replaced by new organizational  ones and resources are redistributed. Sooner or later, competing demands for scarce resources  are promoted by new slogans.

What if Perfect Depression Care has to compete for scarce resources with Perfect Diabetes Care or alleviation of gross ethnic disparities in cardiovascular outcomes?

A lot of well-meant slogans ultimately have unintended, negative consequences. “Make pain the 5th vital sign” led to more attention being paid to previously ignored and poorly managed pain. This was followed by mandated routine assessment and intervention, which led to unnecessary procedures and unprecedented epidemic of addiction and death from prescribed opioids. “Stamp out distress” has led to mandated screening and intervention programs for psychological distress in cancer care, with high rates of antidepressant prescription without proper diagnosis or follow-up.

If taken literally and seriously, a lofty, but abstract goal like Zero Suicide becomes a threat to any “just culture” in healthcare organization. If the slogan is taken seriously as resources are inevitably withdrawn, a culture of blame will emerge and pressures to distort easily manipulated statistics. Patients posing threats to the goal of zero suicide will be excluded from the system with an unknown, but negative consequences for their morbidity and mortality.

 Bottom line – we can’t have slogan-driven healthcare policies that will likely have negative implications and conflict with evidence.

 Enter Big Pharma

Not unexpectedly, Big Pharma is getting involved in promoting Zero Suicides:

Eli Lilly and Company Foundation donates $250,000 to expand Community Health Network’s Zero Suicides prevention initiative,

Major gift will save Hoosier lives through a suicide prevention network that responds to a critical Indiana healthcare issue.

 According to press coverage, the funds will go to:

The Lilly Foundation donation also provides resources needed to build a Central Indiana crisis network that will include Indiana’s schools, foster care system, juvenile justice program, primary and specialty healthcare providers, policy makers and suicide survivors. These partners will be trained to identify people at risk of attempting suicide, provide timely intervention and quickly connect them with Community’s crisis providers. Indiana’s state government is a key partner in building the statewide crisis network.

I’m sure this effort is good for  the profits of Pharma. Dissemination of screening programs into settings that are not directly connected to quality depression care is inevitably ineffective. The main healthcare consequences are an increase in antidepressant prescriptions without appropriate diagnoses, patient education, and follow-up. Substantial overtreatment results from people being identified without proper diagnosis who otherwise would not be seeking treatment. Care for depression in the community is hardly Perfect Depression Care.

It is great publicity for Eli Lilly and the community receiving the gift will surely be grateful.

Launching Zero Suicides in English communities and elsewhere

My academic colleagues in the UK assure me that we can simply dismiss an official UK government press release about the goal of zero suicides from Nick Clegg. It has been rendered obsolete by subsequent political events. A number commented that they never took it seriously, regardless.

Nick Clegg calls for new ambition for zero suicides across the NHS

The claims in the press release stand in stark contrast to long waiting times for mental health services and important gaps in responses to serious mental health crises, including lethal suicide attempts. However, another web link is to an announcement:

Centre for Mental Health was commissioned by the East of England Strategic Clinical Networks to evaluate activity taking place in four local areas in the region through a pilot programme to extend suicide prevention into communities.

The ‘zero suicide’ initiative is based on an approach developed by Dr Ed Coffey in Detroit, Michigan. The approach aims to prevent suicides by creating a more open environment for people to talk about suicidal thoughts and enabling others to help them. It particularly aims to reach people who have not been reached through previous initiatives and to address gaps in existing provision.

Four local areas in the East of England (Bedfordshire, Cambridgeshire & Peterborough, Essex and Hertfordshire) were selected in 2013 as pathfinder sites to develop new approaches to suicide prevention. Centre for Mental Health evaluated the work of the sites during 2015.

The evaluation found an impressive range of activities that had taken suicide prevention activities out into local communities. They included:

• Training key public service staff such as GPs, police officers, teachers and housing officers
• Training others who may encounter someone at risk of taking their own life, such as pub landlords, coroners, private security staff, faith groups and gym workers
• Creating ‘community champions’ to put local people in control of activities
• Putting in place practical suicide prevention measures in ‘hot spots’ such as bridges and railways
• Working with local newspapers, radio and social media to raise awareness in the wider community
• Supporting safety planning for people at risk of suicide, involving families and carers throughout the process
• Linking with local crisis services to ensure people get speedy access to evidence-based treatments.

The report noted that some of the people who received the training had already saved lives:

“I saved a man’s life using the skills you taught us on the course. I cannot find words to properly express the gratitude I have for that. Without the training I would have been in bits. It was a very public place, packed with people – but, to onlookers, we just looked like two blokes sitting on a bench talking.”

“Déjà vu all over again”, as Yogi Berra would say. This effort also recalls Bill Murray in the movie Groundhog Day, where he is trapped into repeating the same day over and over again.

A few years ago I was a scientific advisor for European Union funded project to disseminate multilevel suicide prevention programs across Europe. One UK site was among those targeted in this report. Implementation of the EU program had already failed before the plate of snacks was being removed from a poorly attended event. The effort quickly failed because it failed to attract the support of local GPs.

Years later, I recognize many of the elements of what we tried to implement, described in language almost identical to ours. There is no mention of the training materials we left behind or of the quick failure of our attempt at implementation.

Many of the proposed measures in the UK plan serve to generate publicity and do not any evidence that they reduce suicides. For instance, training people in the community who might conceivably come in contact with a suicidal person accomplishes little other than producing good publicity. Uptake of such training is abysmally low and is not likely to affect the probability that a person in a suicidal crisis will encounter anyone who can make a difference

Broad efforts to increase uptake of mental health services in the UK strain a system already suffer from unacceptably long waiting times for services. People with any likelihood of attempting suicide, however poorly predicted, are likely to be lost among persons seeking services with less serious or pressing needs.

Thoughts I have accumulated from years of evaluating depression screening programs and suicide intervention efforts

 Staying mobilized around preventing suicide is difficult because it is an infrequent event and most activations of resources will prove to false positives.

It can be tedious and annoying for both staff and patients to keep focused on an infrequent event, particularly for the vast majority of patients who rightfully believe they are not at risk for suicide.

Resources can be drained off from less frequent, but more high risk situations that require sustained intensity of response, pragmatic innovation, and flexibility of rules.

Heightened efforts to detect mental health problems increase access for people already successfully accessing services and decrease resources for those needing special efforts. The net result can be an increase in disparities.

Suicide data are easily manipulated by ignoring selective loss to follow-up. Many suicides occur at breaks in the system, where getting follow-up data is also problematic.

Finally, death by suicide is a health outcomes that is multiply determined. It does not lend itself to targeted public health approaches like eliminating polio, tempting though invoking the analogy may be.

Postscript

It is likely  that I exposed anyone reaching this postscript to a new and disconcerting perspective. What I have been saying is  discrepant with the publicity about “zero suicides” available in the media. The portrayal of “zero suicides” is quite persuasive because it is sophisticated and well-crafted. Its dissemination is well resourced and often financed by individuals and institutions with barely discernible – if at all – conflicts of financial and political interests. Just try to find any dissenters or skeptical assessments.

My takeaway message: It’s best to process claims about suicide prevention with a high level of skepticism, an insistent demand for evidence, and a preparedness for discovering that seemingly well trusted sources are not without agendas. They are usually  providing propaganda rather than evidence-based arguments.

Relaxing vs Stimulating Acupressure for Fatigue Among Breast Cancer Patients: Lessons to be Learned

  • A chance to test your rules of thumb for quickly evaluating clinical trials of alternative or integrative  medicine in prestigious journals.
  • A chance to increase your understanding of the importance of  well-defined control groups and blinding in evaluating the risk of bias of clinical trials.
  • A chance to understand the difference between merely evidence-based treatments versus science-based treatments.
  • Lessons learned can be readily applied to many wasteful evaluations of psychotherapy with shared characteristics.

A press release from the University of Michigan about a study of acupressure for fatigue in cancer patients was churnaled  – echoed – throughout the media. It was reproduced dozens of times, with little more than an editor’s title change from one report to the next.

Fortunately, the article that inspired all the fuss was freely available from the prestigious JAMA: Oncology. But when I gained access, I quickly saw that it was not worth my attention, based on what I already knew or, as I often say, my prior probabilities. Rules of thumb is a good enough term.

So the article became another occasion for us to practice our critical appraisal skills, including, importantly, being able to make reliable and valid judgments that some attention in the media is worth dismissing out of hand, even when tied to an article in a prestigious medical journal.

The press release is here: Acupressure reduced fatigue in breast cancer survivors: Relaxing acupressure improved sleep, quality of life.

A sampling of the coverage:

sample coverage

As we’ve come to expect, the UK Daily Mail editor added its own bit of spin:

daily mailHere is the article:

Zick SM, Sen A, Wyatt GK, Murphy SL, Arnedt J, Harris RE. Investigation of 2 Types of Self-administered Acupressure for Persistent Cancer-Related Fatigue in Breast Cancer Survivors: A Randomized Clinical Trial. JAMA Oncol. Published online July 07, 2016. doi:10.1001/jamaoncol.2016.1867.

Here is the Trial registration:

All I needed to know was contained in a succinct summary at the Journal website:

key points

This is a randomized clinical trial (RCT) in which two active treatments that

  • Lacked credible scientific mechanisms
  • Were predictably shown to be better than
  • A routine care that lacked the positive expectations and support.
  • A primary outcome assessed by  subjectiveself-report amplified the illusory effectiveness of the treatments.

But wait!

The original research appeared in a prestigious peer-reviewed journal published by the American Medical Association, not a  disreputable journal on Beall’s List of Predatory Publishers.

Maybe  this means publication in a peer-reviewed prestigious journal is insufficient to erase our doubts about the validity of claims.

The original research was performed with a $2.65 million peer-reviewed grant from the National Cancer Institute.

Maybe NIH is wasting scarce money on useless research.

What is acupressure?

 According to the article

Acupressure, a method derived from traditional Chinese medicine (TCM), is a treatment in which pressure is applied with fingers, thumbs, or a device to acupoints on the body. Acupressure has shown promise for treating fatigue in patients with cancer,23 and in a study24 of 43 cancer survivors with persistent fatigue, our group found that acupressure decreased fatigue by approximately 45% to 70%. Furthermore, acupressure points termed relaxing (for their use in TCM to treat insomnia) were significantly better at improving fatigue than another distinct set of acupressure points termed stimulating (used in TCM to increase energy).24 Despite such promise, only 5 small studies24– 28 have examined the effect of acupressure for cancer fatigue.

290px-Acupuncture_point_Hegu_(LI_4)You can learn more about acupressure here. It is a derivative of acupuncture, that does not involve needles, but the same acupuncture pressure points or acupoints as acupuncture.

Don’t be fooled by references to traditional Chinese medicine (TCM) as a basis for claiming a scientific mechanism.

See Chairman Mao Invented Traditional Chinese Medicine.

Chairman Mao is quoted as saying “Even though I believe we should promote Chinese medicine, I personally do not believe in it. I don’t take Chinese medicine.”

 

Alan Levinovitz, author of the Slate article further argues:

 

In truth, skepticism, empiricism, and logic are not uniquely Western, and we should feel free to apply them to Chinese medicine.

After all, that’s what Wang Qingren did during the Qing Dynasty when he wrote Correcting the Errors of Medical Literature. Wang’s work on the book began in 1797, when an epidemic broke out in his town and killed hundreds of children. The children were buried in shallow graves in a public cemetery, allowing stray dogs to dig them up and devour them, a custom thought to protect the next child in the family from premature death. On daily walks past the graveyard, Wang systematically studied the anatomy of the children’s corpses, discovering significant differences between what he saw and the content of Chinese classics.

And nearly 2,000 years ago, the philosopher Wang Chong mounted a devastating (and hilarious) critique of yin-yang five phases theory: “The horse is connected with wu (fire), the rat with zi (water). If water really conquers fire, [it would be much more convincing if] rats normally attacked horses and drove them away. Then the cock is connected with ya (metal) and the hare with mao (wood). If metal really conquers wood, why do cocks not devour hares?” (The translation of Wang Chong and the account of Wang Qingren come from Paul Unschuld’s Medicine in China: A History of Ideas.)

Trial design

A 10-week randomized, single-blind trial comparing self-administered relaxing acupressure with stimulating acupressure once daily for 6 weeks vs usual care with a 4-week follow-up was conducted. There were 5 research visits: at screening, baseline, 3 weeks, 6 weeks (end of treatment), and 10 weeks (end of washout phase). The Pittsburgh Sleep Quality Index (PSQI) and Long-Term Quality of Life Instrument (LTQL) were administered at baseline and weeks 6 and 10. The Brief Fatigue Inventory (BFI) score was collected at baseline and weeks 1 through 10.

Note that the trial was “single-blind.” It compared two forms of acupressure, relaxing versus stimulating. Only the patient was blinded to which of these two treatments was being provided, except patients clearly knew whether or not they were randomized to usual care. The providers were not blinded and were carefully supervised by the investigators and provided feedback on their performance.

The combination of providers not being blinded, patients knowing whether they were randomized to routine care, and subjective self-report outcomes together are the makings of a highly biased trial.

Interventions

Usual care was defined as any treatment women were receiving from health care professionals for fatigue. At baseline, women were taught to self-administer acupressure by a trained acupressure educator.29 The 13 acupressure educators were taught by one of the study’s principal investigators (R.E.H.), an acupuncturist with National Certification Commission for Acupuncture and Oriental Medicine training. This training included a 30-minute session in which educators were taught point location, stimulation techniques, and pressure intensity.

Relaxing acupressure points consisted of yin tang, anmian, heart 7, spleen 6, and liver 3. Four acupoints were performed bilaterally, with yin tang done centrally. Stimulating acupressure points consisted of du 20, conception vessel 6, large intestine 4, stomach 36, spleen 6, and kidney 3. Points were administered bilaterally except for du 20 and conception vessel 6, which were done centrally (eFigure in Supplement 2). Women were told to perform acupressure once per day and to stimulate each point in a circular motion for 3 minutes.

Note that the control/comparison condition was an ill-defined usual care in which it is not clear that patients received any attention and support for their fatigue. As I have discussed before, we need to ask just what was being controlled by this condition. There is no evidence presented that patients had similar positive expectations and felt similar support in this condition to what was provided in the two active treatment conditions. There is no evidence of equivalence of time with a provider devoted exclusively to the patients’ fatigue. Unlike patients assigned to usual care, patients assigned to one of the acupressure conditions received a ritual delivered with enthusiasm by a supervised educator.

Note the absurdity of the  naming of the acupressure points,  for which the authority of traditional Chinese medicine is invoked, not evidence. This absurdity is reinforced by a look at a diagram of acupressure points provided as a supplement to the article.

relaxation acupuncture pointsstimulation acupressure points

 

Among the many problems with “acupuncture pressure points” is that sham stimulation generally works as well as actual stimulation, especially when the sham is delivered with appropriate blinding of both providers and patients. Another is that targeting places of the body that are not defined as acupuncture pressure points can produce the same results. For more elaborate discussion see Can we finally just say that acupuncture is nothing more than an elaborate placebo?

 Worth looking back at credible placebo versus weak control condition

In a recent blog post   I discussed an unusual study in the New England Journal of Medicine  that compared an established active treatment for asthma to two credible control conditions, one, an inert spray that was indistinguishable from the active treatment and the other, acupuncture. Additionally, the study involved a no-treatment control. For subjective self-report outcomes, the active treatment, the inert spray and acupuncture were indistinguishable, but all were superior to the no treatment control condition. However, for the objective outcome measure, the active treatment was more effective than all of the three comparison conditions. The message is that credible placebo control conditions are superior to control conditions lacking and positive expectations, including no treatment and, I would argue, ill-defined usual care that lacks positive expectations. A further message is ‘beware of relying on subjective self-report measures to distinguish between active treatments and placebo control conditions’.

Results

At week 6, the change in BFI score from baseline was significantly greater in relaxing acupressure and stimulating acupressure compared with usual care (mean [SD], −2.6 [1.5] for relaxing acupressure, −2.0 [1.5] for stimulating acupressure, and −1.1 [1.6] for usual care; P < .001 for both acupressure arms vs usual care), and there was no significant difference between acupressure arms (P  = .29). At week 10, the change in BFI score from baseline was greater in relaxing acupressure and stimulating acupressure compared with usual care (mean [SD], −2.3 [1.4] for relaxing acupressure, −2.0 [1.5] for stimulating acupressure, and −1.0 [1.5] for usual care; P < .001 for both acupressure arms vs usual care), and there was no significant difference between acupressure arms (P > .99) (Figure 2). The mean percentage fatigue reductions at 6 weeks were 34%, 27%, and −1% in relaxing acupressure, stimulating acupressure, and usual care, respectively.

These are entirely expectable results. Nothing new was learned in this study.

The bottom line for this study is that there was absolutely nothing to be gained by comparing an inert placebo condition to another inert placebo condition to an uninformative condition without clear evidence the control condition offered control of nonspecific factors – positive expectations, support, and attention. This was a waste of patient time and effort, as well as government funds, and produced results that were potentially misleading to patients. Namely, results are likely to be misinterpreted the acupressure is an effective, evidence-based treatment for cancer-related fatigue.

How the authors explained their results

Why might both acupressure arms significantly improve fatigue? In our group’s previous work, we had seen that cancer fatigue may arise through multiple distinct mechanisms.15 Similarly, it is also known in the acupuncture literature that true and sham acupuncture can improve symptoms equally, but they appear to work via different mechanisms.40 Therefore, relaxing acupressure and stimulating acupressure could elicit improvements in symptoms through distinct mechanisms, including both specific and nonspecific effects. These results are also consistent with TCM theory for these 2 acupoint formulas, whereby the relaxing acupressure acupoints were selected to treat insomnia by providing more restorative sleep and improving fatigue and the stimulating acupressure acupoints were chosen to improve daytime activity levels by targeting alertness.

How could acupressure lead to improvements in fatigue? The etiology of persistent fatigue in cancer survivors is related to elevations in brain glutamate levels, as well as total creatine levels in the insula.15 Studies in acupuncture research have demonstrated that brain physiology,41 chemistry,42 and function43 can also be altered with acupoint stimulation. We posit that self-administered acupressure may have similar effects.

Among the fallacies of the authors’ explanation is the key assumption that they are dealing with a specific, active treatment effect rather than a nonspecific placebo intervention. Supposed differences between relaxing versus stimulating acupressure arise in trials with a high risk of bias due to unblinded providers of treatment and inadequate control/comparison conditions. ‘There is no there there’ to be explained, to paraphrase a quote attributed to Gertrude Stein

How much did this project cost?

 According to the NIH Research Portfolios Online Reporting Tools website, this five-year project involved support by the federal government of $2,265,212 in direct and indirect costs. The NCI program officer for investigator-initiated  R01CA151445 is Ann O’Marawho serves ina similar role for a number of integrative medicine projects.

How can expenditure of this money be justified for determining whether so-called stimulating acupressure is better than relaxing acupressure for cancer-related fatigue?

 Consider what could otherwise have been done with these monies.

 Evidence-based versus science based medicine

Proponents of unproven “integrative cancer treatments” can claim on the basis of the study the acupressure is an evidence-based treatment. Future Cochrane Collaboration Reviews may even cite this study as evidence for this conclusion.

I normally label myself as an evidence-based skeptic. I require evidence for claims of the efficacy of treatments and am skeptical of the quality of the evidence that is typically provided, especially when it comes from enthusiasts of particular treatments. However, in other contexts, I describe myself as a science based medicine skeptic. The stricter criteria for this term is that not only do I require evidence of efficacy for treatments, I require evidence for the plausibility of the science-based claims of mechanism. Acupressure might be defined by some as an evidence-based treatment, but it is certainly not a science-based treatment.

For further discussion of this important distinction, see Why “Science”-Based Instead of “Evidence”-Based?

Broader relevance to psychotherapy research

The efficacy of psychotherapy is often overestimated because of overreliance on RCTs that involve inadequate comparison/control groups. Adequately powered studies of the comparative efficacy of psychotherapy that include active comparison/control groups are infrequent and uniformly provide lower estimates of just how efficacious psychotherapy is. Most psychotherapy research includes subjective patient self-report measures as the primary outcomes, although some RCTs provide independent, blinded interview measures. A dependence on subjective patient self-report measures amplifies the bias associated with inadequate comparison/control groups.

I have raised these issues with respect to mindfulness-based stress reduction (MBSR) for physical health problems  and for prevention of relapse in recurrence in patients being tapered from antidepressants .

However, there is a broader relevance to trials of psychotherapy provided to medically ill patients with a comparison/control condition that is inadequate in terms of positive expectations and support, along with a reliance on subjective patient self-report outcomes. The relevance is particularly important to note for conditions in which objective measures are appropriate, but not obtained, or obtained but suppressed in reports of the trial in the literature.