How to get a flawed systematic review and meta-analysis withdrawn from publication: a detailed example

Cochrane normally requires authors to agree to withdraw completed reviews that have been published. This withdrawal in the face of resistance from the authors is extraordinary.

There is a lot to be learned from this letter and the accompanying documents in terms of Courtney calmly and methodically laying out a compelling case for withdrawal of a review with important clinical practice and policy implications.

mind the brain logo

Robert Courtney’s wonderfully detailed cover letter probably proved decisive in getting the Cochrane review withdrawn, along with the work of another citizen scientist/patient advocate, Tom Kindlon.

Cochrane normally requires authors to agree to withdraw completed reviews that have been published. This withdrawal in the face of resistance from the authors is extraordinary.

There is a lot to be learned from this letter and the accompanying documents in terms of Courtney calmly and methodically laying out a compelling case for withdrawal of a review with important clinical practice and policy implications.

Especially take a look at the exchanges with the author Lillebeth Larun that are included in the letter.

Excerpt from the cover letter below:

It is my opinion that the published Cochrane review unfortunately fails to meet the standards expected by the public of Cochrane in terms of publishing rigorous, unbiased, transparent and independent analysis; So I would very much appreciate it if you could investigate all of the problems I raised in my submitted comments and ensure that corrections are made or, at the very least, that responses are provided which allow readers to understand exactly why Cochrane believe that no corrections are required, with reference to Cochrane guidelines.

On this occasion, in certain respects, I consider the review to lack rigour, to lack clarity, to be misleading, and to be flawed. I also consider the review (including the discussions, some of the analyses, and unplanned changes to the protocol) to indicate bias in favour of the treatments which it investigates.

robert bob courtneyAnother key excerpt summarized Courtney’s four comments on the Cochrane review that had not yet succeeded in getting the review withdrawn:

In summary, my four submissions focus on, but are not restricted to the following issues:

  • The review authors switched their primary outcomes in the review, and used unplanned analyses, which has had the effect of substantially transforming some of the interpretation and reporting of the primary outcomes of the review;

  • The review fails to prominently explain and describe the primary outcome switching and to provide a prominent sensitivity analysis. In my opinion, the review also fails to justify the primary outcome switching;

  • The review fails to clearly report that there were no significant treatment effects at follow-up for any pooled outcomes in any measures of health (except for sleep, a secondary outcome), but instead the review gives the impression that most follow-up outcomes indicated significant improvements, and that the treatments were largely successful at follow-up;

  • The review uses some unpublished and post-hoc data from external studies, despite the review-authors claiming that they have included only formally published data and pre-specified outcome data. Using post-hoc and unpublished data, which contradicts the review’s protocol and stated methodology, may have had a significant effect on the review outcomes, possibly even changing the review outcomes from non-significant to significant;

  • The main discussion sections in the review include incorrect and misleading reports of the review’s own outcomes, giving a.false overall impression of the efficacy of the reviewed therapies;

  • The review includes an inaccurate assessment of bias (according to the Cochrane guidelines for reporting bias) with respect to some of the studies included in the review’s analyses.

These are all serious issues, that I believe we should not be seeing in a Cochrane review.

Digression: My Correspondence with Tom Kindlon regarding this blog post

James Coyne <jcoynester@gmail.com>

Oct 18, 2018, 12:45 PM (3 days ago)

to Tom

I’m going to be doing a couple of blog posts about Bob, one of them about the details of the lost year of his life (2017) which he shared with me in February 2018, shortly before he died. But the other blog post is going to be basically this long email posted with commentary. I am concerned that you get your proper recognition as fully sharing the honors with him for ultimately forcing the withdrawal of the exercise review. Can you give me some suggestion how that might be assured? references? blogs

Do you know the details of Bob ending his life? I know it was a deliberate decision, but was it an accompanied suicide? More people need to know about his involuntary hospitalization and stupid diagnosis of anorexia.

Kind regards

tom Kindlon
Tom Kindlon

Tom Kindlon’s reply to me

Tom Kindlon

Oct 18, 2018, 1:01 PM (3 days ago)

Hi James/Jim,

It is great you’re going to write on this.

I submitted two long comments on the Cochrane review of exercise therapy for CFS, which can be read here:

<https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub7/detailed-comment/en?messageId=157054020&gt;

<https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub7/detailed-comment/en?messageId=157052118&gt;

Robert Courtney then also wrote comments. When he was not satisfied with the responses, he made a complaint.

All the comments can be read on the review here:

<https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub7/read-comments&gt;

but as I recall the comments by people other than Robert and myself were not substantial.

I will ask what information can be given out about Bob’s death.

Thanks again for your work on this,

Tom

The Cover Letter: Did it break the impasse about withdrawing the review?

from:     Bob <brightonbobbob@yahoo.co.uk>

to:            James Coyne <jcoynester@gmail.com>

date:     Feb 18, 2018, 5:06 PM

subject:                Fw: Formal complaint – Cochrane review CD003200Sun, Feb 18, 1:15 PM

THIS IS A COPY OF A FORMAL COMPLAINT SENT TO DR DAVID TOVEY.

Formal Complaint

12th February 2018

From:

Robert Courtney.

UK

To:

Dr David Tovey

Editor in Chief of the Cochrane Library

Cochrane Editorial Unit

020 7183 7503

dtovey@cochrane.org

Complaint with regards to:

Cochrane Database of Systematic Reviews.

Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. Exercise therapy for chronic fatigue syndrome. Cochrane Database Syst Rev. 2017; CD003200. DOI: 10.1002/14651858.CD003200.pub7

Dear Dr David Tovey,

This is a formal complaint with respect to the current version of “Exercise therapy for chronic fatigue syndrome” by L. Larun et al. (Cochrane Database Syst Rev. 2017; CD003200.)

First of all, I would like to apologise for the length of my submissions relating to this complaint. The issues are technical and complex and I hope that I have made them easy to read and understand despite the length of the text.

I have attached four PDF files to this email which outline the details of my complaint. In 2016, I submitted each of these documents as part of the Cochrane comments facility. They have now been published in the updated version of the review. (For your convenience, the details of these submissions are listed at the end of this email with a weblink to an online copy of each document.)

I have found the responses to my comments, by L. Larun, the lead author of the review, to be inadequate, especially considering the seriousness of some of the issues raised.

It is my opinion that the published Cochrane review unfortunately fails to meet the standards expected by the public of Cochrane in terms of publishing rigorous, unbiased, transparent and independent analysis; So I would very much appreciate it if you could investigate all of the problems I raised in my submitted comments and ensure that corrections are made or, at the very least, that responses are provided which allow readers to understand exactly why Cochrane believe that no corrections are required, with reference to Cochrane guidelines.

On this occasion, in certain respects, I consider the review to lack rigour, to lack clarity, to be misleading, and to be flawed. I also consider the review (including the discussions, some of the analyses, and unplanned changes to the protocol) to indicate bias in favour of the treatments which it investigates.

Exercise as a therapy for chronic fatigue syndrome is a highly controversial subject, and so there may be more of a need for independent oversight and scrutiny of this Cochrane review than might usually be the case.

In addition to the technical/methodological issues raised in my four submitted comments, I would also like you to consider whether there may be a potential lack of independence on the part of the authors of this review.

All of the review authors, bar Price, are currently working in collaboration on another Cochrane project with some of the authors of the studies included in this review. (The project involves co-authoring a protocol for a future Cochrane review) [2]. One of the meetings held to develop the protocol for this new review was funded by Peter White’s academic fund [1]. White is the Primary Investigator for the PACE trial (a study included in this Cochrane review).

It is important that Cochrane is seen to uphold high standards of independence, transparency and rigour.

Please refer to my four separate submissions (attached) for the details of my complaint regarding the contents of the review. As way of an introduction, only, I will also briefly discuss, below, some of the points I have raised in my four documents.

In summary, my four submissions focus on, but are not restricted to the following issues:

  • The review authors switched their primary outcomes in the review, and used unplanned analyses, which has had the effect of substantially transforming some of the interpretation and reporting of the primary outcomes of the review;
  • The review fails to prominently explain and describe the primary outcome switching and to provide a prominent sensitivity analysis. In my opinion, the review also fails to justify the primary outcome switching;
  • The review fails to clearly report that there were no significant treatment effects at follow-up for any pooled outcomes in any measures of health (except for sleep, a secondary outcome), but instead the review gives the impression that most follow-up outcomes indicated significant improvements, and that the treatments were largely successful at follow-up;
  • The review uses some unpublished and post-hoc data from external studies, despite the review-authors claiming that they have included only formally published data and pre-specified outcome data. Using post-hoc and unpublished data, which contradicts the review’s protocol and stated methodology, may have had a significant effect on the review outcomes, possibly even changing the review outcomes from non-significant to significant;
  • The main discussion sections in the review include incorrect and misleading reports of the review’s own outcomes, giving a.false overall impression of the efficacy of the reviewed therapies;
  • The review includes an inaccurate assessment of bias (according to the Cochrane guidelines for reporting bias) with respect to some of the studies included in the review’s analyses.

These are all serious issues, that I believe we should not be seeing in a Cochrane review.

These issues have already caused misunderstanding and misreporting of the review in academic discourse and publishing. (See an example of this below.)

All of the issues listed above are explained in full detail in the four PDF files attached to this email. They should be considered to be the basis of this complaint.

For the purposes of this correspondence, I will illustrate some specific issues in more detail.

In the review, the following health indicators were used as outcomes to assess treatment effects: fatigue, physical function, overall health, pain, quality of life, depression, anxiety, and sleep. All of these health indicators, except uniquely for sleep (a secondary outcome) demonstrated a non-significant outcome for pooled treatment effects at follow-up for exercise therapy versus passive control. But a reader would not be aware of this from reading any of the discussion in the review. I undertook a lengthy and detailed analysis of the data in the review before i could comprehend this. I would like these results to be placed in a prominent position in the review, and reported correctly and with clarity, so that a casual reader can quickly understand these important outcomes. These outcomes cannot be understood from reading the discussion, and some outcomes have been reported incorrectly in the discussion. In my opinion, Cochrane is not maintaining its expected standards.

Unfortunately, there is a prominent and important error in the review, which I believe helps to give the mis-impression that the investigated therapies were broadly effective. Physical function and overall-health (both at follow-up) have been mis-reported in the main discussion as being positive outcomes at follow-up, when in fact they were non-significant outcomes. This seems to be an important failing of the review that I would like to be investigated and corrected.

Regarding one of the points listed above, copied here:

“The review fails to clearly report that there were no significant treatment effects at follow-up for any pooled outcomes in any measures of health (except for sleep, a secondary outcome), but instead the review gives the impression that most follow-up outcomes indicated significant improvements, and that the treatments were largely successful at follow-up”

This is one of the most substantial issues that I have highlighted. This issue is related to the primary outcome switching in the review.

(This relates to assessing fatigue at long-term follow-up for exercise therapy vs passive control.)

An ordinary (i.e. casual) reader of the review may easily be left with the impression that the review demonstrates that the investigated treatment has almost universal beneficial health effects. However there were no significant treatment effects for pooled outcome analyses at follow-up for any health outcomes except for sleep (a secondary outcome ). The lack of universal treatment efficacy at follow-up is not at all clear from a casual read of the review, or even from a thorough read. Instead, a careful analysis of the data is necessary to understand the outcomes. I believe that the review is unhelpful in the way it has presented the outcomes, and lacks clarify.

These follow-up outcomes are a very important issue for medical, patient and research communities, but I believe that they have been presented in a misleading and unhelpful way in the discussions of the review. This issue is discussed mainly in my submission no.4 (see my list of PDF documents at the bottom of this correspondence), and also a little in submission no.3.

I will briefly explain some of the specific details, as way of an introduction, but please refer to my attached documents for the full details.

The pre-specified primary outcomes were pooled treatment effects (i.e. using pooled data from all eligible studies) immediately after treatment and at follow-up.

However, for fatigue, this pre-specified primary outcome (i.e. pooled treatment effects for the combination of data from all eligible studies) was abandoned/switched (for what i consider to be questionable reasons) and replaced with a non-pooled analysis. The new unplanned analysis did not pool the data from all eligible studies but analysed data from studies grouped together by the specific measure used to assess fatigue (i.e. grouped by the various different fatigue questionnaire assessments).

Looking at these post-hoc grouped outcomes, for fatigue at follow-up , two out of the three grouped outcomes had significant treatment effects, and the other outcome was a non-significant effect. This post-hoc analysis indicates that the majority of outcomes ( i.e. two out of three) demonstrated a significant treatment effect , however, this does not mean that the pre-specified pooled analysis of all eligible studies would have demonstrated a positive treatment effect. Therefore switching outcomes, and using a post-hoc analysis, allows for the potential introduction of bias to the review. Indeed, on careful inspection of the minutia of the review, the pre-specified analysis of pooled outcomes demonstrates a non-significant treatment effect, for fatigue at follow-up (exercise therapy versus passive control)

The (non-significant) outcome of this pre-specified pooled analysis of fatigue at follow-up is somewhat buried within the data tables of review, and is very difficult to find; It is not discussed prominently or highlighted. Furthermore, the explanation that the primary outcome was switched, is only briefly mentioned and can easily be missed. Uniquely, for the main outcomes, there is no table outlining the details of the pre-specified pooled analysis of fatigue at follow-up. In contrast, the post-hoc analysis, which has mainly positive outcomes, has been given high prominence throughout the review with little explanation that it is a post-hoc outcome.

So, to reiterate, the (two out of three significant, and one non-significant) post-hoc outcomes for fatigue at follow-up were reported as primary outcomes instead of the (non-significant) pre-specified pooled treatment effect for all eligible studies. Two out of three post-hoc outcomes were significant in effect, however, the pre-specified pooled treatment effect, for the same measures, were not significant (for fatigue at follow-up – exercise therapy versus passive control). Thus, the outcome switching transformed one of the main outcomes of the review, from a non-insignificant effect to a mainly significant effect.

Furthermore, for exercise therapy versus passive control at follow-up, all the other health outcomes were non-significant (except sleep – a secondary outcome), but I believe the casual reader would be unaware of this because it is not explained clearly or prominently in the discussion, and some outcomes have been reported erroneously in the discussion as indicating a significant effect.

All of the above is outlined in my four PDF submissions, with detailed reference to specific sections of the review and specific tables etc.

I believe that the actual treatment effects at follow-up are different to the impression gained from a casual read of the review, or even a careful read of the review. It’s only by an in-depth analysis of the entire review that these issues would be noticed.

In what i believe to be a reasonable request in my submissions, i asked the reviewers to: “Clearly and unambiguously explain that all but one health indicator (i.e. fatigue, physical function, overall health, pain, quality of life, depression, and anxiety, but not sleep) demonstrated a non-significant outcome for pooled treatment effects at follow-up for exercise therapy versus passive control”. My request was not acted upon.

The Cochrane reviewers did provide a reason for the change to the protocol, from a pooled analysis to analyses of groups of mean difference values: “We realise that the standardised mean difference (SMD) is much more difficult to conceptualise and interpret than the normal mean difference (MD) […]”.

However, this is a questionable and unsubstantiated claim, and in my opinion isn’t an adequate explanation or justification for changing the primary outcomes; personally, I find it easier to interpret a single pooled analysis than a group of different analyses with each analysis using a different non-standardised scale to measure fatigue.

Using a SMD is standard practice for Cochrane reviews; Cochrane’s guidance recommends using pooled analyses when the outcomes use different measures, which was the case in this review; Thus i struggle to understand why (in an unplanned change to methodology) using a SMD was considered unhelpful by the reviewers in this case. My PDF document no.4 challenges the reviewers’ reason, with reference to the official Cochrane reviewers’ guidelines.

This review has already led to an academic misunderstanding and mis-reporting of its outcomes, which is demonstrated in the following published letter from one of the co-authors of the IPD protocol……

CMAJ (Canada) recommends exercise for CFS [http://www.cmaj.ca/content/188/7/510/tab-e-letters ]

The letter claims: “We based the recommendations on the Cochrane systematic review which looked at 8 randomised trials of exercise for chronic fatigue, and together showed a consistent modest benefit of exercise across the different patient groups included. The clear and consistent benefit suggests indication rather than contraindication of exercise.”

However, there was not a “consistent modest benefit of exercise” and there was not a “clear and consistent benefit” considering that there were no significant treatment effects for any pre-specified (pooled) health outcomes at follow-up, except for sleep. The actual outcomes of the review seem to contradict the interpretation expressed in the letter.

Even if we include the unplanned analyses in our considerations, then it would still be the case that most outcomes did not indicate a beneficial treatment effect at follow-up for exercise therapy versus passive control. Furthermore, one of the most important outcomes, physical function, did not indicate a significant improvement at follow up (despite the discussion erroneously stating that it was a significant effect).

Two of my submissions discuss other issues, which I will outline below.

My first submission is in relation to the following…

The review states that all the analysed data had previously been formally published and was pre-specified in the relevant published studies. However, the review includes an analysis of external data that had not been formally published and is post-hoc in nature, despite alternative data being available that has been formally published and had been pre-specified in the relevant study. The post-hoc data relates to the FINE trial (Wearden 2010). The use of this data was not in accordance with the Cochrane review’s protocol and also contradicts the review’s stated methodology and the discussion of the review.

Specifically, the fatigue data taken from the FINE trial was not pre-specified for the trial and was not included in the original FINE trial literature. Instead, the data had been informally posted on a BMJ rapid response by the FINE trial investigators[3].

The review analyses post-hoc fatigue data from the FINE trial which is based on the Likert scoring system for the Chalder fatigue questionnaire, whereas the formally published FINE trial literature uses the same Chalder fatigue questionnaires but uses the biomodal scoring system, giving different outcomes for the same patient questionnaires. The FINE trial’s post-hoc Likert fatigue data (used in the review) was initially published by the FINE authors only in a BMJ rapid response post [3], apparently as an after-thought.

This is the response to my first letter…

Larun
Larun said she was “extremely concerned and disappointed” with the Cochrane editors’ actions. “I disagree with the decision and consider it to be disproportionate and poorly justified,” she said.

———————-

Larun said:

Dear Robert Courtney

Thank you for your detailed comments on the Cochrane review ‘Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work. We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.

The Chalder Fatigue Scale was used to measure fatigue. The results from the Wearden 2010 trial show a statistically significant difference in favour of pragmatic rehabilitation at 20 weeks, regardless whether the results were scored bi-modally or on a scale from 0-3. The effect estimate for the 70 week comparison with the scale scored bi-modally was -1.00 (CI-2.10 to +0.11; p =.076) and -2.55 (-4.99 to -0.11; p=.040) for 0123 scoring. The FINE data measured on the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors to be able to use the estimates in our meta-analysis. In our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks. The decision to use the 0123 scoring did does not affect the conclusion of the review.

Regards,

Lillebeth Larun

——————

In her response, above, Larun discusses the FINE trial and quotes an effect size for post-hoc outcome data (fatigue at follow-up) from the FINE trial that is included in the review. Her quoted figures accurately reflect the data quoted by the FINE authors in their BMJ rapid-response comment [3] but, confusingly, these are slightly different from the data in the Cochrane review. In her response, Larun states that the FINE trial effect size for fatigue at 70 weeks using Likert data is -2.55 (-4.99 to -0.11; p=.040), whereas the Cochrane Review states that it is -2.12 [-4.49, 0.25].

This inconsistency makes this discussion confusing. Unfortunately there is no authoritative source for the data because it had not been formally published when the Cochrane review was published.

It seems that, in her response, Larun has quoted the BMJ rapid response data by Wearden et al.[3], rather than her own review’s data. Referring to her review’s data, Larun says that in “our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks”.

It is not clear exactly why there are now two different Likert effect sizes, for fatigue at 70 weeks, but we can be sure that the use of this data undermines the review’s claim that “for this updated review, we have not collected unpublished data for our outcomes…”

This confusion, perhaps, demonstrates one of the pitfalls of using unpublished data. The difference between the data published in the review and the data quoted by Larun in her response (which are both supposedly the same unpublished data from the FINE trial) raises the question of exactly what data has been analysed in the review, and what exactly is the source . If it is unpublished data, and seemingly variable in nature, how are readers expected to scrutinise or trust the Cochrane analysis?

With respect to the FINE trial outcomes (fatigue at 70 week follow-up), Larun has provided the mean differences (effect size) for the (pre-specified) bimodal data and for (post-hoc) Likert data. These two different scoring methods (bimodel and Likert), are used for identical patient Chalder fatigue questionnaires, and provide different effect sizes, so switching the fatigue scoring methods may possibly have had an impact on the review’s primary outcomes for fatigue.

Larun hasn’t provided the effect estimates for fatigue at end-of-treatment, but these would also demonstrate variance between bimodal and Likert scoring, so switching the outcomes might have had a significant impact on the primary outcome of the Cochrane review at end-of-treatment, as well as at follow-up.

Note that the effect estimates outlined in this correspondence, for the FINE trial, are mean differences (this is the data taken from the FINE trial), rather than standardised mean differences (which are sometimes used in the meta-analyses in the Cochrane review); It is important not to get confused between the two different statistical analyses.

Larun said: “The decision to use the 0123 [i.e. Likert] scoring did does [sic] not affect the conclusion of the review.”

But it is not possible for a reader to verify that because Larun has not provided any evidence to demonstrate that switching outcomes has had no effect on the conclusion of the review. i.e. There is no sensitivity analysis, despite the review switching outcomes and using unpublished post-hoc data instead of published pre-specified data. This change in methodology means that the review does not conform to its own protocol and stated methodology. This seems like a significant issue.

Are we supposed to accept the word of the author, rather than review the evidence for ourselves? This is a Cochrane review – renowned for rigour and impartiality.

Note that Larun has acknowledged that I am correct with respect to the FINE trial data used in the review (i.e. that the data was unpublished and not part of the formally published FINE trial study, but was simply posted informally in a BMJ rapid response). Larun confirms that: “…the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors…” But then Larun confusingly (for me) says we must “agree to disagree”.

Larun has not amended her literature to resolve the situation; Larun has not changed her unplanned analysis back to her planned analyses (i.e. to use published pre-specified data as per the review protocol, rather than unpublished post-hoc data); nor has she amended the text of the review so that it clearly and prominently indicates that the primary outcomes were switched. Neither has a sensitivity analysis been published using the FINE trial’s published pre-specified data.

Note the difference in the effect estimates at 70 weeks for bimodal scoring [-1.00 (CI -2.10 to +0.11; p =.076)] vs Likert scoring [-2.55 (-4.99 to -0.11; p=.040)] (as per the Cochrane analysis) or -2.12 [-4.49, 0.25] (also Likert scoring) as per Larun’s response and the BMJ rapid response where the data was initially presented to the public.

Confusingly, there are two different effect sizes for the same (Likert) data; one shows a significant treatment effect and the other shows a non-significant treatment effect. This seems like a rather chaotic situation for a Cochrane review . The data is neither consistent nor transparent. The unplanned Cochrane analysis uses data which has not been published and cannot be scrutinised.

Furthermore, we now have three sets of data for the same outcomes. Because an unplanned analysis was used in the review, it is nearly impossible to work out what is what.

In her response, above, Larun says that both fatigue outcomes (i.e. bimodal & Likert scoring systems) at 70 weeks are non-significant. This is true of the data published in the Cochrane review but, confusingly, this isn’t true if we consider the data that Larun has provided in her response, above. The bimodal and Likert data (fatigue at 70 weeks) presented in the review both have a non-significant effect, however, the Likert data quoted in Larun’s correspondence (which reflects the data in the FINE trial authors’ BMJ rapid response) shows a significant outcome. This may reflect the use of adjusted vs unadjusted data, but it isn’t clear.

Using post-hoc data may allow bias to creep into the review; For example, the Cochrane reviewers might have seen the post hoc data for the FINE trial , because it was posted in an open-access BMJ rapid response [3] prior to the Cochrane review publication date. I am not accusing the authors of conscious bias but Cochrane guidelines are put in place to avoid doubt and to maintain rigour and transparency. Hypothetically, a biased author may have seen that a post-hoc Likert analysis allowed for better outcomes to be reported for the FINE trial. The Cochrane guidelines are established in order to avoid such potential pitfalls and bias, and to avoid the confusion that is inherent in this review.

Note that the review still incorrectly says that all the data is previously published data – even though Larun admits in the letter that it isn’t. (i.e. the data are not formally published in a peer-reviewed journal; i assume that the review wasn’t referring to data that might be informally published in blogs or magazines etc, because the review pretends to analyse formally published data only.)

The authors have practically dismissed my concerns and have not amended anything in the review, despite admitting in the response that they’ve used post-hoc data.

The fact that this is all highly confusing, even after I have studied it in detail, demonstrates that these issues need to be straightened out and fixed.

It surely shouldn’t be the case, in a Cochrane review, that we ( for the same outcomes ) have three sets of results being bandied about, and the data used in a post hoc analysis seems to vary over time, and change from a non-significant treatment effect to a significance treatment effect, depending on where it is quoted. Because it is unpublished, independent scrutiny is made more difficult.

For your information, the BMJ rapid response (Wearden et al.) includes the following data : “Effect estimates [95% confidence intervals] for 20 week comparisons are: PR versus GPTAU -3.84 [-6.17, -1.52], SE 1.18, P=0.001; SL versus GPTAU +0.30 [-1.73, +2.33], SE 1.03, P=0.772. Effect estimates [95% confidence intervals] for 70 week comparisons are: PR versus GPTAU -2.55 [-4.99,-0.11], SE 1.24, P=0.040; SL versus GPTAU +0.36 [-1.90, 2.63], SE 1.15, P=0.752.”

My second submission was in relation to the following…

I believe that properly applying the official Cochrane guidelines would require the review to categorise the PACE trial (White 2011) data as ‘unplanned’ rather than ‘pre-specified’, and would require the risk of bias in relation to ‘selective reporting’ to be categorised accordingly. The Cochrane review currently categorises the risk of ‘selective reporting’ bias for the PACE trial as “low”, whereas the official Cochrane guidelines indicate (unambiguously) that the risk of bias for the PACE data should be “high”. I believe that my argument is fairly robust and water-tight.

This is the response to my second letter…

———————–

Larun said:

Dear Robert Courtney

Thank you for your detailed comments on the Cochrane review ‘Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work. We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.

Cochrane reviews aim to report the review process in a transparent way, for example, are reasons for the risk of bias stated. We do not agree that Risk of Bias for the Pace trial (White 2011) should be changed, but have presented it in a way so it is possible to see our reasoning. We find that we have been quite careful in stating the effect estimates and the certainty of the documentation. We note that you read this differently.

Regards,

Lillebeth

————————-

I do not understand what is meant by: “We do not agree that Risk of Bias for the Pace trial (White 2011) should be changed, but have presented it in a way so it is possible to see our reasoning.” …

The review does not discuss the issue of the PACE data being unplanned and I, for one, do not understand the reasoning for not correcting the category for the risk of selective reporting bias. The response to my submission fails to engage with the substantive and serious issues that I raised.

To date, nearly all the issues raised in my letters have been entirely dismissed by Larun. I find this surprising, especially considering that some of the points that I have made were factual (i.e. not particularly open to interpretation) and difficult to dispute. Indeed, Larun’s response even accepts the factual point that I made, in relation to the FINE data, but then confusingly dismisses my request for the issue to be remedied.

There is more detail in the four PDF submissions which are attached to this email, and which have now been published in the latest version of the Cochrane review. I will stop this email now so as not to overwhelm you, and so I don’t repeat myself .

Again, I apologise for the complexity. My four submissions , attached to this email as PDF files, form the basis of my complaint so I ask you to consider them to be the central basis of my complaint . I hope that they will be sufficiently clear.

I trust that you will wish to investigate these issues, with a view to upholding the high standards expected from a Cochrane review.

I look forward to hearing from you in due course. Please feel free to email me at any time with any questions, of if you believe it would be helpful to discuss any of the issues raised.

Regards,

Robert Courtney.

My ‘comments’ (submitted to the Cochrane review authors):

Please note that the four attached PDF documents form the basis of this complaint.

For your convenience, I have included a weblink to a downloadable online copy of each document, and I have attached copies to this email as PDF files, and the comments have now been published in the latest updated version of the review.

The dates refer to the date the comments were submitted to Cochrane.

  1. Query re use of post-hoc unpublished outcome data: Scoring system for the Chalder fatigue scale, Wearden 2010.

Robert Courtney

16th April 2016

https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/fine-trial-unpublished-data

  1. Assessment of Selective Reporting Bias in White 2011.

Robert Courtney

1st May 2016

https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/pace-trial-selective-reporting-bias

  1. A query regarding the way outcomes for physical function and overall health have been described in the abstract, conclusion and discussions of the review.

Robert Courtney

12th May 2016

[ https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/misreporting-of-outcomes-for-physical-function ]

  1. Concerns regarding the use of unplanned primary outcomes in the Cochrane review.

Robert Courtney

3rd June 2016

https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/primary-outcome-switching

References:

  1. Quote from Cochrane reference CD011040:

“Acknowledgements[…]The author team held three meetings in 2011, 2012 and 2013 which were funded as follows: […]2013 via Peter D White’s academic fund (Professor of Psychological Medicine, Centre for Psychiatry, Wolfson Institute of Preventive Medicine, Barts and The London School of Medicine and Dentistry, Queen Mary University of London).”

  1. Larun L, Odgaard-Jensen J, Brurberg KG, Chalder T, Dybwad M, Moss-Morris RE, Sharpe M, Wallman K, Wearden A, White PD, Glasziou PP. Exercise therapy for chronic fatigue syndrome (individual patient data) (Protocol). Cochrane Database of Systematic Reviews 2014, Issue 4. Art. No.: CD011040.

http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD011040/abstract

http://www.cochrane.org/CD011040/DEPRESSN_exercise-therapy-for-chronic-fatigue-syndrome-individual-patient-data

 

  1. Wearden AJ, Dowrick C, Chew-Graham C, et al. Fatigue scale. BMJ Rapid Response. 2010.

http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0 (accessed Feb 21, 2016).

End.

Cochrane complaints procedure:

http://www.cochranelibrary.com/help/the-cochrane-library-complaints-procedure.html

Lessons we need to learn from a Lancet Psychiatry study of the association between exercise and mental health

The closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

giphyThe closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

Apparently, the editor of Lancet Psychiatry and reviewers did not give the study a close look before it was accepted.

The article was used to raise funds for a startup company in which one of the authors was heavily invested. This was disclosed, but doesn’t let the authors off the hook for promoting a seriously flawed study. Nor should the editor of Lancet Psychiatry or reviewers escape criticism, nor the large number of people on Twitter who thoughtlessly retweeted and “liked” a series of tweets from the last author of the study.

This blog post is intended to raise consciousness about bad science appearing in prestigious journals and to allow citizen scientists to evaluate their own critical thinking skills in terms of their ability to detect misleading and exaggerated claims.

1.Sometimes a disclosure of extensive conflicts of interest alerts us not to pay serious attention to a study. Instead, we should question why the study got published in a prestigious peer-reviewed journal when it had such an obvious risk of bias.

2.We need citizen scientists with critical thinking skills to identify such promotional efforts and alert others in their social network that hype and hokum are being delivered.

3.We need to stand up to authors who use scientific papers for commercial purposes, especially when they troll critics.

Read on and you will see what a skeptical look at the paper and its promotion revealed.

  • The study failed to capitalize on the potential of multiple years of data for developing and evaluating statistical models. Bigger is not necessarily better. Combining multiple years of data was wasteful and served only the purpose of providing the authors bragging rights and the impressive, but meaningless p-values that come from overly large samples.
  • The study relied on an unvalidated and inadequate measure of mental health that confounded recurring stressful environmental conditions in the work or home with mental health problems, even where validated measures of mental health would reveal no effects.
  • The study used an odd measure of history of mental health problems that undoubtedly exaggerated past history.
  • The study confused physical activity with (planned) exercise. Authors amplified their confusion by relying on an exceedingly odd strategy for getting estimate of how much participants exercised: Estimates of time spent in a single activity was used in analyses of total time spent exercising. All other physical activity was ignored.
  • The study made a passing acknowledgment of the problems interpreting simple associations as causal, but then went on to selectively sample the existing literature to make the case that interventions to increase exercise improve mental health.
  • Taken together, a skeptical of assessment of this article provides another demonstration that disclosure of substantial financial conflicts of interests should alert readers to a high likelihood of a hyped, inaccurately reported study.
  • The article was pay walled so that anyone interested in evaluating the authors claims for themselves had to write to the author or have access to the article through a university library site. I am waiting for the authors to reply to my requests for the supplementary tables that are needed to make full sense of their claims. In the meantime, I’ll just complain about authors with significant conflicts of interest heavily promoting studies that they hide behind paid walls.

I welcome you to  examine the author’s thread of tweets. Request the actual article from the author if you want to evaluate independently my claims. This can be great material for a masters or honors class on critical appraisal, whether in psychology or journalism.

title of article

Let me know if you think that I’ve been too hard on this study.

A thread of tweets  from the last author celebrated the success of well orchestrated publicity campaign for a new article concerning exercise and mental health in Lancet Psychiatry.

The thread started:

Our new @TheLancetPsych paper was the biggest ever study of exercise and mental health. it caused quite a stir! here’s my guided tour of the paper, highlighting some of our excitements and apprehensions along the way [thread] 1/n

And ended with pitch for the author’s do-good startup company:

Where do we go from here? Over @spring_health – our mental health startup in New York City – we’re using these findings to develop personalized exercise plans. We want to help every individual feel better—faster, and understand exactly what each patient needs the most.

I wasn’t long into the thread before my skepticism was stimulated. The fourth tweet in the thread had a figure that didn’t get any comments about how bizarre it was.

The tweet

It looks like those differences mattered. for example, people who exercised for about 45 minutes seemed to have better mental health than people who exercised for less than 30, or more than 60 minutes. — a sweet spot for mental health, perhaps?

graphs from paper

Apparently the author does not comment on an anomaly either. Housework appears to be better for mental health than a summary score of all exercise and looks equal to or better than cycling or jogging. But how did housework slip into the category “exercise”?

I begin wondering what the authors meant by “exercise” or if they’d given the definition serious consideration when constructing their key variable from the survey data.

But then that tweet was followed by another one that generated more confusion with a  graph the seemingly contradicted the figures in the last one

the type of exercise people did seems important too! People doing team sports or cycling had much better mental health than other sports. But even just walking or doing household chores was better than nothing!

Then a self-congratulatory tweet for a promotional job well done.

for sure — these findings are exciting, and it has been overwhelming to see the whole world talking openly and optimistically about mental health, and how we can help people feel better. It isn’t all plain sailing though…

The author’s next tweet revealed a serious limitation to the measure of mental health used in the study in a screenshot.

screenshot up tweet with mental health variable

The author acknowledged the potential problem, sort of:

(1b- this might not be the end of the world. In general, most peple have a reasonable understanding of their feelings, and in depressed or anxious patients self-report evaluations are highly correlated with clinician-rated evaluations. But we could be more precise in the future)

“Not the end of the world?” Since when does the author of the paper in the Lancet family of journals so casually brush off a serious methodological issue? A lot of us who have examined the validity of mental health measures would be skeptical of this dismissal  of a potentially fatal limitation.

No validation is provided for this measure. On the face of it, respondents could endorse it on basis of facing  recurring stressful situations that had no consequences for their mental health. This reflects ambiguity of the term stress for both laypersons and scientists. “Stress” could variously refer to an environmental situation, a subjective experience of stress, or an adaptational outcome. Waitstaff could consider Thursday when the chef is off, a recurrent, weekly stress. Persons with diagnosable persistent depressive disorder would presumably endorse more days than not as being a mental health challenge. But they would mean something entirely different.

The author acknowledged that the association between exercise and mental health might be bidirectional in terms of causality

adam on lots of reasons to believe relationship goes both ways.PNG

But then made a strong claim for increased exercise leading to better mental health.

exercise increases mental health.PNG

[Actually, as we will see, the evidence from randomized trials of exercise to improve mental health is modest, and entirely disappears one limits oneself to the quality studies.]

The author then runs off the rail with the claim that the benefits of exercise exceed benefits of having greater than poverty-level income.

why are we so excited.PNG

I could not resist responding.

Stop comparing adjusted correlations obtained under different circumstances as if they demonstrated what would be obtained in RCT. Don’t claim exercising would have more effect than poor people getting more money.

But I didn’t get a reply from the author.

Eventually, the author got around to plugging his startup company.

I didn’t get it. Just how did this heavy promoted study advance the science fo such  “personalized recommendation?

Important things I learned from others’ tweets about the study

I follow @BrendonStubbs on Twitter and you should too. Brendon often makes wise critical observations of studies that most everyone else is uncritically praising. But he also identifies some studies that I otherwise would miss and says very positive things about them.

He started his own thread of tweets about the study on a positive note, but then he identified a couple of critical issues.

First, he took issue with the author’s week claiming to have identified a tipping point, below which exercise is beneficial, and above which exercise could prove detrimental the mental health.

4/some interpretations are troublesome. Most confusing, are the assumptions that higher PA is associated/worsens your MH. Would we say based on cross sect data that those taking most medication/using CBT most were making their MH worse?

A postdoctoral fellow @joefirth7  seconded that concern:

I agree @BrendonStubbs: idea of high PA worsening mental health limited to observation studies. Except in rare cases of athletes overtraining, there’s no exp evidence of ‘tipping point’ effect. Cross-sect assocs of poor MH <–> higher PA likely due to multiple other factors…

Ouch! But then Brendan follows up with concerns that the measure of physical activity has not been adequately validated, noting that such self-report measures prove to be invalid.

5/ one consideration not well discussed, is self report measures of PA are hopeless (particularly in ppl w mental illness). Even those designed for population level monitoring of PA https://journals.humankinetics.com/doi/abs/10.1123/jpah.6.s1.s5 … it is also not clear if this self report PA measure has been validated?

As we will soon see, the measure used in this study is quite flawed in its conceptualization and its odd methodology of requiring participants to estimate the time spent exercising for only one activity, with 70 choices.

Next, Brandon points to a particular problem using self-reported physical activity in persons with mental disorder and gives an apt reference:

6/ related to this, self report measures of PA shown to massively overestimate PA in people with mental ill health/illness – so findings of greater PA linked with mental illness likely bi-product of over-reporting of PA in people with mental illness e.g Validity and Value of Self-reported Physical Activity and Accelerometry in People With Schizophrenia: A Population-Scale Study of the UK Biobank [ https://academic.oup.com/schizophreniabulletin/advance-article/doi/10.1093/schbul/sbx149/4563831 ]

7/ An additional point he makes: anyone working in field of PA will immediately realise there is confusion & misinterpretation about the concepts of exercise & PA in the paper, which is distracting. People have been trying to prevent this happening over 30 years

Again, Brandon provides a spot-on citation clarifying the distinction between physical activity and exercise:, Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research 

The mysterious pseudonymous Zad Chow @dailyzad called attention to a blog post they had just uploaded and let’s take a look at some of the key points.

Lessons from a blog post: Exercise, Mental Health, and Big Data

Zad Chow is quite balanced in dispensing praise and criticism of the Lancet Psychiatry paper. They noted the ambiguity of any causality in cross-sectional correlation and that investigated the literature on their own.

So what does that evidence say? Meta-analyses of randomized trials seem to find that exercise has large and positive treatment effects on mental health outcomes such as depression.

Study Name     # of Randomized Trials             Effects (SMD) + Confidence Intervals

Schuch et al. 2016       25         1.11 (95% CI, 0.79-1.43)

Gordon et al. 2018      33         0.66 (95% CI, 0.48-0.83)

Krogh et al. 2017          35         −0.66 (95% CI, -0.86, -0.46)

But, when you only pool high-quality studies, the effects become tiny.

“Restricting this analysis to the four trials that seemed less affected of bias, the effect vanished into −0.11 SMD (−0.41 to 0.18; p=0.45; GRADE: low quality).” – Krogh et al. 2017

Hmm, would you have guessed this from the Lancet Psychiatry author’s thread of tweets?

Zad Chow showed the hype and untrustworthiness of the press coverage in prestigious media with a sampling of screenshots.

zad chou screenshots of press coverage

I personally checked and don’t see that Zad Chow’s selection of press coverage was skewed. Coverage in the media all seemed to be saying the same thing. I found the distortion to continue with uncritical parroting – a.k.a. churnaling – of the claims of the Lancet Psychiatry authors in the Wall Street Journal. 

The WSJ repeated a number of the author’s claims that I’ve already thrown into question and added a curiosity:

In a secondary analysis, the researchers found that yoga and tai chi—grouped into a category called recreational sports in the original analysis—had a 22.9% reduction in poor mental-health days. (Recreational sports included everything from yoga to golf to horseback riding.)

And the NHS England totally got it wrong:

NHS getting it wrong.PNG

So, we learned that the broad category “recreational sports” covers yoga and tai chi , as well as golf and  horseback riding. This raises serious questions about the lumping and splitting of categories of physical activity in the analyses that are being reported.

I needed to access the article in order to uncover some important things 

I’m grateful for the clues that I got from Twitter, and especially Zad Chow that I used in examining the article itself.

I got hung up on the title proclaiming that the study involved 1·2 million individuals. When I checked the article, I saw that the authors use three waves of publicly available data to get that number. Having that many participants gave them no real advantage except for bragging rights and the likelihood that modest associations could be expressed in expressed in spectacular p-values, like p<2・2 × 10–16. I don’t understand why the authors didn’t conduct analyses with one-way and Qwest validate results in another.

The obligatory Research in Context box made it sound like a systematic search of the literature had been undertaken. Maybe, but the authors were highly selective in what they chose to comment upon, as seen in its contradiction by the brief review of Zad Chow. The authors would have us believe that the existing literature is quite limited and inconclusive, supporting the need for like their study.

research in context

Caveat Lector, a strong confirmation bias is likely ahead in this article.

Questions accumulated quickly as to the appropriateness of the items available from a national survey undoubtedly constructed with other purposes. Certainly these items would not have been selected if the original investigators were interested in the research question at the center of this article.

Participants self-reported a previous diagnosis of depression or depressive episode on the basis of the following question: “Has a doctor, nurse, or other health professional EVER told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?”

Our own work has cast serious doubt on the correspondence of reports of a history of depression in response to a brief question embedded in a larger survey with results of a structured interview in which respondents’ answers can be probed. We found that answers to such questions were more related to current distress, then to actual past diagnoses and treatment of depression. However, the survey question used in the Lancet Psychiatry study added the further ambiguity and invalidity with the added  “or minor depression.” I am not sure under what circumstances a health care professional would disclose a diagnosis of “minor depression” to a patient, but I doubt it would be in context in which the professional felt treatment was needed.

Despite the skepticism that I was developing about the usefulness of the survey data, I was unprepared for the assessment of “exercise.”

Other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” Participants who answered yes to this question were then asked: “What type of physical activity or exercise did you spend the most time doing during the past month?” A total of 75 types of exercise were represented in the sample, which were grouped manually into eight exercise categories to balance a diverse representation of exercises with the need for meaningful cell sizes (appendix).

Participants indicated the number of times per week or month that they did this exercise and the number of minutes or hours that they usually spend exercising in this way each time.

I had already been tipped off by the discussion on twitter that there would be a thorough confusion of planned exercise and mere physical activity. But now that was compounded. Why was physical activity during employment excluded? What if participants were engaged in a number of different physical activities,  like both jogging and bicycling? If so, the survey obtained data for only one of these activities, with the other excluded, and the choice could’ve been quite arbitrary as to which one the participant identified as the one to be counted.

Anyone who has ever constructed surveys would be alert to the problems posed by participants’ awareness that saying “yes” to exercising would require contemplating  75 different options, arbitrarily choosing one of them for a further question how much time the participant engaged in this activity. Unless participants were strongly motivated, then there was an incentive to simply say no, they didn’t exercise.

I suppose I could go on, but it was my judgment that any validity what the authors were claiming  had been ruled out. Like someone once said on NIH grant review panel, there are no vital signs left, let’s move on to the next item.

But let’s refocus just a bit on the overall intention of these authors. They want to use a large data set to make statements about the association between physical activity and a measure of mental health. They have used matching and statistical controls to equate participants. But that strategy effectively eliminates consideration of crucial contextual variables. Persons’ preferences and opportunities to exercise are powerfully shaped by their personal and social circumstances, including finances and competing demands on their time. Said differently, people are embedded in contexts in which a lot of statistical maneuvering has sought to eliminate.

To suggest a small number of the many complexities: how much physical activity participants get  in their  employment may be an important determinant of choices for additional activity, as well as how much time is left outside of work. If work typically involves a lot of physical exertion, people may simply be left too tired for additional planned physical activity, a.k.a. exercise, and the physical health may require it less. Environments differ greatly in terms of the opportunities and the safety of engaging in various kinds of physical activities. Team sports require other people being available. Etc., etc.

What I learned from the editorial accompanying the Lancet Psychiatry article

The brief editorial accompanying the article aroused my curiosity as to whether someone assigned to reading and commenting on this article would catch things that apparently the editor and reviewer missed.

Editorial commentators are chosen to praise, not to bury articles. There are strong social pressures to say nice things. However, this editorial leaked a number of serious concerns.

First

In presenting mental health as a workable, unified concept, there is a presupposition that it is possible and appropriate to combine all the various mental disorders as a single entity in pursuing this research. It is difficult to see the justification for this approach when these conditions differ greatly in their underlying causes, clinical presentation, and treatment. Dementia, substance misuse, and personality disorder, for example, are considered as distinct entities for research and clinical purposes; capturing them for study under the combined banner of mental health might not add a great deal to our understanding.

The problem here of categorisation is somewhat compounded by the repeated uncomfortable interchangeability between mental health and depression, as if these concepts were functionally equivalent, or as if other mental disorders were somewhat peripheral.

Then:

A final caution pertains to how studies approach a definition of exercise. In the current study, we see the inclusion of activities such as childcare, housework, lawn-mowing, carpentry, fishing, and yoga as forms of exercise. In other studies, these activities would be excluded for not fulfilling the definition of exercise as offered by the American College of Sports Medicine: “planned, structured and repetitive bodily movement done to improve or maintain one or more components of physical fitness.” 11 The study by Chekroud and colleagues, in its all-encompassing approach, might more accurately be considered a study in physical activity rather than exercise.

The authors were listening for a theme song with which they could promote their startup company in a very noisy data set. They thought they had a hit. I think they had noise.

The authors’ extraordinary disclosure of interests (see below this blog post) should have precluded publication of this serious flawed piece of work, either simply for reason of high likelihood of bias or because it promoted the editor and reviewers to look more carefully at the serious flaws hiding in plain sight.

Postscript: Send in the trolls.

On Twitter, Adam Chekroud announced he felt no need to respond to critics. Instead, he retweeted and “liked” trolling comments directed at critics from the twitter accounts of his brother, his mother, and even the official Twitter account of a local fried chicken joint @chickenlodge, that offered free food for retweets and suggested including Adam Chekroud’s twitter handle if you wanted to be noticed.

chicken lodge

Really, Adam, if you can’t stand the heat, don’t go near  where they are frying chicken.

The Declaration of Interests from the article.

declaration of interest 1

declaration of interest 2

 

Headspace mindfulness training app no better than a fake mindfulness procedure for improving critical thinking, open-mindedness, and well-being.

The Headspace app increased users’ critical thinking and being open-minded. So did practicing a sham mindfulness procedure- participants simply sat with their eyes closed, but thought they were meditating.

mind the brain logo

The Headspace app increased users’ critical thinking and open-mindedness. So did practicing a sham mindfulness procedure. Participants simply sat with their eyes closed, but thought they were meditating.

cat_ dreamstime_164683 (300 x 225)Results call into question claims about Headspace  coming from other studies that did not have such a credible, active control group comparison.

Results also call into question the widespread use of standardized self-report measures of mindfulness to establish whether someone is in the state of mindfulness. These measures don’t distinguish between the practice of standard versus fake mindfulness.

Results can be seen as further evidence that practicing mindfulness depends on nonspecific factors (AKA placebo), rather than any active, distinctive ingredient.

Hopefully this study will prompt better studies evaluating the Headspace App, as well as evaluations of mindfulness training more generally, using credible active treatments, rather than no treatment or waitlist controls.

Maybe it is time for a moratorium on trials of mindfulness without such an active control or at least a tempering of claims based on poorly controlled  trials.

This study points to the need for development of more psychometrically sophisticated measures of mindfulness that are not so vulnerable to experiment expectations and demand characteristics.

Until the accumulation of better studies with better measures, claims about the effects of practicing mindfulness ought to be recognized as based on relatively weak evidence.

The study

Noone, C & Hogan,M. Randomised active-controlled trial of effects of online mindfulness intervention on executive control, critical thinking and key thinking dispositionsBMC Psychology, 2018

Trial registration

The study was initially registered in the AEA Social Science Registry before the recruitment was initiated (RCT ID: AEARCTR-0000756; 14/11/2015) and retrospectively registered in the ISRCTN registry (RCT ID: ISRCTN16588423) in line with requirements for publishing the study protocol.

Excerpts from the Abstract

The aim of this study was…investigating the effects of an online mindfulness intervention on executive function, critical thinking skills, and associated thinking dispositions.

Method

Participants recruited from a university were randomly allocated, following screening, to either a mindfulness meditation group or a sham meditation group. Both the researchers and the participants were blind to group allocation. The intervention content for both groups was delivered through the Headspace online application, an application which provides guided meditations to users.

And

Primary outcome measures assessed mindfulness, executive functioning, critical thinking, actively open-minded thinking, and need for cognition. Secondary outcome measures assessed wellbeing, positive and negative affect, and real-world outcomes.

Results

Significant increases in mindfulness dispositions and critical thinking scores were observed in both the mindfulness meditation and sham meditation groups. However, no significant effects of group allocation were observed for either primary or secondary measures. Furthermore, mediation analyses testing the indirect effect of group allocation through executive functioning performance did not reveal a significant result and moderation analyses showed that the effect of the intervention did not depend on baseline levels of the key thinking dispositions, actively open-minded thinking, and need for cognition.

The authors conclude

While further research is warranted, claims regarding the benefits of mindfulness practice for critical thinking should be tempered in the meantime.

Headscape Be used on an iPhone

The active control condition

The sham treatment control condition was embarrassingly straightforward and simple. But as we will see, participants found it credible.

This condition presented the participants with guided breathing exercises. Each session began by inviting the participants to sit with their eyes closed. These exercises were referred to as meditation but participants were not given guidance on how to control their awareness of their body or breath. This approach was designed to control for the effects of expectations surrounding mindfulness and physiological relaxation to ensure that the effect size could be attributed to mindfulness practice specifically. This content was also delivered by Andy Puddicombe and was developed based on previous work by Zeidan and colleagues [55, 57, 58].

What can we conclude about the standard self-report measures of the state of mindfulness?

The study used the Five Facet Mindfulness Questionnaire, which is widely used to assess whether people are in a state of mindfulness. It has been cited almost 4000 times.

Participants assigned to the mindfulness condition had significant changes for all five facets from baseline to follow up: observing, non-reactivity, non-judgment, acting with awareness, and describing. In the absence of a comparison with change in the sham mindfulness group, these pre-post results would seem to suggest that the measure was sensitive to whether participants had practiced mindfulness. However, there were no differences from the changes observed for the participants assigned to mindfulness and those which were simply asked to sit with their eyes closed.

I asked Chris Noone about the questionnaires his group used to assess mindfulness:

The participants genuinely thought they were meditating in the sham condition so I think both non-specific and demand characteristics were roughly equivalent across both groups. I’m also skeptical regarding the ability of the Five-Facet Mindfulness Questionnaire (or any mindfulness questionnaire for that matter) to capture anything other than “perceived mindfulness”. The items used in these questionnaires feature similar content to the scripts used by the people delivering the mindfulness (and sham) guided meditations. The improvement in critical thinking across both groups is just a mix of learning across a semester and habituation to the task (as the same problems were posed at both measurements).

What I like about this trial

The trial provides a critical test of a key claim for mindfulness:

Mindfulness should facilitate critical thinking in higher-education, based on early Buddhist conceptualizations of mindfulness as clarity of thought.

The trial was registered before recruitment and departures from protocol were noted.

Sample size was determined by power analysis.

The study had a closely matched, active control condition, a sham mindfulness treatment.

The credibility and equivalence of this sham condition versus the active treatment under study was repeatedly assessed.

“Manipulation checks were carried out to assess intervention acceptability, technology acceptance and meditation quality 2 weeks after baseline and 4 weeks after baseline.”

The study tested some a priori hypotheses about mediators and moderation:

Analyses were intention to treat.

 How the study conflicts with past studies

Previous studies claimed to show positive effects of mindfulness on aspects of executive functioning [25 and  26]

How the contradiction of past studies by these results is resolved

 “There are many studies using guided meditations similar to those in our mindfulness meditation condition, delivered through smartphone applications [49, 50, 52, 90, 91], websites [92, 93, 94, 95, 96, 97] and CDs [98, 99], which show effects on measures of outcomes reliably associated with increases in mindfulness such as depression, anxiety, stress, wellbeing and compassion. There are two things to note about these studies – they tend not to include a measure of dispositional mindfulness (e.g. only 4% of all mindfulness intervention studies reviewed in a recent meta-analysis include such measures at baseline and follow-up; [54]) and they usually employ a weak form of control group such as a no-treatment control or waitlist control [54]. Therefore, even when change in mindfulness is assessed in mindfulness meditation intervention studies, it is usually overestimated and this must be borne in mind when comparing the results of this study with those of previous studies. This combined with generally only moderate correlations with behavioural outcomes [54] suggests that when mindfulness interventions are effective, dispositional measures do not fully capture what has changed.”

The broader take away messages

“Our results show that, for most outcomes, there were significant changes from baseline to follow-up but none which can be specifically attributed to the practice of mindfulness.’

This creative use of a sham mindfulness control condition is a breakthrough that should be widely followed. First, it allowed a fair test of whether mindfulness is any better than another active, credible treatment. Second, because the active treatment was a sham, results provide a challenge to the notion that apparent effects of mindfulness on critical thinking are anything more than a placebo effect.

The Headspace App is enormously popular and successful, based on claims about what benefits its use will provide. Some of these claims may need to be tempered, not only in terms of critical thinking, but effects on well-being.

The Headspace App platform lends itself to such critical evaluations with respect to a sham treatment with a degree of standardization that is not readily possible with face-to-face mindfulness training. This opportunity should be exploited further with other active control groups constructed on the basis of specific hypotheses.

There is far too much research on the practice of mindfulness being done that does not advance understanding of what works or how it works. We need a lot fewer studies, and more with adequate control/comparison groups.

Perhaps we should have a moratorium on evaluations of mindfulness without adequate control groups.

Perhaps articles being aimed at audiences making enthusiastic claims for the benefits of mindfulness should routinely note whether these claims are based on adequately controlled studies. Most are not.

When psychotherapy trials have multiple flaws…

Multiple flaws pose more threats to the validity of psychotherapy studies than would be inferred when the individual flaws are considered independently.

mind the brain logo

Multiple flaws pose more threats to the validity of psychotherapy studies than would be inferred when the individual flaws are considered independently.

We can learn to spot features of psychotherapy trials that are likely to lead to exaggerated claims of efficacy for treatments or claims that will not generalize beyond the sample that is being studied in a particular clinical trial. We can look to the adequacy of sample size, and spot what Cochrane collaboration has defined as risk of bias in their handy assessment tool.

We can look at the case-mix in the particular sites where patients were recruited.  We can examine the adequacy of diagnostic criteria that were used for entering patients to a trial. We can examine how blinded the trial was in terms of whoever assigned patients to particular conditions, but also what the patients, the treatment providers, and their evaluaters knew which condition to which particular patients were assigned.

And so on. But what about combinations of these factors?

We typically do not pay enough attention multiple flaws in the same trial. I include myself among the guilty. We may suspect that flaws are seldom simply additive in their effect, but we don’t consider whether they may be even synergism in the negative effects on the validity of a trial. As we will see in this analysis of a clinical trial, multiple flaws can provide more threats to the validity trial than what we might infer when the individual flaws are considered independently.

The particular paper we are probing is described in its discussion section as the “largest RCT to date testing the efficacy of group CBT for patients with CFS.” It also takes on added importance because two of the authors, Gijs Bleijenberg and Hans Knoop, are considered leading experts in the Netherlands. The treatment protocol was developed over time by the Dutch Expert Centre for Chronic Fatigue (NKCV, http://www.nkcv.nl; Knoop and Bleijenberg, 2010). Moreover, these senior authors dismiss any criticism and even ridicule critics. This study is cited as support for their overall assessment of their own work.  Gijs Bleijenberg claims:

Cognitive behavioural therapy is still an effective treatment, even the preferential treatment for chronic fatigue syndrome.

But

Not everybody endorses these conclusions, however their objections are mostly baseless.

Spoiler alert

This is a long read blog post. I will offer a summary for those who don’t want to read through it, but who still want the gist of what I will be saying. However, as always, I encourage readers to be skeptical of what I say and to look to my evidence and arguments and decide for themselves.

Authors of this trial stacked the deck to demonstrate that their treatment is effective. They are striving to support the extraordinary claim that group cognitive behavior therapy fosters not only better adaptation, but actually recovery from what is internationally considered a physical condition.

There are some obvious features of the study that contribute to the likelihood of a positive effect, but these features need to be considered collectively, in combination, to appreciate the strength of this effort to guarantee positive results.

This study represents the perfect storm of design features that operate synergistically:

perfect storm

 Referral bias – Trial conducted in a single specialized treatment setting known for advocating psychological factors maintaining physical illness.

Strong self-selection bias of a minority of patients enrolling in the trial seeking a treatment they otherwise cannot get.

Broad, overinclusive diagnostic criteria for entry into the trial.

Active treatment condition carry strong message how patients should respond to outcome assessment with improvement.

An unblinded trial with a waitlist control lacking the nonspecific elements (placebo) that confound the active treatment.

Subjective self-report outcomes.

Specifying a clinically significant improvement that required only that a primary outcome be less than needed for entry into the trial

Deliberate exclusion of relevant objective outcomes.

Avoidance of any recording of negative effects.

Despite the prestige attached to this trial in Europe, the US Agency for Healthcare Research and Quality (AHRQ) excludes this trial from providing evidence for its database of treatments for chronic fatigue syndrome/myalgic encephalomyelitis. We will see why in this post.

factsThe take away message: Although not many psychotherapy trials incorporate all of these factors, most trials have some. We should be more sensitive to when multiple factors occur in the same trial, like bias in the site for patient recruitment; lacking of blinding; lack of balance between active treatment and control condition in terms of nonspecific factors, and subjective self-report measures.

The article reporting the trial is

Wiborg JF, van Bussel J, van Dijk A, Bleijenberg G, Knoop H. Randomised controlled trial of cognitive behaviour therapy delivered in groups of patients with chronic fatigue syndrome. Psychotherapy and Psychosomatics. 2015;84(6):368-76.

Unfortunately, the article is currently behind a pay wall. Perhaps readers could contact the corresponding author Hans.knoop@radboudumc.nl  and request a PDF.

The abstract

Background: Meta-analyses have been inconclusive about the efficacy of cognitive behaviour therapies (CBTs) delivered in groups of patients with chronic fatigue syndrome (CFS) due to a lack of adequate studies. Methods: We conducted a pragmatic randomised controlled trial with 204 adult CFS patients from our routine clinical practice who were willing to receive group therapy. Patients were equally allocated to therapy groups of 8 patients and 2 therapists, 4 patients and 1 therapist or a waiting list control condition. Primary analysis was based on the intention-to-treat principle and compared the intervention group (n = 136) with the waiting list condition (n = 68). The study was open label. Results: Thirty-four (17%) patients were lost to follow-up during the course of the trial. Missing data were imputed using mean proportions of improvement based on the outcome scores of similar patients with a second assessment. Large and significant improvement in favour of the intervention group was found on fatigue severity (effect size = 1.1) and overall impairment (effect size = 0.9) at the second assessment. Physical functioning and psychological distress improved moderately (effect size = 0.5). Treatment effects remained significant in sensitivity and per-protocol analyses. Subgroup analysis revealed that the effects of the intervention also remained significant when both group sizes (i.e. 4 and 8 patients) were compared separately with the waiting list condition. Conclusions: CBT can be effectively delivered in groups of CFS patients. Group size does not seem to affect the general efficacy of the intervention which is of importance for settings in which large treatment groups are not feasible due to limited referral

The trial registration

http://www.isrctn.com/ISRCTN15823716

Who was enrolled into the trial?

Who gets into a psychotherapy trial is a function of the particular treatment setting of the study, the diagnostic criteria for entry, and patient preferences for getting their care through a trial, rather than what is being routinely provided in that setting.

 We need to pay particular attention to when patients enter psychotherapy trials hoping they will receive a treatment they prefer and not to be assigned to the other condition. Patients may be in a clinical trial for the betterment of science, but in some settings, they are willing to enroll because of a probability of getting treatment they otherwise could not get. This in turn also affects the evaluation of both the condition in which they get the preferred treatment, but also their evaluation of the condition in which they are denied it. Simply put, they register being pleased with what they wanted or not being pleased if they did not get what they wanted.

The setting is relevant to evaluating who was enrolled in a trial.

The authors’ own outpatient clinic at the Radboud University Medical Center was the site of the study. The group has an international reputation for promoting the biopsychosocial model, in which psychological factors are assumed to be the decisive factor in maintaining somatic complaints.

All patients were referred to our outpatient clinic for the management of chronic fatigue.

There is thus a clear referral bias  or case-mix bias but we are not provided a ready basis for quantifying it or even estimating its effects.

The diagnostic criteria.

The article states:

In accordance with the US Center for Disease Control [9], CFS was defined as severe and unexplained fatigue which lasts for at least 6 months and which is accompanied by substantial impairment in functioning and 4 or more additional complaints such as pain or concentration problems.

Actually, the US Center for Disease Control would now reject this trial because these entry criteria are considered obsolete, overinclusive, and not sufficiently exclusive of other conditions that might be associated with chronic fatigue.*

There is a real paradigm shift happening in America. Both the 2015 IOM Report and the Centers for Disease Control and Prevention (CDC) website emphasize Post Exertional Malaise and getting more ill after any effort with M.E. CBT is no longer recommended by the CDC as treatment.

cdc criteriaThe only mandatory symptom for inclusion in this study is fatigue lasting 6 months. Most properly, this trial targets chronic fatigue [period] and not the condition, chronic fatigue syndrome.

Current US CDC recommendations  (See box  7-1 from the IoM document, above) for diagnosis require postexertional malaise for a diagnosis of myalgic encephalomyelitis (ME). See below.

pemPatients meeting the current American criteria for ME would be eligible for enrollment in this trial, but it’s unclear what proportion of the patients enrolled actually met the American criteria. Because of the over-inclusiveness of the entry diagnostic criteria, it is doubtful whether the results would generalize to American sample. A look at patient flow into the study will be informative.

Patient flow

Let’s look at what is said in the text, but also in the chart depicting patient flow into the trial for any self-selection that might be revealed.

In total, 485 adult patients were diagnosed with CFS during the inclusion period at our clinic (fig. 1). One hundred and fifty-seven patients were excluded from the trial because they declined treatment at our clinic, were already asked to participate in research incompatible with inclusion (e.g. research focusing on individual CBT for CFS) or had a clinical reason for exclusion (i.e. they received specifically tailored interventions because they were already unsuccessfully treated with individual CBT for CFS outside our clinic or were between 18 and 21 years of age and the family had to be involved in the therapy). Of the 328 patients who were asked to engage in group therapy, 99 (30%) patients indicated that they were unwilling to receive group therapy. In 25 patients, the reason for refusal was not recorded. Two hundred and four patients were randomly allocated to one of the three trial conditions. Baseline characteristics of the study sample are presented in table 1. In total, 34 (17%) patients were lost to follow-up. Of the remaining 170 patients, 1 patient had incomplete primary outcome data and 6 patients had incomplete secondary outcome data.

flow chart

We see that the investigators invited two thirds of patients attending the clinic to enroll in the trial. Of these, 41% refused. We don’t know the reason for some of the refusals, but almost a third of the patients approached declined because they did not want group therapy. The authors left being able to randomize 42% of patients coming to the clinic or less than two thirds of patients they actually asked. Of these patients, a little more than two thirds received the treatment to which were randomized and were available for follow-up.

These patients receiving treatment to which they were randomized and who were available for follow-up are self-selected minority of the patients coming to the clinic. This self-selection process likely reduced the proportion of patients with myalgic encephalomyelitis. It is estimated that 25% of patients meeting the American criteria a housebound and 75% are unable to work. It’s reasonably to infer that patients being the full criteria would opt out of a treatment that require regular attendance of a group session.

The trial is biased to ambulatory patients with fatigue and not ME. Their fatigue is likely due to some combinations of factors such as multiple co-morbidities, as-yet-undiagnosed medical conditions, drug interactions, and the common mild and subsyndromal  anxiety and depressive symptoms that characterize primary care populations.

The treatment being evaluated

Group cognitive behavior therapy for chronic fatigue syndrome, either delivered in a small (4 patients and 1 therapist) or larger (8 patients and 2 therapists) group format.

The intervention consisted of 14 group sessions of 2 h within a period of 6 months followed by a second assessment. Before the intervention started, patients were introduced to their group therapist in an individual session. The intervention was based on previous work of our research group [4,13] and included personal goal setting, fixing sleep-wake cycles, reducing the focus on bodily symptoms, a systematic challenge of fatigue-related beliefs, regulation and gradual increase in activities, and accomplishment of personal goals. A formal exercise programme was not part of the intervention.

Patients received a workbook with the content of the therapy. During sessions, patients were explicitly invited to give feedback about fatigue-related cognitions and behaviours to fellow patients. This aspect was introduced to facilitate a pro-active attitude and to avoid misperceptions of the sessions as support group meetings which have been shown to be insufficient for the treatment of CFS.

And note:

In contrast to our previous work [4], we communicated recovery in terms of fatigue and disabilities as general goal of the intervention.

Some impressions of the intensity of this treatment. This is a rather intensive treatment with patients having considerable opportunities for interactions with providers. This factor alone distinguishes being assigned to the intervention group versus being left in the wait list control group and could prove powerful. It will be difficult to distinguish intensity of contact from any content or active ingredients of the therapy.

I’ll leave for another time a fuller discussion of the extent to which what was labeled as cognitive behavior therapy in this study is consistent with cognitive therapy as practiced by Aaron Beck and other leaders of the field. However, a few comments are warranted. What is offered in this trial does not sound like cognitive therapy as Americans practice it. What is often in this trial seems emphasize challenging beliefs, pushing patients to get more active, along with psychoeducational activities. I don’t see indications of the supportive, collaborative relationship in which patients are encouraged to work on what they want to work on, engage in outside activities (homework assignments) and get feedback.

What is missing in this treatment is what Beck calls collaborative empiricism, “a systemic process of therapist and patient working together to establish common goals in treatment, has been found to be one of the primary change agents in cognitive-behavioral therapy (CBT).”

Importantly, in Beck’s approach, the therapist does not assume cognitive distortions on the part of the patient. Rather, in collaboration with the patient, the therapist introduces alternatives to the interpretations that the patient has been making and encourages the patient to consider the difference. In contrast, rather than eliciting goal statements from patients, therapist in this study imposes the goal of increased activity. Therapists in this study also seem ready to impose their views that the patients’ fatigue-related beliefs are maladaptive.

The treatment offered in this trial is complex, with multiple components making multiple assumptions that seem quite different from what is called cognitive therapy or cognitive behavioral therapy in the US.

The authors’ communication of recovery from fatigue and disability seems a radical departure not only from cognitive behavior therapy for anxiety and depression and pain, but for cognitive behavior therapy offered for adaptation to acute and chronic physical illnesses. We will return to this “communication” later.

The control group

Patients not randomized to group CBT were placed on a waiting list.

Think about it! What do patients think about having gotten involved in all the inconvenience and burden of a clinical trial in hope that they would get treatment and then being assigned to the control group with just waiting? Not only are they going to be disappointed and register that in their subjective evaluations of the outcome assessments patients may worry about jeopardizing the right to the treatment they are waiting for if they overly endorse positive outcomes. There is a potential for  nocebo effect , compounding the placebo effect of assignment to the CBT active treatment groups.

What are informative comparisons between active treatments and  control conditions?

We need to ask more often what inclusion of a control group accomplishes for the evaluation of a psychotherapy. In doing so, we need to keep in mind that psychotherapies do not have effect sizes, only comparisons of psychotherapies and control condition have effect sizes.

A pre-post evaluation of psychotherapy from baseline to follow-up includes the effects of any active ingredient in the psychotherapy, a host of nonspecific (placebo) factors, and any changes that would’ve occurred in the absence of the intervention. These include regression to the mean– patients are more likely to enter a clinical trial now, rather than later or previously, if there has been exacerbation of their symptoms.

So, a proper comparison/control condition includes everything that the patients randomized to the intervention group get except for the active treatment. Ideally, the intervention and the comparison/control group are equivalent on all these factors, except the active ingredient of the intervention.

That is clearly not what is happening in this trial. Patients randomized to the intervention group get the intervention, the added intensity and frequency of contact with professionals that the intervention provides, and all the support that goes with it; and the positive expectations that come with getting a therapy that they wanted.

Attempts to evaluate the group CBT versus the wait-list control group involved confounding the active ingredients of the CBT and all these nonspecific effects. The deck is clearly being stacked in favor of CBT.

This may be a randomized trial, but properly speaking, this is not a randomized controlled trial, because the comparison group does not control for nonspecific factors, which are imbalanced.

The unblinded nature of the trial

In RCTs of psychotropic drugs, the ideal is to compare the psychotropic drug to an inert pill placebo with providers, patients, and evaluate being blinded as to whether the patients received psychotropic drug or the comparison pill.

While it is difficult to achieve a comparable level of blindness and a psychotherapy trial, more of an effort to achieve blindness is desirable. For instance, in this trial, the authors took pains to distinguish the CBT from what would’ve happened in a support group. A much more adequate comparison would therefore be CBT versus either a professional or peer-led support group with equivalent amounts of contact time. Further blinding would be possible if patients were told only two forms of group therapy were being compared. If that was the information available to patients contemplating consenting to the trial, it wouldn’t have been so obvious from the outset to the patients being randomly assigned that one group was preferable to the other.

Subjective self-report outcomes.

The primary outcomes for the trial were the fatigue subscale of the Checklist Individual Strength;  the physical functioning subscale of the Short Health Survey 36 (SF-36); and overall impairment as measured by the Sickness Impact Profile (SIP).

Realistically, self-report outcomes are often all that is available in many psychotherapy trials. Commonly these are self-report assessments of anxiety and depressive symptoms, although these may be supplemented by interviewer-based assessments. We don’t have objective biomarkers with which to evaluate psychotherapy.

These three self-report measures are relatively nonspecific, particularly in a population that is not characterized by ME. Self-reported fatigue in a primary care population lacks discriminative validity with respect to pain, anxiety and depressive symptoms, and general demoralization.  The measures are susceptible to receipt of support and re-moralization, as well as gratitude for obtaining a treatment that was sought.

Self-report entry criteria include a score 35 or higher on the fatigue severity subscale. Yet, a score of less than 35 on this scale at follow up is part of what is defined as a clinically significant improvement with a composite score from combined self-report measures.

We know from medical trials that differences can be observed with subjective self-report measures that will not be found with objective measures. Thus, mildly asthmatic patients will fail to distinguish in their subjective self-reports between [  between the effective inhalant albuterol, an inert inhalant, and sham acupuncture, but will rate improvement better than getting no intervention.  However,  there will be a strong advantage over the other three conditions with an objective measure, maximum forced expiratory volume in 1 second (FEV1) as assessed  with spirometry.

The suppression of objective outcome measures

We cannot let these the authors of this trial off the hook in their dependence on subjective self-report outcomes. They are instructing patients that recovery is the goal, which implies that it is an attainable goal. We can reasonably be skeptical about acclaim of recovery based on changes in self-report measures. Were the patients actually able to exercise? What was their exercise capacity, as objectively measured? Did they return to work?

These authors have included such objective measurements in past studies, but not included them as primary outcomes, nor, even in some cases, reported them in the main paper reporting the trial.

Wiborg JF, Knoop H, Stulemeijer M, Prins JB, Bleijenberg G. How does cognitive behaviour therapy reduce fatigue in patients with chronic fatigue syndrome? The role of physical activity. Psychol Med. 2010 Jan 5:1

The senior authors’ review fails to mention their three studies using actigraphy that did not find effects for CBT. I am unaware of any studies that did find enduring effects.

Perhaps this is what they mean when they say the protocol has been developed over time – they removed what they found to be threats to the findings that they wanted to claim.

Dismissing of any need to consider negative effects of treatment

Most psychotherapy fail to assess any adverse effects of treatment, but this is usually done discretely, without mention. In contrast, this article states

Potential harms of the intervention were not assessed. Previous research has shown that cognitive behavioural interventions for CFS are safe and unlikely to produce detrimental effects.

Patients who meet stringent criteria for ME would be put at risk for pressure to exert themselves. By definition they are vulnerable to postexertional malaise (PEM). Any trail of this nature needs to assess that risk. Maybe no adverse effects would be found. If that were so, it would strongly indicate the absence of patients with appropriate diagnoses.

Timing of assessment of outcomes varied between intervention and control group.

I at first did not believe what I was reading when I encountered this statement in the results section.

The mean time between baseline and second assessment was 6.2 months (SD = 0.9) in the control condition and 12.0 months (SD = 2.4) in the intervention group. This difference in assessment duration was significant (p < 0.001) and was mainly due to the fact that the start of the therapy groups had to be frequently postponed because of an irregular patient flow and limited treatment capacities for group therapy at our clinic. In accordance with the treatment manual, the second assessment was postponed until the fourteenth group session was accomplished. The mean time between the last group session and the second assessment was 3.3 weeks (SD = 3.5).

So, outcomes were assessed for the intervention group shortly after completion of therapy, when nonspecific (placebo) effects would be stronger, but a mean of six months later than for patients assigned to the control condition.

Post-hoc statistical controls are not sufficient to rescue the study from this important group difference, and it compounds other problems in the study.

Take away lessons

Pay more attention to how limitations any clinical trial may compound each other in terms of the trial provide exaggerated estimates of the effects of treatment or the generalizability of the results to other settings.

Be careful of loose diagnostic criteria because a trial may not generalize to the same criteria being applied in settings that are different either in terms of patient population of the availability of different treatments. This is particularly important when a treatment setting has a bias in referrals and only a minority of patients being invited to participate in the trial actually agree and are enrolled.

Ask questions about just what information is obtained in comparing active treatment group and the study to its control/comparison. For start, just what is being controlled and how might that affect the estimates of the effectiveness of the active treatment?

Pay particular attention to the potent combination of the trial being unblinded, a weak comparision/control, and an active treatment that is not otherwise available to patients.

Note

*The means of determining whether the six months of fatigue might be accounted for by other medical factors was specific to the setting. Note that a review of medical records for sufficient for an unknown proportion of patients, with no further examination or medical tests.

The Department of Internal Medicine at the Radboud University Medical Center assessed the medical examination status of all patients and decided whether patients had been sufficiently examined by a medical doctor to rule out relevant medical explanations for the complaints. If patients had not been sufficiently examined, they were seen for standard medical tests at the Department of Internal Medicine prior to referral to our outpatient clinic. In accordance with recommendations by the Centers for Disease Control, sufficient medical examination included evaluation of somatic parameters that may provide evidence for a plausible somatic explanation for prolonged fatigue [for a list, see [9]. When abnormalities were detected in these tests, additional tests were made based on the judgement of the clinician of the Department of Internal Medicine who ultimately decided about the appropriateness of referral to our clinic. Trained therapists at our clinic ruled out psychiatric comorbidity as potential explanation for the complaints in unstructured clinical interviews.

workup

Power pose: I. Demonstrating that replication initiatives won’t salvage the trustworthiness of psychology

An ambitious multisite initiative showcases how inefficient and ineffective replication is in correcting bad science.

 

mind the brain logo

Bad publication practices keep good scientists unnecessarily busy, as in replicability projects.- Bjoern Brembs

Power-PoseAn ambitious multisite initiative showcases how inefficient and ineffective replication is in correcting bad science. Psychologists need to reconsider pitfalls of an exclusive reliance on this strategy to improve lay persons’ trust in their field.

Despite the consistency of null findings across seven attempted replications of the original power pose study, editorial commentaries in Comprehensive Results in Social Psychology left some claims intact and called for further research.

Editorial commentaries on the seven null studies set the stage for continued marketing of self-help products, mainly to women, grounded in junk psychological pseudoscience.

Watch for repackaging and rebranding in next year’s new and improved model. Marketing campaigns will undoubtedly include direct quotes from the commentaries as endorsements.

We need to re-examine basic assumptions behind replication initiatives. Currently, these efforts  suffer from prioritizing of the reputations and egos of those misusing psychological science to market junk and quack claims versus protecting the consumers whom these gurus target.

In the absence of a critical response from within the profession to these persons prominently identifying themselves as psychologists, it is inevitable that the void be filled from those outside the field who have no investment in preserving the image of psychology research.

In the case of power posing, watchdog critics might be recruited from:

Consumer advocates concerned about just another effort to defraud consumers.

Science-based skeptics who see in the marketing of the power posing familiar quackery in the same category as hawkers using pseudoscience to promote homeopathy, acupuncture, and detox supplements.

Feminists who decry the message that women need to get some balls (testosterone) if they want to compete with men and overcome gender disparities in pay. Feminists should be further outraged by the marketing of junk science to vulnerable women with an ugly message of self-blame: It is so easy to meet and overcome social inequalities that they have only themselves to blame if they do not do so by power posing.

As reported in Comprehensive Results in Social Psychology,  a coordinated effort to examine the replicability of results reported in Psychological Science concerning power posing left the phenomenon a candidate for future research.

I will be blogging more about that later, but for now let’s look at a commentary from three of the over 20 authors get reveals an inherent limitation to such ambitious initiatives in tackling the untrustworthiness of psychology.

Cesario J, Jonas KJ, Carney DR. CRSP special issue on power poses: what was the point and what did we learn?.  Comprehensive Results in Social Psychology. 2017

 

Let’s start with the wrap up:

The very costly expense (in terms of time, money, and effort) required to chip away at published effects, needed to attain a “critical mass” of evidence given current publishing and statistical standards, is a highly inefficient use of resources in psychological science. Of course, science is to advance incrementally, but it should do so efficiently if possible. One cannot help but wonder whether the field would look different today had peer-reviewed preregistration been widely implemented a decade ago.

 We should consider the first sentence with some recognition of just how much untrustworthy psychological science is out there. Must we mobilize similar resources in every instance or can we develop some criteria to decide what is on worthy of replication? As I have argued previously, there are excellent reasons for deciding that the original power pose study could not contribute a credible effect size to the literature. There is no there to replicate.

The authors assume preregistration of the power pose study would have solved problems. In clinical and health psychology, long-standing recommendations to preregister trials are acquiring new urgency. But the record is that motivated researchers routinely ignore requirements to preregister and ignore the primary outcomes and analytic plans to which they have committed themselves. Editors and journals let them get away with it.

What measures do the replicationados have to ensure the same things are not being said about bad psychological science a decade from now? Rather than urging uniform adoption and enforcement of preregistration, replicationados urged the gentle nudge of badges for studies which are preregistered.

Just prior to the last passage:

Moreover, it is obvious that the researchers contributing to this special issue framed their research as a productive and generative enterprise, not one designed to destroy or undermine past research. We are compelled to make this point given the tendency for researchers to react to failed replications by maligning the intentions or integrity of those researchers who fail to support past research, as though the desires of the researchers are fully responsible for the outcome of the research.

There are multiple reasons not to give the authors of the power pose paper such a break. There is abundant evidence of undeclared conflicts of interest in the huge financial rewards for publishing false and outrageous claims. Psychological Science about the abstract of the original paper to leave out any embarrassing details of the study design and results and end with a marketing slogan:

That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

 Then the Association for Psychological Science gave a boost to the marketing of this junk science with a Rising Star Award to two of the authors of this paper for having “already made great advancements in science.”

As seen in this special issue of Comprehensive Results in Social Psychology, the replicationados share responsibility with Psychological Science and APS for keeping keep this system of perverse incentives intact. At least they are guaranteeing plenty of junk science in the pipeline to replicate.

But in the next installment on power posing I will raise the question of whether early career researchers are hurting their prospects for advancement by getting involved in such efforts.

How many replicationados does it take to change a lightbulb? Who knows, but a multisite initiative can be combined with a Bayesian meta-analysis to give a tentative and unsatisfying answer.

Coyne JC. Replication initiatives will not salvage the trustworthiness of psychology. BMC Psychology. 2016 May 31;4(1):28.

The following can be interpreted as a declaration of financial interests or a sales pitch:

eBook_PositivePsychology_345x550I will soon be offering e-books providing skeptical looks at positive psychology and mindfulness, as well as scientific writing courses on the web as I have been doing face-to-face for almost a decade.

 Sign up at my website to get advance notice of the forthcoming e-books and web courses, as well as upcoming blog posts at this and other blog sites. Get advance notice of forthcoming e-books and web courses. Lots to see at CoyneoftheRealm.com.

 

‘Replace male doctors with female ones and save at least 32,000 lives each year’?

The authors of a recent article in JAMA Internal Medicine

Physician Gender and Outcomes of Hospitalized Medicare Beneficiaries in the U.S.,” Yusuke Tsugawa, Anupam B. Jena, Jose F. Figueroa, E. John Orav, Daniel M. Blumenthal, Ashish K. Jha, MD, MPH1,2,8, JAMA Internal Medicine, online December 19, 2016, doi: 10.1001/jamainternmed.2016.7875

Stirred lots of attention in the media with direct quotes like these:

“If we had a treatment that lowered mortality by 0.4 percentage points or half a percentage point, that is a treatment we would use widely. We would think of that as a clinically important treatment we want to use for our patients,” said Ashish Jha, professor of health policy at the Harvard School of Public Health. The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

Washington Post: Women really are better doctors, study suggests.

LA  Times: How to save at least 32,000 lives each year: Replace male doctors with female ones.

NPR: Patients cared for by female doctors fare better than those treated by men.

My immediate reactions after looking at the abstract were only confirmed when I delved deeper.

Basically, we have a large, but limited and very noisy data set. It is unlikely that these data allow us to be confident about the strength of any signal concerning the relationship between physician gender and patient outcome that is so important to the authors. The small apparent differences could be just more noise on which the authors have zeroed in so that they can make a statement about the injustice of gender differences in physician pay.

 I am unwilling to relax methodological and statistical standards to manufacture support for such a change. There could be unwanted consequences of accepting that arguments can be made with such weak evidence, even for a good cause.

What if the authors had found the same small differences in noisy data in the reverse direction? Would they argue that we should preserve gender differences in physician pay? What if the authors focus on a different variable in all this noise and concluded that lower pay which women receive was associated with reduced mortality? Would we then advocate that will reduce the pay of both male and female physicians in order to improve patient outcomes?

Despite all the excitement that claim about an effect of physician gender on patient mortality is generating, it is most likely that we are dealing with noise arising from overinterpretation of complex analyses that assume more completeness and precision than can be found in the data being analyzed.

These claims are not just a matter of causal relationships being spun from correlation. Rather, they are causal claims being made on the basis of partial correlations emerging in complex multivariate relationships found in an administrative data set.

  • Administrative data sets, particularly Medicaid data sets like this one, are not constructed with such research questions in mind. There are severe constraints on what variables can be isolated and which potential confounds can be identified and tested.
  • Administrative data sets consist of records, not actual behaviors. It’s reasonable to infer a patient death associated with a record of a death. Association of a physician gender associated with a particular record is more problematic, as we will see. Even if we accept the association found in these records, it does not necessarily mean that physicians engaged in any particular behaviors or that the physician behavior is associated with the pattern of deaths emerging in these multivariate analyses.
  • The authors start out with a statement about differences in how female and male physicians practice. In the actual article and the media, they have referred to variables like communication skills, providing evidence-based treatments, and encouraging health-related behaviors. None of these variables are remotely accessible in a Medicaid data set.
  • Analyses of such administrative data sets do not allow isolation of the effects of physician gender from the effects of the contexts in which their practice occurs and relevant associated variables. We are not talking about a male or female physician encountering a particular patient being associated with a death or not, but an administrative record of physician gender arising in a particular context being interpreted as associated with a death. Male and female physicians may differ in being found in particular contexts in nonrandom fashion. It’s likely that these differences will dwarf any differences in outcomes. There will be a real challenge in even confidently attributing those outcomes to whether patients had an attending male or female physician.

The validity of complex multivariate analyses are strongly threatened by specification bias and residual confounding. The analyses must assume that all of the relevant confounds have been identified and measured without error. Departures from these ideal conditions can lead to spurious results, and generally do. Examination of the limitations in the variables available in a Medicaid data set and how they were coded can quickly undermine any claim to validity.

Acceptance of claims about effects of particular variables like female physician gender arising in complex multivariate analyses involve assumptions of “all-other-things-being-equal.” If we attempt to move from statistical manipulation to inference about a real world encounter, we no longer talking about a particular female physician, but a construction that may be very different from particular physicians interacting with particular patients in particular contexts.

The potential for counterfactual statements can be seen if we move from the study to one of science nerds and basketball players and hypothesize if John and Jason were of equivalent height, John would not study so hard.

Particularly in complex social situations, it is usually a fantasy that we can change one variable, and only one variable, not others. Just how did John and Jason get of equal height? And how are they now otherwise different?

Associations discovered in administrative data sets most often do not translate into effects observed in randomized trials. I’m not sure how we could get a representative sample of patients to disregard their preferences and accept random assignment to a male or female physician. It would have to be a very large study to detect the effect sizes reported in this observational study, and I’m skeptical this sufficiently strong signal would emerge from all of the noise.

We might relax our standards and accept a quasi-experimental design that would be smaller but encompass a wider range of relevant variables. For instance, it is conceivable that we could construct a large sample in which physicians varied in terms of whether they had formal communication skills training. We might examine whether communications training influenced subsequent patient mortality, independent of physician gender, and vice versa. This would be a reasonable translation of the authors’ hypothesis that communication skills differences between male and female physicians account for what the authors believe is the observed association between physician gender and mortality. I know of no such study having been done. I know of no study demonstrating that physician communication training affects patient mortality. I’m skeptical that the typical communication training is so powerful in its effects. If such a study required substantial resources, rather than relied on data on hand, I would not be encouraged to invest in it by the strength of the results of the present study to marshal those resources.

What I saw when I looked at the article

 We dealing with very small adjusted differences in percentage arising in a large sample.

Patients treated by female physicians had lower 30-day mortality (adjusted mortality, 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233).

Assignment of a particular patient to a particular physician is done with a lot of noise.

We assigned each hospitalization to a physician based on the National Provider Identifier in the Carrier File that accounted for the largest amount of Medicare Part B spending during that hospitalization.25 Part B spending comprises professional and other fees determined by the physician. On average, these physicians were responsible for 51.1% of total Part B spending for a given hospitalization.

One commentator quoted in a news article noted:

William Weeks, a professor of psychiatry at Dartmouth’s Geisel School of Medicine, said that the researchers had done a good job of trying to control for other factors that might influence the outcome. He noted that one caveat is that hospital care is usually done by a team. That fact was underscored by the method the researchers used to identify the doctor who led the care for patients in the study. To identify the gender of the physician, they looked for the doctor responsible for the biggest chunk of billing for hospital services — which was, on average, about half. That means that almost half of the care was provided by others.

Actually, much of the care is not provided by the attending physician, but other staff, including nurses and residents.

The authors undertook the study to call attention to gender disparities in physician pay. But could disparities show up in males being able to claim more billable procedures – greater credit administratively for what is done with patients during hospitalization, including by other physicians? This might explain at least some of the gender differences, but could undermine the validity of this key variable in relating physician gender to differences in patient outcome.

The statistical control of differences in patient and physician characteristics afforded by variables in this data set is inadequate.

Presumably, a full range of patient variables is related to whether patients die within 30 days of a hospitalization. Recall the key assumption that all of the relevant confounds have been identified and assessed without error in considering the variables used to characterize patient characteristics:

Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year.

Note that the comorbidity index is based on collapsing 27 other variables into one number. Simplifies the statistics, yes, but with a tremendous loss of information.

Recall the assumption that this set of variables represent not just what is available in administrative data set, but all the patient characteristics relevant to their dying within 30 days after discharge from the hospital. Are we really willing to accept this assumption?

For the physician variables displayed at the top of Table 1, there are huge differences between male and female physicians, relative to the modest difference in patient mortality, adjusted mortality, 11.07% vs 11.49%.

smaller table of patient characiteristics

These authors encourage us to think about the results as simulating a randomized trial, except that statistical controls are serving the function that randomization of patients to physician gender would serve. We are being asked to accept that these difference in baseline characteristics of the practices of female versus physicians can be eliminated through statistics. We would never accept that argument in a randomized trial.

Addressing criticisms of the authors interpretation of their results.

 The senior author provided a pair of blog posts in which he acknowledges criticism of his study, but attempts to defuse key objections. It’s unfortunate that the sources of these objections are not identified, and so we dependent on the author’s summary out of context. I think the key responses are to straw man objections.

Correlation, Causation, and Gender Differences in Patient Outcomes

Do women make better doctors than men?

Correlation is not causation.

We often make causal inferences based on observational data – and here’s the kicker: sometimes, we should.  Think smoking and lung cancer.  Remember the RCT that assigned people to smoking (versus not) to see if it really caused lung cancer?  Me neither…because it never happened.  So, if you are a strict “correlation is not causation” person who thinks observational data only create hypotheses that need to be tested using RCTs, you should only feel comfortable stating that smoking is associated with lung cancer but it’s only a hypothesis for which we await an RCT.  That’s silly.  Smoking causes lung cancer.

No, it is this argument that is silly. We can now look back on the data concerning smoking and lung cancer and benefit from the hindsight provided by years of sorting smoking as a risk factor from potential confounds.  I recall at some point, drinking coffee being related to lung cancer in the United States, whereas drinking tea was correlated in the UK. Of course, if we don’t know that smoking is the culprit, we might miss that in the US, smoking was done while drinking coffee, whereas the UK, while drinking tea.

And isolating smoking as a risk factor, rather than just a marker for risk, is so much simpler than isolating whatever risk factors for death are hidden behind physician gender as a marker for risk of mortality.

Coming up with alternative explanations for the apparent link between physician gender and patient mortality.

The final issue – alternative explanations – has been brought up by nearly every critic. There must be an alternative explanation! There must be confounding!  But the critics have mostly failed to come up with what a plausible confounder could be.  Remember, a variable, in order to be a confounder, must be correlated both with the predictor (gender) and outcome (mortality).

This is similarly a fallacious argument. I am not arguing for alternative substantive explanations, I’m proposing that spurious results were produced by pervasive specification bias, including measurement error. There is no potential confounder I have to identify. I am simply arguing that that the small differences in mortality are dwarfed by specification and measurement error.

This tiny difference is actually huge in its implications.

Several critics have brought up the point that statistical significance and clinical significance are not the same thing.  This too is epidemiology 101.  Something can be statistically significant but clinically irrelevant.  Is a 0.43 percentage point difference in mortality rate clinically important? This is not a scientific or a statistical question.  This is a clinical question. A policy and public health question.  And people can reasonably disagree.  From a public health point of view, a 0.43 percentage point difference in mortality for Medicare beneficiaries admitted for medical conditions translates into potentially 32,000 additional deaths. You might decide that this is not clinically important. I think it is. It’s a judgment call and we can disagree.

The author taking a small difference and magnifying its importance by applying to a larger population. He is attributing the “additional deaths” to patients being treated by men. I feel he hasn’t made a case that physician gender is the culprit and so nothing is accomplished except introducing shock and awe by amplifying the small effect into its implications for the larger population.

In response to a journalist, the author makes a parallel argument:

The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

In addition to what I have already argued, if we know the same number of deaths are attributable to automobile crashes, we at least know how to take steps to reduce these crashes and the mortality associated with them. We don’t know how to change the mortality the authors claim is associated with physician gender. We don’t even know that the author’s claims are valid.

Searching for meaning where meaning no meaning is to be found.

In framing the study and interpreting the results to the media, the authors undertake a search of the literature with a heavy confirmation bias, ignoring the many contradictions that are uncovered with a systematic search. For instance, one commentator on the senior author’s blog notes

It took me about 5 minutes of Google searching to find a Canadian report suggesting that female physicians in that country have workloads around 75% to 80% of male physicians:

https://secure.cihi.ca/free_products/PracticingPhysicianCommunityCanada.pdf

If US data is even vaguely similar, that factor would be a serious omission from your article.

But the authors were looking for what supported the results, not for studies that potentially challenged or contradicted their results. They are looking to strengthen a narrative, not expose it to refutation.

Is there a call to action here?

As consumers of health services, we could all switch to being cared for by female physicians. I suspect that some of the systems and structural issues associated with the appearance that care by male physicians inferior would be spread among females, including increased workloads. The bias in the ability of male physicians to claim credit for the work of others would be redistributed to women. Neither would improve patient mortality.

We should push for reduction in inequalities in pay related to gender. But we don’t need results of this study to encourage us.

I certainly know health care professionals and researchers who have more confidence in communication learning modules producing clinically significant changes in position behavior. I don’t know any of them who could produce evidence that these changes include measurable reductions in patient mortality. If someone produces such data, I’m capable of being persuaded. But the present study adds nothing to my confidence in that likelihood.

If we are uncomfortable with the communication skills or attention to evidence that our personal physicians display, we should replace them. But I don’t think this study provides additional evidence for us doing so, beyond the legitimacy of us acting on our preferences.

In the end, this article reminds us to stick to our standards and not be tempted to relax them to make socially acceptable points.

 

 

 

 

 

An open-minded, skeptical look at the success of “zero suicides”: Any evidence beyond the rhetoric?

  • Claims are spreading across social media that a goal of zero suicides can be achieved by radically re-organizing resources in health systems and communities. Extraordinary claims require extraordinary evidence.
  • I thoroughly searched for evidence backing claims of “zero suicides” being achieved.
  • The claims came up short, after expectations were initially raised by some statistics and a provocative graph. But any persuasiveness to these details quickly dissipated when they were scrutinized. Lesson: Abstract numbers and graphs are not necessarily quality evidence and dazzling ones can obscure a lack of evidence.
  • The goal of “zero suicides” has attracted support of Pharma and generated programs around the world, with little fidelity to the original concept developed in the  Henry Ford Health System in Detroit. In many contexts in which it is now being invoked, “zero suicides” is a vacuous buzz term, not a coherent, organizational strategy
  • Preventing suicide is a noble goal to which a lot of emotion gets attached. It also creates lucrative financial opportunities and attracts vested interests which often simply repackage existing programs for resale.
  • How can anyone oppose the idea that we should eliminate suicide? Clever sloganeering can stifle criticism and suppress embarrassing evidence to the contrary
  • Yet, we should not be bullied, nor distracted by slogans from our usual, skeptical insistence on those who make strong claims having the burden to provide strong evidence.
  • Deaths by suicide are statistically infrequent, poorly predicted events that occur in troubled contexts of interpersonal and institutional breakdown. These aspects can frustrate efforts to eliminate suicide entirely – or even accurately track these deaths.
  • Eliminating deaths by suicide is only very loosely analogous to wiping out polio and lots of pitfalls await those who get confused by a false equivalence.
  • Pursuit of the goal of “zero suicides,” particularly in under-resourced and not well-organized community settings can have unintended, negative consequences.
  • “Zero suicides” is likely a fad, to be replaced by next year’s fashion or maybe a few years after.
  • We need to step back and learn from the rise and fall of slogans and the unintended impact on distribution of scarce resources and the costs to human well-being.
  • My take away message is that increasingly sophisticated and even coercive communications about clinical and public health policies often harness the branding of prestigious medical journals. Interpreting these claims require a matching skepticism, critical thinking skills, and renewed demands for evidence.

Beginning the search for evidence for the slogan “Zero Sucide.”

zero tweetNumerous gushy tweets about achieving “zero suicides” drew me into a search for more information. I easily traced the origins of the campaign to a program at the Henry Ford Health System, a Detroit-based HMO, but the concept has now gone thoroughly international. My first Google Scholar search did not yield quality evidence from any program evaluations, but a subsequent Google search produced exceptionally laudatory and often self-congratulatory statements.

I briefly diverted my efforts to contacting authorities whom I expected might comment about “zero suicides.” Some indicated a lack of familiarity prevented them from commenting, but others were as evasive as establishment Republicans asked about Donald Trump. One expert, however, was forthcoming with an interesting article, which proved to have just right tone.  I recommend:

Kutcher S, Wei Y, Behzadi P. School-and Community-Based Youth Suicide Prevention Interventions Hot Idea, Hot Air, or Sham?. The Canadian Journal of Psychiatry. 2016 Jul 12:0706743716659245.

Continuing my search, I found numerous links to other articles, including a laudatory, Medical News and Perspectives opinion piece in JAMA behind a readily circumvented pay wall. There was also a more accessible source with a branding by New England Journal of Medicine.

Clicking on these links, I found editorial and even blatantly promotional material, not randomized trials or other quality evidence.

This kind of non-evidence-based publicity in highly visible medical journals is extraordinary in itself, although not unprecedented. Increasingly, the brand of particular medical journals is sold and harnessed to bestow special credibility on political and financial interests, has seen in 1 and 2.

NEJM Catalyst: How We Dramatically Reduced Suicide.

 NEJM Catalyst is described as bringing

Health care executives, clinician leaders, and clinicians together to share innovative ideas and practical applications for enhancing the value of health care delivery.

0 suicide takeaway
From NEJM Catalyst

The claim of “zero suicides” originated in the Perfect Care for Depression in a division of the Henry Ford Health System.

The audacious goal of zero suicides was part of the Behavioral Health Services division’s larger goal to develop a system of perfect care for depression. Our roadmap for transformation was the Quality Chasm report, which defined six dimensions of perfect care: safety, timeliness, effectiveness, efficiency, equity, and patient-centeredness. We set perfection goals and metrics for each dimension, with zero suicides being the perfection goal for effectiveness. Very quickly, however, our team seized on zero suicides as the overarching goal for our entire transformation.

The strategies:

We used three key strategies to achieve this goal. The first two — improving access to care and restricting access to lethal means of suicide — are evidence-based interventions to reduce suicide risk. While we had pursued these strategies in the past, setting the target at zero suicides injected our team with gumption. To improve access to care, we developed, implemented, and tested new models of care, such as drop-in group visits, same-day evaluations by a psychiatrist, and department-wide certification in cognitive behavior therapy. This work, once messy and arduous for the PDC team, became creative, fun, and focused. To reduce access to lethal means of suicide, we partnered with patients and families to develop new protocols for weapons removal. We also redesigned the structure and content of patient encounters to reflect the assumption that every patient with a mental illness, even if that illness is in remission, is at increased risk of suicide. Therefore, we eliminated suicide screens and risk stratification tools that yielded non-actionable results, freeing up valuable time. Eventually, each of these approaches was incorporated into the electronic health record as decision support.

The third strategy:

…The pursuit of perfection was not possible without a just culture for our internal team. Ultimately, we found this the most important strategy in achieving zero suicides. Since our goal was to achieve radical transformation, not just to tweak the margins, PDC staff couldn’t justly be punished if they came up short on these lofty goals. We adopted a root cause analysis process that treated suicide events equally as tragedies and learning opportunities.

Process of patient care described in JAMA

What happens to a patient being treated in the context of Perfect Depression Care is described in the JAMA  piece:

Each patient seen through the BHS is first assessed and stratified on the basis of suicide risk: acute, moderate, or low. “Everyone is at risk. It’s just a matter of whether it’s acute or whether it requires attention but isn’t emergent,” said Coffey. A patient considered to be at high risk undergoes a psychiatric evaluation the same day. A patient at low risk is evaluated within 7 days. Group sessions for patients also allow individuals to connect and offer support to one another, not unlike the supportive relationships between sponsors and “sponsees” in 12-step programs

The claim of Zero Suicides, in numbers and a graph

…A dramatic and statistically significant 80% reduction in suicide that has been maintained for over a decade, including one year (2009) when we actually achieved the perfection goal of zero suicides (see the figure below). During the PDC initiative, the annual HMO network membership ranged from 182,183 to 293,228, of which approximately 60% received care through Behavioral Health Services. From 1999 to 2010, there were 160 suicides among HMO members. In 1999, as we launched PDC, the mean annual suicide rate for these mental health patients was 110.3 per 100,000. During the 11 years of the initiative, the mean annual suicide rate dropped to 36.21 per 100,000. This decrease is statistically significant and, moreover, took place while the suicide rate actually increased among non–mental health patients and among the general population of the state of Michigan.

Improved_Suicide_Rates_Among_Henry_Ford_Medical_Group_HMO_Members

[This graph conflicts a bit with a graph in NEJM Catalyst that indicates suicides in the health care system were 0 suicides for 2008 and this continued through the first quarter of 2010]

It is clear that rates of suicide fluctuate greatly from year-to-year in the health system. It also appears from the graph that for most years during the program, rates of suicide among patients in the Henry Ford Health System were substantially greater than those of the general population in Michigan, which were relatively flat. Any comparisons between the program and the general statistics for the state of Michigan are not particularly informative. Michigan is a state of enormous health care disparities. During this period, there was a large insured population. Demographics differ greatly, but patients receiving care within an HMO were a substantially more privileged group than the general population of Michigan. During this time, there were many uninsured and a lot of annual movement in and out of the Henry Ford Health System. At any one time, only 60% of the patients within the health system were enrolled in the behavioral health system in which the depression program occurred.

A substantial proportion of suicides occur with individuals who are not previously known to health systems. Such persons are more represented in the statistics for the state of Michigan. Another substantial proportion of suicides occur in individuals with weakened or recently broken contact with health systems. We don’t know how the statistics reported for the health system accommodated biased departures from the health system or simply missing data. We don’t know whether behavior related to risk of suicide affected migration into the health care system or to the small group receiving behavioral healthcare through the health system. For instance, what became of patients with a psychiatric disorder in a comorbid substance use disorder? Those who were incarcerated?

Basically, the success of the program is not obvious within the noisy fluctuation of suicides within the Henry Ford Health System or the smaller behavioral health program. We cannot control for basic confounding factors or selective enrollment and disenrollment in the health care system, or even expelling from the behavioral health system of persons at risk.

 “Zero suicides” as a literal and serious goal?

The NEJM Catalyst article gave the originator of the program free reign for self-praise.

The most unexpected hurdles were skepticism that perfection goals like zero suicides were reasonable or feasible (some objected that it was “setting us up for failure”), and disbelief in the dramatic improvements obtained (we heard comments like “results from quality improvement projects aren’t scientifically rigorous”). We addressed these concerns by ensuring the transparency of our results and lessons, by collaborating with others to continually improve our methodological issues, and by supporting teams across the world who wish to pursue similar initiatives.

Our team challenged this assumption and asked, If zero is not the right goal for suicide occurrence, then what number is? Two? Twelve? Which twelve? In spite of its radicalism — indeed because of it — the goal of zero suicides became the galvanizing force behind an effort that achieved one of the most dramatic and sustained reductions in suicide in the clinical literature.

Will the Henry Ford program prove sustainable?

Edward Coffey moved to  President, CEO, and Chief of Staff at the Menninger Clinic 18 months before his article in the NEJM Catalyst. I am curious to what aspects of his Zero Suicides/Perfect Depression Care Program are still maintained at Henry Ford. As it is described, the program was designed with admirably short waiting times for referral to behavioral healthcare. If the program persists as originally described, many professionals are kept vigilant and engaged in activities to reduce suicide without any statistical likelihood of having the opportunity to actually prevent one.

In decades of work within health systems, I have found that once demonstration projects have run their initial course, their goals are replaced by new organizational  ones and resources are redistributed. Sooner or later, competing demands for scarce resources  are promoted by new slogans.

What if Perfect Depression Care has to compete for scarce resources with Perfect Diabetes Care or alleviation of gross ethnic disparities in cardiovascular outcomes?

A lot of well-meant slogans ultimately have unintended, negative consequences. “Make pain the 5th vital sign” led to more attention being paid to previously ignored and poorly managed pain. This was followed by mandated routine assessment and intervention, which led to unnecessary procedures and unprecedented epidemic of addiction and death from prescribed opioids. “Stamp out distress” has led to mandated screening and intervention programs for psychological distress in cancer care, with high rates of antidepressant prescription without proper diagnosis or follow-up.

If taken literally and seriously, a lofty, but abstract goal like Zero Suicide becomes a threat to any “just culture” in healthcare organization. If the slogan is taken seriously as resources are inevitably withdrawn, a culture of blame will emerge and pressures to distort easily manipulated statistics. Patients posing threats to the goal of zero suicide will be excluded from the system with an unknown, but negative consequences for their morbidity and mortality.

 Bottom line – we can’t have slogan-driven healthcare policies that will likely have negative implications and conflict with evidence.

 Enter Big Pharma

Not unexpectedly, Big Pharma is getting involved in promoting Zero Suicides:

Eli Lilly and Company Foundation donates $250,000 to expand Community Health Network’s Zero Suicides prevention initiative,

Major gift will save Hoosier lives through a suicide prevention network that responds to a critical Indiana healthcare issue.

 According to press coverage, the funds will go to:

The Lilly Foundation donation also provides resources needed to build a Central Indiana crisis network that will include Indiana’s schools, foster care system, juvenile justice program, primary and specialty healthcare providers, policy makers and suicide survivors. These partners will be trained to identify people at risk of attempting suicide, provide timely intervention and quickly connect them with Community’s crisis providers. Indiana’s state government is a key partner in building the statewide crisis network.

I’m sure this effort is good for  the profits of Pharma. Dissemination of screening programs into settings that are not directly connected to quality depression care is inevitably ineffective. The main healthcare consequences are an increase in antidepressant prescriptions without appropriate diagnoses, patient education, and follow-up. Substantial overtreatment results from people being identified without proper diagnosis who otherwise would not be seeking treatment. Care for depression in the community is hardly Perfect Depression Care.

It is great publicity for Eli Lilly and the community receiving the gift will surely be grateful.

Launching Zero Suicides in English communities and elsewhere

My academic colleagues in the UK assure me that we can simply dismiss an official UK government press release about the goal of zero suicides from Nick Clegg. It has been rendered obsolete by subsequent political events. A number commented that they never took it seriously, regardless.

Nick Clegg calls for new ambition for zero suicides across the NHS

The claims in the press release stand in stark contrast to long waiting times for mental health services and important gaps in responses to serious mental health crises, including lethal suicide attempts. However, another web link is to an announcement:

Centre for Mental Health was commissioned by the East of England Strategic Clinical Networks to evaluate activity taking place in four local areas in the region through a pilot programme to extend suicide prevention into communities.

The ‘zero suicide’ initiative is based on an approach developed by Dr Ed Coffey in Detroit, Michigan. The approach aims to prevent suicides by creating a more open environment for people to talk about suicidal thoughts and enabling others to help them. It particularly aims to reach people who have not been reached through previous initiatives and to address gaps in existing provision.

Four local areas in the East of England (Bedfordshire, Cambridgeshire & Peterborough, Essex and Hertfordshire) were selected in 2013 as pathfinder sites to develop new approaches to suicide prevention. Centre for Mental Health evaluated the work of the sites during 2015.

The evaluation found an impressive range of activities that had taken suicide prevention activities out into local communities. They included:

• Training key public service staff such as GPs, police officers, teachers and housing officers
• Training others who may encounter someone at risk of taking their own life, such as pub landlords, coroners, private security staff, faith groups and gym workers
• Creating ‘community champions’ to put local people in control of activities
• Putting in place practical suicide prevention measures in ‘hot spots’ such as bridges and railways
• Working with local newspapers, radio and social media to raise awareness in the wider community
• Supporting safety planning for people at risk of suicide, involving families and carers throughout the process
• Linking with local crisis services to ensure people get speedy access to evidence-based treatments.

The report noted that some of the people who received the training had already saved lives:

“I saved a man’s life using the skills you taught us on the course. I cannot find words to properly express the gratitude I have for that. Without the training I would have been in bits. It was a very public place, packed with people – but, to onlookers, we just looked like two blokes sitting on a bench talking.”

“Déjà vu all over again”, as Yogi Berra would say. This effort also recalls Bill Murray in the movie Groundhog Day, where he is trapped into repeating the same day over and over again.

A few years ago I was a scientific advisor for European Union funded project to disseminate multilevel suicide prevention programs across Europe. One UK site was among those targeted in this report. Implementation of the EU program had already failed before the plate of snacks was being removed from a poorly attended event. The effort quickly failed because it failed to attract the support of local GPs.

Years later, I recognize many of the elements of what we tried to implement, described in language almost identical to ours. There is no mention of the training materials we left behind or of the quick failure of our attempt at implementation.

Many of the proposed measures in the UK plan serve to generate publicity and do not any evidence that they reduce suicides. For instance, training people in the community who might conceivably come in contact with a suicidal person accomplishes little other than producing good publicity. Uptake of such training is abysmally low and is not likely to affect the probability that a person in a suicidal crisis will encounter anyone who can make a difference

Broad efforts to increase uptake of mental health services in the UK strain a system already suffer from unacceptably long waiting times for services. People with any likelihood of attempting suicide, however poorly predicted, are likely to be lost among persons seeking services with less serious or pressing needs.

Thoughts I have accumulated from years of evaluating depression screening programs and suicide intervention efforts

 Staying mobilized around preventing suicide is difficult because it is an infrequent event and most activations of resources will prove to false positives.

It can be tedious and annoying for both staff and patients to keep focused on an infrequent event, particularly for the vast majority of patients who rightfully believe they are not at risk for suicide.

Resources can be drained off from less frequent, but more high risk situations that require sustained intensity of response, pragmatic innovation, and flexibility of rules.

Heightened efforts to detect mental health problems increase access for people already successfully accessing services and decrease resources for those needing special efforts. The net result can be an increase in disparities.

Suicide data are easily manipulated by ignoring selective loss to follow-up. Many suicides occur at breaks in the system, where getting follow-up data is also problematic.

Finally, death by suicide is a health outcomes that is multiply determined. It does not lend itself to targeted public health approaches like eliminating polio, tempting though invoking the analogy may be.

Postscript

It is likely  that I exposed anyone reaching this postscript to a new and disconcerting perspective. What I have been saying is  discrepant with the publicity about “zero suicides” available in the media. The portrayal of “zero suicides” is quite persuasive because it is sophisticated and well-crafted. Its dissemination is well resourced and often financed by individuals and institutions with barely discernible – if at all – conflicts of financial and political interests. Just try to find any dissenters or skeptical assessments.

My takeaway message: It’s best to process claims about suicide prevention with a high level of skepticism, an insistent demand for evidence, and a preparedness for discovering that seemingly well trusted sources are not without agendas. They are usually  providing propaganda rather than evidence-based arguments.