How to get a flawed systematic review and meta-analysis withdrawn from publication: a detailed example

Cochrane normally requires authors to agree to withdraw completed reviews that have been published. This withdrawal in the face of resistance from the authors is extraordinary.

There is a lot to be learned from this letter and the accompanying documents in terms of Courtney calmly and methodically laying out a compelling case for withdrawal of a review with important clinical practice and policy implications.

mind the brain logo

Robert Courtney’s wonderfully detailed cover letter probably proved decisive in getting the Cochrane review withdrawn, along with the work of another citizen scientist/patient advocate, Tom Kindlon.

Cochrane normally requires authors to agree to withdraw completed reviews that have been published. This withdrawal in the face of resistance from the authors is extraordinary.

There is a lot to be learned from this letter and the accompanying documents in terms of Courtney calmly and methodically laying out a compelling case for withdrawal of a review with important clinical practice and policy implications.

Especially take a look at the exchanges with the author Lillebeth Larun that are included in the letter.

Excerpt from the cover letter below:

It is my opinion that the published Cochrane review unfortunately fails to meet the standards expected by the public of Cochrane in terms of publishing rigorous, unbiased, transparent and independent analysis; So I would very much appreciate it if you could investigate all of the problems I raised in my submitted comments and ensure that corrections are made or, at the very least, that responses are provided which allow readers to understand exactly why Cochrane believe that no corrections are required, with reference to Cochrane guidelines.

On this occasion, in certain respects, I consider the review to lack rigour, to lack clarity, to be misleading, and to be flawed. I also consider the review (including the discussions, some of the analyses, and unplanned changes to the protocol) to indicate bias in favour of the treatments which it investigates.

robert bob courtneyAnother key excerpt summarized Courtney’s four comments on the Cochrane review that had not yet succeeded in getting the review withdrawn:

In summary, my four submissions focus on, but are not restricted to the following issues:

  • The review authors switched their primary outcomes in the review, and used unplanned analyses, which has had the effect of substantially transforming some of the interpretation and reporting of the primary outcomes of the review;

  • The review fails to prominently explain and describe the primary outcome switching and to provide a prominent sensitivity analysis. In my opinion, the review also fails to justify the primary outcome switching;

  • The review fails to clearly report that there were no significant treatment effects at follow-up for any pooled outcomes in any measures of health (except for sleep, a secondary outcome), but instead the review gives the impression that most follow-up outcomes indicated significant improvements, and that the treatments were largely successful at follow-up;

  • The review uses some unpublished and post-hoc data from external studies, despite the review-authors claiming that they have included only formally published data and pre-specified outcome data. Using post-hoc and unpublished data, which contradicts the review’s protocol and stated methodology, may have had a significant effect on the review outcomes, possibly even changing the review outcomes from non-significant to significant;

  • The main discussion sections in the review include incorrect and misleading reports of the review’s own outcomes, giving a.false overall impression of the efficacy of the reviewed therapies;

  • The review includes an inaccurate assessment of bias (according to the Cochrane guidelines for reporting bias) with respect to some of the studies included in the review’s analyses.

These are all serious issues, that I believe we should not be seeing in a Cochrane review.

Digression: My Correspondence with Tom Kindlon regarding this blog post

James Coyne <jcoynester@gmail.com>

Oct 18, 2018, 12:45 PM (3 days ago)

to Tom

I’m going to be doing a couple of blog posts about Bob, one of them about the details of the lost year of his life (2017) which he shared with me in February 2018, shortly before he died. But the other blog post is going to be basically this long email posted with commentary. I am concerned that you get your proper recognition as fully sharing the honors with him for ultimately forcing the withdrawal of the exercise review. Can you give me some suggestion how that might be assured? references? blogs

Do you know the details of Bob ending his life? I know it was a deliberate decision, but was it an accompanied suicide? More people need to know about his involuntary hospitalization and stupid diagnosis of anorexia.

Kind regards

tom Kindlon
Tom Kindlon

Tom Kindlon’s reply to me

Tom Kindlon

Oct 18, 2018, 1:01 PM (3 days ago)

Hi James/Jim,

It is great you’re going to write on this.

I submitted two long comments on the Cochrane review of exercise therapy for CFS, which can be read here:

<https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub7/detailed-comment/en?messageId=157054020&gt;

<https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub7/detailed-comment/en?messageId=157052118&gt;

Robert Courtney then also wrote comments. When he was not satisfied with the responses, he made a complaint.

All the comments can be read on the review here:

<https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD003200.pub7/read-comments&gt;

but as I recall the comments by people other than Robert and myself were not substantial.

I will ask what information can be given out about Bob’s death.

Thanks again for your work on this,

Tom

The Cover Letter: Did it break the impasse about withdrawing the review?

from:     Bob <brightonbobbob@yahoo.co.uk>

to:            James Coyne <jcoynester@gmail.com>

date:     Feb 18, 2018, 5:06 PM

subject:                Fw: Formal complaint – Cochrane review CD003200Sun, Feb 18, 1:15 PM

THIS IS A COPY OF A FORMAL COMPLAINT SENT TO DR DAVID TOVEY.

Formal Complaint

12th February 2018

From:

Robert Courtney.

UK

To:

Dr David Tovey

Editor in Chief of the Cochrane Library

Cochrane Editorial Unit

020 7183 7503

dtovey@cochrane.org

Complaint with regards to:

Cochrane Database of Systematic Reviews.

Larun L, Brurberg KG, Odgaard-Jensen J, Price JR. Exercise therapy for chronic fatigue syndrome. Cochrane Database Syst Rev. 2017; CD003200. DOI: 10.1002/14651858.CD003200.pub7

Dear Dr David Tovey,

This is a formal complaint with respect to the current version of “Exercise therapy for chronic fatigue syndrome” by L. Larun et al. (Cochrane Database Syst Rev. 2017; CD003200.)

First of all, I would like to apologise for the length of my submissions relating to this complaint. The issues are technical and complex and I hope that I have made them easy to read and understand despite the length of the text.

I have attached four PDF files to this email which outline the details of my complaint. In 2016, I submitted each of these documents as part of the Cochrane comments facility. They have now been published in the updated version of the review. (For your convenience, the details of these submissions are listed at the end of this email with a weblink to an online copy of each document.)

I have found the responses to my comments, by L. Larun, the lead author of the review, to be inadequate, especially considering the seriousness of some of the issues raised.

It is my opinion that the published Cochrane review unfortunately fails to meet the standards expected by the public of Cochrane in terms of publishing rigorous, unbiased, transparent and independent analysis; So I would very much appreciate it if you could investigate all of the problems I raised in my submitted comments and ensure that corrections are made or, at the very least, that responses are provided which allow readers to understand exactly why Cochrane believe that no corrections are required, with reference to Cochrane guidelines.

On this occasion, in certain respects, I consider the review to lack rigour, to lack clarity, to be misleading, and to be flawed. I also consider the review (including the discussions, some of the analyses, and unplanned changes to the protocol) to indicate bias in favour of the treatments which it investigates.

Exercise as a therapy for chronic fatigue syndrome is a highly controversial subject, and so there may be more of a need for independent oversight and scrutiny of this Cochrane review than might usually be the case.

In addition to the technical/methodological issues raised in my four submitted comments, I would also like you to consider whether there may be a potential lack of independence on the part of the authors of this review.

All of the review authors, bar Price, are currently working in collaboration on another Cochrane project with some of the authors of the studies included in this review. (The project involves co-authoring a protocol for a future Cochrane review) [2]. One of the meetings held to develop the protocol for this new review was funded by Peter White’s academic fund [1]. White is the Primary Investigator for the PACE trial (a study included in this Cochrane review).

It is important that Cochrane is seen to uphold high standards of independence, transparency and rigour.

Please refer to my four separate submissions (attached) for the details of my complaint regarding the contents of the review. As way of an introduction, only, I will also briefly discuss, below, some of the points I have raised in my four documents.

In summary, my four submissions focus on, but are not restricted to the following issues:

  • The review authors switched their primary outcomes in the review, and used unplanned analyses, which has had the effect of substantially transforming some of the interpretation and reporting of the primary outcomes of the review;
  • The review fails to prominently explain and describe the primary outcome switching and to provide a prominent sensitivity analysis. In my opinion, the review also fails to justify the primary outcome switching;
  • The review fails to clearly report that there were no significant treatment effects at follow-up for any pooled outcomes in any measures of health (except for sleep, a secondary outcome), but instead the review gives the impression that most follow-up outcomes indicated significant improvements, and that the treatments were largely successful at follow-up;
  • The review uses some unpublished and post-hoc data from external studies, despite the review-authors claiming that they have included only formally published data and pre-specified outcome data. Using post-hoc and unpublished data, which contradicts the review’s protocol and stated methodology, may have had a significant effect on the review outcomes, possibly even changing the review outcomes from non-significant to significant;
  • The main discussion sections in the review include incorrect and misleading reports of the review’s own outcomes, giving a.false overall impression of the efficacy of the reviewed therapies;
  • The review includes an inaccurate assessment of bias (according to the Cochrane guidelines for reporting bias) with respect to some of the studies included in the review’s analyses.

These are all serious issues, that I believe we should not be seeing in a Cochrane review.

These issues have already caused misunderstanding and misreporting of the review in academic discourse and publishing. (See an example of this below.)

All of the issues listed above are explained in full detail in the four PDF files attached to this email. They should be considered to be the basis of this complaint.

For the purposes of this correspondence, I will illustrate some specific issues in more detail.

In the review, the following health indicators were used as outcomes to assess treatment effects: fatigue, physical function, overall health, pain, quality of life, depression, anxiety, and sleep. All of these health indicators, except uniquely for sleep (a secondary outcome) demonstrated a non-significant outcome for pooled treatment effects at follow-up for exercise therapy versus passive control. But a reader would not be aware of this from reading any of the discussion in the review. I undertook a lengthy and detailed analysis of the data in the review before i could comprehend this. I would like these results to be placed in a prominent position in the review, and reported correctly and with clarity, so that a casual reader can quickly understand these important outcomes. These outcomes cannot be understood from reading the discussion, and some outcomes have been reported incorrectly in the discussion. In my opinion, Cochrane is not maintaining its expected standards.

Unfortunately, there is a prominent and important error in the review, which I believe helps to give the mis-impression that the investigated therapies were broadly effective. Physical function and overall-health (both at follow-up) have been mis-reported in the main discussion as being positive outcomes at follow-up, when in fact they were non-significant outcomes. This seems to be an important failing of the review that I would like to be investigated and corrected.

Regarding one of the points listed above, copied here:

“The review fails to clearly report that there were no significant treatment effects at follow-up for any pooled outcomes in any measures of health (except for sleep, a secondary outcome), but instead the review gives the impression that most follow-up outcomes indicated significant improvements, and that the treatments were largely successful at follow-up”

This is one of the most substantial issues that I have highlighted. This issue is related to the primary outcome switching in the review.

(This relates to assessing fatigue at long-term follow-up for exercise therapy vs passive control.)

An ordinary (i.e. casual) reader of the review may easily be left with the impression that the review demonstrates that the investigated treatment has almost universal beneficial health effects. However there were no significant treatment effects for pooled outcome analyses at follow-up for any health outcomes except for sleep (a secondary outcome ). The lack of universal treatment efficacy at follow-up is not at all clear from a casual read of the review, or even from a thorough read. Instead, a careful analysis of the data is necessary to understand the outcomes. I believe that the review is unhelpful in the way it has presented the outcomes, and lacks clarify.

These follow-up outcomes are a very important issue for medical, patient and research communities, but I believe that they have been presented in a misleading and unhelpful way in the discussions of the review. This issue is discussed mainly in my submission no.4 (see my list of PDF documents at the bottom of this correspondence), and also a little in submission no.3.

I will briefly explain some of the specific details, as way of an introduction, but please refer to my attached documents for the full details.

The pre-specified primary outcomes were pooled treatment effects (i.e. using pooled data from all eligible studies) immediately after treatment and at follow-up.

However, for fatigue, this pre-specified primary outcome (i.e. pooled treatment effects for the combination of data from all eligible studies) was abandoned/switched (for what i consider to be questionable reasons) and replaced with a non-pooled analysis. The new unplanned analysis did not pool the data from all eligible studies but analysed data from studies grouped together by the specific measure used to assess fatigue (i.e. grouped by the various different fatigue questionnaire assessments).

Looking at these post-hoc grouped outcomes, for fatigue at follow-up , two out of the three grouped outcomes had significant treatment effects, and the other outcome was a non-significant effect. This post-hoc analysis indicates that the majority of outcomes ( i.e. two out of three) demonstrated a significant treatment effect , however, this does not mean that the pre-specified pooled analysis of all eligible studies would have demonstrated a positive treatment effect. Therefore switching outcomes, and using a post-hoc analysis, allows for the potential introduction of bias to the review. Indeed, on careful inspection of the minutia of the review, the pre-specified analysis of pooled outcomes demonstrates a non-significant treatment effect, for fatigue at follow-up (exercise therapy versus passive control)

The (non-significant) outcome of this pre-specified pooled analysis of fatigue at follow-up is somewhat buried within the data tables of review, and is very difficult to find; It is not discussed prominently or highlighted. Furthermore, the explanation that the primary outcome was switched, is only briefly mentioned and can easily be missed. Uniquely, for the main outcomes, there is no table outlining the details of the pre-specified pooled analysis of fatigue at follow-up. In contrast, the post-hoc analysis, which has mainly positive outcomes, has been given high prominence throughout the review with little explanation that it is a post-hoc outcome.

So, to reiterate, the (two out of three significant, and one non-significant) post-hoc outcomes for fatigue at follow-up were reported as primary outcomes instead of the (non-significant) pre-specified pooled treatment effect for all eligible studies. Two out of three post-hoc outcomes were significant in effect, however, the pre-specified pooled treatment effect, for the same measures, were not significant (for fatigue at follow-up – exercise therapy versus passive control). Thus, the outcome switching transformed one of the main outcomes of the review, from a non-insignificant effect to a mainly significant effect.

Furthermore, for exercise therapy versus passive control at follow-up, all the other health outcomes were non-significant (except sleep – a secondary outcome), but I believe the casual reader would be unaware of this because it is not explained clearly or prominently in the discussion, and some outcomes have been reported erroneously in the discussion as indicating a significant effect.

All of the above is outlined in my four PDF submissions, with detailed reference to specific sections of the review and specific tables etc.

I believe that the actual treatment effects at follow-up are different to the impression gained from a casual read of the review, or even a careful read of the review. It’s only by an in-depth analysis of the entire review that these issues would be noticed.

In what i believe to be a reasonable request in my submissions, i asked the reviewers to: “Clearly and unambiguously explain that all but one health indicator (i.e. fatigue, physical function, overall health, pain, quality of life, depression, and anxiety, but not sleep) demonstrated a non-significant outcome for pooled treatment effects at follow-up for exercise therapy versus passive control”. My request was not acted upon.

The Cochrane reviewers did provide a reason for the change to the protocol, from a pooled analysis to analyses of groups of mean difference values: “We realise that the standardised mean difference (SMD) is much more difficult to conceptualise and interpret than the normal mean difference (MD) […]”.

However, this is a questionable and unsubstantiated claim, and in my opinion isn’t an adequate explanation or justification for changing the primary outcomes; personally, I find it easier to interpret a single pooled analysis than a group of different analyses with each analysis using a different non-standardised scale to measure fatigue.

Using a SMD is standard practice for Cochrane reviews; Cochrane’s guidance recommends using pooled analyses when the outcomes use different measures, which was the case in this review; Thus i struggle to understand why (in an unplanned change to methodology) using a SMD was considered unhelpful by the reviewers in this case. My PDF document no.4 challenges the reviewers’ reason, with reference to the official Cochrane reviewers’ guidelines.

This review has already led to an academic misunderstanding and mis-reporting of its outcomes, which is demonstrated in the following published letter from one of the co-authors of the IPD protocol……

CMAJ (Canada) recommends exercise for CFS [http://www.cmaj.ca/content/188/7/510/tab-e-letters ]

The letter claims: “We based the recommendations on the Cochrane systematic review which looked at 8 randomised trials of exercise for chronic fatigue, and together showed a consistent modest benefit of exercise across the different patient groups included. The clear and consistent benefit suggests indication rather than contraindication of exercise.”

However, there was not a “consistent modest benefit of exercise” and there was not a “clear and consistent benefit” considering that there were no significant treatment effects for any pre-specified (pooled) health outcomes at follow-up, except for sleep. The actual outcomes of the review seem to contradict the interpretation expressed in the letter.

Even if we include the unplanned analyses in our considerations, then it would still be the case that most outcomes did not indicate a beneficial treatment effect at follow-up for exercise therapy versus passive control. Furthermore, one of the most important outcomes, physical function, did not indicate a significant improvement at follow up (despite the discussion erroneously stating that it was a significant effect).

Two of my submissions discuss other issues, which I will outline below.

My first submission is in relation to the following…

The review states that all the analysed data had previously been formally published and was pre-specified in the relevant published studies. However, the review includes an analysis of external data that had not been formally published and is post-hoc in nature, despite alternative data being available that has been formally published and had been pre-specified in the relevant study. The post-hoc data relates to the FINE trial (Wearden 2010). The use of this data was not in accordance with the Cochrane review’s protocol and also contradicts the review’s stated methodology and the discussion of the review.

Specifically, the fatigue data taken from the FINE trial was not pre-specified for the trial and was not included in the original FINE trial literature. Instead, the data had been informally posted on a BMJ rapid response by the FINE trial investigators[3].

The review analyses post-hoc fatigue data from the FINE trial which is based on the Likert scoring system for the Chalder fatigue questionnaire, whereas the formally published FINE trial literature uses the same Chalder fatigue questionnaires but uses the biomodal scoring system, giving different outcomes for the same patient questionnaires. The FINE trial’s post-hoc Likert fatigue data (used in the review) was initially published by the FINE authors only in a BMJ rapid response post [3], apparently as an after-thought.

This is the response to my first letter…

Larun
Larun said she was “extremely concerned and disappointed” with the Cochrane editors’ actions. “I disagree with the decision and consider it to be disproportionate and poorly justified,” she said.

———————-

Larun said:

Dear Robert Courtney

Thank you for your detailed comments on the Cochrane review ‘Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work. We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.

The Chalder Fatigue Scale was used to measure fatigue. The results from the Wearden 2010 trial show a statistically significant difference in favour of pragmatic rehabilitation at 20 weeks, regardless whether the results were scored bi-modally or on a scale from 0-3. The effect estimate for the 70 week comparison with the scale scored bi-modally was -1.00 (CI-2.10 to +0.11; p =.076) and -2.55 (-4.99 to -0.11; p=.040) for 0123 scoring. The FINE data measured on the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors to be able to use the estimates in our meta-analysis. In our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks. The decision to use the 0123 scoring did does not affect the conclusion of the review.

Regards,

Lillebeth Larun

——————

In her response, above, Larun discusses the FINE trial and quotes an effect size for post-hoc outcome data (fatigue at follow-up) from the FINE trial that is included in the review. Her quoted figures accurately reflect the data quoted by the FINE authors in their BMJ rapid-response comment [3] but, confusingly, these are slightly different from the data in the Cochrane review. In her response, Larun states that the FINE trial effect size for fatigue at 70 weeks using Likert data is -2.55 (-4.99 to -0.11; p=.040), whereas the Cochrane Review states that it is -2.12 [-4.49, 0.25].

This inconsistency makes this discussion confusing. Unfortunately there is no authoritative source for the data because it had not been formally published when the Cochrane review was published.

It seems that, in her response, Larun has quoted the BMJ rapid response data by Wearden et al.[3], rather than her own review’s data. Referring to her review’s data, Larun says that in “our unadjusted analysis the results were similar for the scale scored bi-modally and the scale scored from 0 to 3, i.e. a statistically significant difference in favour of rehabilitation at 20 weeks and a trend that does not reach statistical significance in favour of pragmatic rehabilitation at 70 weeks”.

It is not clear exactly why there are now two different Likert effect sizes, for fatigue at 70 weeks, but we can be sure that the use of this data undermines the review’s claim that “for this updated review, we have not collected unpublished data for our outcomes…”

This confusion, perhaps, demonstrates one of the pitfalls of using unpublished data. The difference between the data published in the review and the data quoted by Larun in her response (which are both supposedly the same unpublished data from the FINE trial) raises the question of exactly what data has been analysed in the review, and what exactly is the source . If it is unpublished data, and seemingly variable in nature, how are readers expected to scrutinise or trust the Cochrane analysis?

With respect to the FINE trial outcomes (fatigue at 70 week follow-up), Larun has provided the mean differences (effect size) for the (pre-specified) bimodal data and for (post-hoc) Likert data. These two different scoring methods (bimodel and Likert), are used for identical patient Chalder fatigue questionnaires, and provide different effect sizes, so switching the fatigue scoring methods may possibly have had an impact on the review’s primary outcomes for fatigue.

Larun hasn’t provided the effect estimates for fatigue at end-of-treatment, but these would also demonstrate variance between bimodal and Likert scoring, so switching the outcomes might have had a significant impact on the primary outcome of the Cochrane review at end-of-treatment, as well as at follow-up.

Note that the effect estimates outlined in this correspondence, for the FINE trial, are mean differences (this is the data taken from the FINE trial), rather than standardised mean differences (which are sometimes used in the meta-analyses in the Cochrane review); It is important not to get confused between the two different statistical analyses.

Larun said: “The decision to use the 0123 [i.e. Likert] scoring did does [sic] not affect the conclusion of the review.”

But it is not possible for a reader to verify that because Larun has not provided any evidence to demonstrate that switching outcomes has had no effect on the conclusion of the review. i.e. There is no sensitivity analysis, despite the review switching outcomes and using unpublished post-hoc data instead of published pre-specified data. This change in methodology means that the review does not conform to its own protocol and stated methodology. This seems like a significant issue.

Are we supposed to accept the word of the author, rather than review the evidence for ourselves? This is a Cochrane review – renowned for rigour and impartiality.

Note that Larun has acknowledged that I am correct with respect to the FINE trial data used in the review (i.e. that the data was unpublished and not part of the formally published FINE trial study, but was simply posted informally in a BMJ rapid response). Larun confirms that: “…the 33-point scale was published in an online rapid response after a reader requested it. We therefore knew that the data existed, and requested clarifying details from the authors…” But then Larun confusingly (for me) says we must “agree to disagree”.

Larun has not amended her literature to resolve the situation; Larun has not changed her unplanned analysis back to her planned analyses (i.e. to use published pre-specified data as per the review protocol, rather than unpublished post-hoc data); nor has she amended the text of the review so that it clearly and prominently indicates that the primary outcomes were switched. Neither has a sensitivity analysis been published using the FINE trial’s published pre-specified data.

Note the difference in the effect estimates at 70 weeks for bimodal scoring [-1.00 (CI -2.10 to +0.11; p =.076)] vs Likert scoring [-2.55 (-4.99 to -0.11; p=.040)] (as per the Cochrane analysis) or -2.12 [-4.49, 0.25] (also Likert scoring) as per Larun’s response and the BMJ rapid response where the data was initially presented to the public.

Confusingly, there are two different effect sizes for the same (Likert) data; one shows a significant treatment effect and the other shows a non-significant treatment effect. This seems like a rather chaotic situation for a Cochrane review . The data is neither consistent nor transparent. The unplanned Cochrane analysis uses data which has not been published and cannot be scrutinised.

Furthermore, we now have three sets of data for the same outcomes. Because an unplanned analysis was used in the review, it is nearly impossible to work out what is what.

In her response, above, Larun says that both fatigue outcomes (i.e. bimodal & Likert scoring systems) at 70 weeks are non-significant. This is true of the data published in the Cochrane review but, confusingly, this isn’t true if we consider the data that Larun has provided in her response, above. The bimodal and Likert data (fatigue at 70 weeks) presented in the review both have a non-significant effect, however, the Likert data quoted in Larun’s correspondence (which reflects the data in the FINE trial authors’ BMJ rapid response) shows a significant outcome. This may reflect the use of adjusted vs unadjusted data, but it isn’t clear.

Using post-hoc data may allow bias to creep into the review; For example, the Cochrane reviewers might have seen the post hoc data for the FINE trial , because it was posted in an open-access BMJ rapid response [3] prior to the Cochrane review publication date. I am not accusing the authors of conscious bias but Cochrane guidelines are put in place to avoid doubt and to maintain rigour and transparency. Hypothetically, a biased author may have seen that a post-hoc Likert analysis allowed for better outcomes to be reported for the FINE trial. The Cochrane guidelines are established in order to avoid such potential pitfalls and bias, and to avoid the confusion that is inherent in this review.

Note that the review still incorrectly says that all the data is previously published data – even though Larun admits in the letter that it isn’t. (i.e. the data are not formally published in a peer-reviewed journal; i assume that the review wasn’t referring to data that might be informally published in blogs or magazines etc, because the review pretends to analyse formally published data only.)

The authors have practically dismissed my concerns and have not amended anything in the review, despite admitting in the response that they’ve used post-hoc data.

The fact that this is all highly confusing, even after I have studied it in detail, demonstrates that these issues need to be straightened out and fixed.

It surely shouldn’t be the case, in a Cochrane review, that we ( for the same outcomes ) have three sets of results being bandied about, and the data used in a post hoc analysis seems to vary over time, and change from a non-significant treatment effect to a significance treatment effect, depending on where it is quoted. Because it is unpublished, independent scrutiny is made more difficult.

For your information, the BMJ rapid response (Wearden et al.) includes the following data : “Effect estimates [95% confidence intervals] for 20 week comparisons are: PR versus GPTAU -3.84 [-6.17, -1.52], SE 1.18, P=0.001; SL versus GPTAU +0.30 [-1.73, +2.33], SE 1.03, P=0.772. Effect estimates [95% confidence intervals] for 70 week comparisons are: PR versus GPTAU -2.55 [-4.99,-0.11], SE 1.24, P=0.040; SL versus GPTAU +0.36 [-1.90, 2.63], SE 1.15, P=0.752.”

My second submission was in relation to the following…

I believe that properly applying the official Cochrane guidelines would require the review to categorise the PACE trial (White 2011) data as ‘unplanned’ rather than ‘pre-specified’, and would require the risk of bias in relation to ‘selective reporting’ to be categorised accordingly. The Cochrane review currently categorises the risk of ‘selective reporting’ bias for the PACE trial as “low”, whereas the official Cochrane guidelines indicate (unambiguously) that the risk of bias for the PACE data should be “high”. I believe that my argument is fairly robust and water-tight.

This is the response to my second letter…

———————–

Larun said:

Dear Robert Courtney

Thank you for your detailed comments on the Cochrane review ‘Exercise Therapy for Chronic Fatigue Syndrome’. We have the greatest respect for your right to comment on and disagree with our work. We take our work as researchers extremely seriously and publish reports that have been subject to rigorous internal and external peer review. In the spirit of openness, transparency and mutual respect we must politely agree to disagree.

Cochrane reviews aim to report the review process in a transparent way, for example, are reasons for the risk of bias stated. We do not agree that Risk of Bias for the Pace trial (White 2011) should be changed, but have presented it in a way so it is possible to see our reasoning. We find that we have been quite careful in stating the effect estimates and the certainty of the documentation. We note that you read this differently.

Regards,

Lillebeth

————————-

I do not understand what is meant by: “We do not agree that Risk of Bias for the Pace trial (White 2011) should be changed, but have presented it in a way so it is possible to see our reasoning.” …

The review does not discuss the issue of the PACE data being unplanned and I, for one, do not understand the reasoning for not correcting the category for the risk of selective reporting bias. The response to my submission fails to engage with the substantive and serious issues that I raised.

To date, nearly all the issues raised in my letters have been entirely dismissed by Larun. I find this surprising, especially considering that some of the points that I have made were factual (i.e. not particularly open to interpretation) and difficult to dispute. Indeed, Larun’s response even accepts the factual point that I made, in relation to the FINE data, but then confusingly dismisses my request for the issue to be remedied.

There is more detail in the four PDF submissions which are attached to this email, and which have now been published in the latest version of the Cochrane review. I will stop this email now so as not to overwhelm you, and so I don’t repeat myself .

Again, I apologise for the complexity. My four submissions , attached to this email as PDF files, form the basis of my complaint so I ask you to consider them to be the central basis of my complaint . I hope that they will be sufficiently clear.

I trust that you will wish to investigate these issues, with a view to upholding the high standards expected from a Cochrane review.

I look forward to hearing from you in due course. Please feel free to email me at any time with any questions, of if you believe it would be helpful to discuss any of the issues raised.

Regards,

Robert Courtney.

My ‘comments’ (submitted to the Cochrane review authors):

Please note that the four attached PDF documents form the basis of this complaint.

For your convenience, I have included a weblink to a downloadable online copy of each document, and I have attached copies to this email as PDF files, and the comments have now been published in the latest updated version of the review.

The dates refer to the date the comments were submitted to Cochrane.

  1. Query re use of post-hoc unpublished outcome data: Scoring system for the Chalder fatigue scale, Wearden 2010.

Robert Courtney

16th April 2016

https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/fine-trial-unpublished-data

  1. Assessment of Selective Reporting Bias in White 2011.

Robert Courtney

1st May 2016

https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/pace-trial-selective-reporting-bias

  1. A query regarding the way outcomes for physical function and overall health have been described in the abstract, conclusion and discussions of the review.

Robert Courtney

12th May 2016

[ https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/misreporting-of-outcomes-for-physical-function ]

  1. Concerns regarding the use of unplanned primary outcomes in the Cochrane review.

Robert Courtney

3rd June 2016

https://sites.google.com/site/mecfsnotes/submissions-to-the-cochrane-review-of-exercise-therapy-for-chronic-fatigue-syndrome/primary-outcome-switching

References:

  1. Quote from Cochrane reference CD011040:

“Acknowledgements[…]The author team held three meetings in 2011, 2012 and 2013 which were funded as follows: […]2013 via Peter D White’s academic fund (Professor of Psychological Medicine, Centre for Psychiatry, Wolfson Institute of Preventive Medicine, Barts and The London School of Medicine and Dentistry, Queen Mary University of London).”

  1. Larun L, Odgaard-Jensen J, Brurberg KG, Chalder T, Dybwad M, Moss-Morris RE, Sharpe M, Wallman K, Wearden A, White PD, Glasziou PP. Exercise therapy for chronic fatigue syndrome (individual patient data) (Protocol). Cochrane Database of Systematic Reviews 2014, Issue 4. Art. No.: CD011040.

http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD011040/abstract

http://www.cochrane.org/CD011040/DEPRESSN_exercise-therapy-for-chronic-fatigue-syndrome-individual-patient-data

 

  1. Wearden AJ, Dowrick C, Chew-Graham C, et al. Fatigue scale. BMJ Rapid Response. 2010.

http://www.bmj.com/rapid-response/2011/11/02/fatigue-scale-0 (accessed Feb 21, 2016).

End.

Cochrane complaints procedure:

http://www.cochranelibrary.com/help/the-cochrane-library-complaints-procedure.html

Lessons we need to learn from a Lancet Psychiatry study of the association between exercise and mental health

The closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

giphyThe closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

Apparently, the editor of Lancet Psychiatry and reviewers did not give the study a close look before it was accepted.

The article was used to raise funds for a startup company in which one of the authors was heavily invested. This was disclosed, but doesn’t let the authors off the hook for promoting a seriously flawed study. Nor should the editor of Lancet Psychiatry or reviewers escape criticism, nor the large number of people on Twitter who thoughtlessly retweeted and “liked” a series of tweets from the last author of the study.

This blog post is intended to raise consciousness about bad science appearing in prestigious journals and to allow citizen scientists to evaluate their own critical thinking skills in terms of their ability to detect misleading and exaggerated claims.

1.Sometimes a disclosure of extensive conflicts of interest alerts us not to pay serious attention to a study. Instead, we should question why the study got published in a prestigious peer-reviewed journal when it had such an obvious risk of bias.

2.We need citizen scientists with critical thinking skills to identify such promotional efforts and alert others in their social network that hype and hokum are being delivered.

3.We need to stand up to authors who use scientific papers for commercial purposes, especially when they troll critics.

Read on and you will see what a skeptical look at the paper and its promotion revealed.

  • The study failed to capitalize on the potential of multiple years of data for developing and evaluating statistical models. Bigger is not necessarily better. Combining multiple years of data was wasteful and served only the purpose of providing the authors bragging rights and the impressive, but meaningless p-values that come from overly large samples.
  • The study relied on an unvalidated and inadequate measure of mental health that confounded recurring stressful environmental conditions in the work or home with mental health problems, even where validated measures of mental health would reveal no effects.
  • The study used an odd measure of history of mental health problems that undoubtedly exaggerated past history.
  • The study confused physical activity with (planned) exercise. Authors amplified their confusion by relying on an exceedingly odd strategy for getting estimate of how much participants exercised: Estimates of time spent in a single activity was used in analyses of total time spent exercising. All other physical activity was ignored.
  • The study made a passing acknowledgment of the problems interpreting simple associations as causal, but then went on to selectively sample the existing literature to make the case that interventions to increase exercise improve mental health.
  • Taken together, a skeptical of assessment of this article provides another demonstration that disclosure of substantial financial conflicts of interests should alert readers to a high likelihood of a hyped, inaccurately reported study.
  • The article was pay walled so that anyone interested in evaluating the authors claims for themselves had to write to the author or have access to the article through a university library site. I am waiting for the authors to reply to my requests for the supplementary tables that are needed to make full sense of their claims. In the meantime, I’ll just complain about authors with significant conflicts of interest heavily promoting studies that they hide behind paid walls.

I welcome you to  examine the author’s thread of tweets. Request the actual article from the author if you want to evaluate independently my claims. This can be great material for a masters or honors class on critical appraisal, whether in psychology or journalism.

title of article

Let me know if you think that I’ve been too hard on this study.

A thread of tweets  from the last author celebrated the success of well orchestrated publicity campaign for a new article concerning exercise and mental health in Lancet Psychiatry.

The thread started:

Our new @TheLancetPsych paper was the biggest ever study of exercise and mental health. it caused quite a stir! here’s my guided tour of the paper, highlighting some of our excitements and apprehensions along the way [thread] 1/n

And ended with pitch for the author’s do-good startup company:

Where do we go from here? Over @spring_health – our mental health startup in New York City – we’re using these findings to develop personalized exercise plans. We want to help every individual feel better—faster, and understand exactly what each patient needs the most.

I wasn’t long into the thread before my skepticism was stimulated. The fourth tweet in the thread had a figure that didn’t get any comments about how bizarre it was.

The tweet

It looks like those differences mattered. for example, people who exercised for about 45 minutes seemed to have better mental health than people who exercised for less than 30, or more than 60 minutes. — a sweet spot for mental health, perhaps?

graphs from paper

Apparently the author does not comment on an anomaly either. Housework appears to be better for mental health than a summary score of all exercise and looks equal to or better than cycling or jogging. But how did housework slip into the category “exercise”?

I begin wondering what the authors meant by “exercise” or if they’d given the definition serious consideration when constructing their key variable from the survey data.

But then that tweet was followed by another one that generated more confusion with a  graph the seemingly contradicted the figures in the last one

the type of exercise people did seems important too! People doing team sports or cycling had much better mental health than other sports. But even just walking or doing household chores was better than nothing!

Then a self-congratulatory tweet for a promotional job well done.

for sure — these findings are exciting, and it has been overwhelming to see the whole world talking openly and optimistically about mental health, and how we can help people feel better. It isn’t all plain sailing though…

The author’s next tweet revealed a serious limitation to the measure of mental health used in the study in a screenshot.

screenshot up tweet with mental health variable

The author acknowledged the potential problem, sort of:

(1b- this might not be the end of the world. In general, most peple have a reasonable understanding of their feelings, and in depressed or anxious patients self-report evaluations are highly correlated with clinician-rated evaluations. But we could be more precise in the future)

“Not the end of the world?” Since when does the author of the paper in the Lancet family of journals so casually brush off a serious methodological issue? A lot of us who have examined the validity of mental health measures would be skeptical of this dismissal  of a potentially fatal limitation.

No validation is provided for this measure. On the face of it, respondents could endorse it on basis of facing  recurring stressful situations that had no consequences for their mental health. This reflects ambiguity of the term stress for both laypersons and scientists. “Stress” could variously refer to an environmental situation, a subjective experience of stress, or an adaptational outcome. Waitstaff could consider Thursday when the chef is off, a recurrent, weekly stress. Persons with diagnosable persistent depressive disorder would presumably endorse more days than not as being a mental health challenge. But they would mean something entirely different.

The author acknowledged that the association between exercise and mental health might be bidirectional in terms of causality

adam on lots of reasons to believe relationship goes both ways.PNG

But then made a strong claim for increased exercise leading to better mental health.

exercise increases mental health.PNG

[Actually, as we will see, the evidence from randomized trials of exercise to improve mental health is modest, and entirely disappears one limits oneself to the quality studies.]

The author then runs off the rail with the claim that the benefits of exercise exceed benefits of having greater than poverty-level income.

why are we so excited.PNG

I could not resist responding.

Stop comparing adjusted correlations obtained under different circumstances as if they demonstrated what would be obtained in RCT. Don’t claim exercising would have more effect than poor people getting more money.

But I didn’t get a reply from the author.

Eventually, the author got around to plugging his startup company.

I didn’t get it. Just how did this heavy promoted study advance the science fo such  “personalized recommendation?

Important things I learned from others’ tweets about the study

I follow @BrendonStubbs on Twitter and you should too. Brendon often makes wise critical observations of studies that most everyone else is uncritically praising. But he also identifies some studies that I otherwise would miss and says very positive things about them.

He started his own thread of tweets about the study on a positive note, but then he identified a couple of critical issues.

First, he took issue with the author’s week claiming to have identified a tipping point, below which exercise is beneficial, and above which exercise could prove detrimental the mental health.

4/some interpretations are troublesome. Most confusing, are the assumptions that higher PA is associated/worsens your MH. Would we say based on cross sect data that those taking most medication/using CBT most were making their MH worse?

A postdoctoral fellow @joefirth7  seconded that concern:

I agree @BrendonStubbs: idea of high PA worsening mental health limited to observation studies. Except in rare cases of athletes overtraining, there’s no exp evidence of ‘tipping point’ effect. Cross-sect assocs of poor MH <–> higher PA likely due to multiple other factors…

Ouch! But then Brendan follows up with concerns that the measure of physical activity has not been adequately validated, noting that such self-report measures prove to be invalid.

5/ one consideration not well discussed, is self report measures of PA are hopeless (particularly in ppl w mental illness). Even those designed for population level monitoring of PA https://journals.humankinetics.com/doi/abs/10.1123/jpah.6.s1.s5 … it is also not clear if this self report PA measure has been validated?

As we will soon see, the measure used in this study is quite flawed in its conceptualization and its odd methodology of requiring participants to estimate the time spent exercising for only one activity, with 70 choices.

Next, Brandon points to a particular problem using self-reported physical activity in persons with mental disorder and gives an apt reference:

6/ related to this, self report measures of PA shown to massively overestimate PA in people with mental ill health/illness – so findings of greater PA linked with mental illness likely bi-product of over-reporting of PA in people with mental illness e.g Validity and Value of Self-reported Physical Activity and Accelerometry in People With Schizophrenia: A Population-Scale Study of the UK Biobank [ https://academic.oup.com/schizophreniabulletin/advance-article/doi/10.1093/schbul/sbx149/4563831 ]

7/ An additional point he makes: anyone working in field of PA will immediately realise there is confusion & misinterpretation about the concepts of exercise & PA in the paper, which is distracting. People have been trying to prevent this happening over 30 years

Again, Brandon provides a spot-on citation clarifying the distinction between physical activity and exercise:, Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research 

The mysterious pseudonymous Zad Chow @dailyzad called attention to a blog post they had just uploaded and let’s take a look at some of the key points.

Lessons from a blog post: Exercise, Mental Health, and Big Data

Zad Chow is quite balanced in dispensing praise and criticism of the Lancet Psychiatry paper. They noted the ambiguity of any causality in cross-sectional correlation and that investigated the literature on their own.

So what does that evidence say? Meta-analyses of randomized trials seem to find that exercise has large and positive treatment effects on mental health outcomes such as depression.

Study Name     # of Randomized Trials             Effects (SMD) + Confidence Intervals

Schuch et al. 2016       25         1.11 (95% CI, 0.79-1.43)

Gordon et al. 2018      33         0.66 (95% CI, 0.48-0.83)

Krogh et al. 2017          35         −0.66 (95% CI, -0.86, -0.46)

But, when you only pool high-quality studies, the effects become tiny.

“Restricting this analysis to the four trials that seemed less affected of bias, the effect vanished into −0.11 SMD (−0.41 to 0.18; p=0.45; GRADE: low quality).” – Krogh et al. 2017

Hmm, would you have guessed this from the Lancet Psychiatry author’s thread of tweets?

Zad Chow showed the hype and untrustworthiness of the press coverage in prestigious media with a sampling of screenshots.

zad chou screenshots of press coverage

I personally checked and don’t see that Zad Chow’s selection of press coverage was skewed. Coverage in the media all seemed to be saying the same thing. I found the distortion to continue with uncritical parroting – a.k.a. churnaling – of the claims of the Lancet Psychiatry authors in the Wall Street Journal. 

The WSJ repeated a number of the author’s claims that I’ve already thrown into question and added a curiosity:

In a secondary analysis, the researchers found that yoga and tai chi—grouped into a category called recreational sports in the original analysis—had a 22.9% reduction in poor mental-health days. (Recreational sports included everything from yoga to golf to horseback riding.)

And the NHS England totally got it wrong:

NHS getting it wrong.PNG

So, we learned that the broad category “recreational sports” covers yoga and tai chi , as well as golf and  horseback riding. This raises serious questions about the lumping and splitting of categories of physical activity in the analyses that are being reported.

I needed to access the article in order to uncover some important things 

I’m grateful for the clues that I got from Twitter, and especially Zad Chow that I used in examining the article itself.

I got hung up on the title proclaiming that the study involved 1·2 million individuals. When I checked the article, I saw that the authors use three waves of publicly available data to get that number. Having that many participants gave them no real advantage except for bragging rights and the likelihood that modest associations could be expressed in expressed in spectacular p-values, like p<2・2 × 10–16. I don’t understand why the authors didn’t conduct analyses with one-way and Qwest validate results in another.

The obligatory Research in Context box made it sound like a systematic search of the literature had been undertaken. Maybe, but the authors were highly selective in what they chose to comment upon, as seen in its contradiction by the brief review of Zad Chow. The authors would have us believe that the existing literature is quite limited and inconclusive, supporting the need for like their study.

research in context

Caveat Lector, a strong confirmation bias is likely ahead in this article.

Questions accumulated quickly as to the appropriateness of the items available from a national survey undoubtedly constructed with other purposes. Certainly these items would not have been selected if the original investigators were interested in the research question at the center of this article.

Participants self-reported a previous diagnosis of depression or depressive episode on the basis of the following question: “Has a doctor, nurse, or other health professional EVER told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?”

Our own work has cast serious doubt on the correspondence of reports of a history of depression in response to a brief question embedded in a larger survey with results of a structured interview in which respondents’ answers can be probed. We found that answers to such questions were more related to current distress, then to actual past diagnoses and treatment of depression. However, the survey question used in the Lancet Psychiatry study added the further ambiguity and invalidity with the added  “or minor depression.” I am not sure under what circumstances a health care professional would disclose a diagnosis of “minor depression” to a patient, but I doubt it would be in context in which the professional felt treatment was needed.

Despite the skepticism that I was developing about the usefulness of the survey data, I was unprepared for the assessment of “exercise.”

Other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” Participants who answered yes to this question were then asked: “What type of physical activity or exercise did you spend the most time doing during the past month?” A total of 75 types of exercise were represented in the sample, which were grouped manually into eight exercise categories to balance a diverse representation of exercises with the need for meaningful cell sizes (appendix).

Participants indicated the number of times per week or month that they did this exercise and the number of minutes or hours that they usually spend exercising in this way each time.

I had already been tipped off by the discussion on twitter that there would be a thorough confusion of planned exercise and mere physical activity. But now that was compounded. Why was physical activity during employment excluded? What if participants were engaged in a number of different physical activities,  like both jogging and bicycling? If so, the survey obtained data for only one of these activities, with the other excluded, and the choice could’ve been quite arbitrary as to which one the participant identified as the one to be counted.

Anyone who has ever constructed surveys would be alert to the problems posed by participants’ awareness that saying “yes” to exercising would require contemplating  75 different options, arbitrarily choosing one of them for a further question how much time the participant engaged in this activity. Unless participants were strongly motivated, then there was an incentive to simply say no, they didn’t exercise.

I suppose I could go on, but it was my judgment that any validity what the authors were claiming  had been ruled out. Like someone once said on NIH grant review panel, there are no vital signs left, let’s move on to the next item.

But let’s refocus just a bit on the overall intention of these authors. They want to use a large data set to make statements about the association between physical activity and a measure of mental health. They have used matching and statistical controls to equate participants. But that strategy effectively eliminates consideration of crucial contextual variables. Persons’ preferences and opportunities to exercise are powerfully shaped by their personal and social circumstances, including finances and competing demands on their time. Said differently, people are embedded in contexts in which a lot of statistical maneuvering has sought to eliminate.

To suggest a small number of the many complexities: how much physical activity participants get  in their  employment may be an important determinant of choices for additional activity, as well as how much time is left outside of work. If work typically involves a lot of physical exertion, people may simply be left too tired for additional planned physical activity, a.k.a. exercise, and the physical health may require it less. Environments differ greatly in terms of the opportunities and the safety of engaging in various kinds of physical activities. Team sports require other people being available. Etc., etc.

What I learned from the editorial accompanying the Lancet Psychiatry article

The brief editorial accompanying the article aroused my curiosity as to whether someone assigned to reading and commenting on this article would catch things that apparently the editor and reviewer missed.

Editorial commentators are chosen to praise, not to bury articles. There are strong social pressures to say nice things. However, this editorial leaked a number of serious concerns.

First

In presenting mental health as a workable, unified concept, there is a presupposition that it is possible and appropriate to combine all the various mental disorders as a single entity in pursuing this research. It is difficult to see the justification for this approach when these conditions differ greatly in their underlying causes, clinical presentation, and treatment. Dementia, substance misuse, and personality disorder, for example, are considered as distinct entities for research and clinical purposes; capturing them for study under the combined banner of mental health might not add a great deal to our understanding.

The problem here of categorisation is somewhat compounded by the repeated uncomfortable interchangeability between mental health and depression, as if these concepts were functionally equivalent, or as if other mental disorders were somewhat peripheral.

Then:

A final caution pertains to how studies approach a definition of exercise. In the current study, we see the inclusion of activities such as childcare, housework, lawn-mowing, carpentry, fishing, and yoga as forms of exercise. In other studies, these activities would be excluded for not fulfilling the definition of exercise as offered by the American College of Sports Medicine: “planned, structured and repetitive bodily movement done to improve or maintain one or more components of physical fitness.” 11 The study by Chekroud and colleagues, in its all-encompassing approach, might more accurately be considered a study in physical activity rather than exercise.

The authors were listening for a theme song with which they could promote their startup company in a very noisy data set. They thought they had a hit. I think they had noise.

The authors’ extraordinary disclosure of interests (see below this blog post) should have precluded publication of this serious flawed piece of work, either simply for reason of high likelihood of bias or because it promoted the editor and reviewers to look more carefully at the serious flaws hiding in plain sight.

Postscript: Send in the trolls.

On Twitter, Adam Chekroud announced he felt no need to respond to critics. Instead, he retweeted and “liked” trolling comments directed at critics from the twitter accounts of his brother, his mother, and even the official Twitter account of a local fried chicken joint @chickenlodge, that offered free food for retweets and suggested including Adam Chekroud’s twitter handle if you wanted to be noticed.

chicken lodge

Really, Adam, if you can’t stand the heat, don’t go near  where they are frying chicken.

The Declaration of Interests from the article.

declaration of interest 1

declaration of interest 2

 

Hazards of pointing out bad meta-analyses of psychological interventions

 

A cautionary tale

Psychology has a meta-analysis problem. And that’s contributing to its reproducibility problem. Meta-analyses are wallpapering over many research weaknesses, instead of being used to systematically pinpoint them. – Hilda Bastian

  • Meta-analyses of psychological interventions are often unreliable because they depend on a small number of poor quality, underpowered studies.
  • It is surprisingly easy to screen the studies being assembled for a meta-analysis and quickly determine that the literature is not suitable because it does not have enough quality studies. Apparently, the authors of many published meta-analyses did not undertake such a brief assessment or were undeterred by it from proceeding anyway.
  • We can’t tell how many efforts at meta-analyses were abandoned because of the insufficiencies of the available literature. But we can readily see that many published meta-analyses offer summary effect sizes for interventions that can’t be expected to be valid or generalizable.
  • We are left with a glut of meta-analyses of psychological interventions that convey inflated estimates of the efficacy of interventions and on this basis, make unwarranted recommendations that broad classes of interventions are ready for dissemination.
  • Professional organizations and promoters of particular treatments have strong vested interests in portraying their psychological interventions as effective. They will use their resources to resist efforts to publish critiques of their published meta-analyses and even fight the teaching of basic critical skills for appraising meta-analysis.
  • Publication of thorough critiques has little or no impact on the subsequent citation or influence of meta-analyses. Furthermore, such critiques are largely ignored.
  • Debunking bad meta-analyses of psychological interventions can be frustrating at best, and, at worst, hazardous to careers.
  • You should engage in such activities if you feel it is right to do so. It will be a valuable learning experience. And you can only hope that someone at some point will take notice.

3 Simple screening questions to decide whether a meta analysis is worth delving into.

I’m sick and tired of spending time trying to make sense of meta-analyses of psychological interventions that should have been dismissed out of hand. The likelihood of any contribution to the literature was ruled out by repeated, gross misapplication of meta-analysis by some authors  or, more often, the pathetic quality and quantity of literature available for meta-analysis.

Just recently, Retraction Watch reported the careful scrutiny of a pair of meta-analyses by two psychology graduate students, Paul-Christian Bürkner and Donald Williams. Coverage in Retraction Watch focused on their inability to get credit for the retraction of one of the papers that had occurred because of their critique.

But I was more saddened by their having spent so much time on the second meta-analysis, “A meta-analysis and theoretical critique of oxytocin and psychosis: Prospects for attachment and compassion in promoting recovery, The authors of this meta-analysis  had themselves acknowledged the literature was quite deficient, but proceeded anyway and published a paper that has already been cited 13 times.

The graduate students, as well as the original authors could simply have taken a quick look at the study’s Table 1: the seven included studies had from 9 to 35 patients exposed to oxytocin.  The study  with 35 patients was an outlier. This study also provided only a within-subject effect size, which should not have been entered into the meta-analysis with the results of the other studies.

The six remaining studies had an average sample size of 14 in the intervention group. I doubt that anyone would have undertaken a study of psychotic patients inhaling oxytocin to generate a robust estimate of effect size with only 9, 10, or 11 patients. It’s unclear why the original investigators stopped accruing patients when they did.

Without having specified their sample size ahead of time (there is no evidence that the investigators did), original investigators could simply have stopped when a peek at the data revealed statistically significant findings or they could have kept accruing patients when a peek revealed only nonsignificant findings. Or they could have dropped some patients. Regardless, the reported samples are so small that adding only one or two more patients could substantially change the results.

Furthermore, if the investigators were struggling to get enough patients, the study was probably under-resourced and compromised in other ways. Small sample sizes compound the problems posed by poor methodology and reporting. The authors conducting this particular meta-analysis could only confirm for one of the studies that data from all patients who were randomized were analyzed, i.e., that there was intention to treat analyses. Reporting was that bad, and worse. Again, think of the effects of the loss of data from the analysis of one or a few patients- it could be decisive for results –  particularly when the loss was not random.

Overall, the authors of the original meta-analysis conceded that the seven studies they were entering into the meta-analyses had a high risk of bias.

It should be apparent that authors cannot take a set of similarly flawed studies and integrate their effect sizes with a meta-analysis and expect to get around the limitations. Bottom line – readers should just dismiss the meta-analysis and get on to other things…

These well-meaning graduate students were wasting their time and talent carefully scrutinizing a pair of meta-analyses that were unworthy of their sustained attention. Think of what they could be doing more usefully. There is so much other bad science out there to uncover.

Everybody – I recommend not putting a lot of effort into analyzing obviously flawed meta-analysis, other than maybe posting a warning notice on PubMed Commons  or ranting in a blog post or both.

Detecting Bad Meta Analyses

Over a decade ago, I developed some quick assessment tools by which I can reliably determine that some meta-analyses are not worth our attention. You can see more about the quickly answered questions here.

To start such an assessment, directly to the table describing studies that were included in a published meta-analysis.

  1. Ask: “To what extent are the studies dominated by cell sample sizes less than 35?” Studies of this size have only a power of .50 to detect a moderate size effect. So, even if an effect were present, it would only be detected 50% of the time of all studies were being reported.
  2. Next, check to see whether whoever did the meta-analysis rated the included studies for risk of bias and how, if at all, risk of bias was taken into account in the meta-analyses.
  3. Finally, does the meta analysis adequately deal with clinical heterogeneity of included studies? Is there a basis for giving a meaningful interpretation to a single summary effect size?

Combining studies may be inappropriate for a variety of the following reasons: differences in patient eligibility criteria in the included trials, different interventions and outcomes, and other methodological differences or missing information.  Moher et al., 1998

I have found this quick exercise often reveals that meta-analyses of psychological interventions are dominated by underpowered studies of low methodological quality that produce positive effects for interventions at a greater rate than would be expected. There is little reason to proceed to calculate a summary effect size.

Pothole-FinalThe potholed road from a presentation to a publication.

My colleagues and I applied these criteria in a 2008 presentation to a packed audience at the European Health Psychology Conference in Bath. My focus was Undertook a similar exercise with four meta-analyses of behavioral interventions for adults (Dixon, Keefe, Scipio, Perri, & Abernethy, 2007; Hoffman, Papas, Chatkoff, & Kerns, 2007 ; Irwin, Cole, & Nicassio, 2006; and Jacobsen, Donovan, Vadaparampil, & Small, 2007) that appeared in a new section of Health Psychology, Evidence Based Treatment Reviews.

A sampling of what we found::

Irwin et al. The Irwin et al meta analysis had the stated objective of

comparing responses in studies that exclusively enrolled persons who were 55 years of age or older versus outcomes in randomized controlled trials that enrolled adults who were, on average, younger than 55 years of age(p. 4).

A quick assessment revealed exclusion of small trials (n < 35) would have eliminated all studies of older adults; five studies included 15 or fewer participants per condition. For the studies including younger adults, only one of the 15 studies would have remained.

Hoffman et al. We found that 17 of the 22 included fell below n = 35 per group. Response to our request, the authors graciously shared a table of the methodological quality of the included studies.

Intervention and control groups were not comparable In 60% of the studies on key variables at baseline.

Less than half provided adequate information concerning number of patients enrolled, treatment drop-out and reasons for drop-outs.

Only 15% of trials provided intent-to-treat analyses.

In a number of studies, the psychological intervention was part of the multicomponent package so that its unique contribution could not be determined. Often the psychological intervention was minimal. For instance, one study noted: “a lecture to give the patient an understanding that ordinary physical activity would not harm the disk and a recommendation to use the back and bend it.”

The only studies comparing a psychological intervention to an active control condition consisted of three underpowered studies into in which effects of the psychological component cannot be separated from the rest of the package in which it was embedded. In one of the studies, massage was the psychological intervention, but in another, it was the control group.

Nonetheless,  Hoffman et al. concluded ““The robust nature of these findings should encourage confidence among clinicians and researchers alike.”

As I readily demolished the meta-analyses  to the delight of the audience, I remarked something to the effect that I’m glad the editor of Health Psychology is not here to hear what I am saying about articles published in the journal he edits.

But Robert Kaplan was there. He invited me for a beer as I left the symposium. He said that such critical probing was sorely lacking in the journal. He invited that my colleagues and I submit an invited article. Eventually it would be published as:

Coyne JC, Thombs BD, Hagedoorn M. Ain’t necessarily so: Review and critique of recent meta-analyses of behavioral medicine interventions in health psychology. Health Psychology. 2010 Mar;29(2):107.

However, Kaplan first had an Associate Editor send out the manuscript for review. The manuscript was rejected  based on a pair of reviews that were not particularly informative . One reviewer stated:

The authors level very serious accusations against fellow scientists and claim to have identified significant shortcomings in their published work. When this is done in public, the authors must have done their homework, dotted all the i’s, and crossed all the t’s. Instead, they reveal “we do not redo these meta-analyses or offer a comprehensive critique, but provide a preliminary evaluation of the adequacy of the conduct, reporting and clinical recommendations of these meta-analyses”. To be frank, this is just not enough when one accuses colleagues of mistakes, poor judgment, false inferences, incompetence, and perhaps worse.

In what he would later describe as the only time he did this in his term as editor of Health Psychology, Bob Kaplan overruled the unanimous recommendations of his associate editor and the two reviewers. He accepted a revision of our manuscript in which we try to be clearer about the bases of our judgments.

According to Google Scholar, our “Ain’t necessarily so…” has been cited 53 times. Apparently it had little effect on the reception of the four meta-analyses. Hoffman et al. has been cited 599 times.

From a well-received workshop to a workshop canceled in order to celebrate a bad meta-analysis.

Mariet Hagedorn and I gave a well-received workshop at the annual meeting of The Society for Behavioral Medicine the next year. A member of SBM’s Evidence-based Behavioral Medicine Committee invited us to their committee meeting held immediately after the workshop. We were invited to give the workshop again in two years. I also became a member of the committee. I offered to be involved in future meta-analyses, learning that a number were planned.

I actually thought that I was involved in a meta-analysis of interventions for depressive symptoms among cancer patients. I immediately identified a study of problem-solving therapy for cancer patients that had such improbably large effect sizes that should be excluded from any meta-analysis as an extreme outlier. The suggestion was appreciated.

But I heard nothing further about the meta-analyses and to I was contacted by one of the authors who said that my permission was needed to be acknowledged in the accepted manuscript. I refused. When I finally saw the published version of the manuscript in the prestigious Journal of the National Cancer Institute, I published a scathing critique, which you can read here. My critique has so far been cited once, the meta-analysis in eighty times.

Only a couple of months before our workshop had been scheduled to occur I was told it was canceled in order to clear the schedule for full press coverage of a new meta-analysis. I only learned of this when I emailed the committee concerning the specific timing of the workshop.  The reply came from the first author of the new meta-analysis.

I have subsequently made the case that that meta-analysis was horribly done and horribly misleading of consumers in two blog posts:

Faux Evidence-Based Behavioral Medicine at Its Worst (Part I)

Faux Evidence-Based Behavioral Medicine Part 2

Some highlights:

The authors boasted of “robust findings” of “substantial rigor” in a meta-analysis that provided “strong evidence for psychosocial pain management approaches.” They claimed their findings supported the “systematic implementation” of these techniques.

The meta-analysis depended heavily on small trials. Of the 38 trials, 19 studies had less than 35 patients in the intervention or control group and so would be excluded with application of this criterion.

Some of the smaller trials were quite small. One had 7 patients receiving an education intervention;  another had 10 patients getting hypnosis; another, 15 patients getting education; another, 15 patients getting self hypnosis; and still another, 8 patients getting relaxation and eight patients getting CBT plus relaxation.

Two of what were by far the largest trials should have been excluded because they involved complex intervention. Patients received telephone-based collaborative care, which had a number of components, including support for adherence to medication.

It appears that listening to music, being hypnotized during a medical procedure, and being taught self hypnosis over 52 sessions, are all under the rubric of skills training. Similarly, interactive educational sessions are considered equivalent to passing out informational materials and simply pamphleteering.

But here’s what most annoyed me about clinical and policy decisions being made on the basis of this meta-analysis:

Perhaps most importantly from a cancer pain control perspective, there was no distinguishing of whether the cancer pain was procedural, acute, or chronic. These types of pain take very different management strategies. In preparation for surgery or radiation treatment, it might be appropriate to relax or hypnotize the patient or provide soothing music. The efficacy could be examined in a randomized trial. But the management of acute pain is quite different and best achieved with medication. Here is where the key gap exists between the known efficacy of medication and the poor control in the community, due to professional and particularly patient attitudes. Control of chronic pain, months after any painful procedures, is a whole different matter, and based on studies of noncancer pain, I would guess that here is another place for psychosocial intervention, but that should be established in randomized trials.

shushedGetting shushed about the sad state of couples interventions for cancer patients research

One of the psychologists present at the SBM meeting published a meta-analysis of couples interventions   in which I was thanked for my input in an acknowledgment. I did not give permission and this notice was subsequently retracted.

Ioana Cristea and Nilufer Kafescioglu and I subsequently submitted a critique to Psycho-Oncology. We were initially told it would be accepted as a letter to the editor, but then it was subject to an extraordinary six uninformative reviews and rejected. The article that we critiqued was given special status as a featured article and distributed free by the otherwise pay walled journal.

A version of our critique was relegated to a blog post.

The complicated politics of meta-analyses supported by professional organizations.

Starting with our “Ain’t necessarily so..” effort, we were taking aim at meta-analyses making broad, enthusiastic claims about the efficacy and readiness for dissemination of psychological interventions. Society for Behavioral Medicine was enjoying a substantial increase in membership, but like other associations dominated by psychologists, the new members were clinicians, not primarily academic researchers. SBM wanted to offer a branding of “evidence-based” to the psychological interventions for which the clinicians were seeking reimbursement. At the time, insurance companies were challenging that licensed psychologists would get reimbursed for psychological interventions that would not administered to patients with psychiatric diagnoses.

People involved with the governance of SBM at the time cannot help but be aware of an ugly side to the politics back then. A small amount of money had been given by NCI to support meta-analyses and it was quite a struggle to control its distribution. That the SBM-sponsored meta-analyses were oddly published in the APA journal, Health Psychology, rather than SBM’s Annals of Behavioral Medicine reflected the bid for presidency of APA’s Division of Health Psychology by someone who had been told that she could not run for president of SBM. But worse, there was a lot of money and undeclared conflicts of interest in play.

Someone originally involved in the meta-analysis of interventions for depressive symptoms among cancer patients had received a $10 million grant from Pfizer to develop a means of monitoring cancer surgeons’ inquiring about psychological distress and their offering of interventions. The idea (which was actually later mandated) was that cancer surgeons could not close their electronic records until they had indicated that they had asked the patient about psychological distress. If patient reported distress, the surgeons had to indicate what intervention was offered to the patient. Only then could they close the medical record. Of course, these requirements could be met simply by asking if a breast cancer patient was distressed and offering her antidepressant without any formal diagnosis or follow-up. These procedures were mandated as part of accreditation of facilities providing cancer care.

Psycho-Oncology, the journal with which we skirmished about the meta-analysis of couples interventions was the official publication of the International Psycho-Oncology Society, another organization dominated by commission seeking reimbursement for services to cancer patients.

You can’t always get what you want.

I nonetheless encourage others, particularly early career investigators, to take the tools that I offer. Please scrutinize meta-analyses that otherwise would have clinical and public policy recommendations attached to their findings. You may have trouble getting published, and you will be slowly disappointed if you expect to influence the reception of already published meta-analysis. You can always post your critiques at PubMed Commons.

You will learn important skills and the politics of trying to publish critiques of papers that are protected as having been “peer reviewed.” If enough of you do this and visibly complain about how ineffectual your efforts have been, we may finally overcome the incumbent advantage and protection from further criticism that goes with getting published.

And bloggers like myself and Hilda Bastian will recognize you and express appreciation.

 

 

Should have seen it coming: Once high-flying Psychological Science article lies in pieces on the ground

Life is too short for wasting time probing every instance of professional organizations promoting bad science when they have an established record of doing just that.

There were lots of indicators that’s what we were dealing with in the Association for Psychological Science (APS) recent campaign for the now discredited and retracted ‘sadness prevents us from seeing blue’ article.

sad blueA quick assessment of the press release should have led us to dismiss the claims being presented and convinced us to move on.

Readers can skip my introductory material by jumping down this blog post to [*} to see my analysis of the APS press release.

Readers can also still access the original press release, which has now disappeared from the web, here. Some may want to read the press release and form their own opinions before proceeding into this blog post.

What, I’ve stopped talking about the PACE trial? Yup, at least at Mind the Brain, for now. But you can go here for the latest in my continued discussion of the PACE trial of CBT for chronic fatigue syndrome, in which I moved from critical observer to activist a while ago.

Before we were so rudely interrupted  by the bad science and bad media coverage of the PACE trial, I was focusing on how readers can learn to make quick assessments of hyped media coverage of dubious scientific studies.

In “Sex and the single amygdala”  I asked:

Can skeptics who are not specialists, but who are science-minded and have some basic skills, learn to quickly screen and detect questionable science in the journals and its media coverage?

The counter argument of course is Chris Mooney telling us “You Have No Business Challenging Scientific Experts”. He cites

“Jenny McCarthy, who once remarked that she began her autism research at the “University of Google.”

But while we are on the topic of autism, how about the counter example of The Lancet’s coverage of the link between vaccines and autism? This nonsense continues to take its toll on American children whose parents – often higher income and more educated than the rest – refused to vaccinate them on the basis of a story that started in The Lancet. Editor Richard Horton had to concede

horton on lancet autism failure

 

 

 

If we accept Chris Mooney‘s position, we are left at the mercy of press releases cranked out by the likes of professional organizations like Association for Psychological Science (APS) that repeatedly demand that we revise our thinking about human nature and behavior, as well as change our behavior if we want to extend our lives and live happier, all on the basis of a single “breakthrough” study. Rarely do APS press releases have any follow-up as to the fate of a study they promoted. One has to hope that PubPeer  or PubMed Commons pick up on the article touted in the press release and see what a jury of post-publication peers decides.

As we have seen in my past Mind the Brain posts, there are constant demands on our attention from press releases generated from professional organizations, university press officers, and even NIH alerting us to supposed breakthroughs in psychological and brain science. Few such breakthroughs hold up over time.

Are there no alternatives?

Are there no alternatives to our simply deferring to the expertise being offered or taking the time to investigate for ourselves claims that are likely to prove exaggerated or simply false?

We should approach press releases from the APS – or from its rival American Psychological Association – using prior probabilities to set our expectations. The Open Science Collaboration: Psychology (OSC) article  in Science presented results of a systematic attempt to replicate 100 findings from prestigious psychological journals, including APS’ s Psychological Science and APA’s Journal of Personality and Social Psychology. Less than half of the findings were replicated. Findings from the APS and APA journals fared worse than the others.

So, our prior probabilities are that declarations of newsworthy, breakthrough findings trumpeted in press releases from psychological organizations are likely to be false or exaggerated – unless we assume that the publicity machines prefer the trustworthy over the exciting and newsworthy in the article they selected to promote.

I will guide readers through a quick assessment of APS press release which I started on this post before getting swept up into the PACE controversy. However, in the intervening time, there have been some extraordinary developments, which I will then briefly discuss. We can use these developments to validate my and your evaluation of the press release available earlier. Surprisingly, there is little overlap between the issues I note in the press release and what concerned post-publication commentators.

*A running commentary based on screening the press release

What once was a link to the“feeling blue and seeing blue”  article now takes one only to

retraction press releasee

Fortunately, the original press release can still be reached here. The original article is preserved here.

My skepticism was already high after I read the opening two paragraphs of the press release

The world might seem a little grayer than usual when we’re down in the dumps and we often talk about “feeling blue” — new research suggests that the associations we make between emotion and color go beyond mere metaphor. The results of two studies indicate that feeling sadness may actually change how we perceive color. Specifically, researchers found that participants who were induced to feel sad were less accurate in identifying colors on the blue-yellow axis than those who were led to feel amused or emotionally neutral.

Our results show that mood and emotion can affect how we see the world around us,” says psychology researcher Christopher Thorstenson of the University of Rochester, first author on the research. “Our work advances the study of perception by showing that sadness specifically impairs basic visual processes that are involved in perceiving color.”

What Anglocentric nonsense. First, blue as a metaphor for sad does not occur across most languages other than English and Serbian. In German, to call someone blue is suggesting the person is drunk. In Russian, you are suggesting that the person is gay. In Arabic, if you say you are having a blue day, it is a bad one. But if you say in Portuguese that “everything is blue”, it suggests everything is fine.

In Indian culture, blue is more associated with happiness than sadness, probably traceable to the blue-blooded Krishna being associated with divine and human love in Hinduism. In Catholicism, the Virgin Mary is often wearing blue and so the color has come to be associated with calmness and truth.

We are off to a bad start. Going to the authors’ description of their first of two studies, we learn:

In one study, the researchers had 127 undergraduate participants watch an emotional film clip and then complete a visual judgment task. The participants were randomly assigned to watch an animated film clip intended to induce sadness or a standup comedy clip intended to induce amusement. The emotional effects of the two clips had been validated in previous studies and the researchers confirmed that they produced the intended emotions for participants in this study.

Oh no! This is not a study of clinical depression, but another study of normal college students “made sad” with a mood induction.

So-called mood induction tasks don’t necessarily change actual mood state, but they do convey to research participants what is expected of them and how they are supposed to act. In one of the earliest studies I ever did, we described a mood induction procedure to subjects without actually having them experience it. We then asked them to respond as if they had received it. Their responses were indistinguishable. We concluded that we could not rule out that what were considered effects of a mood induction task were simply demand characteristics, what research participants perceive as instructions as to how they should behave.

It was fashionable way back then for psychology researchers who were isolated in departments that did not have access to clinically depressed patients to claim that they were nonetheless conducting analog studies of depression. Subjecting students to unsolvable anagram task or uncontrollable loud noises was seen as inducing learned helplessness in them, thereby allowing investigators an analog study of depression. We demonstrated a problem with that idea. If students believed that the next task that they were administered was part of the same experiment, they performed poorly, as if they were in a state of learned helplessness or depression. However, if they believed that the second task was unrelated to the first, they would show no such deficits. Their negative state of helplessness or depression was confined to their performance in what they thought was the same setting in which the induction had occurred. Shortly after our experiments. Marty Seligman wisely stopped doing studies “inducing” learned helplessness in humans, but he continued to make the same claims about the studies he had done.

Analog studies of depression disappeared for awhile, but I guess they have come back into fashion.

But the sad/blue experiment could also be seen as a priming  experiment. The research participants were primed by the film clip and their response to a color naming task was then examined.

It is fascinating that neither the press release nor the article itself ever mentioned the word priming. It was only a few years ago that APS press releases were crowing about priming studies. For instance, a 2011 press release entitled “Life is one big priming experiment…” declared:

One of the most robust ideas to come out of cognitive psychology in recent years is priming. Scientists have shown again and again that they can very subtly cue people’s unconscious minds to think and act certain ways. These cues might be concepts—like cold or fast or elderly—or they might be goals like professional success; either way, these signals shape our behavior, often without any awareness that we are being manipulated.

Whoever wrote that press release should be embarrassed today. In the interim, priming effects have not proven robust. Priming studies that cannot be replicated have figured heavily in the assessment that the psychological literature is untrustworthy. Priming studies also figure heavily in the 56 retracted studies of fraudster psychologist Diederik Stapel. He claims that he turned to inventing data when his experiments failed to demonstrate priming effects that he knew were there. Yet, once he resorted to publishing studies with fabricated data, others claimed to replicate his work.

I made up research, and wrote papers about it. My peers and the journal editors cast a critical eye over it, and it was published. I would often discover, a few months or years later, that another team of researchers, in another city or another country, had done more or less the same experiment, and found the same effects.  My fantasy research had been replicated. What seemed logical was true, once I’d faked it.

So, we have an APS press release reporting a study that assumes that the association between sadness and the color blue is so hardwired and culturally universal that is reflected in basic visual processes. Yet the study does not involve clinical depression, only an analog mood induction and a closer look reveals that once again APS is pushing a priming study. I think it’s time to move on. But let’s read on:

The results cannot be explained by differences in participants’ level of effort, attention, or engagement with the task, as color perception was only impaired on the blue-yellow axis.

“We were surprised by how specific the effect was, that color was only impaired along the blue-yellow axis,” says Thorstenson. “We did not predict this specific finding, although it might give us a clue to the reason for the effect in neurotransmitter functioning.”

The researchers note that previous work has specifically linked color perception on the blue-yellow axis with the neurotransmitter dopamine.

The press release tells us that the finding is very specific, occurring only on the blue-yellow axis, not the red-green axes and thatdifferences between are not found in level of effort, attention, or engagement of the task. The researchers did not expect such a specific finding, they were surprised.

The press release wants to convince us of an exciting story of novelty and breakthrough.  A skeptic sees it differently: This is an isolated finding that is unanticipated by the researchers getting all dressed up. See, we should’ve moved on.

The evidence with which the press release wants to convince us is exciting because it is specific and novel. iThe researchers are celebrating the specificity of their finding, but the blue-yellow axis finding may be the only one statistically significant because it is due to chance or an artifact.

And bringing up unmeasured “neurotransmitter functioning” is pretentious and unwise. I challenge the researchers to show that effects of watching a brief movie clip registers in measurable changes in neurotransmitters. I’m skeptical even whether persons drawn from the community or outpatient samples reliably differ from non-depressed persons in measures of the neurotransmitter dopamine.

This is new work and we need to take time to determine the robustness and generalizability of this phenomenon before making links to application,” he concludes.

Claims in APS press releases are not known for their “robustness and generalizability.” I don’t think this particular claim should prompt an effort at independent replication when scientists have so many more useful things to keep them busy.

Maybe, these investigators should have checked robustness and generalizability before rushing into print. Maybe APS should stop pestering us with findings that surprise researchers and that have not yet been replicated.

A flying machine in pieces on the ground

Sadness impairs color perception was sent soaring high, lifted by an APS press release now removed from the web, but that is still available here. The press release was initially uncritically echoed, usually cut-and-paste or outright churnaled  in over two dozen media mentions.

But, alas, Sadness impairs color perception is now a flying machine in pieces on the ground 

Noticing of the article’s problems seem to have started with some chatter of skeptically-minded individuals on Twitter,  which led to comments at PubPeer where the article was torn to pieces. What unfolded was a wonderful demonstration of crowdsourced post-publication peer review in action. Lesson: PubPeer rocks and can overcome the failures of pre-publication peer review to keep bad stuff out of the literature.

You can follow the thread of comments at PubPeer.

  • An anonymous skeptic started off by pointing out an apparent lack of a significant statistical effect where one was claimed.
  • There was an immediate call for a retraction, but it seemed premature.
  • Soon re-analyses of the data from the paper were being reported, confirming the lack of a significant statistical effect when analyses were done appropriately and reported transparently.
  • The data set for the article was mysteriously changed after it had been uploaded.
  • Doubts were expressed about the integrity of the data – had they been tinkered with?
  • The data disappeared.
  • There was an announcement of a retraction.

The retraction notice  indicated that the researchers were still convinced of the validity of their hypothesis, despite deciding to retract their paper.

We remain confident in the proposition that sadness impairs color perception, but would like to acquire clearer evidence before making this conclusion in a journal the caliber of Psychological Science.

so deflatedThe retraction note also carries a curious Editors note:

Although I believe it is already clear, I would like to add an explicit statement that this retraction is entirely due to honest mistakes on the part of the authors.

Since then, doubts about express whether retraction was a sufficient response or whether something more is needed. Some of the participants in the PubPeer discussion drafted a letter to the editor incorporating their reanalyses and prepared to submit it to Psychological Science. Unfortunately, having succeeded in getting the bad science retracted, these authors reduced the likelihood of theirr reanalysis being accepted by Psychological Science. As of this date, their fascinating account remains unpublished but available on the web.

Postscript

Next time you see an APS or APA press release, what will be your starting probabilities about the trustworthiness of the article being promoted? Do you agree with Chris Mooney that you should simply defer to the expertise of the professional organization?

Why would professional organizations risk embarrassment with these kinds of press releases? Apparently they are worth the risk. Such press releases can echo through the conventional and social media and attract early attention to an article. The game is increasing the impact factor of the journal (JIFs).

Although it is unclear precisely how journal impact factors are calculated, the number reflects the average number of citations an article obtains within two years of publication. However, if press releases  promote “early releases” of articles,  the journal can acquire citations before the clock starts ticking for the two years. APS and APA are in intense competition for prestige of their journals and membership. It matters greatly to them which organization can claim the most prestigious journals, as demonstrated by their JIFs.

So, press releases are important from garnering early attention. Apparently breakthroughs, innovations, and “first ever” mattered more than trustworthiness. In the professional organizations hope we won’t remember the fate of past claims.

 

Sex and the single amygdala: A tale almost saved by a peek at the data

So sexy! Was bringing up ‘risky sex’ merely a strategy to publish questionable and uninformative science?

wikipedia 1206_FMRIMy continuing question: Can skeptics who are not specialists, but who are science-minded and have some basic skills, learn to quickly screen and detect questionable science in the journals and media coverage?

You don’t need a weatherman to know which way the wind blows.” – Bob Dylandylan wind blows

I hope so. One goal of my blogging is to arouse readers’ skepticism and provide them some tools so that they can decide for themselves what to believe, what to reject, and what needs a closer look or a check against trusted sources.

Skepticism is always warranted in science, but it is particularly handy when confronting the superficial application of neuroscience to every aspect of human behavior. Neuroscience is increasingly being brought into conversations to sell ideas and products when it is neither necessary nor relevant. Many claims about how the brain is involved are false or exaggerated not only in the media, but in the peer-reviewed journals themselves.

A while ago I showed how a neuroscientist and a workshop guru teamed up to try to persuade clinicians with functional magnetic resonance imaging (fMRI) data  that a couples therapy was more sciencey than the rest. Although I took a look at some complicated neuroscience, a lot of my reasoning [1, 2, 3] merely involved applying basic knowledge of statistics and experimental design. I raised sufficient skepticism to dismiss the neuroscientist and psychotherapy guru’s claims, Even putting aside the excellent specialist insights provided by Neurocritic and his friend Magneto.

In this issue of Mind the Brain, I’m pursuing another tip from Neurocritic about some faulty neuroscience in need of debunking.

The paper

Victor, E. C., Sansosti, A. A., Bowman, H. C., & Hariri, A. R. (2015). Differential Patterns of Amygdala and Ventral Striatum Activation Predict Gender-Specific Changes in Sexual Risk Behavior. The Journal of Neuroscience, 35(23), 8896-8900.

Unfortunately, the paper is behind a pay wall. If you can’t get it through a university library portal, you can send a request for a PDF to the corresponding author, elizabeth.victor@duke.edu.

The abstract

Although the initiation of sexual behavior is common among adolescents and young adults, some individuals express this behavior in a manner that significantly increases their risk for negative outcomes including sexually transmitted infections. Based on accumulating evidence, we have hypothesized that increased sexual risk behavior reflects, in part, an imbalance between neural circuits mediating approach and avoidance in particular as manifest by relatively increased ventral striatum (VS) activity and relatively decreased amygdala activity. Here, we test our hypothesis using data from seventy 18- to 22-year-old university students participating in the Duke Neurogenetics Study. We found a significant three-way interaction between amygdala activation, VS activation, and gender predicting changes in the number of sexual partners over time. Although relatively increased VS activation predicted greater increases in sexual partners for both men and women, the effect in men was contingent on the presence of relatively decreased amygdala activation and the effect in women was contingent on the presence of relatively increased amygdala activation. These findings suggest unique gender differences in how complex interactions between neural circuit function contributing to approach and avoidance may be expressed as sexual risk behavior in young adults. As such, our findings have the potential to inform the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.

My thought processes

Hmm, sexual risk behavior -meaning number of partners? How many new partners during a follow-up period constitutes “risky” and does it matter whether safe sex was practiced? Well, ignoring these issues and calling it “sexual risk behavior “allows the authors to claim relevance to hot topics like HIV prevention….

But let’s cut to the chase: I’m always skeptical about a storyline depending on a three-way statistical interaction. These effects are highly unreliable, particularly in a sample size of only N = 70. I’m suspicious why investigators ahead of time staking their claims on a three-way interaction, not something simpler. I will be looking for evidence that they started with this hypothesis in mind, rather than cooking it up after peeking at the data.

fixed-designs-for-psychological-research-35-638Three-way interactions involve dividing a sample up into at eight boxes, in this case, 2 x (2) x (2). Such interactions can be mind-boggling to interpret, and this one is no exception

Although relatively increased VS activation predicted greater increases in sexual partners for both men and women, the effect in men was contingent on the presence of relatively decreased amygdala activation and the effect in women was contingent on the presence of relatively increased amygdala activation.

And then the “simple” interpretation?

These findings suggest unique gender differences in how complex interactions between neural circuit function contributing to approach and avoidance may be expressed as sexual risk behavior in young adults.

And the public health implications?

As such, our findings have the potential to inform the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.

hs-amygdalaJust how should these data inform public health strategies beyond what we knew before we stumbled upon this article? Really, should we stick people’s heads in a machine and gather fMRI data  before offering them condoms? Should we encourage computer dating services to post along with a recent headshot, recent fMRI images showing that prospective dates do not have their risky behavior center in the amygdala activated? Or encourage young people to get their heads examined with an fMRI before deciding whether it’s wise to sleep with somebody new?

So it’s difficult to see the practical relevance of these findings, but let’s stick around and consider the paragraph that Neurocritic singled out.

The paragraph

outlierThe majority of the sample reported engaging in vaginal sex at least once in their lifetime (n = 42, 60%). The mean number of vaginal sexual partners at baseline was 1.28 (SD =0.68). The mean increase in vaginal sexual partners at the last follow-up was 0.71 (SD = 1.51). There were no significant differences between men and women in self-reported baseline or change in self-reported number of sexual partners (t=0.05, p=0.96; t=1.02, p= 0.31, respectively). Although there was not a significant association between age and self-reported number of partners at baseline (r = 0.17, p= 0.16), younger participants were more likely to report a greater increase in partners over time (r =0.24, p =0.04). Notably, distribution analyses revealed two individuals with outlying values (3 SD from M; both subjects reported an increase in 8 partners between baseline and follow up). Given the low rate of sexual risk behavior reported in the sample, these outliers were not excluded, as they likely best represent young adults engaging in sexual risk behavior.

What triggers skepticism?

This paragraph is quite revealing if we just ponder it a bit.

First, notice there is only a single significant correlation (p=.04) in a subgroup analysis. Differences between men and women were examined finding no significant findings in either baseline or changes in number of sexual partners over the length of the observation. However, disregarding that finding, the authors went on to explore changes in number of partners over time among the younger participants and, bingo, there was their p =0.04.

Whoa! Age was never mentioned in the abstract. We are now beyond the 2 x 2 x 2 interaction mentioned in the abstract and rooting through another dimension, younger versus older.

But, worse, getting that significance required retaining two participants with eight new sexual partners each during the follow-up period. The decision to retain these participants was made after the pattern of results was examined with and without inclusion of these outliers. The authors say so and essentially say they decided because it made a better story.

The only group means and standard deviation included these two participants. Even including the participants, the average number of new sexual partners was less than one during some follow-up. We have no idea whether that one was risky or not. It’s a safer assumption that having eight new partners is risky, but even that we don’t know for sure.

Keep in mind for future reference: Investigators are supposed to make decisions about outliers without reference to the fate of the hypothesis being studied. And knowing nothing about this particular study, most authorities would say if two people out of 70 are way out there on a particular variable that otherwise has little variance, you should exclude them.

It is considered a Questionable Research Practice to make decisions about inclusion/exclusion based on what story the outcome of this decision allows the authors to tell. It is p-hacking, and significance chasing.

And note the distribution of numbers of vaginal sex partners. Twenty eight participants had none at the end of the study. Most accumulated less than one during the follow up, and even that mean number was distorted by two having eight partners. Hmm, it is going to be hard to get multivariate statistics to work appropriately when we get to the fancy neuroscience data. We could go off on discussions of multivariate normal or Poisson distributions or just think a bit..

We can do a little detective work and determine that one outlier was a male, another a female. (*1) Let’s go back to our eight little boxes of participants that are involved in the interpretation of the three-way interaction. It’s going to make a great difference exactly where the deviant male and female are dropped into one of the boxes or whether they are left out.

And think about sampling issues. What if, for reasons having nothing to with the study, neither of these outliers had shown up? Or if only one of them had showed up, it would skew the results in a particular direction, depending on whether the participant was the male or female.

Okay, if we were wasting our time continuing to read the article after finding what we did in the abstract, we are certainly wasting more of our time by continuing after reading this paragraph. But let’s keep poking around as an educational exercise.

The rest of the methods and results sections

We learn from the methods section that there was an ethnically diverse sample with a highly variable follow-up, from zero days to 3.9 years (M = 188.72 d, SD = 257.15; range = 0 d–3.19 years). And there were only 24 men in the original sample for the paper of 70 participants.

We don’t know whether these two outliers had eight sexual partners within a week of the first assessment or they were the ones captured in extending the study to almost 4 years. That matters somewhat, but we also have to worry whether this was an appropriate sample – with so few participants in it in the first place and even fewer who had sex by the end of the study – and length of follow-up to do such a study. The mean follow-up of about six months and huge standard deviation suggest there is not a lot of evidence of risky behavior, at least in terms of casual vaginal sex.

This is all getting very funky.

So I wondered about the larger context of the study, with increasing doubts that the authors had gone to all this trouble just to test an a priori hypothesis about risky sex.

We are told that the larger context is the ongoing “Duke Neurogenetics Study (DNS), which assesses a wide range of behavioral and biological traits.” The extensive list of inclusions and exclusions suggests a much more ambitious study. If we had more time, we could go look up the Duke Neurogenetics Study and see if that’s the case. But I have a strong suspicion that the study was not organized around the specific research questions of this paper (*2). I really can’t tell without any preregistration of this particular paper but I certainly have questions about how much Hypothesizing after the Results Are Known (HARKing) is going on here in the refining of hypotheses and measures, and decisions about which data to report.

Further explorations of the results section

I remind readers that I know little about fMRI data. Put it aside and we can discover some interesting things reading through the brief results section.

Main effects of task

As expected, our fMRI paradigms elicited robust affect-related amygdala and reward-related VS activity across the entire parent sample of 917 participants (Fig. 1). In our substudy sample of 70 participants, there were no significant effects of gender (t(70) values < 0.88, p values >0.17) or age (r values < 0.22; p values > 0.07) on VS or amygdala activity in either hemisphere.

figure1Hmm, let’s focus on the second sentence first. The authors tell us absolutely nothing is going on in terms of differences in amygdala and reward-related VS activity in relation to age and gender in the sample of 70 participants in the current study. In fact, we don’t even need to know what “amygdala and reward-related VS activity” is to wonder why the first sentence of this paragraph directs us to a graph not of the 70 participants, but a larger sample of 917 participants. And when we go to figure 1, we see some wild wowie zowie, hit-the-reader-between-the-eyes differences (in technical terms, intraocular trauma) for women. And claims of p < 0.000001 twice. But wait! One might think significance of that magnitude would have to come from the 917 participants, except the labeling of the X-axis must come from the substudy of the 70 participants for whom data concerning number of sex partners was collected. Maybe the significance comes from the anchoring of one of the graph lines by the one wayout outlier.

Note that the outlier woman with eight partners anchors the blue line for High Left Amygdala. Without inclusion of that single woman, the nonsignificant trends between women with High Left Amygdala versus women with Low Left Amygdala would be reversed.

figure2The authors make much of the differences between Figure 1 showing Results for Women and Figure 2 showing Results for Men. The comparison seems dramatic except that, once again, the one outlier sends the red line for Low Left Amygdala off from the blue line for High Left Amygdala. Otherwise, there is no story to tell. Mind-boggling, but I think we can safely conclude that something is amiss in these Frankenstein graphs.

Okay, we should stop beating a corpse of an article. There are no vital signs left.

Alternatively, we could probe the section on Poisson regressions and minimally note some details. There is the flash of some strings of zeros in the P values, but it seems complicated and then we are warned off with “no factors survive Bonferroni correction.” And then in the next paragraph, we get to exploring dubious interactions. And there is the final insult of the authors bringing in a two-way interaction trending toward significance among men, p =.051.

But we were never told how all this would lead as we were promised in the end of the abstract, “to the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.”

Rushing through the discussion section, we note the disclosure that

The nature of these unexpected gender differences on clear and warrants further consideration.

So, the authors confess that they did not start with expectations of finding a gender difference. They had nothing to report from a subset of data from an ambitious project put together for other purposes with an ill-suited follow-up for the research question (and even an ill-suited experimental task. They made a decision to include two outliers, salvaged some otherwise weak and inconsistent differences, and then constructed a story that depended on their inclusion. Bingo, they can survive confirmation bias and get published.

Readers might have been left with just their skepticism about the three-way interaction described in the abstract. However, the authors implicated themselves by disclosing in the article their examination of a distribution and reasons for including outlier. Then they further disclosed they did not start with a hypothesis about gender differences.

Why didn’t the editor and reviewers at Journal of Neuroscience (impact factor 6.344) do their job and cry foul? Questionable research practices (QRPs) are brought to us courtesy of questionable publication practices (QPPs).

And then we end with the confident

These limitations notwithstanding, our current results suggest the importance of considering gender-specific patterns of interactions between functional neural circuits supporting approach and avoidance in the expression of sexual risk behavior in young adults.

Yet despite this vague claim, the authors still haven’t explained how this research could be translated to practice.

Takeaway points for the future.

Without a tip from NeuroCritic, I might not have otherwise zeroed in on the dubious complex statistical interaction on which the storyline in the abstract depended. I also benefited from the authors for whatever reason telling us that they had peeked at the data and telling us further in the discussion that they had not anticipated the gender difference. With current standards for transparency and no preregistration of such studies, it would’ve been easy for us to miss what was done because the authors did not need to alert us. Until there are more and better standards enforced, we just need to be extra skeptical of claims of the application of neuroscience to everyday life.

Trust your skepticism.

Apply whatever you know about statistics and experimental methods. You probably know more than you think you do

Beware of modest sized neuroscience studies for which authors develop storylines from the patterning authors can discover in their data, not from a priori hypotheses suggested by a theory. If you keep looking around in the scientific literature and media coverage of it, I think you will find a lot of this QRP and QPP.

Don’t go into a default believe-it mode just because an article is peer-reviewed.

Notes

  1. If both the outliers were of the same gender, it would have been enough for that gender to have had significantly more sex partners than the other.
  1. Later we had told in the Discussion section that particular stimuli for which fMRI data were available were not chosen for relevance to the research question claimed for this this paper.

We did not measure VS and amygdala activity in response to sexually provocative stimuli but rather to more general representations of reward and affective arousal. It is possible that variability in VS and amygdala activity to such explicit stimuli may have different or nonexistent gender-specific patterns that may or may not map onto sexual risk behaviors.

Special thanks to Neurocritic for suggesting this blog post and for feedback, as well as to Neuroskeptic, Jessie Sun, and Hayley Jach for helpful feedback. However, @CoyneoftheRealm bears sole responsibility for any excesses or errors in this post.