Stalking a Cheshire cat: Figuring out what happened in a psychotherapy intervention trial

John Ioannidis, the “scourge of sloppy science”  has documented again and again that the safeguards being introduced into the biomedical literature against untrustworthy findings are usually ineffective. In Ioannidis’ most recent report , his group:

…Assessed the current status of reproducibility and transparency addressing these indicators in a random sample of 441 biomedical journal articles published in 2000–2014. Only one study provided a full protocol and none made all raw data directly available.

As reported in a recent post in Retraction Watch, Did a clinical trial proceed as planned? New project finds out, Psychiatrist Ben Goldacre has a new project with

…The relatively straightforward task of comparing reported outcomes from clinical trials to what the researchers said they planned to measure before the trial began. And what they’ve found is a bit sad, albeit not entirely surprising.

Ben Goldacre specifically excludes psychotherapy studies from this project. But there are reasons to believe that the psychotherapy literature is less trustworthy than the biomedical literature because psychotherapy trials are less frequently registered, adherence to CONSORT reporting standards is less strict, and investigators more routinely refuse to share data when requested.

Untrustworthiness of information provided in the psychotherapy literature can have important consequences for patients, clinical practice, and public health and social policy.

cheshire cat1The study that I will review twice switched outcomes in its reports, had a poorly chosen comparison control group and flawed analyses, and its protocol was registered after the study started. Yet, the study will likely provide data for decision-making about what to do with primary care patients with a few unexplained medical symptoms. The recommendation of the investigators is to deny these patients medical tests and workups and instead provide them with an unvalidated psychiatric diagnosis and a treatment that encourages them to believe that their concerns are irrational.

In this post I will attempt to track what should have been an orderly progression from (a) registration of a psychotherapy trial to (b) publishing of its protocol to (c) reporting of the trial’s results in the peer-reviewed literature. This exercise will show just how difficult it is to make sense of studies in a poorly documented psychological intervention literature.

  • I find lots of surprises, including outcome switching in both reports of the trial.
  • The second article reporting results of the trial that does not acknowledge registration, minimally cites the first reports of outcomes, and hides important shortcomings of the trial. But the authors inadvertently expose new crucial shortcomings without comment.
  • Detecting important inconsistencies between registration and protocols and reports in the journals requires an almost forensic attention to detail to assess the trustworthiness of what is reported. Some problems hide in plain sight if one takes the time to look, but others require a certain clinical connoisseurship, a well-developed appreciation of the subtle means by which investigators spin outcomes to get novel and significant findings.
  • Outcome switching and inconsistent cross-referencing of published reports of a clinical trial will bedevil any effort to integrate the results of the trial into the larger literature in a systematic review or meta-analysis.
  • Two journals – Psychosomatic Medicine and particularly Journal of Psychosomatic Research– failed to provide adequate peer review of articles based on this trial, in terms of trial registration, outcome switching, and allowing multiple reports of what could be construed as primary outcomes from the same trial into the literature.
  • Despite serious problems in their interpretability, results of this study are likely to be cited and influence far-reaching public policies.
  • cheshire cat4The generalizability of results of my exercise is unclear, but my findings encourage skepticism more generally about published reports of results of psychotherapy interventions. It is distressing that more alarm bells have not been sounded about the reports of this particular study.

The publicly accessible registration of the trial is:

Cognitive Behaviour Therapy for Abridged Somatization Disorder (Somatic Symptom Index [SSI] 4,6) patients in primary care. Current controlled trials ISRCTN69944771

The publicly accessible full protocol is:

Magallón R, Gili M, Moreno S, Bauzá N, García-Campayo J, Roca M, Ruiz Y, Andrés E. Cognitive-behaviour therapy for patients with Abridged Somatization Disorder (SSI 4, 6) in primary care: a randomized, controlled study. BMC Psychiatry. 2008 Jun 22;8(1):47.

The second report of treatment outcomes in Journal of Psychosomatic Research

Readers can more fully appreciate the problems that I uncovered if I work backwards from the second published report of outcomes from the trial. Published in Journal of Psychosomatic Research, the article is behind a pay wall, but readers can write to the corresponding author for a PDF: This person is also the corresponding author for the second paper in Psychosomatic Medicine, and so readers might want to request both papers.

Gili M, Magallón R, López-Navarro E, Roca M, Moreno S, Bauzá N, García-Cammpayo J. Health related quality of life changes in somatising patients after individual versus group cognitive behavioural therapy: A randomized clinical trial. Journal of Psychosomatic Research. 2014 Feb 28;76(2):89-93.

The title is misleading in its ambiguity because “somatising” does not refer to an established diagnostic category. In this article, it refers to an unvalidated category that encompasses a considerable proportion of primary care patients, usually those with comorbid anxiety or depression. More about that later.

PubMed, which usually reliably attaches a trial registration number to abstracts, doesn’t do so for this article 

The article does not list the registration, and does not provide the citation when indicating that a trial protocol is available. The only subsequent citations of the trial protocol are ambiguous:

More detailed design settings and study sample of this trial have been described elsewhere [14,16], which explain the effectiveness of CBT reducing number and severity of somatic symptoms.

The above quote is also the sole citation of a key previous paper that presents outcomes for the trial. Only an alert and motivated reader would catch this. No opportunity within the article is provided for comparing and contrasting results of the two papers.

The brief introduction displays a decided puffer fish phenomenon, exaggerating the prevalence and clinical significance of the unvalidated “abridged somatization disorder.” Essentially, the authors invoke the  problematic, but accepted psychiatric diagnostic categories somatoform or somatization disorders in claiming validity for a diagnosis with much less stringent criteria. Oddly, the category has different criteria when applied to men and women: men require four unexplained medical symptoms, whereas women require six.

I haven’t previously counted the term “abridged” in psychiatric diagnosis. Maybe the authors mean “subsyndromal,” as in “subsyndromal depression.” This is a dubious labeling because it suggested all characteristics needed for diagnosis are not present, some of which may be crucial. Think of it: is a persistent cough subsyndromal lung cancer or maybe emphysema? References to symptoms being “subsyndromal”often occur in context where exaggerated claims about prevalence are being made with inappropriate, non-evidence-based inferences  about treatment of milder cases from the more severe.

A casual reader might infer that the authors are evaluating a psychiatric treatment with wide applicability to as many as 20% of primary care patients. As we will see, the treatment focuses on discouraging any diagnostic medical tests and trying to convince the patient that their concerns are irrational.

The introduction identifies the primary outcome of the trial:

The aim of our study is to assess the efficacy of a cognitive behavioural intervention program on HRQoL [health-related quality of life] of patients with abridged somatization disorder in primary care.

This primary outcome is inconsistent with what was reported in the registration, the published protocol, and the first article reporting outcomes. The earlier report does not even mention the inclusion of a measure of HRQoL, measured by the SF-36. It is listed in the study protocol as a “secondary variable.”

The opening of the methods section declares that the trial is reported in this paper consistent with the Consolidated Standards of Reporting Clinical Trials (CONSORT). This is not true because the flowchart describing patients from recruitment to follow-up is missing. We will see that when it is reported in another paper, some important information is contained in that flowchart.

The methods section reports only three measures were administered: a Standardized Polyvalent Psychiatric Interview (SPPI), a semistructured interview developed by the authors with minimal validation; a screening measure for somatization administered by primary care physicians to patients whom they deemed appropriate for the trial, and the SF-36.

Crucial details are withheld about the screening and diagnosis of “abridged somatization disorder.” If these details had been presented, a reader would further doubt the validity of this unvalidated and idiosyncratic diagnosis.

Few readers, even primary care physicians or psychiatrists, will know what to make of the Smith’s guidelines (Googling it won’t yield much), which is essentially a matter of simply sending a letter to the referring GP. Sending such a letter is a notoriously ineffective intervention in primary care. It mainly indicates that patients referred to a trial did not get assigned to an active treatment. As I will document later, the authors were well aware that this would be an ineffectual control/comparison intervention, but using it as such guarantees that their preferred intervention would look quite good in terms of effect size.

The two active interventions are individual- and group-administered CBT which is described as:

Experimental or intervention group: implementation of the protocol developed by Escobar [21,22] that includes ten weekly 90-min sessions. Patients were assessed at 4 time points: baseline, post-treatment, 6 and 12 months after finishing the treatment. The CBT intervention mainly consists of two major components: cognitive restructuring, which focuses on reducing pain-specific dysfunctional cognitions, and coping, which focuses on teaching cognitive and behavioural coping strategies. The program is structured as follows. Session 1: the connection between stress and pain. Session 2: identification of automated thoughts. Session 3: evaluation of automated thoughts. Session 4: questioning the automatic thoughts and constructing alternatives. Session 5: nuclear beliefs. Session 6: nuclear beliefs on pain. Session 7: changing coping mechanisms. Session 8: coping with ruminations, obsessions and worrying. Session 9: expressive writing. Session 10: assertive communication.

There is sparse presentation of data from the trial in the results section, but some fascinating details await a skeptical, motivated reader.

Table 1 displays social demographic and clinical variables. Psychiatric comorbidity is highly prevalent. Readers can’t tell exactly what is going on, because the authors’ own interview schedule is used to assess comorbidity. But it appears that all but a small minority of patients diagnosed with “abridged somatization disorder” have substantial anxiety and depression. Whether these symptoms meet formal criteria cannot be determined. There is no mention of physical comorbidities.

But there is something startling awaiting an alert reader in Table 2.

sf-36 gili

There is something very odd going on here, and very likely a breakdown of randomization. Baseline differences in the key outcome measure, SF-36 are substantially greater between groups than any within-group change. The treatment as usual condition (TAU) has much lower functioning [lower scores mean lower functioning] than the group CBT condition, which is substantially below the individual CBT difference.

If we compare the scores to adult norms, all three groups of patients are poorly functioning, but those “randomized” to TAU are unusually impaired, strikingly more so than the other two groups.

Keep in mind that evaluations of active interventions, in this case CBT, in randomized trials always involve a between difference between groups, not just difference observed within a particular group. That’s because a comparison/control group is supposed to be equivalent for nonspecific factors, including natural recovery. This trial is going to be very biased in its evaluation of individual CBT, a group within which patients started much higher in physical functioning and ended up much higher. Statistical controls fail to correct for such baseline differences. We simply do not have an interpretable clinical trial here.

cheshire cat2The first report of treatment outcomes in Psychosomatic Medicine

 Moreno S, Gili M, Magallón R, Bauzá N, Roca M, del Hoyo YL, Garcia-Campayo J. Effectiveness of group versus individual cognitive-behavioral therapy in patients with abridged somatization disorder: a randomized controlled trial. Psychosomatic medicine. 2013 Jul 1;75(6):600-8.

The title indicates that the patients are selected on the basis of “abridged somatization disorder.”

The abstract prominently indicates the trial registration number (ISRCTN69944771), which can be plugged into Google to reach the publicly accessible registration.

If a reader is unaware of the lack of validation for “abridged somatization disorder,” they probably won’t infer that from the introduction. The rationale given for the study is that

A recently published meta-analysis (18) has shown that there has been ongoing research on the effectiveness of therapies for abridged somatization disorder in the last decade.

Checking that meta-analysis, it only included a single null trial for treatment of abridged somatization disorder. This seems like a gratuitous, ambiguous citation.

I was surprised to learn that in three of the five provinces in which the study was conducted, patients

…Were not randomized on a one-to-one basis but in blocks of four patients to avoid a long delay between allocation and the onset of treatment in the group CBT arm (where the minimal group size required was eight patients). This has produced, by chance, relatively big differences in the sizes of the three arms.

This departure from one-to-one randomization was not mentioned in the second article reporting results of the study, and seems an outright contradiction of what is presented there. Neither is it mentioned nor in the study protocol. This patient selection strategy may have been the source of lack of baseline equivalence of the TAU and to intervention groups.

For the vigilant skeptic, the authors’ calculation of sample size is an eye-opener. Sample size estimation was based on the effectiveness of TAU in primary care visits, which has been assumed to be very low (approximately 10%).

Essentially, the authors are justifying a modest sample size because they don’t expect the TAU intervention is utterly ineffective. How could authors believe there is equipoise, that the comparison control and active interventions treatments could be expected to be equally effective? The authors seem to say that they don’t believe this. Yet,equipoise is an ethical and practical requirement for a clinical trial for which human subjects are being recruited. In terms of trial design, do the authors really think this poor treatment provides an adequate comparison/control?

In the methods section, the authors also provide a study flowchart, which was required for the other paper to adhere to CONSORT standards but was missing in the other paper. Note the flow at the end of the study for the TAU comparison/control condition at the far right. There was substantially more dropout in this group. The authors chose to estimate the scores with the Last Observation Carried Forward (LOCF) method which assumes the last available observation can be substituted for every subsequent one. This is a discredited technique and particularly inappropriate in this context. Think about it: the TAU condition was expected by the authors to be quite poor care. Not surprisingly,  more patients assigned to it dropped out. But they might have  dropped out while deteriorating, and so the last observation obtained is particularly inappropriate. Certainly it cannot be assumed that the smaller number of dropouts from the other conditions were from the same reason. We have a methodological and statistical mess on our hands, but it was hidden from us in our discussion of the second report.



Six measures are mentioned: (1) the Othmer-DeSouza screening instrument used by clinicians to select patients; (2) the Screening for Somatoform Disorders (SOMS, a 39 item questionnaire that includes all bodily symptoms and criteria relevant to somatoform disorders according to either DSM-IV or ICD-10; (3) a Visual Analog Scale of somatic symptoms (Severity of Somatic Symptoms scale) that patients useto assess changes in severity in each of 40 symptoms; (4) the authors own SPPI semistructured psychiatric interview for diagnosis of psychiatric morbidity in primary care settings; (5) the clinician administered Hamilton Anxiety Rating Scale; and the (6) Hamilton Depression Rating Scale.

We are never actually told what the primary outcome is for the study, but it can be inferred from the opening of the discussion:

The main finding of the trial is a significant improvement regardless of CBT type compared with no intervention at all. CBT was effective for the relief of somatization, reducing both the number of somatic symptoms (Fig. 2) and their intensity (Fig. 3). CBT was also shown to be effective in reducing symptoms related to anxiety and depression.

But I noticed something else here, after a couple of readings. The items used to select patients and identify them with “abridged somatization disorder” reference  39 or 40 symptoms, and men only needing four, while women only needing six symptoms for a diagnosis. That means that most pairs of patients receiving a diagnosis will not have a symptom in common. Whatever “abridged somatization disorder” means, patients who received this diagnosis are likely to be different from each other in terms of somatic symptoms, but probably have other characteristics in common. They are basically depressed and anxious patients, but these mood problems are not being addressed directly.

Comparison of this report to the outcomes paper  reviewed earlier shows none of these outcomes are mentioned as being assessed and certainly not has outcomes.

Comparison of this report to the published protocol reveals that number and intensity of somatic symptoms are two of the three main outcomes, but this article makes no mention of the third, utilization of healthcare.

Readers can find something strange in Table 2 presenting what seems to be one of the primary outcomes, severity of symptoms. In this table the order is TAU, group CBT, and individual CBT. Note the large difference in baseline symptoms with group CBT being much more severe. It’s difficult to make sense of the 12 month follow-up because there was differential drop out and reliance on an inappropriate LOCR imputation of missing data. But if we accept the imputation as the authors did, it appears that they were no differences between TAU and group CBT. That is what the authors reported with inappropriate analyses of covariance.

Moreno severity of symptoms

The authors’ cheerful take away message?

This trial, based on a previous successful intervention proposed by Sumathipala et al. (39), presents the effectiveness of CBT applied at individual and group levels for patients with abridged somatization (somatic symptom indexes 4 and 6).

But hold on! In the introduction, the authors’ justification for their trial was:

Evidence for the group versus individual effectiveness of cognitive-behavioral treatment of medically unexplained physical symptoms in the primary care setting is not yet available.

And let’s take a look at Sumathipala et al.

Sumathipala A, Siribaddana S, Hewege S, Sumathipala K, Prince M, Mann A. Understanding the explanatory model of the patient on their medically unexplained symptoms and its implication on treatment development research: a Sri Lanka Study. BMC Psychiatry. 2008 Jul 8;8(1):54.

The article presents speculations based on an observational study, not an intervention study so there is no success being reported.

The formal registration 

The registration of psychotherapy trials typically provides sparse details. The curious must consult the more elaborate published protocol. Nonetheless, the registration can often provide grounds for skepticism, particularly when it is compared to any discrepant details in the published protocol, as well as subsequent publications.

This protocol declares

Study hypothesis

Patients randomized to cognitive behavioural therapy significantly improve in measures related to quality of life, somatic symptoms, psychopathology and health services use.

Primary outcome measures

Severity of Clinical Global Impression scale at baseline, 3 and 6 months and 1-year follow-up

Secondary outcome measures

The following will be assessed at baseline, 3 and 6 months and 1-year follow-up:
1. Quality of life: 36-item Short Form health survey (SF-36)
2. Hamilton Depression Scale
3. Hamilton Anxiety Scale
4. Screening for Somatoform Symptoms [SOMS]

Overall trial start date


Overall trial end date


The published protocol 

Primary outcome

Main outcome variables:

– SSS (Severity of somatic symptoms scale) [22]: a scale of 40 somatic symptoms assessed by a 7-point visual analogue scale.

– SSQ (Somatic symptoms questionnaire) [22]: a scale made up of 40 items on somatic symptoms and patients’ illness behaviour.

When I searched for, Severity of Clinical Global Impression, the primary outcome declared in the registration , and I could find no reference to it.

The protocol was submitted on May 14, 2008 and published on June 22, 2008. This suggests that the protocol was submitted after the start of the trial.

To calculate the sample size we consider that the effectiveness of usual treatment (Smith’s norms) is rather low, estimated at about 20% in most of the variables [10,11]. We aim to assess whether the new intervention is at least 20% more effective than usual treatment.

Comparison group

Control group or standardized recommended treatment for somatization disorder in primary care (Smith’s norms) [10,11]: standardized letter to the family doctor with Smith’s norms that includes: 1. Provide brief, regularly scheduled visits. 2. Establish a strong patient-physician relationship. 3. Perform a physical examination of the area of the body where the symptom arises. 4. Search for signs of disease instead of relying of symptoms. 5. Avoid diagnostic tests and laboratory or surgical procedures. 6. Gradually move the patient to being “referral ready”.

Basically, TAU, the comparison/control group involves simply sending a letter to referring physicians encouraging them simply to meet regularly with the patients but discouraged diagnostic test or medical procedures. Keep in mind that patients for this study were selected by the physicians because they found them particularly frustrating to treat. Despite the authors’ repeated claims about the high prevalence of “abridged somatization disorder,” they relied on a large number of general practice settings to each contribute only a few patients . These patients are very heterogeneous in terms of somatic symptoms, but most share anxiety or depressive symptoms.

House of GodThere is an uncontrolled selection bias here that makes generalization from results of the study problematic. Just who are these patients? I wonder if these patients have some similarity to the frustrating GOMERS (Get Out Of My Emergency Room) in the classic House of God, a book described by Amazon  as “an unvarnished, unglorified, and amazingly forthright portrait revealing the depth of caring, pain, pathos, and tragedy felt by all who spend their lives treating patients and stand at the crossroads between science and humanity.”

Imagine the disappointment about the referring physicians and the patients when consent to participate in this study simply left the patients back in routine care provided by the same physicians . It’s no wonder that the patients deteriorated and that patients assigned to this treatment were more likely to drop out.

Whatever active ingredients the individual and group CBT have, they also include some nonspecific factors missing from the TAU comparison group: frequency and intensity of contact, reassurance and support, attentive listening, and positive expectations. These nonspecific factors can readily be confused with active ingredients and may account for any differences between the active treatments and the TAU comparison. What terrible study.

The two journals providing reports of the studies failed to responsibility to the readership and the larger audience seeking clinical and public policy relevance. Authors have ample incentive to engage in questionable publication practices, including ignoring and even suppressing registration, switching outcomes, and exaggerating the significance of their results. Journals of necessity must protect authors from their own inclinations, as well as the readers and the larger medical community from on trustworthy reports. Psychosomatic Medicine and Journal of Psychosomatic Research failed miserably in their peer review of these articles. Neither journal is likely to be the first choice for authors seeking to publish findings from well-designed and well reported trials. Who knows, maybe the journals’ standards are compromised by the need to attract randomized trials for what is construed as a psychosomatic condition, at least by the psychiatric community.

Regardless, it’s futile to require registration and posting of protocols for psychotherapy trials if editors and reviewers ignore these resources in evaluating articles for publication.

Postscript: imagine what will be done with the results of this study

You can’t fix with a meta analysis what investigators bungled by design .

In a recent blog post, I examined a registration for a protocol for a systematic review and meta-analysis of interventions to address medically unexplained symptoms. The review protocol was inadequately described, had undisclosed conflicts of interest, and one of the senior investigators had a history of switching outcomes in his own study and refusing to share data for independent analysis. Undoubtedly, the study we have been discussing meets the vague criteria for inclusion in this meta-analysis. But what outcomes will be chosen, particularly when they should only be one outcome per study? And will be recognized that these two reports are actually the same study? Will key problems in the designation of the TAU control group be recognized, with its likely inflation of treatment effects, when used to calculate effect sizes?

cheshire_cat_quote_poster_by_jc_790514-d7exrjeAs you can see, it took a lot of effort to compare and contrast documents that should have been in alignment. Do you really expect those who conduct subsequent meta-analyses to make those multiple comparisons or will they simply extract multiple effect sizes from the two papers so far reporting results?

Obviously, every time we encounter a report of a psychotherapy in the literature, we won’t have the time or inclination to undertake such a cross comparison of articles, registration, and protocol. But maybe we should be skeptical of authors’ conclusions without such checks.

I’m curious what a casual reader would infer from encountering either of these reports of this clinical trial I have reviewed in a literature search, but not the other one.




Amazingly spun mindfulness trial in British Journal of Psychiatry: How to publish a null trial

mindfulness chocolateSince when is “mindfulness therapy is not inferior to routine primary care” newsworthy?


Spinning makes null results a virtue to be celebrated…and publishable.

An article reporting a RCT of group mindfulness therapy

Sundquist, J., Lilja, Å., Palmér, K., Memon, A. A., Wang, X., Johansson, L. M., & Sundquist, K. (2014). Mindfulness group therapy in primary care patients with depression, anxiety and stress and adjustment disorders: randomised controlled trial. The British Journal of Psychiatry.

was previously reviewed in Mental Elf. You might want to consider their briefer evaluation before beginning mine. I am going to be critical not only of the article, but the review process that got it into British Journal of Psychiatry (BJP).

I am an Academic Editor of PLOS One,* where we have the laudable goal of publishing all papers that are transparently reported and not technically flawed. Beyond that, we leave decisions about scientific quality to post-publication commentary of the many, not a couple of reviewers whom the editor has handpicked. Yet, speaking for myself, and not PLOS One, I would have required substantial revisions or rejected the version of this paper that got into the presumably highly selective, even vanity journal BJP**.

The article is paywalled, but you can get a look at the abstract here  and write to the corresponding author for a PDF at

As always, examine the abstract carefully  when you suspect spin, but expect that you will not fully appreciate the extent of spin until you have digested the whole paper. This abstract declares

Mindfulness-based group therapy was non-inferior to treatment as usual for patients with depressive, anxiety or stress and adjustment disorders.

“Non-inferior” meaning ‘no worse than routine care?’ How could that null result be important enough to get into a journal presumably having a strong confirmation bias? The logic sounds just like US Senator George Aiken famously proposing getting America out of the war it was losing in Vietnam by declaring America had won and going home.

There are hints of other things going on, like no reporting of how many patients were retained for analysis or whether there were intention-to-treat analyses. And then the weird mention of outcomes being analyzed with “ordinal mixed models.”  Have you ever seen that before? And finally, do the results hold for patients with any of those disorders or only a particular sample of unknown mix and maybe only representing those who could be recruited from specific settings? Stay tuned…

What is a non-inferiority trial and when should one conduct one?

An NHS website explains

The objective of non-inferiority trials is to compare a novel treatment to an active treatment with a view of demonstrating that it is not clinically worse with regards to a specified endpoint. It is assumed that the comparator treatment has been established to have a significant clinical effect (against placebo). These trials are frequently used in situations where use of a superiority trial against a placebo control may be considered unethical.

Noninferiority trials (NIs) have a bad reputation. Consistent with a large literature, a recent systematic review of NI HIV trials  found the overall methodological quality to be poor, with a high risk of bias. The people who brought you CONSORT saw fit to develop special reporting standards for NIs  so that misuse of the design in the service of getting publishable results is more readily detected. You might want to download the CONSORT checklist for NI and apply the checklist to the trial under discussion. Right away, you can see how deficient the reporting is in the abstract of the paper under discussion.

Basically, an NI RCT commits investigators and readers to accepting null results as support for a new treatment because it is no worse than an existing one. Suspicions are immediately raised as to why investigators might want to make that point.

Conflicts of interest could be a reason. Demonstration that the treatment is as good as existing treatments might warrant marketing of the new treatment or dissemination into existing markets. There could be financial rewards or simply promoters and enthusiasts favoring what they would find interesting. Yup, some bandwagons, some fads and fashions psychotherapy are in large part due to promoters simply seeking the new and different, without evidence that a treatment is better than existing ones.

Suspicions are reduced when the new treatment has other advantages, like greater acceptability or a lack of side effects, or when the existing treatments are so good that an RCT of the new treatment with a placebo-control condition would be unethical.

We should give evaluate whether there is an adequate rationale for authors doing an NI RCT, rather than them relying on the conventional test whether the null hypothesis can be rejected of no differences between the intervention and a control condition. Suitable support would be a strong record of efficacy for a well defined control condition. It would also help if the trial were pre-registered as NI, quieting concerns that it was declared as such after peeking at the data.

net-smart-mindfulnessThe first things I noticed in the methods section…trouble

  • The recruitment procedure is strangely described, but seems to indicate that the therapist providing mindfulness training were present during recruitment and probably weren’t blinded to group assignment and conceivably could influence it. The study thus does not have clear evidence of an appropriate randomization procedure and initial blinding. Furthermore, the GPs administering concurrent treatment also were not blinded and might take group assignment into account in subsequent prescribing and monitoring of medication.
  • During the recruitment procedure, GPs assessed whether medication was needed and made prescriptions before randomization occurred. We will need to see – we are not told in the methods section – but I suspect a lot of medication is being given to both intervention and control patients. That is going to complicate interpretation of results.
  • In terms of diagnosis, a truly mixed group of patients was recruited. Patients experiencing stress or adjustment reactions were thrown in with patients who had mild or moderate depression or anxiety disorders. Patients were excluded who were considered severe enough to need psychiatric care.
  • Patients receiving any psychotherapy at the start of the trial were excluded, but the authors ignored whether patients were receiving medication.

This appears to be a mildly distressed sample that is likely to show some recovery in the absence of any treatment. The authors’ not controlling for the medication was received is going to be a big problem later. Readers won’t be able to tell whether any improvement in the intervention condition is due to its more intensive support and encouragement that results in better adherence to medication.

  • The authors go overboard in defending their use of multiple overlapping
    Play at,d.ZGU
    Play Elvis is Dead at at

    measures and overboard in praising the validity of their measures. For instance, The Hospital Anxiety and Depression Scale (HADS) is a fatally flawed instrument, even if still widely used. I considered the instrument dead in terms of reliability and validity, but like Elvis, it is still being cited.

Okay, the authors claim these measures are great, and attach clinical importance to cut points that others no longer consider valid. But then, why do they decide that the scales are ordinal, not interval? Basically, they are saying the scales are so bad that the differences between one number to the next higher or lower for pairs of items can’t be considered equal. This is getting weird. If the scales are as good as the authors claim, why do the authors take the unusual step of considering them as psychometrically inadequate?

I know, I’m getting technical to the point that I risk losing some readers, but the authorsspin no are setting readers up to be comfortable with a decision to focus on medians, not mean scores – making it more difficult to detect any differences between the mindfulness therapy and routine care. Spin, spin!

There are lots of problems with the ill described control condition, treatment as usual (TAU). My standing gripe with this choice is  that TAU varies greatly across settings, and often is so inadequate that at best the authors are comparing whether mindfulness therapy is better than some unknown mix of no treatment and inadequate treatment.

We know enough about mindfulness therapy at this point to not worry about whether it is better than nothing at all, but should be focusing on whether is better than another active treatment and whether its effectiveness is due to particular factors. The authors state that most of the control patients were receiving CBT, but don’t indicate how they knew that, except for case records. Notoriously, a lot of the therapy done in primary care that is labeled by practitioners as CBT does not pass muster. I would be much more comfortable with some sort of control over what patients were receiving in the control arm, or at least better specification.


I’m again trying to avoid getting very technical here, but point out for those who have a developed interest in statistics, that there were strange things going on.

  • Particular statistical analyses (depending on group medians, rather than means are chosen that are less likely to reveal differences between intervention and control group than the parametric statistics that are typically done.
  • Complicated decisions justify throwing away data and then using multivariate techniques to estimate what the data were. The multivariate techniques require assumptions that are not tested.
  • The power analysis is not conducted to detect differences between groups, but to be able to provide a basis for saying that mindfulness does not differ from routine care. Were the authors really interested in that question rather than whether mindfulness is better than routine care in initially designing a study and its analytic plan? Without pre-registration, we cannot know.


There are extraordinary revelations in table 1, baseline characteristics.

Please click to enlarge

  • The intervention and control group initially differed for two of the four outcome variables before they even received the intervention. Thus, intervention and control conditions are not comparable in important baseline characteristics. This is in itself a risk of bias, but also raises further questions about the adequacy of the randomization procedure and blinding.
  • We are told nothing about the distribution of diagnoses across the intervention and control group, which is very important in interpreting results and considering what generalizations can be made.
  • Most patients in both the intervention and control groups were receiving antidepressants and about a third of them either condition were receiving a “tranquilizer” or have missing data for that variable.

Signals that there is something amiss in this study are growing stronger. Given the mildness of disturbance and high rates of prescription of medication, we are likely dealing with a primary care sample where medications are casually distributed and poorly monitored. Yet, this study is supposedly designed to inform us whether adding mindfulness to this confused picture produces outcomes that are not worse.

Table 5 adds to the suspicions. There were comparable, significant changes in both the intervention and control condition over time. But we can’t know if that was due to the mildness of distress or effectiveness of both treatments.

table 5

Twice as many patients assigned to mindfulness dropped out of treatment, compared to those assigned to routine care. Readers are given some information about how many sessions of mindfulness patients attended, but not the extent to which they practiced mindfulness.

positive spin 2Discussion

We are told

The main finding of the present RCT is that mindfulness group therapy given in a general practice setting, where a majority of patients with depression, anxiety, and stress and adjustment disorders are treated, is non-inferior to individual-based therapy, including CBT. To the best of our knowledge, this is the first RCT performed in a general practice setting where the effect of mindfulness group therapy was compared with an active control group.

Although a growing body of research has examined the effect of mindfulness on somatic as well as psychiatric conditions, scientific knowledge from RCT studies is scarce. For example, a 2007 review…

It’s debatable whether the statement was true in 2007, but a lot has happened since then. Recent reviews suggest that mindfulness therapy is better than nothing and better than inactive control conditions that do not provide comparable levels of positive expectations and support. Studies are accumulating that indicate mindfulness therapy is not consistently better than active control conditions. Differences become less likely when the alternative treatments are equivalent in positive expectations conveyed to patients and providers, support, and intensity in terms of frequency and amount of contact. Resolving this latter question of whether mindfulness is better than reasonable alternatives is now critical in this study provides no relevant data.

An Implications section states

Patients who receive antidepressants have a reported remission rate of only 35–40%.41 Additional treatment is therefore needed for non-responders as well as for those who are either unable or unwilling to engage in traditional psychotherapy.

The authors are being misleading to the point of being irresponsible in making this statement in the context of discussing the implications of their study. The reference is to the American STAR*D treatment study, which dealt with very different, more chronically and unremittingly depressed population.

An appropriately referenced statement about primary care populations like what this study was recruited would point to the lack of diagnosis on which prescription of medicaton was based, unnecessary treatment with medication of patients who would not be expected to benefit from it, and poor monitoring and follow-up of patients who could conceivably benefit from medication if appropriately minutes. The statement would reflect the poor state of routine care for depression in the community, but would undermine claims that the control group received an active treatment with suitable specification that would allow any generalizations about the efficacy of mindfulness.


This RCT has numerous flaws in its conduct and reporting that preclude making any contribution to the current literature about mindfulness therapy. What is extraordinary is that, as a null trial, it got published in BJP. Maybe its publication in its present form represents incompetent reviewing and editing, or maybe a strategic, but inept decision to publish a flawed study with null findings because it concerns the trendy topic of mindfulness and GPs to whom British psychiatrists want to reach out.

An RCT of mindfulness psychotherapy is attention-getting. Maybe the BJP is willing to sacrifice trustworthiness of the interpretation of results for newsworthiness. BJP will attract readership it does not ordinarily get with publication of this paper.

What is most fascinating is that the study was framed as a noninferiority trial and therefore null results are to be celebrated. I challenge anyone to find similar instances of null results for a psychotherapy trial being published in BJP except in the circumstances that make a lack of effect newsworthy because it suggests that investment in the dissemination of a previously promising treatment is not justified. I have a strong suspicion that this particular paper got published because the results were dressed up as a successful demonstration of noninferiority.

I would love to see the reviews this paper received, almost as much as any record of what the authors intended when they planned the study.

Will this be the beginning of a trend? Does BJP want to encourage submission of noninferiority psychotherapy studies? Maybe the simple explanation is that the editor and reviewers do not understand what a noninferiority trial is and what it can conceivably conclude.

Please, some psychotherapy researcher with a null trial sitting in the drawer, test the waters by dressing the study up as a noninferior trial and submitted to BJP.

How bad is this study?

The article provides a non-intention-to-treat analysis of a comparison of mindfulness to an ill specified control condition that would not qualify as an active condition. The comparison does not allow generalization to other treatments in other settings. The intervention and control conditions had significant differences in key characteristics at baseline. The patient population is ill-described in ways that does not allow generalization to other patient populations. The high rates of co-treatment confounding due to antidepressants and tranquilizers precludes determination of any effects of the mindfulness therapy. We don’t know if there were any effects, or if both the mindfulness therapy and control condition benefited from the natural decline in distress of a patient population largely without psychiatric diagnoses. Without a control group like a waiting list, we can’t tell if these patients would have improved any way. I could go on but…

This study was not needed and may be unethical

lipstickpigThe accumulation of literature is such that we need less mindfulness therapy research, not more. We need comparisons with well specified active control groups that can answer the question of whether mindfulness therapy offers any advantage over alternative treatments, not only in efficacy, but in the ability to retain patients so they get an adequate exposure to the treatment. We need mindfulness studies with cleverly chosen comparison conditions that allow determination of whether it is the mindfulness component of mindfulness group therapy that has any effectiveness, rather than relaxation that mindfulness therapy shares with other treatments.

To conduct research in patient populations, investigators must have hypotheses and methods with the likelihood of making a meaningful contribution to the literature commensurate with all the extra time and effort they are asking of patients. This particular study fails this ethical test.

Finally, the publication of this null trial as a noninferiority trial pushes the envelope in terms of the need for preregistration of design and analytic plans for trials. If authors of going to claim a successful demonstration of non-inferiority, we need to know that is what they set out to do, rather than just being stuck with null findings they could not otherwise publish.

*DISCLAIMER: This blog post presents solely the opinions of the author, and not necessarily PLOS. Opinions about the publishability of papers reflect only the author’s views and not necessarily an editorial decision for a manuscript submitted to PLOS One.

**I previously criticized the editorial process at BJP, calling for the retraction of a horribly flawed meta-analysis of the mental health effects of abortion written by an American antiabortion activist. I have pointed out how another flawed review of the efficacy of long-term psychodynamic psychotherapy represented duplicate publication . But both of these papers were published under the last editor. I still hope that the current editor can improve the trustworthiness of what is published at BJP. I am not encouraged by this particular paper, however.

Positive psychology interventions for depressive symptoms

POSITIVE THINKINGI recently  talked with a junior psychiatrist about whether she should undertake a randomized trial of positive psychology interventions with depressed primary care patients. I had concerns about whether positive psychology interventions would be acceptable to clinically depressed primary care patients or offputting and even detrimental.

Going back to my first publication almost 40 years ago, I’ve been interested in the inept strategies that other people adopt to try to cheer up depressed persons. The risk of positive psychology interventions is that depressed primary care patients would perceive the exercises as more ineffectual pressures on them to think good thoughts, be optimistic and snap out of their depression. If depressed persons try these exercises without feeling better, they are accumulating more failure experiences and further evidence that they are defective, particularly in the context of glowing claims in the popular media of the power of simple positive psychology interventions to transform lives.  Some depressed people develop acute sensitivity to superficial efforts to make them feel better. Their depression is compounded by their sense of coercion and invalidation of what they are so painfully feeling. This is captured in the hilarious Ren & Stimpy classic

happy happy 2


Happy Helmet Joy Joy song video


Something borrowed, something blue

By positive psychology interventions, my colleague and I didn’t have in mind techniques that positive psychology borrowed from cognitive therapy for depression. Ambitious positive psychology school-based interventions like the UK Resilience Program incorporate these techniques. They have been validated for use with depressed patients when part of Beck’s cognitive therapy, but are largely ineffective when used with nonclinical populations that are not sufficiently depressed to register an improvement. Rather, we had had in mind interventions and exercises that are distinctly positive psychology.

Dr. Joan Cook, Dr.Beck, and Jim Coyne
Dr. Joan Cook, Dr.Beck, and Jim Coyne

I surveyed the positive psychology literature to get some preliminary impressions, forcing myself to read the Journal of Positive Psychology and even the Journal of Happiness Studies. I sometimes had to take breaks and go see dark movies as an antidote, such as A Most Wanted Man and The Drop, both of which I heartily recommend. I will soon blog about the appropriateness of positive psychology exercises for depressed patients. But this post concerns a particular meta-analysis that I stumbled upon. It is open access and downloadable anywhere in the world. You can obtain the article and form your own opinions before considering mine or double check mine:

Bolier, L., Haverman, M., Westerhof, G. J., Riper, H., Smit, F., & Bohlmeijer, E. (2013). Positive psychology interventions: a meta-analysis of randomized controlled studies. BMC Public Health, 13(1), 119.

I had thought this meta analysis just might be the comprehensive, systematic assessment of the literature for which I searched. I was encouraged that it excluded positive psychology interventions borrowed from cognitive therapy. Instead, the authors sought studies that evaluated

the efficacy of positive psychology interventions such as counting your blessings [29,30], practicing kindness [31], setting personal goals [32,33], expressing gratitude [30,34] and using personal strengths [30] to enhance well-being, and, in some cases, to alleviate depressive symptoms [30].

But my enthusiasm was dampened by the wishy-washy conclusion prominently offered in the abstract:

The results of this meta-analysis show that positive psychology interventions can be effective in the enhancement of subjective well-being and psychological well-being, as well as in helping to reduce depressive symptoms. Additional high-quality peer-reviewed studies in diverse (clinical) populations are needed to strengthen the evidence-base for positive psychology interventions.

Can be? With apologies to Louis Jordan, is they or ain’t they effective? And just why is additional high-quality research needed to strengthen conclusions? Because there are only a few studies or because there are many studies, but mostly of poor quality?

I’m so disappointed when authors devote the time and effort that meta-analysis requires and then beat around the bush such wimpy, noncommittal conclusions.

A first read alerted me to some bad decisions that the authors had made from the outset. Further reads showed me how effects of these decisions were compounded by the poor quality of the literature of which they had to make sense.

I understand the dilemma the authors faced. The positive psychology intervention literature  has developed in collective defiance of established standards for evaluating interventions intended to benefit people and especially interventions to be sold to people who trust they are beneficial. To have something substantive to say about positive psychology interventions, the authors of this meta analysis had to lower their standards for selecting and interpreting studies. But they could have done a better job of integrating acknowledgement of problems in the quality of this literature into their evaluation of it. Any evaluation should come with a prominent warning label about the poor quality of studies and evidence of publication bias.

The meta-analysis

Meta-analyses involve (1) systematic searches of the literature; (2) selection of studies meeting particular criteria; and (3) calculation of standardized effect sizes to allow integration of results of studies with different measures of the same construct. Conclusions are qualified by (4) quality ratings of the individual studies and by (5) calculation of the overall statistical heterogeneity of the study results.

The authors searched

PsychInfo, PubMed and the Cochrane Central Register of Controlled Trials, covering the period from 1998 (the start of the positive psychology movement) to November 2012. The search strategy was based on two key components: there should be a) a specific positive psychology intervention, and b) an outcome evaluation.

They also found additional studies by crosschecking references of previous evaluations of positive psychology interventions.

To be selected, a study had to

  • Be developed within the theoretical tradition of positive psychology.
  • Be a randomized controlled study.
  • Measure outcomes of subjective well-being (such as positive affect), personal well-being (such as hope), or depressive symptoms (Such as Beck Depression Inventory).
  • Have results reported in a peer-reviewed journal.
  • Provide sufficient statistics to allow calculation of standardized effect sizes.

I’m going to focus on evaluation of interventions in terms of their ability to reduce depressive symptoms. But I think my conclusions hold for the other outcomes.

The authors indicated their way of assessing the quality of studies (0 to 6) was based on a count derived from an adaptation of the risk of bias items of the Cochrane collaboration. I’ll discuss their departures from the Cochrane criteria later, but these authors’ six criteria were

  • Adequacy of concealment of randomization.
  • Blinding of subjects to which condition they had been assigned.
  • Baseline comparability of groups at the beginning of the study.
  • Whether there was an adequate power analysis or  at least 50 participants in the analysis.
  • Completeness of follow up data: clear attrition analysis and loss to follow up < 50%.
  • Handling of missing data: the use of intention-to-treat analysis, as opposed to analysis of only completers.

The authors used two indicators to assess heterogeneity

  • The Q-statistic. When significant it calls for rejection of null-hypothesis of homogeneity and indicates that the true effect size probably does vary from study to study.
  • The  I2-statistic, which is a percentage indicating the study-to-study dispersion of effect sizes due to real differences, beyond sampling error.

[I know, this is getting technical, but I will try to explain as we go. Basically, the authors estimated the extent to which the effect size they obtained could generalize back to the individual studies. When individual studies vary very much, an overall effect size for a set of studies can be very different  from any for an individual intervention. So without figuring out the nature of this heterogeneity and resolving it, the effect sizes do not adequately represent individual studies or interventions.]

One way of reducing heterogeneity is to identify outlier studies that have much larger or smaller effect sizes than the rest. These studies can simply be removed from consideration or sensitivity analyses can be conducted, in which analyses are compared that retain or remove outlier studies.

The authors expected big differences across the studies and so adopted the criteria for keeping a study  of Cohen’s d (standardized difference) between intervention and control group of 2.5 standard deviations. That is huge. The average psychological intervention for depression differs from a waitlist or no treatment group by .62, but from another active treatment by only d = .20. How could these authors think that even an effect size of 1.0 with largely nonclinical populations could be expected for positive psychology interventions? They are at risk of letting in a lot of exaggerated and nonreplicable results. But stay tuned.

The authors also examined the likelihood that there was a publication bias in the studies that they were able to find, using funnel plots, the Orwin’s fail-safe number and the Trim and Fill method. I will focus on the funnel plot because it is graphic, but the other approaches provide similar results.  The authors of this meta analysis state

A funnel plot is a graph of effect size against study size. When publication bias is absent, the observed studies are expected to be distributed symmetrically around the pooled effect size.

Hypothetical funnel plot indicating bias CLICK TO ENLARGE
Hypothetical funnel plot indicating bias



At the end of the next two sections, I will conclude that the authors were overly generous in their evaluation of positive psychology interventions. The quality of the available studies precludes deciding whether positive psychology interventions are effective. But don’t accept this conclusion without me having to document my reasons for it. Please read on.

Click to enlarge
Click to enlarge

The systematic search identified 40 articles presenting results of 39 studies. The overall quality ratings of the studies were quite low [See Table 1 in the article]. There was a mean score of 2.5 (SD = 1.25). Twenty studies were rated of low quality (<3), 18 of medium quality (3-4), one received a rating of 5. The studies with the lowest quality had the largest effect sizes (Table 4).

Fourteen effect sizes were available for depressive symptoms. The authors report an overall small effect size of positive psychology interventions on depressive symptoms of .23. Standards for evaluating effect sizes are arbitrary, but this one would generally be considered small.

There was multiple indications  of publication bias, including  funnel plots of these effect sizes, and it was estimated that 5 negative findings were missing. According to the authors

Funnel plots were asymmetrically distributed in such a way that the smaller studies often showed the more positive results (in other words, there is a certain lack of small insignificant studies).

When the effect sizes for the missing studies were imputed (estimated), the adjusted overall effect size for depressive symptoms was reduced to a nonsignificant .19.

To provide some perspective, let’s examine the statistics for approximately the effect size of .20. There is a 56% probability (as opposed to a 50/50 probability) that a person assigned to a positive psychology intervention would be better off than a person assigned to the control group.

Created by Kristoffer Magnusson.
Created by Kristoffer Magnusson.

But let’s give a closer look to a forest plot of the studies with depressive symptoms as an outcome.

As can be seen in the figure below, each study has a horizontal line in the forest plot and most have a square box in the middle. The line represents the 95% confidence interval for the standard mean difference between the positive psychology intervention and its control group, and the darkened square is the mean difference.

forest plot
Click to enlarge

Note that two studies, Fava (2005) and Seligman, study 2 (2006) have long lines with an arrow at the right, but no darkened squares. The arrow indicates the line for each extends beyond what is shown in the graph. The long line for each indicates wide confidence intervals and imprecision in the estimated effect. Implications? Both studies are extreme outliers with large, but imprecise estimates of effect sizes. We will soon see why.

There are also vertical lines in the graph. One is marked 0,00 and indicates no difference between the intervention and control group. If the line for an individual study crosses it, the difference between the intervention and control group was not significant.

Among the things to notice are:

  • Ten of the 14 effect sizes available for depressive symptoms across the 0,00 line indicating that individual effect sizes were not significant.
  • The four lines that don’t cross this line and therefore had significant effects were Fava (2005), Hurley, Mongrain, Seligman (2006, study 2).

Checking Table 2 for characteristics of the studies, we find that Fava compared 10 people receiving the positive psychology intervention to a control group of 10. Seligman had 11 people in the intervention group and 9 in the control group. Hurley is listed as comparing 94 people receiving the intervention to 99 controls. But I checked the actual study and these numbers represent a substantial loss of participants from the 151 intervention and 164 control participants who started the study. Hurly lost 39% of participants from the Time 2 assessment and analyzed only completers, without intent to treat analyses or imputation (which would have been inappropriate anyway because of the high proportion of missing data).

I cannot make sense of Mongrain’s studies being counted as positive. A check with Table 1 indicates that 4 studies with Mongrain as an author were somehow combined. Yet, when I looked them up, one  study reports no significant differences between intervention and control conditions for depression, with the authors explicitly indicated that they failed to replicate Seligman et al (2006). A second study reports

In terms of depressive symptoms, no significant effects were found for time or time x condition. Thus, participant reports of depressive symptoms did not change significantly over time, or over time as a function of the condition that they were assigned to.

A third study reported significant effects for completers, but nonsignificant effects in multilevel modeling analyses that attempted to compensate for attrition. The fourth study  again failed to find that depressive symptoms’ decline over time was a function of which group to which participants were assigned, in multilevel analyses attempting to compensate for attrition.

So, Mongrain’s studies should not be counted as having a positive effect size for depressive symptoms unless perhaps we accept a biased completer analysis over multilevel modeling. We are left with Fava and Seligman’s quite small studies and Hurley’s study relying on completer analyses without adjustment for substantial attrition.

By the authors’ ratings, the quality of these studies was poor. Fava score and Seligman both scored 1 out of 6 in the quality assessments. Hurley scored 2.  Mongrain scored 4 and the other negative studies had a mean score of 2.6. So, any claim from individual studies of positive psychology interventions have an effect on depressive symptoms depend on two grossly underpowered studies and another study with analysis of only completers in the face of substantial attrition. And the positive studies tend to be of lower quality.

worse than itBut the literature concerning positive psychology interventions is worse than it first looks.

The authors’ quality ratings are too liberal.

  • Item 3, Baseline comparability of groups at the beginning of the study, is essential if effect sizes are to be meaningful. But it becomes meaningless if such grossly underpowered studies are included. For instance, it would take a large difference in baseline characteristics of Fava’s 8 intervention versus 8 control participants to be significant. That there were no significant differences in the baseline characteristics is very weak as assurance that individual or combined baseline characteristics did not account for any differences that were observed.
  • Item 4, Whether there was an adequate power analysis or at least 50 participants in the analysis can be met in either of 2 ways. But we don’t have evidence that the power analyses were conducted prior to the conduct of the trial and having at least 50 participants does not reduce bias if there is substantial attrition.
  • Item 5, Completeness of follow up data: clear attrition analysis and loss to follow up < 50%, allows studies with substantial loss to follow up to score positive. Hurly’s loss of over a third of participants who were randomized rules out generalization of results back to the original sample, much less an effect size that can be integrated with other studies that did not lose so many participants.

The authors of this meta analysis chose to “adapt,” rather than simply accept the validated Cochrane Collaboration risk of bias assessment. Seen here, one Cochrane criterion is whether the randomization procedure is described in sufficient detail to decide that the intervention and control group would be comparable except for group assignment. These studies typically did not provide sufficient details of any care having been taken to ensure this or any details whatsoever except that the study was randomized.

Another criterion is whether there is evidence of selective outcome reporting. I would not score any of these studies as demonstrating that all outcomes were reported. The issue is that authors can assess participants with a battery of psychological measures, and then pick those that differed significantly between groups to be highlighted.

The Cochrane Collaboration includes a final criterion, “other sources of bias.” In doing meta analyses of psychological intervention studies, consider investigator allegiance is crucial because the intervention for which the investigator is rooting almost always does better.  My group’s agitation about financial conflicts of interest has won us the Bill Silverman award from the Cochrane Collaboration. The collaboration is now revising its other sources of bias critirion so that conflicts of interest are to be taken into account. Some authors of articles about positive psychology interventions profit immensely from marketing positive psychology merchandise. I am not aware of any of the studies included in the meta analysis having disclosures of conflict of interest.

If you think I am being particularly harsh in my evaluation of positive psychology interventions, you need only to consult my numerous other blog posts about meta analyses and see the consistency with which I apply standards. And I have not even gotten to my pet peeves in evaluating intervention research – overly small cell size and “control groups” that are not clear on what is being controlled.

The number of participants some of these studies is so small that the intended effects of randomization cannot be assured and any positive findings are likely to be false positives. If the number of participants in either the intervention or control group is less than 35, there is less than 50% probability of detecting a moderate sized positive effect, even if it is actually there. Put differently, there is more than 50% probability that any significant finding will be false positive. Inclusion of studies with so few participants undermines the validity of other quality ratings. We cannot tell why Fava or Seligman did not have one more or one fewer participant. These are grossly underpowered studies and adding or dropping a single participant in either group could substantially change results.

Then there is the question of control groups. While some studies simply indicate waitlist, others had an undefined treatment as usual, or no treatment, and a number of others indicate “placebo,” apparently following Seligman et al’s  (2005):

Placebo control exercise: Early memories. Participants were asked to write about their early memories every night for one week.

As Mongrain correctly noted, this is not a “placebo.” Seligman et al. and the studies modeled after it failed to include any elements of positive expectation, support, or attention that are typically provided in conditions labeled “placebo.” Mongrain and her colleagues attempted to provide such elements in their control condition, and perhaps this contributed to their negative findings.

A revised conclusion for this meta-analysis

Instead of the wimpy conclusion of the authors presented in their abstract, I would suggest acknowledgment that

The existing literature does not provide robust support for the efficacy of positive psychology interventions for depressive symptoms. The absence of evidence is not necessarily evidence of an absence of an effect. However, more definitive conclusions await better quality studies with adequate sample sizes and suitable control of possible risk of bias. Widespread dissemination of positive psychology interventions, particularly with glowing endorsements and strong claims of changing lives, is premature in the absence of evidence they are effective.

Can the positive psychology intervention literature be saved from itself?

Studies of positive psychology interventions are conducted, published, and evaluated in a gated community where vigorous peer review is neither sought nor apparently effective in identifying and correcting major flaws in manuscripts before they are published. Many within the positive psychology movement  find this supportive environment an asset, but it has failed to produce a quality literature demonstrating positive interventions can indeed contribute to human well-being. Positive psychology intervention research has been insulated from widely accepted standards for doing intervention research. There is little evidence that any of manuscripts reporting the studies were submitted with completed CONSORT checklists, which are now required by most journals. There’s little evidence of awareness of Cochrane risk of bias assessment or of steps been taking to reduce bias.

In what other area of intervention research are claims for effectiveness so dependent on such small studies of such low methodological quality published in journals in which there is only limited independent peer review and such strong confirmatory bias?

As seen on its Friends of Positive Psychology listserv, the positive psychology community is averse to criticism, even constructive criticism from within its ranks. There is dictatorial one-person rule on the listserv. Dissenters routinely vanish without any due process or notice to the rest of the listserv community, much like under disappearances under a Latin American dictatorship.

There are many in the positive psychology movement who feel that that the purpose of positive psychology research is to uphold the tenets of the movement and show, not test the effectiveness of its interventions for changing lives. Investigators who want to evaluate positive psychology interventions need to venture beyond the safety and support of Journal of Positive Psychology and Journal of Happiness Studies to seek independent peer review, informed by widely accepted standards for evaluating psychological interventions.

Salvaging psychotherapy research: a manifesto

"Everybody has won, and all must have prizes." Chapter 3 of Lewis Carroll's Alice's Adventures in Wonderland
“Everybody has won, and all must have prizes.” Chapter 3 of Lewis Carroll’s Alice’s Adventures in Wonderland

NOTE: Additional documentation and supplementary links and commentary are available at What We Need to Do to Redeem Psychotherapy Research.

Fueling Change in Psychotherapy Research with Greater Scrutiny and Public Accountability

John Ioannidis’s declarations that most positive findings are false and that most breakthrough discoveries are exaggerated or fail to replicate apply have as much to with psychotherapy as they do with biomedicine.

BadPharma-Dec2012alltrials_basic_logo2We should take a few tips from Ben Goldacre’s Bad Pharma and clean up the psychotherapy literature, paralleling what is being accomplished with pharmaceutical trials. Sure, much remains to be done to ensure the quality and transparency of drug studies and to get all of the data into public view. But the psychotherapy literature lags far behind and is far less reliable than the pharmaceutical literature.

As it now stands, the psychotherapy literature does not provide a dependable guide to policy makers, clinicians, and consumers attempting to assess the relative costs and benefits of choosing a particular therapy over others. If such stakeholders uncritically depend upon the psychotherapy literature to evaluate the evidence-supported status of treatments, they will be confused or misled.

Psychotherapy research is scandalously bad.

Many RCTs are underpowered, yet consistently obtain positive results by redefining the primary outcomes after results are known. The typical RCT is a small, methodologically flawed study conducted by investigators with strong allegiances to one of the treatments being evaluated. Which treatment is preferred by investigators is a better predictor of the outcome of the trial than the specific treatment being evaluated.

Many positive findings are created by spinning a combination of confirmatory bias, flexible rules of design, data analysis and reporting and significance chasing.

Many studies considered positive, including those that become highly cited, are basicallycherrypicking null trials for which results for the primary outcome are ignored, and post-hoc analysis of secondary outcomes and subgroup analyses are emphasized. Spin starts in abstracts and results that are reported there are almost always positive.

noceboThe bulk of psychotherapy RCTs involve comparisons between a single active treatment and an inactive or neutral control group such as wait list, no treatment, or “routine care” which is typically left undefined but in which exposure to treatment of adequate quality and intensity is not assured. At best these studies can tell us whether a treatment is better than doing nothing at all or than patients expecting treatment because they have enrolled in a trial and not getting it (nocebo).


Meta-analyses of psychotherapy often do not qualify conclusions by grade of evidence, ignore clinical and statistical heterogeneity, inadequately address investigator allegiance, downplay the domination by small trials with statistically improbable rates of positive findings, and ignore the extent to which positive effect sizes occur mainly in comparisons between active and inactive treatments.

Meta-analyses of psychotherapies are strongly biased toward concluding that treatments work, especially when conducted by those who have undeclared conflicts of interest, including developers and promoters of treatments that stand to gain financially from their branding as “evidence-supported.”

Overall, meta-analyses too heavily depend on underpowered, flawed studies conducted by investigators with strong allegiances to a particular treatment or to finding that psychotherapy is in general efficacious. When controls are introduced for risk of bias or investigator allegiance, affects greatly diminish or even disappear.

Conflicts of interest associated with authors having substantial financial benefits at stake are rarely disclosed in the studies that are reviewed or the meta-analyses themselves.

Designations of Treatments as Evidence-Supported

There are low thresholds for professional groups such as the American Psychological Association Division 12 or governmental organizations such as the US Substance Abuse and Mental Health Services Administration (SAMHSA) declaring treatments to be “evidence-supported.” Seldom are any treatments deemed ineffective or harmful by these groups.

Professional groups have conflicts of interest in wanting their members to be able to claim the treatments they practice are evidence-supported, while not wanting to restrict practitioner choice with labels of treatment as ineffective. Other sources of evaluation like SAMHSA depend heavily and uncritically on what promoters of particular psychotherapies submit in applications for “evidence supported status.”

"Everybody has won, and all must have prizes." Chapter 3 of Lewis Carroll's Alice's Adventures in Wonderland
“Everybody has won, and all must have prizes.” Chapter 3 of Lewis Carroll’s Alice’s Adventures in Wonderland

The possibility that there are no consistent differences among standardized, credible treatments across clinical problems is routinely ridiculed as the “dodo bird verdict” and rejected without systematic consideration of the literature for particular clinical problems. Yes, some studies find differences between two active, credible treatments in the absence of clear investigator allegiance, but these are unusual.

The Scam of Continuing Education Credit

thought field therapyRequirements that therapists obtain continuing education credit are intended to protect consumers from outdated, ineffective treatments. There is inadequate oversight of the scientific quality of what is offered. Bogus treatments are promoted with pseudoscientific claims. Organizations like the American Psychological Association (APA) prohibit groups of their members making statements protesting the quality of what is being offered and APA continues to allow CE for bogus and unproven treatments like thought field therapy and somatic experiencing.

Providing opportunities for continuing education credit is a lucrative business for both accrediting agencies and sponsors. In the competitive world of workshops and trainings, entertainment value trumps evidence. Training in delivery of manualized evidence-supported treatments has little appeal when alternative trainings emphasize patient testimonials and dramatic displays of sudden therapeutic gain in carefully edited videotapes, often with actors rather than actual patients.

Branding treatments as evidence supported is used to advertise workshops and trainings in which the particular crowd-pleasing interventions that are presented are not evidence supported.

Those who attend Acceptance and Commitment (ACT) workshops may see videotapes where the presenter cries with patients, recalling his own childhood.  They should ask themselves: “Entertaining, moving perhaps, but is this an evidence supported technique?

Psychotherapies with some support from evidence are advocated for conditions for which there is no evidence for their efficacy. What would be disallowed as “off label applications” for pharmaceuticals is routinely accepted in psychotherapy workshops.

We Know We Can Do Better

Psychotherapy research has achieved considerable sophistication in design, analyses, and strategies to compensate for missing data and elucidate mechanisms of change.

Psychotherapy research lags behind pharmaceutical research, but nonetheless hasCONSORT recommendations and requirements for trial preregistration, including specification of primary outcomes; completion of CONSORT checklists to ensure basic details of trials are reported; preregistration of meta-analyses and systematic reviews at sites like PROSPERO, as well as completion of the PRISMA checklist for adequacy of reporting of meta-analyses and systematic reviews.

nothing_to_declare1Declarations of conflicts of interest are rare and exposure of authors who routinely failed to disclose conflicts of interest is even rarer.

Departures from preregistered protocols in published reports of RCTs are common, and there is little checking of discrepancies in abstracts from results that were actually obtained or promised in preregistration by authors.  There is  inconsistent and incomplete adherence to these requirements. There is little likelihood that noncompliant authors will held accountable and  high incentive to report positive findings in order for a study is to be published in a prestigious journal such as the APA’s Journal of Consulting and Clinical Psychology (JCCP). Examining the abstracts of papers published in JCCP gives the impression that trials are almost always positive, even when seriously underpowered.

Psychotherapy research is conducted and evaluated within a club, a mutual admiration society in which members are careful not to disparage others’ results or enforce standards that they themselves might want relaxed when it comes to publishing their own research. There are rivalries between tribes like psychodynamic therapy and cognitive behavior therapy, but suppression of criticism within the tribes and in strenuous efforts to create the appearance that members of the tribes only do what works.

Reform from Without

Journals and their editors have often resisted changes such as adoption of CONSORT, structured abstracts, and preregistration of trials. The Communications and Publications Baord of the American Psychological Association made APA one of the last major holdout publishers to endorse CONSORT and initially provided an escape clause that CONSORT only applied to articles explicitly labeled as a randomized trial. The board also blocked a push by the Editor of Health Psychology for structured abstracts that reliably reported details needed to evaluate what had actually been done in trials and the results were obtained. In both instances, the committee was most concerned about the implications for the major outlet for clinical trials among its journals, Journal of Consulting and Clinical Psychology.

Although generally not an outlet for psychotherapy trials, the journals of the Associationvagal tone for Psychological Science (APS) show signs of even being worse offenders in terms of ignoring standards and commitment to confirmatory bias. For instance, it takes a reader a great deal of probing to discover that a high-profile paper of Barbara Fredrickson in Psychological Science was actually a randomized trial and further detective work to discover that it was a null trial. There is no sign that a CONSORT checklist was ever filed the study. And despite Frederickson using the spun Psychological Science trial report to promote her workshops, there is no conflict of interest declared.

The new APS Clinical Psychological Science show signs of even more selective publication and confirmatory bias than APA journals, producing newsworthy articles, to the exclusion of null and modest findings. There will undoubtedly be a struggle between APS and APA clinical journals for top position in the hierarchy publishing only papers that that are attention grabbing, even if flawed, while leaving to other journals that are considered less prestigious, the  publishing of negative trials and failed replications.

If there is to be reform, pressures must come from outside the field of psychotherapy, from those without vested interest in promoting particular treatments or the treatments offered by members of professional organizations. Pressures must come from skeptical external review by consumers and policymakers equipped to understand the games that psychotherapy researchers play in creating the appearance that all treatments work, but the dodo bird is dead.

Specific journals are reluctant to publish criticism of their publishing practices.  If we at first cannot gain publication in the offending journals of our concerns, we can rely on blogs and Twitter to call out editors and demand explanations of lapses in peer review and upholding of quality.

We need to raise stakeholders’ levels of skepticism, disseminate critical appraisal skills widely and provide for their application in evaluating exaggerated claim and methodological flaws in articles published in prestigious, high impact journals. Bad science in the evaluation of psychotherapy must be recognized as the current norm, not an anomaly.

We could get far by enforcing rules that we already have.

We need to continually expose journals’ failures to enforce rules about preregistration, disclosure of conflicts of interest, and discrepancies between published clinical trials and their preregistration.

There are too many blatant examples of investigators failing to deliver what they promised in the preregistration, registering after trials have started to accrue patients, and reviewers apparently not ever checking if the primary outcomes and analyses promised in trial registration are actually delivered.

Editors should

  • Require an explicit statement of whether the trial has been registered and where.
  • Insist that reviewers consult trial registration, including modifications, and comment on any deviation.
  • Explicitly label registration dated after patient accrual has started.

spin noCONSORT for abstracts should be disseminated and enforced. A lot of hype and misrepresentation in the media starts with authors’ own spin in the abstract . Editors should insist that main analyses for the preregistered primary outcome be presented in the abstract and highlighted in any interpretation of results.

No more should underpowered in exploratory pilot feasibility studies be passed off as RCTs when they achieve positive results. An orderly sequence of treatment development should occur before conducting what are essentially Phase 3 randomized trials.

Here as elsewhere in reforming psychotherapy research, there is something to be learned from drug trials. A process of intervention development ought to include establishing the feasibility and basic parameters of clinical trials needs to proceed phase 3 randomized trials, but cannot be expected to become phase 3 or to provide effect sizes for the purposes of demonstrating efficacy or comparison to other treatments.

Use of wait list, no treatment, and ill-defined routine care should be discouraged as control groups. For clinical conditions for which there are well-established treatments, head-to-head comparisons should be conducted, as well as including control groups that might elucidate mechanism. A key example of the latter would be structured, supportive therapy that controls for attention and positive expectation. There is little to be gained by further accumulation of studies in which the efficacy of the preferred treatment is assured by comparison to a lamed control group that lacks any conceivable element of affective care.

Evaluations of treatment effects should take into account prior probabilities suggested by the larger literature concerning comparisons between two active, credible treatments. The well-studied treatment of depression literature suggests some parameters: effect size is associated with a treatment are greatly reduced when comparisons are restricted to credible, active treatments; better quality studies; and controls are introduced for investigator allegiance. It is unlikely that initial claims about a breakthrough treatment exceeding the efficacy of existing treatments will be sustained in larger studies conducted by investigators independent of developers and promoters.

Disclosure of conflict of interest should be enforced and nondisclosure identified in correction statements and further penalized. Investigator allegiance should be considered in assessing risk of bias.

Developers of treatments and persons with significant financial gain from a treatment being declared “evidence-supported” should be discouraged from conducting meta-analyses of their own treatments.

Trials should be conducted with sample sizes adequate to detect at least moderate effects. When positive findings from underpowered studies are published,  readers scrutinize the literature for similarly underpowered trials that achieve similarly positive effects.

Meta-analyses of psychotherapy should incorporate p-hacking techniques to evaluate the likelihood that pattern of significant findings exceeds likely probability.

Adverse events and harms should routinely be reported, including lost opportunity costs such as failure to obtain more effective treatment.

We need to shift the culture of doing and reporting psychotherapy research. We need to shift from praising exaggerated claims about treatment and faux evidence generated to  promote opportunities for therapists and their professional organizations.  Instead, it is much more praiseworthy to provide  robust, sustainable, even if more modest claims and to call out hype and hokum in ways that preserve the credibility of psychotherapy.

Click to Enlarge

The alternative is to continue protecting psychotherapy research from stringent criticism and enforcement of standards for conducting and reporting research. We can simply allow the branding of psychotherapies as “evidence supported” to fall into appropriate disrepute.