Delusional? Trial in Lancet Psychiatry claims brief CBT reduces paranoid delusions

lancet psychiatryIn this issue of Mind the Brain, I demonstrate a quick assessment of the conduct and reporting of a clinical trial.  The authors claimed in Lancet Psychiatry a “first ever” in targeting “worries” with brief cognitive therapy as a way of reducing persistent persecutory delusions in psychotic persons. A Guardian article written by the first author claims effects were equivalent to what is obtained with antipsychotic medication. Lancet Psychiatry allowed the authors a sidebar to their article presenting glowing testimonials of 3 patients making extraordinary gains. Oxford University lent its branding* to the first author’s workshop promoted with a video announcing a status of “evidence-based” for the treatment.

There is much claiming to be new here. Is it a breakthrough in treatment of psychosis and in standards for reporting a clinical trial? Or is what is new not praiseworthy?

I identify the kinds of things that I sought in first evaluating the Lancet Psychiatry article and what additional information needed to be consulted to assess the contribution to the field and relevance to practice.

The article is available open access.

Its publication was coordinated with the first author’s extraordinarily self-promotional elarticle in The Guardian

The Guardian article makes the claim that

benefits were what scientists call “moderate” – not a magic bullet, but with meaningful effects nonetheless – and are comparable with what’s seen with many anti-psychotic medications.

The advertisement for the workshop is here


The Lancet Psychiatry article also cites the author’s self-help book for lay persons. There was no conflict of interest declared.

Probing the article’s Introduction

Reports of clinical trials should be grounded in a systematic review of the existing literature. This allows readers to place the study in the context of existing research and the unsolved clinical and research problems the literature poses. This background prepares the reader to evaluate the contribution the particular trial can make.

Just by examining the references for the introduction, we can find signs of a very skewed presentation.

The introduction cites 13 articles, 10 of which are written by the author and an eleventh is written by a close associate. The remaining 2 citations are more generic, to a book and an article about causality.

Either the author is at the world center of this kind of research or seriously deficient in his attention to the larger body of evidence. At the outset, the author announces a bold reconceptualization of the role of worry in causing psychotic symptoms:

Worry is an expectation of the worst happening. It consists of repeated negative thoughts about potential adverse outcomes, and is a psychological component of anxiety. Worry brings implausible ideas to mind, keeps them there, and increases the level of distress. Therefore we have postulated that worry is a causal factor in the development and maintenance of persecutory delusions, and have tested this theory in several studies.

This is controversial, to say the least. The everyday experience of worrying is being linked to persecutory delusions. A simple continuum seems to be proposed – people can start off with everyday worrying and end out with a psychotic delusion and twenty years of receiving psychiatric services. Isn’t this too simplistic or just plain wrong?

Has no one but the author done relevant work or even reacted to the author’s work? The citations provided in the introduction suggest the author’s work is all we need in order to interpret this study in the larger context of what is known about psychotic persecutory delusions.

Contrast my assessment with the author’s own:

Panel 2: Research in context
Systematic review We searched the ISRCTN trial registry and the PubMed database with the search terms “worry”,“delusions”. “persecutory”,“paranoia”,and “schizophrenia”without date restrictions, for English-language publications of randomised controlled trials investigating the treatment of worry in patients with persecutory delusions. Other than our pilot investigation12 there were no other such clinical trials in the medical literature. We also examined published meta-analyses on standard cognitive behavioural therapy (CBT) for persistent delusions or hallucinations, or both.

The problem is that “worry” is a nonspecific colloquial term, not a widely used scientific one. For the author to require that studies have “worry” as a keyword in order to be retrieved is a silly restriction.

PubMedI welcome readers to redo the PubMed search dropping this term. Next replace “worry” with “anxiety.” Furthermore, the author makes unsubstantiated assumptions about a causal role for worry/anxiety in development of delusions. Drop the “randomized controlled trial” restriction from the PubMed search and you find a large relevant literature. Persons with schizophrenia and persecutory delusions are widely acknowledged to be anxious. But you won’t find much suggestion in this literature that the anxiety is causal or that people progress from worrying about something to developing schizophrenia and persecutory delusions. This seems a radical version gone wild of the idea that normal and psychotic experiences are on a continuum, concocted with a careful avoidance of contrary evidence.

Critical appraisal of clinical trials often skips examination of whether the background literature cited to justify the study is accurate and balanced. I think this brief foray has demonstrated that it can be important in establishing whether an investigator is claiming false authority for a view with cherry picking and selective attention to the literature.

Basic design of the study

The 150 patients randomized in this study are around 40 years old. Half of the sample of has been in psychiatric services for 11 or more years, with 29% of the patients in the intervention group and 19% in the control group receiving services for more than 20 years. The article notes in passing that all patients were prescribed antipsychotic medication at the outset of the study except 1 in the intervention group and 9 in the control group – 1:9? It is puzzling how such differences emerged if randomization was successful in controlling for baseline differences. Maybe it demonstrates the limitations of block randomization.

The intervention is decidedly low intensity for what is presumably a long standing symptom in chronically psychotic population.

We aimed to provide the CBT worry-reduction intervention in six sessions over 8 weeks. Each session lasted roughly an hour and took place in NHS clinics or at patients’ homes.

The six sessions were organized around booklets shared by the patient and therapist.

The main techniques were psychoeducation about worry, identification and reviewing of positive and negative beliefs about worry, increasing awareness of the initiation of worry and individual triggers, use of worry periods, planning activity at times of worry (which could include relaxation), and learning to let go of worry.

Patients were expected to practice exercises from the author’s self-help book for lay persons.

The two main practical techniques to reduce worry were then introduced: the use of worry periods (confining worry to about a 20 minute set period each day) and planning of activities at peak worry times. Worry periods were implemented flexibly. For example, most patients set up one worry period a day, but they could choose to have two worry periods a day or, in severe instances, patients instead aimed for a worry-free period. Ideally, the worry period was then substituted with a problem-solving period.

Compared to what?

The treatment of the control group was ill-defined routine care “delivered according to national and local service protocols and guidelines.” Readers are not told how much treatment the patients received or whether their care was actually congruent with these guidelines. Routine care of mental health patients in the community is notoriously deficient. That over half of these patients had been in services for more than a decade suggests that treatment for many of them had tapered off and was being delivered with no expectation of improvement.

To accept this study as an evaluation of the author’s therapy approach, we need to know how much in the way of other treatment was received by patients in both the intervention and control group. Were patients in the routine care condition, as I suspect, largely being ignored? The intervention group got 6 sessions of therapy over 8 weeks. Is that a substantial increase in psychotherapy or even in time to talk with a professional over what they would otherwise receive? Did being assigned to the intervention also increase patients’ other contact with mental health services? If the intervention therapists heard that patients was having problems with medication or serious unmet medical needs, how did they respond?

The authors report collecting data concerning receipt of services with the Client Service Receipt Inventory, but nowhere is that reported.

Most basically, we don’t know what elements the comparison/control group controlled. We have no reason to presume that the amount of contact time and basic relationship with a treatment provider was controlled.

As I have argued before, it is inappropriate and arguably unethical to use ill defined routine care or treatment-as-usual in the evaluation of a psychological intervention. We cannot tell if any apparent benefits to patients having been assigned to the intervention are due to correcting the inadequacies of routine care, including its missing of basic elements of support, attention, and encouragement. We therefore cannot tell if there are effective elements to the intervention other than  these nonspecific factors.

We cannot tell if any positive results to this trial suggest encourage dissemination and implementation or only improving likely deficiencies in the treatment received by patients in long term psychiatric care.

In terms of quickly evaluating articles reporting clinical trials, we see that imply asking “compared to what” and jumping to the comparison/control condition revealed a lot of deficiencies at the outset in what this trial could reveal.

Measuring outcomes

Two primary outcomes were declared – changes in the Penn State Worry Questionnaire and the Psychotic Symptoms Rating Scale- Delusion (PSYRATS-delusion) subscale. The authors use multivariate statistical techniques to determine whether patients assigned to the intervention group improved more on either of these measures, and whether specifically reduction in worry caused reductions in persecutory delusions.

Understand what is at stake here: the authors are trying to convince us that this is a groundbreaking study that shows that reducing worry with a brief intervention reduces long standing persecutory delusions.

The authors lose substantial credibility if we look closely at their primary measures, including their items, not just the scale names.

what-me-worry-715605The Penn State Worry Questionnaire (PSWQ) is a 16 item questionnaire widely used with college student, community and clinical samples. Items include

When I am under pressure I worry a lot.

I am always worrying about something.

And reverse direction items scored so greater endorsement indicates less worrying –

I do not tend to worry about things.

I never worry about anything.

I know, how many times does basically the same question have to be asked?

The questionnaire is meant to be general. It focuses on a single complaint that could be a symptom of anxiety. While the questionnaire could be used to screening for anxiety disorders, it does not provide a diagnosis of a mental disorder, which requires other symptoms be present. Actually, worry is only one of three components of anxiety. The others are physiological – like racing heart, sweating, or trembling – and behavioral – like avoidance or procrastination.

But “worry” is also a feature of depressed mood. Another literature discusses “worry” as “rumination.” We should not be surprised to find this questionnaire functions reasonably well as a screen for depression.

But past research has shown that even in nonclinical populations, using a cutpoint to designate high versus low worriers results in unstable classification. Without formal intervention, many of those who are “high” become  “low” over time.

In order to be included in this study, patients had to have a minimum score of 44 on the PSWQ. If we skip to the results of the study we find that the patients in the intervention group dropped from 64.8 to 56.1 and those receiving only routine care dropped from 64.5 to 59.8. The average patient in either group would have still qualified for inclusion in the study at the end of follow up.

The second outcome measure, the Psychotic Symptoms Rating Scale- Delusion subscale has six items: duration and frequency of preoccupation; intensity of distress; amount of distressing content; conviction and disruption. Each item is scored 0-4, with 0 = no problem and 4 = maximum severity.

The items are so diverse that interpretation of a change in the context of an intervention trial targeting worry becomes difficult. Technically speaking, the lack of comparability among items is so great that the measure cannot be considered an interval scale for which conventional parametric statistics could be used. We cannot reasonably assume changes in one item is equivalent to changes in other items.

It would seem, for instance, that amount of preoccupation with delusions, amount and intensity of distress, and amount of preoccupation with delusions are very different matters. The intervention group changed from a mean of 18.7 on a scale with a possible score of 24 to 13.6 at 24 weeks; the control group from 18.0 to 16.4. This change could simply represent reduction in the amount and intensity of distress, not in patients’ preoccupation with the delusions, their conviction that the delusions are true, or the disruption in their lives. Overall, the PSYRATS-delusion subscale is not a satisfactory measure on which to make strong claims about reducing worry reducing delusions. The measure is too contaminated with content similar to the worries questionnaire. We might only be finding ‘changes in worries results in changes in worries.”

Checking primary outcomes is important in evaluating a clinical trial, but in this case, it was crucial to examine what the measures assessed at an item content level. Too often reviewers uncritically accept the name of an instrument as indicating what it validly measures when used as an outcome measure.

The fancy multivariate analyses do not advance our understanding of what went on in the study. The complex statistical analyses might simply be demonstrating patients were less worried as seen in questionnaires and interview ratings based on what patients say when asked whether they are distressed.

My summary assessment is that a low intensity intervention is being evaluated against an ill-defined treatment as usual. The outcome measures are too nonspecific and overlapping to be helpful. We may simply be seeing effects of contact and reassurance among patients who are not getting much of either. So what?

testimonialsBring on the patient endorsements

Panel 1: Patient comments on the intervention presents glowing endorsements from 3 of the 73 patients assigned to the intervention group. The first patient describes the treatment as “extremely helpful” and as providing a “breakthrough.” The second patient suggests describing starting treatment being lost and without self-confidence but now being relaxed at times of the day that had previously been stressful. The third patient declared

“The therapy was very rewarding. There wasn’t anything I didn’t like. I needed that kind of therapy at the time because if I  didn’t have that therapy at that time, I wouldn’t be here.

Wow, but these dramatic gains seem inconsistent with the modest gains registered with the quantitative primary outcome measures. We are left guessing how these endorsements were elicited – where they obtained in a context where patients were expected to express gratitude for the extra attention they received? –  and the criteria by which the particular quotes were selected from what is presumably a larger pool.

Think of the outcry if Lancet Psychiatry extended this innovation to reporting of clinical trials to evaluations of medications by their developers. If such side panels are going to be retained in the future in the reporting of a clinical trial, maybe it would be best that they be marked “advertisement” and accompanied by a declaration of conflict of interest.

A missed opportunity to put the authors’ intervention to a fair test

In the Discussion section the authors state

although we think it highly unlikely that befriending or supportive counselling [sic] would have such persistent effects on worry and delusions, this possibility will have to be tested specifically in this group.

Actually, the authors don’t have much evidence of anything but a weak effect that might well have been achieved with befriending or supportive counseling delivered by persons with less training. We should be careful of accepting claims of any clinically significant effects on delusions. At best, the authors have evidence that distress associated with delusions was reduced and that in any coordination in scores between the two measurs may simply reflect confounding of the two outcome measures.

It is a waste of scarce research funds, an unethical waste of patients willingness to contribute to science to compare this low intensity psychotherapy to ill-described, unquantified treatment as usual. Another low intensity treatment like befriending or supportive counseling might provide sufficient elements of attention, support, and raised expectations to achieve comparable results.

Acknowledging the Supporting Cast

In evaluating reports of clinical trials, it is often informative to look to footnotes and acknowledgments, as well as the main text. This article acknowledges Anthony Morrison as a member of the Trial Steering Committee and Douglas Turkington as a member of the Data Monitoring and Ethics Committee. Readers of Mind the Brain might recognize Morrison as first author of a Lancet trial that I critiqued for exaggerated claims and Turkington as the first author of a trial that became an internet sensation when post-publication reviewers pointed out fundamental problems in the reporting of data.  Turkington and an editor of the journal in which the report of the trial was published counterattacked.

All three of these trials involve exaggerated claims based on a comparison between CBT and an ill-defined routine care. Like the present one, Morrison’s trial failed to report data concerning collected receipt of services. And in an interview with Lancet, Morrison admitted to avoiding a comparison between CBT and anything but routine care out of concern that differences might not be found with any treatment providing a supportive relationship, even basic supportive counseling.

MRCA note to funders

This project (09/160/06) was awarded by the Efficacy and Mechanism Evaluation (EME) Programme, and is funded by the UK Medical Research Council (MRC) and managed by the UK NHS National Institute for Health Research (NIHR) on behalf of the MRC-NIHR partnership.

Really, UK MRC, you are squandering scarce funds on methodologically poor, often small trials for which investigators make extravagant claims and that don’t include a comparison group allowing control for nonspecific effects. You really ought to insist on better attention to the existing literature in justifying another trial and adequate controls for amount of contact time, attention and support.

Don’t you see the strong influence of investigator allegiance dictating reporting of results consistent with the advancement of the investigators’ product?

I don’t understand why you allowed the investigator group to justify the study with such idiosyncratic, highly selective review of the literature driven by substituting a colloquial term “worry” for more commonly used search terms.

Do you have independent review of grants by persons who are more accepting of the usual conventions of conducting and reporting trials? Or are you faced with the problems of a small group of reviewers giving out money to like-minded friends and family? Note that the German Federal Ministry of Education and Research (BMBF) has effectively dealt with inbred old boy networks by excluding Germans from the panels of experts reviewing German grants. Might you consider the same strategy in getting more seriously about funding projects with some potential for improving patient care? Get with it, insist on rigor and reproducibility in what you fund.

*We should make too much of Oxford lending its branding to this workshop. Look at the workshops to which Harvard Medical School lends its labels.

Busting foes of post-publication peer review of a psychotherapy study

title_vigilante_blu-rayAs described in the last issue of Mind the Brain, peaceful post-publication peer reviewers (PPPRs) were ambushed by an author and an editor. They used the usual home team advantages that journals have – they had the last word in an exchange that was not peer-reviewed.

As also promised, I will team up in this issue with Magneto to bust them.

Attacks on PPPRs threaten a desperately needed effort to clean up the integrity of the published literature.

The attacks are getting more common and sometimes vicious. Vague threats of legal action caused an open access journal to remove an article delivering fair and balanced criticism.

In a later issue of Mind the Brain, I will describe an  incident in which authors of a published paper had uploaded their data set, but then  modified it without notice after PPPRs used the data for re-analyses. The authors then used the modified data for new analyses and then claimed the PPPRs were grossly mistaken. Fortunately, the PPPRs retained time stamped copies of both data sets. You may like to think that such precautions are unnecessary, but just imagine what critics of PPPR would be saying if they had not saved this evidence.

Until journals get more supportive of post publication peer review, we need repeated vigilante actions, striking from Twitter, Facebook pages, and blogs. Unless readers acquire basic critical appraisal skills and take the time to apply them, they will have to keep turning to the social media for credible filters of all the crap that is flooding the scientific literature.

MagnetoYardinI’ve enlisted Magneto because he is a mutant. He does not have any extraordinary powers of critical appraisal. To the contrary, he unflinchingly applies what we should all acquire. As a mutant, he can apply his critical appraisal skills without the mental anguish and physiological damage that could beset humans appreciating just how bad the literature really is. He doesn’t need to maintain his faith in the scientific literature or the dubious assumption that what he is seeing is just a matter of repeat offender authors, editors, and journals making innocent mistakes.

Humans with critical appraisal risk demoralization and too often shirk from the task of telling it like it is. Some who used their skills too often were devastated by what they found and fled academia. More than a few are now working in California in espresso bars and escort services.

Thank you, Magneto. And yes, I again apologize for having tipped off Jim Coan about our analyses of his spinning and statistical manipulations of his work to get newsworthy finding. Sure, it was an accomplishment to get a published apology and correction from him and Susan Johnson. I am so proud of Coan’s subsequent condemnation of me on Facebook as the Deepak Chopra of Skepticism  that I will display it as an endorsement on my webpage. But it was unfortunate that PPPRs had to endure his nonsensical Negative Psychology rant, especially without readers knowing what precipitated it.

shakespeareanThe following commentary on the exchange in Journal of Nervous and Mental Disease makes direct use of your critique. I have interspersed gratuitous insults generated by Literary Genius’ Shakespearean insult generator and Reocities’ Random Insult Generator.

How could I maintain the pretense of scholarly discourse when I am dealing with an author who repeatedly violates basic conventions like ensuring tables and figures correspond to what is claimed in the abstract? Or an arrogant editor who responds so nastily when his slipups are gently brought to his attention and won’t fix the mess he is presenting to his readership?

As a mere human, I needed all the help I could get in keeping my bearings amidst such overwhelming evidence of authorial and editorial ineptness. A little Shakespeare and Monty Python helped.

The statistical editor for this journal is a saucy full-gorged apple-john.


Cognitive Behavioral Techniques for Psychosis: A Biostatistician’s Perspective

Domenic V. Cicchetti, PhD, quintessential  biostatistician
Domenic V. Cicchetti, PhD, quintessential biostatistician

Domenic V. Cicchetti, You may be, as your website claims

 A psychological methodologist and research collaborator who has made numerous biostatistical contributions to the development of major clinical instruments in behavioral science and medicine, as well as the application of state-of-the-art techniques for assessing their psychometric properties.

But you must have been out of “the quintessential role of the research biostatistician” when you drafted your editorial. Please reread it. Anyone armed with an undergraduate education in psychology and Google Scholar can readily cut through your ridiculous pomposity, you undisciplined sliver of wild belly-button fluff.

You make it sound like the Internet PPPRs misunderstood Jacob Cohen’s designation of effect sizes as small, medium, and large. But if you read a much-accessed article that one of them wrote, you will find a clear exposition of the problems with these arbitrary distinctions. I know, it is in an open access journal, but what you say is sheer bollocks about it paying reviewers. Do you get paid by Journal of Nervous and Mental Disease? Why otherwise would you be a statistical editor for a journal with such low standards? Surely, someone who has made “numerous biostatistical contributions” has better things to do, thou dissembling swag-bellied pignut.

More importantly, you ignore that Jacob Cohen himself said

The terms ‘small’, ‘medium’, and ‘large’ are relative . . . to each other . . . the definitions are arbitrary . . . these proposed conventions were set forth throughout with much diffidence, qualifications, and invitations not to employ them if possible.

Cohen J. Statistical power analysis for the behavioural sciences. Second edition, 1988. Hillsdale, NJ: Lawrence Earlbaum Associates. p. 532.

Could it be any clearer, Dommie?

Click to enlarge

You suggest that the internet PPPRs were disrespectful of Queen Mother Kraemer in not citing her work. Have you recently read it? Ask her yourself, but she seems quite upset about the practice of using effects generated from feasibility studies to estimate what would be obtained in an adequately powered randomized trial.

Pilot studies cannot estimate the effect size with sufficient accuracy to serve as a basis of decision making as to whether a subsequent study should or should not be funded or as a basis of power computation for that study.

Okay you missed that, but how about:

A pilot study can be used to evaluate the feasibility of recruitment, randomization, retention, assessment procedures, new methods, and implementation of the novel intervention. A pilot study is not a hypothesis testing study. Safety, efficacy and effectiveness are not evaluated in a pilot. Contrary to tradition, a pilot study does not provide a meaningful effect size estimate for planning subsequent studies due to the imprecision inherent in data from small samples. Feasibility results do not necessarily generalize beyond the inclusion and exclusion criteria of the pilot design.

A pilot study is a requisite initial step in exploring a novel intervention or an innovative application of an intervention. Pilot results can inform feasibility and identify modifications needed in the design of a larger, ensuing hypothesis testing study. Investigators should be forthright in stating these objectives of a pilot study.

Dommie, although you never mention it, surely you must appreciate the difference between a within-group effect size and a between-group effect size.

  1. Interventions do not have meaningful effect sizes, between-group comparisons do.
  2. As I have previously pointed out

 When you calculate a conventional between-group effect size, it takes advantage of randomization and controls for background factors, like placebo or nonspecific effects. So, you focus on what change went on in a particular therapy, relative to what occurred in patients who didn’t receive it.

Turkington recruited a small, convenience sample of older patients from community care who averaged over 20 years of treatment. It is likely that they were not getting much support and attention anymore, whether or not they ever were. The intervention that Turkington’s study provided that attention. Maybe some or all of any effects were due to simply compensating for what was missing from from inadequate routines care. So, aside from all the other problems, anything going on in Turkington’s study could have been nonspecific.

Recall that in promoting his ideas that antidepressants are no better than acupuncture for depression, Irving Kirsh tried to pass off within-group as equivalent to between-group effect sizes, despite repeated criticisms. Similarly, long term psychodynamic psychotherapists tried to use effect sizes from wretched case series for comparison with those obtained in well conducted studies of other psychotherapies. Perhaps you should send such folks a call for papers so that they can find an outlet in Journal of Nervous and Mental Disease with you as a Special Editor in your quintessential role as biostatistician.

Douglas Turkington’s call for a debate

Professor Douglas Turkington: "The effect size that got away was this big."
Professor Douglas Turkington: “The effect size that got away was this big.”

Doug, as you requested, I sent you a link to my Google Scholar list of publications. But you still did not respond to my offer to come to Newcastle and debate you. Maybe you were not impressed. Nor did you respond to Keith Law’s repeated request to debate. Yet you insulted internet PPPR Tim Smits with the taunt,

Click to Enlarge


You congealed accumulation of fresh cooking fat.

I recommend that you review the recording of the Maudsley debate. Note how the moderator Sir Robin Murray boldly announced at the beginning that the vote on the debate was rigged by your cronies.

Do you really think Laws and McKenna got their asses whipped? Then why didn’t you accept Laws’ offer to debate you at a British Psychological Society event, after he offered to pay your travel expenses?

High-Yield Cognitive Behavioral Techniques for Psychosis Delivered by Case Managers…

Dougie, we were alerted that bollacks would follow with the “high yield” of the title. Just what distinguishes this CBT approach from any other intervention to justify “high yield” except your marketing effort? Certainly, not the results you have obtained from an earlier trial, which we will get to.

Where do I begin? Can you dispute what I said to Dommie about the folly of estimating effect sizes for an adequately powered randomized trial from a pathetically small feasibility study?

I know you were looking for a convenience sample, but how did you get from Newcastle, England to rural Ohio and recruit such an unrepresentative sample of 40 year olds with 20 years of experience with mental health services? You don’t tell us much about them, not even a breakdown of their diagnoses. But would you really expect that the routine care they were currently receiving was even adequate? Sure, why wouldn’t you expect to improve upon that with your nurses? But would you be demonstrating?

insult 1


The PPPR boys from the internet made noise about Table 2 and passing reference to the totally nude Figure 5 and how claims in the abstract had no apparent relationship to what was presented in the results section. And how nowhere did you provide means or standard deviations. But they did not get to Figure 2 Notice anything strange?

figure 2Despite what you claim in the abstract, none of the outcomes appear significant. Did you really mean standard error of measurement (SEMs), not standard deviations (SDs)? People did not think so to whom I showed the figure.

mike miller


And I found this advice on the internet:

If you want to create persuasive propaganda:

If your goal is to emphasize small and unimportant differences in your data, show your error bars as SEM,  and hope that your readers think they are SD.

If our goal is to cover-up large differences, show the error bars as the standard deviations for the groups, and hope that your readers think they are a standard errors.

Why did you expect to be able to talk about effect sizes of the kind you claim you were seeking? The best meta analysis suggests an effect size of only .17 with blind assessment of outcome. Did you expect that unblinding assessors would lead to that much more improvement? Oh yeh, you cited your own previous work in support:

That intervention improved overall symptoms, insight, and depression and had a significant benefit on negative symptoms at follow-up (Turkington et al., 2006).

Let’s look at Table 1 from Turkington et al., 2006.

A consistent spinning of results

Table 1 2006

Don’t you just love those three digit significance levels that allow us to see that p =.099 for overall symptoms meets the apparent criteria of p < .10 in this large sample? Clever, but it doesn’t work for depression with p = .128. But you have a track record of being sloppy with tables. Maybe we should give you the benefit of a doubt and ignore the table.

But Dougie, this is not some social priming experiment with college students getting course credit. This is a study that took up the time of patients with serious mental disorder. You left some of them in the squalor of inadequate routine care after gaining their consent with the prospect that they might get more attention from nurses. And then with great carelessness, you put the data into tables that had no relationship to the claims you were making in the abstract. Or in your attempts to get more funding for future such ineptitude. If you drove your car like you write up clinical trials, you’d lose your license, if not go to jail.

insult babbling



The 2014 Lancet study of cognitive therapy for patients with psychosis

Forgive me that I missed until Magneto reminded me that you were an author on the, ah, controversial paper

Morrison, A. P., Turkington, D., Pyle, M., Spencer, H., Brabban, A., Dunn, G., … & Hutton, P. (2014). Cognitive therapy for people with schizophrenia spectrum disorders not taking antipsychotic drugs: a single-blind randomised controlled trial. The Lancet, 383(9926), 1395-1403.

But with more authors than patients remaining in the intervention group at follow up, it is easy to lose track.

You and your co-authors made some wildly inaccurate claims about having shown that cognitive therapy was as effective as antipsychotics. Why, by the end of the trial, most of the patients remaining in follow up were on antipsychotic medication. Is that how you obtained your effectiveness?

In our exchange of letters in The Lancet, you finally had to admit

We claimed the trial showed that cognitive therapy was safe and acceptable, not safe and effective.

Maybe you should similarly be retreating from your claims in the Journal of Nervous and Mental Disease article? Or just take refuge in the figures and tables being uninterpretable.

No wonder you don’t want to debate Keith Laws or me.

insult 3



A retraction for High-Yield Cognitive Behavioral Techniques for Psychosis…?

The Turkington article meets the Committee on Publication Ethics (COPE) guidelines for an immediate retraction (

But neither a retraction nor even a formal expression of concern has appeared.

Toilet-outoforderMaybe matters can be left as they now are. In the social media, we can point to the many problems of the article like a clogged toilet warning that Journal of Nervous and Mental Disease is not a fit place to publish – unless you are seeking exceeding inept or nonexistent editing and peer review.




Vigilantes can periodically tweet Tripadvisor style warnings, like

toilets still not working



Now, Dommie and Dougie, before you again set upon some PPPRs just trying to do their jobs for little respect or incentive, consider what happened this time.

Special thanks are due for Magneto, but Jim Coyne has sole responsibility for the final content. It  does not necessarily represent the views of PLOS blogs or other individuals or entities, human or mutant.

Cognitive behavior and psychodynamic therapy no better than routine care for anorexia.

Putting a positive spin on an ambitious, multisite trial doomed from the start.

I announced in my last blog post that this one would be about bad meta-analyses of weakStop_Press_2 data used to secure insurance reimbursement for long-term psychotherapy. But that is postponed so that I can give timely coverage to the report in Lancet of results of the Anorexia Nervosa Treatment of OutPatients (ANTOP) randomized clinical trial (RCT). The trial, proclaimed the largest ever of its kind, compared cognitive behavior therapy, focal psychodynamic therapy, and “optimized” routine care for the treatment of anorexia.

This post is an adapt sequel to my last one. I had expressed a lot of enthusiasm for a RCT comparing cognitive behavior therapy (CBT) to psychoanalytic therapy for bulimia. I was impressed with its design and execution and the balanced competing investigator allegiances. The article’s reporting was transparent, substantially reducing risk of bias and allowing a clear message. You will not see me very often being so positive about a piece of research in this blog, although I did note some limitations.

Hands down, CBT did better than psychoanalytic therapy in reducing binging and purging, despite there being only five months of cognitive therapy and two years of psychoanalysis. This difference seems to be a matter of psychoanalysis doing quite poorly, and not that the cognitive behavior CBT doing so well.

However, on my Facebook wall, Ioana Cristea, a known contrarian and evidence-based skeptic like myself, posted a comment about my blog:

Did you see there’s also a recent very similar Lancet study for anorexia? With different results, of course.

She was referring to

Zipfel, Stephan, Beate Wild, Gaby Groß, Hans-Christoph Friederich, Martin Teufel, Dieter Schellberg, Katrin E. Giel et al. Focal psychodynamic therapy, cognitive behaviour therapy, and optimised treatment as usual in outpatients with anorexia nervosa (ANTOP study): randomised controlled trial. The Lancet (2013).

The abstract of the Lancet article is available here, but the full text is behind a pay wall. Fortunately, the registered trial protocol for the study is available open access here. You can at least get the details of what the authors said they were going to do, ahead of doing it.

For an exceedingly quick read, try the press release for the trial here, entitled

Largest therapy trial worldwide: Psychotherapy treats anorexia effectively.

Or an example of a thorough uncritical churnalling of this press release in the media here.

What we are told about anorexia

Media portrayals of anorexia often show the extreme self-starvation associated with the severe disorder, but this study recruited women with mild to moderate anorexia.

The introduction of the ANTOP article states

  • Anorexia nervosa is associated with serious medical morbidity and pronounced psychosocial comorbidity.
  • It has the highest mortality rate of all mental disorders, and relapse happens frequently.
  • The course of illness is very often chronic, particularly if left untreated.

A sobering accompanying editorial in Lancet stated

The evidence base for anorexia nervosa treatment is meagre1, 2 and 3 considering the extent to which this disorder erodes quality of life and takes far too many lives prematurely.4 But clinical trials for anorexia nervosa are difficult to conduct, attributable partly to some patients’ deep ambivalence about recovery, the challenging task of offering a treatment designed to remove symptoms that patients desperately cling to, the fairly low prevalence of the disorder, and high dropout rates. The combination of high dropout and low treatment acceptability has led some researchers to suggest that we pause large-scale clinical trials for anorexia nervosa until we resolve these fundamental obstacles.

What the authors claim that this study found.

The press release states

Overall, the two new types of therapy demonstrated advantages compared to the optimized therapy as usual,” said Prof. Zipfel. “At the end of our study, focal psychodynamic therapy proved to be the most successful method, while the specific cognitive behavior therapy resulted in more rapid weight gain.

And the abstract

At the end of treatment, BMI [body mass index] had increased in all study groups (focal psychodynamic therapy 0·73 kg/m², enhanced cognitive behavior therapy 0·93 kg/m², optimised treatment as usual 0·69 kg/m²); no differences were noted between groups (mean difference between focal psychodynamic therapy and enhanced cognitive behaviour therapy –0·45, 95% CI –0·96 to 0·07; focal psychodynamic therapy vs optimised treatment as usual –0·14, –0·68 to 0·39; enhanced cognitive behaviour therapy vs optimised treatment as usual –0·30, –0·22 to 0·83). At 12-month follow-up, the mean gain in BMI had risen further (1·64 kg/m², 1·30 kg/m², and 1·22 kg/m², respectively), but no differences between groups were recorded (0·10, –0·56 to 0·76; 0·25, –0·45 to 0·95; 0·15, –0·54 to 0·83, respectively). No serious adverse events attributable to weight loss or trial participation were recorded.

How can we understand results presented in terms of changes in BMI?

body-mass-index-formulaYou can find out more about BMI [body mass index] here and you can calculate your own here. But note that BMI is a controversial measure, does not directly assess body fat, and is not particularly accurate for people who are large- or small-framed or fit or athletic.

These patients had to have been quite underweight to be diagnosed with anorexia, and so how much weight did they gain as result of treatment?  The authors should have given us the results in numbers that make sense to most people.

The young adult women in the study averaged 46.7 kg or 102.7 pounds at the beginning of the study. I had to do some calculations to translate the changes in BMI reported by these authors with the assumption that they were an average height of 5’6”, like other German women.

Four months after beginning the 10 month treatment, the women had gained an average of 5 pounds and at 12 months after the end of treatment (so 22 months after beginning treatment), they had gained another 3 pounds.

On average, the women participating in the trial were still underweight 22 months after the trial’s start and would have still qualified for entering the trial, at least according to the weight criterion.

How the authors explain their results.

Optimised treatment as usual, combining psychotherapy and structured care from a family doctor, should be regarded as solid baseline treatment for adult outpatients with anorexia nervosa. Focal psychodynamic therapy proved advantageous in terms of recovery at 12-month follow-up, and enhanced cognitive behaviour therapy was more effective with respect to speed of weight gain and improvements in eating disorder psychopathology. Long-term outcome data will be helpful to further adapt and improve these novel manual-based treatment approaches.

My assessment after reading this article numerous times and consulting supplementary material:

  • Anorexia was treated with two therapies, each compared to an unusual control condition termed “optimized” treatment as usual. When the study was over and even in follow-up, anorexia won and the treatments lost.
  • In interpreting these results, note that the study involved a sample of young women with mostly only mild to moderate anorexia. Only a little more than half had full syndrome anorexia.
  • In post hoc “exploratory analyses,” the authors emphasized a single measure at a single time point that favored focal psychodynamic therapy, despite null findings with most other standard measures at all time points.
  • The authors expressed their outcomes in within-group effect sizes. This is an unusual way that exaggerated results, particularly when comparisons are made to the effect sizes reported for other studies.
  • Put another way, results of the trial were very likely spun, starting with the abstract, and continuing in the results and press release.
  • The study demonstrates the difficulty treating anorexia and evaluating this treatment. Only modest increases in body weight were obtained despite intensive treatment.  Interpretation of what happened is complicated by high rates of dropping out of therapy and loss to follow-up, and the necessity of inpatient stays and other supplementary treatment.
  • The optimized routine care condition involved ill-described, uncontrolled  psychotherapeutic and medical interventions. Little sense can be made of this clinical trial except that availability of manualized treatment proved no better (or no worse), and none of the treatments, including routine care, did particularly well.
  • The study is best understood as testing the effectiveness of treating anorexia in some highly unusual circumstances in Germany, not an efficacy trial testing the strength of the two treatments. Results are not generalizable to either of the psychotherapies administered by themselves in other contexts.
  • The study probably demonstrates that  meaningful RCTs of the treatment of anorexia cannot be conducted in Germany with generalizable results.
  • Maybe this trial is just another demonstration that we do not know enough to undertake a randomized study of the treatment of anorexia that would yield readily interpretable findings.

Sad, sad, sad. So you can stop here if all you wanted was my evaluation. Or you can continue reading to find out how I arrived at and whether you agree.

Outcomes for the trial: why am I so unimpressed?

On average, the women were still underweight at follow up, despite having had only mildly to moderate anorexia at the start of the study.  The sample was quite heterogeneous at baseline. We don’t know how much of the modest weight gain and the minority of women who were considered “fully recovered” represents small improvements in women starting with higher BMI and milder, subsyndromal anorexia at baseline.

Any discussion of outcomes has to take into account the substantial number of women not completing treatment and lost to follow up.

Missing data can be estimated with fancy imputational techniques. But they are not magic, and involve some assumptions that cannot be tested with loss of patients to follow up in such small treatment groups. And yet, we need some way to account for all patients initially entering a clinical trial (termed an intent-to-treat analysis) for valid, generalizable results. So, we cannot ignore these problems and simply concentrate just on the women completing treatment and remaining available.

And then there is the issue of nonstudy treatment, including inpatient stays. The study has no way of taking them into account, other than reporting them. Inpatient stays could have occurred for different reasons across the three conditions. We cannot determine if the inpatient stays contributed to the results that were observed or maybe interfered with the outpatient treatment. But here too, we cannot simply ignore this factor.

We certainly cannot assume that failures to complete treatment, loss to follow up and the necessity of inpatient stays are randomly distributed between groups. We cannot convincingly rule out that some combination of these factors are decisive for the results that were obtained.

The spinning of the trial in favor of focal psychodynamic treatment.

positive spin 2The preregistration of the trial listed BMI at the end of treatment as the primary outcome. That means the investigators staked any claims about the trial on this outcome at this time point. There were no overall differences.

The preregistration also listed numerous secondary outcomes: the Morgan-Russell-criteria; general axis I psychopathology (SCID I) ; eating disorder specific psychopathology (SIAB-Ex; Eating Disorder Inventory-2) severity of depressive comorbidity (PHQ-9); and quality of life according to the SF-36. Not all of these outcomes are reported in the article, and for the ones that are reported, almost all are not significantly different at any timepoint.

The authors’ failure to designate one or two of these variables a priori (ahead of time) sets them up to pick-the-best hypothesizing after results are known or HARKING. We do not actually know what was done, but there is a high risk of bias.

We should in general be highly skeptical about post hoc exploratory analyses of variables that were not pre-designated as outcomes for a clinical trial, in either primary or secondary analyses.

In table 3 of their article, the investigators present within-group effect sizes that portray the manualized treatments as doing impressively well.

 ANTOP study 1 page-page-0

Yet, as I will discuss in forthcoming blogs, within-group effect sizes are highly misleading compared to the usually reported between-group effect sizes. These within-group effect sizes attribute all changes that occurred in a particular group to the effects of the intervention. That includes claiming credit for nonspecific effects common across conditions, as well as any improvement due to positive expectations or patients bouncing back after having enrolled in the study at a particular bad time.

The conventional strategy is to provide between-group effect sizes comparing a treatment to what was obtained the other groups.  This preserves the effects of randomization and makes use of what can be learned from comparison/control conditions. Treatment do not have effect sizes, but comparisons of treatments do.

As an example, we do not pay much attention to the within-group effect size for antidepressants in a particular study, because these numbers do not take into account how the antidepressants did relative to a pill placebo condition. Presumably the pill placebo is chemically inert, but it is provided with the same attention from clinicians, positive expectations, and support that come with the antidepressant. Once these factors shared by both the antidepressant and pill placebo conditions are taken into account, the effect size for antidepressant decreases.

Take a look at weight gain by the end of the 12 month follow-up among patients receiving focal psychodynamic therapy. In Table 3, the within-group effect size for focal psychodynamic therapy is a whopping 1.6, p < .001. But the more appropriate between-group effect size for comparing focal psychodynamic therapy to treatment as usual shown in Table 2 is  a wimpy, nonsignificant .13, p< .48 (!)

An extraordinary “optimized” treatment as usual.

Descriptions in the preregistered study protocol, press releases, and methods section of the article do not do justice to the “optimized” treatment as usual. The method section did not rouse particular concern from me. It described patients assigned to the treatment as usual being provided with a list of psychotherapists specializing in the treatment of eating disorders and their family physicians assuming an active role in monitoring and providing actual treatment. This does not sound particularly unusual for a comparison/control group. After all, it would be unethical to leave women with such a threatening, serious disorder on a waiting list just to allow a comparison.

But then I came across this shocker description of the optimized routine care condition in the discussion section:

Under close guidance from their family doctor—eg, regular weight monitoring and essential blood testing—and with close supervision of their respective study centre, patients allocated optimised treatment as usual were able to choose their favourite treatment approach and setting (intensity, inpatient, day patient, or outpatient treatment) and their therapist, in accordance with German national treatment guidelines for anorexia nervosa.11 Moreover, comparisons of applied dosage and intensity of treatment showed that all patients— irrespective of treatment allocation—averaged a similar number of outpatient sessions over the course of the treatment and follow-up periods (about 40 sessions). These data partly reflect an important achievement of the German health-care system: that access to psychotherapy treatment is covered by insurance. However, patients allocated optimised treatment as usual needed additional inpatient treatment more frequently (41%) than either those assigned focal psychodynamic therapy (23%) or enhanced cognitive behaviour therapy (35%).

OMG! I have never seen such intensive treatment-as-usual in a clinical trial. I doubt anything like this treatment would be available elsewhere in the world as standard care.

This description raises a number of disturbing questions about the trial:

Why would any German women with anorexia enroll in the clinical trial? Although a desire to contribute to science is sometimes a factor, the main reason for patients entering clinical trials are because they think they will get better treatment and maybe because they think they can get a preferred treatment which they cannot get it elsewhere. But, if this is the situation of routine care in Germany, why would eligible women not just remain in routine care without the complications of being in a clinical trial?

At one point, the authors claim that 1% of the population has a diagnosis of anorexia. That represents a lot of women. Yet, they were only able to randomize 242 patients, despite a massive two-year effort to recruit patients involving 10 German departments of psychotherapy and psychosomatic medicine. It appears that a very small minority of the available patients were recruited, raising questions about the representativeness of the sample.

Patients had little incentive to remain in the clinical trial rather than dropping out. Dropping out of the clinical trial would still give them access to free treatment–without the hassle of remaining in the trial.

In a more typical trial, patients assigned to treatment as usual are provided with a list of referrals. Often few bother to complete a referral or remain in treatment, and so we can assume that the treatment-as-usual condition usually represents minimal treatment, providing a suitable comparison  with a positive outcome for more active, free treatment. In the United States, patients enrolling in clinical trials often either do not have health insurance or can find only providers who will not accept what health insurance they have for the treatment they want. Patients in the United States enter a clinical trial just to get the possibility of treatment, very different circumstances than in Germany.

Overall, no matter what condition patients were assigned, all received about the same amount of outpatient psychotherapy, about 40 sessions. How could these authors have expected to find a substantial difference between the two manualized treatments and this intensity of routine care? Differences between groups of the magnitude they assumed in calculating sample sizes under these conditions would be truly extraordinary.

Alot of attention and support is provided in 40 sessions of such psychotherapy, making it difficult to detect the specific effects provided by the manualized therapies, above and beyond the attention support they provide..

In short, the manualized treatments were doomed to null findings in comparison to treatment as usual. The only thing really unexpected about this trial is that all three conditions did so poorly.

What is a comparison/control group supposed to accomplish, anyway?

Investigators undertaking randomized controlled trials of psychotherapies know about the necessity of comparison/control groups, but they generally understand less the implication of their choice of a comparison/control group.

Most evidence-based treatments earned their status by proving superior in a clinical trial to a control group such as wait list or no treatment at all. Such comparisons provide the backbone to claims of evidence-based treatments, but are not particularly informative. It may simply be that many manualized, structured treatments are no better than other active treatments patients have similar intensity of treatment, positive expectations, and attention and support.

Some investigators, however, are less interested in establishing the efficacy of treatments, then in demonstrating the effectiveness of particular treatments over what is already being done in the community. Effectiveness studies typically find small effects been obtained in straw-man comparisons between treatments and the weak effects observed in control groups.

But even if their intention is to conduct an effectiveness study, investigators need to better describe the nature of of treatment as usual, if they are to make reasonable generalizations to other clinical and health system contexts.

We know that the optimized treatment as usual was exceptionally intensive, but we have no idea from the published article what it entailed, except lots of treatment, as much as what was provided provided in the active treatment conditions. It may even be that some of the women assigned to optimized treatment obtained therapists providing much the same treatment.

Again, if all of the conditions had done well in terms of improved patient outcomes, then we could have concluded that introducing manualized treatment does not accomplish much in Germany at least. But my assessment is that none of the three conditions did particularly well.

The optimized treatment as usual is intensive but not evidence-based. In my last blog post, we viewed a situation in which less treatment proved better than more. Maybe the availability of intensive and extensive treatment discourages women from taking responsibility for their health threatening condition. They do not improve, simply because they can always get more treatment. That is simply a hypothesis, but Germany is spending lots of money assuming that it is incorrect.

Why Germany may not be the best place to do a clinical trial for treatment of anorexia.

Germany may not be an appropriate place to do a clinical trial of treatment for anorexia for a number of reasons:

  • The ready availability of free, intensive treatment prevents recruitment of a large, representative sample of women with anorexia to a potentially burdensome clinical trial.
  • There is less incentive for women to remain in the study once they are enrolled because they can always drop out and get the same intensity of treatment elsewhere.
  • The control/comparison group of “optimized” treatment as usual complied with the extensive requirements of the German national treatment guidelines for anorexia nervosa. But these standards are not evidence-based and appear to have produced mediocre outcomes in at least this trial.
  • Treatment as usual available to everyone is not necessarily effective, but it precludes detecting incremental improvements obtained by less intensive, but focused treatments.

Prasad and Ioannidis have recently called attention to the pervasiveness of non-evidence-based medical treatments and practice guidelines that are not either cost-effective, ensuring good patient outcomes, or avoiding unnecessary risks. They propose de-implementing such unproven practices, but acknowledge the likelihood that cultural values, vested interests, and politics can interfere with efforts to subject established but unproven practices to empirical test.

Surely, that would be the case in any effort to de-implement guidelines the treatment of anorexia in Germany.

The potentially life-threatening nature of anorexia may discourage any temporary suspension of treatment guidelines until evidence can be obtained. But we need only to look to the example of similarly life-threatening cancers where improved treatments only came about only when investigators were able to suspend well-established but unproven treatments and conduct randomized trials.

It would be unethical to assigned women with anorexia to waitlist control or no treatment when free treatment is readily available in the community. So, there may be no other options but to use treatment has usual has a control condition.

If so, a finding of no differences between groups is almost certainly guaranteed. And given the poor performance of routine care observed in this study, such results were not represent the familiar Dodo Bird Verdict for comparisons between psychotherapies in which all of the treatments were winners in all get prizes.

Why it may be premature to conduct randomized trials of treatment of anorexia.

This may well be, as the investigators proclaim in their press release, the largest ever RCT of treatment for anorexia. But it is very difficult to make sense of it, other than to conclude that no treatments, including treatment as usual, had particularly impressive results.

For me, this study highlights the anonymous barriers to conducting a well-controlled RCT for anorexia with patients representative of the kinds that would seek treatment in real-world clinical context.

There are unsolved issues of patient dropout and retention for follow-up that seriously threaten the integrity of any results. We just do not know how to recruit a representative sample of patients with anorexia and keep them in therapy and around for follow-up.

Maybe we should ask women with anorexia about what they think. Maybe we could enlist some of them to assist in a design of a randomized trial or at least a treatment investigators could retain sufficient numbers of them to conduct a randomized trial

I am not sure how we would otherwise get this understanding without involving women with anorexia in the design of treatment in future clinical trials.

There are unsolved issues of medical surveillance and co-treatment confounding. Anorexia poses physical health problems in the threats associated with sudden weight loss. But we do not have evidence-based protocols in place for standardizing surveillance and decision-making.

Before we undertake massive randomized trials such as ANTOP, we need to get information to set basic parameters from nonrandomized but nonetheless informative small-scale studies. Obviously the investigators in this study could not even estimate effect sizes in order to set sample sizes.

Well,  you presumably having made it through this long read, what do you think?




Coordinating depression treatment from afar: Are results credible?

A March 7, 2013 article in JAMA Internal Medicine claimed that depressed heart patients improved with a treatment involving “centralized, stepped, patient preference-based treatment” and that benefits were substantially greater than seen in past intervention studies. The trial, called the Comparison of Depression Interventions after Acute Coronary Syndrome (CODIACS) should get a lot of attention. The article is already being picked up on the web with headlines like “Distance Program Helps Depressed Heart Patients” and “Treating Post-ACS Depression Effective, Cost-Neutral.

Reading the article a number of times left me with doubts that the trial actually demonstrated the efficacy of the intervention in ways that could be generalized to real world settings, despite the trial having been conducted in the community. But the article also prompted me to think about the dysfunctional, fragmented system of care for depression in general medical settings in America, how poor the treatment is that patients get there, and the difficulty doing meaningful research in this context.

The report of the study, along with a thoughtful editorial commentary, are available open access from the journal. I encourage you to read the article now, ahead of proceeding, or read it along with this blog post, and see if you can see what I saw and whether you agree with my assessments. I am going to be offering an unconventional, but hopefully, in the end, persuasive, analysis that concludes that this trial did not show much that we did not already know. Regardless, we can learn something from this article about interpreting the results of clinical trials for depression, with the added bonus of this article showing how to effectively write a report of a clinical trial in a way that captures attention. This article is a truly masterful example.

Patients. The 150 patients were recruited across five sites, with 73 randomized to the intervention group and 77 to the routine care group. To be eligible, patients had to have elevated scores on the Beck Depression Inventory, a self-report depression questionnaire and to be two to six months after hospitalization for an acute cardiac syndrome, which could be a myocardial infarction (heart attack) or unstable angina.

There were no formal diagnostic interviews conducted, so patients were not required to have a diagnosable depression of the kind for which there are established treatment guidelines. Entry into the study required a score of 10 or greater on the Beck Depression Inventory on at least two occasions or a score >15 on one occasion, but this yields a sample with many, and perhaps most patients not actually being clinically depressed.

The treatment. Patients assigned to the intervention group obtained treatment from an interdisciplinary team of professionals including a local physician or advanced practice nurse and a therapist, with psychiatric and psychological supervisors monitoring treatment and patient progress from a centralized location. Patients got their choice of treatments: problem-solving therapy, medication, a combination, or neither. Problem-solving treatment is a form of cognitive behavior therapy that is practically oriented, and provides patients with the tools to tackle everyday problems that they identify as being tied to their depression. If patients chose this therapy, it was first delivered over the Internet by way of interactive video phone calls. Subsequent sessions were provided by video calls or telephones at either the clinic or the patient’s home, depending on the patient’s preference.

Patients in the active treatment group who chose antidepressant medication were interviewed by a local physician or nurse, with the patient and this local health provider having to reach agreement on the appropriate medication based on the patient’s past experience with antidepressants and current symptoms. The patients were first interviewed face to face at 1 to 2 week intervals and then every three weeks.

The intervention has some state of the art features and if we think of patients assigned to it getting a Mercedes, then the patients assigned to the routine care condition got a skateboard. Their primary care physician or cardiologist was simply informed of their participation in this study and their score on the depression questionnaire. The patients were then free to obtain whatever depression care that they could from the physician or another health care provider, but, as we will see, there were substantial barriers to their getting treatment and few, who were not already receiving treatment, did so subsequently.

Given the study was conducted in the United States, is important to know whether treatment was free or if patients had to pay for it, either out of their pocket or with insurance and often substantial co-pay. This information is not provided in the article, but I emailed the first author, who indicated that treatment was free in the intervention condition, but patients had to pay for any treatment if they were assigned to routine care. There are a couple of issues here. First, patients may have been motivated to enroll the study rather than simply get treatment through their health care provider solely because of the 50:50 chance of getting treatment that they could not otherwise afford. Second, being assigned to the intervention rather than the control group meant patients not only getting a complex intervention probably not readily available elsewhere, but getting it free. So, any difference in outcomes between the intervention versus control group could be due to patients being getting free treatment they wanted at their choice of their home or a clinic, not the specifics of the treatment. Maybe all we need to improve the outcome of depression is to make treatment free and readily available, rather than with all these bells and whistles. Finally, differences in outcomes might reflect patients being assigned to the control group registering their disappointment of not being assigned to the intervention group, not the benefits of the intervention. The outcomes for the patients assigned to routine care could thus be artificially lowered so that the intervention looked more effective.

Treatment already being received. Rates of treatment with antidepressants in the United States and western European countries are high, and the number of people on antidepressants probably exceeds the number of people who are depressed, even allowing for lots of depressed people not getting treatment. In this study, 27 of the patients assigned to the intervention arm of the study and 26 of the patients assigned to routine care were already receiving an antidepressant. So, we need to take into account any increases in this number. It is to the authors’ credit that they even reported this. Most studies of enhanced care for depression do not disclose the extent to which patients being enrolled are already in treatment, so the readers left assuming that patients were not already in treatment or having to guess the extent to which they were, without much information to go on.

Much of the antidepressant treatment patients were already receiving was inappropriate or ineffective. It is estimated that 40% of patients in general medical settings who receive an antidepressant derive no benefit over simply remaining on a waiting list. That is because some of the treatment is provided to patients who are too mildly depressed to show benefit or who are simply not even depressed, and among patients who are sufficiently depressed to benefit, there is inadequate patient education and follow up. It is important to emphasize that antidepressants do not make unhappy people happy. Rather, the effectiveness of these drugs is limited to persons who have a diagnosable depression, who, in the case of this study may have been in a minority.

Routine management of depression in general medical settings in the United States is so poor that patients receiving antidepressants often do not obtain the benefit that they would have gotten from assignment to a pill placebo condition in a clinical trial. Practice guidelines dictate that patients be contacted at five weeks after starting an antidepressant. If improvement is not apparent, their dose should be adjusted, they should be switched to another antidepressant, or maybe they just be given some education about the need for adherence. Guidelines dictate this, but are notoriously ineffective in ensuring that patients get the necessary follow up. Patients getting an initial prescription for an antidepressant from a primary care physician very may simply disappear into the community, often without even renewing their prescription.

In contrast, patients being assigned to a pill placebo condition in a clinical trial get much more than a sugar pill, they get positive expectations and a lot of regular attention and support. Any differences found between an antidepressant and a pill placebo condition in a clinical trial has to be in addition to this attention and support, because patients are blinded as to whether they are getting an antidepressant or pill placebo.

Figure 2

Results of the trial. Patients assigned to the intervention group dropped an average of 3.5 points more on the depression questionnaire, relative to patients assigned to the routine care. Take a look at Figure 2 from the article, which compares this trial to other data concerning treatment for depression. It involves a blobbogram or forest plot of this and past studies. To understand what that means, you can click on this link, or you can go to the excellent, readily understandable discussion on pages 14 to 18 of Ben Goldacre’s Bad Pharma. But for our purposes, it would not be an outrageous distortion to think of this forest plot as being a snapshot of a horse race. (I know, an oversimplification, and past co-authors on meta-analyses, please forgive me.) The horse out in front represents results of this CODIACS trial, and the only other horse almost neck and neck represents results of the COPES randomized controlled trial, which served as preliminary work for the CODIACS trial, which was conducted by the same authors.

Three and a half points on a Beck Depression Inventory with range of possible scores from 0 to 64 does not sound like much, but this could be an exceptional finding. Figure 2 shows that, along with an earlier study done in preparation for it, exceed the effects found in meta-analyses (a statistical tool for integrating results of different studies) for other complex (collaborative care) interventions in medical settings, represented by the horse at the top of the diagram; meta-analyses of published and unpublished clinical trials of SSRI antidepressants, represented as the next horse down; meta-analyses of only published trials of SSRI antidepressants, which, because of publication bias are higher, represented by the next horse down; and a variety of single trials. We are therefore talking about an apparently big effect, especially for a group with low depressive symptoms to begin with.

Yet, before you conclude “wow!”  we need to ask if this is really that impressive. It is important to note that effect sizes obtained in clinical trials depend not only on changes observed in the intervention group, but also changes observed in the control group. A control group showing little change can make an otherwise mediocre intervention look impressive. In the case of this trial, there was no change in the routine care group. So, we might simply be comparing doing a whole lot with this intervention to doing basically nothing, without getting at what of the “whole lot” mattered for outcome.

If we go to the meta-analysis of collaborative care trials represented by the top horse, we find that it considered 37 trials involving  12,355 depressed patients. The results were strong enough to recommend implementation, but–do check this out–only in the United States (!). Trials conducted outside the United States did not demonstrate a significant effect on patient outcomes of improving routine care for depression in this manner. Why– because the other countries are too primitive to benefit? Hardly, the other countries are mainly the United Kingdom and the Netherlands, where routine care for depression is less fragmented and poses less of a financial burden on patients. So, we might infer that collaborative care works best in contexts, like the United States, in which there is lots of room for improvement in routine care for depression, including making it less costly to patients.

Returning to our discussion of the CODIACS trial, we can find an alternative expression of its results, namely, that 24 patients in the active treatment group achieved remission of depression, and 16 in the usual care. Thus, 49 patients in the active treatment group and 57 patients in the routine care group remained depressed. This is not atypical, and shows just how far we have to go in getting better outcomes for treatment of depression in the community, better even than claimed for this CODIACS trial. Despite being the horse out ahead of the pack, the study left most patients still with their depression.

Treatment received after patients were randomized to intervention or control group. In the intervention group, the number of patients on antidepressants went from 27 to 37 out of 73, and for the routine care group, the number went from 26 to 28 out of 76. These are not impressive numbers. The number of patients in the intervention group receiving psychotherapy increased from 6 to 48, and the number in the control group increased from 7 to 14. These are more impressive numbers, and consistent with the view that patients identified as depressed in general medical care often have difficulties completing appropriate, affordable psychotherapy, unless there is some assistance. Note too that the therapy for the intervention group was provided wherever the patient preferred, either in the convenience of their home or at a clinic.

A number of factors could explain these differences in the therapy received by the intervention versus the control group. Patients in the intervention group could have been getting more psychotherapy at follow-up because they are encouraged to do so, because it was free, or because it was more readily available and convenient than what patients got in routine care group, where they might not be able to find a therapist with fees they could afford, even if they looked hard.

So, we can conclude that the intervention modestly increased the number of patients on antidepressants and rather substantially increased the number of patients getting Internet provided psychotherapy, whereas being assigned to the control group meant not much change in treatment. The net effect was a change in depression scores, but largely driven by not much happening in the routine care group.

Cost analysis. The authors concluded that healthcare did not cost more for patients assigned to the intervention group versus the control group. They arrived at this conclusion by combining the cost of mental health care, which was higher for the intervention group, with cost of general medical care, which was lower for the intervention group. We cannot take this assessment of no increase in cost too seriously. These estimates are based on very small numbers of patients and they do not take into account the considerable cost of setting up, staffing, and maintaining this complex centralized system. I think these cost would leave this complex intervention like many other complex, collaborative care interventions for depression feasible within a research trial, but not sustainable in routine care. A commentary on a cost analysis of the authors’ earlier COPES study noted

One cannot compare the cost of coq au vin and a glass of pinot noir at a local French restaurant to the same meal in Paris without including the cost of airfare and hotel. The cost of enhanced depression care is like the cost of the French meal; the real cost must include all other expenses for the trip to get us there.

And it is probably unrealistic to assume we can improve the treatment of depression with no increase in costs, anyway.

From this article, we don’t really learn

  • How many of these patients were actually clinically depressed.
  • Whether treatment with antidepressants was appropriate for these patients
  • The quality of care patients were receiving in routine care, either before randomization or after but there is reason to believe that it was quite in adequate.
  • Whether encouragement, cost, or accessibility led to more psychotherapy being received by the patients in the intervention group
  • Whether any of the active components of the intervention decisively mattered, rather than positive expectations, attention and support that the patients received.

Bottom line: in the context of the usually poor care for depression in general medical settings in the United States, we don’t know if this kind of intensity is needed to achieve better outcomes or if lower intensity interventions could have achieved much the same effect.

Maybe we are indeed observing in the effects of a powerful intervention in the CODIACS trial. But we cannot rule out that what we are seeing is a false signal among a lot of noise. To really detect a powerful effect due to the intervention, we would need different patients in a comparison group drawn from a different setting. A trial reporting an intervention being better than routine care for depression does not demonstrate that the intervention is good, nor does it identify that the intervention is the source of the apparent effect.

Just what is this routine care being provided anyway? Three of the authors of this article have co-authored an excellent paper on how having routine care as the comparison group for an intervention does not necessarily allow us to say much about the effectiveness of the intervention.  It needs to be established that the routine care was adequate, or if the intervention simply compensated for the inadequacy of routine care in a way that lots of interventions could have. And then there is the ethical requirement of equipoise. It dictates that researchers have a reasonable assumption that the treatments they are offering to patients in a clinical trial are reasonably equivalent. Could the investigator team justify the needed assumption that they were offering equivalent treatment to the two groups?

If routine care is a car with bad spark plugs, all one needs is new spark plugs. However, it would be a mistake to generalize from a trial in which the intervention was the equivalent of providing new spark plugs that the same intervention would work for other cars that are not in need of such repairs.

How common is it that quite modestly sized studies like the CODIACS and its predecessor COPES trial finish way ahead of the pack of other, mostly larger studies, only to not be replicated? The accompanying editorial commentary by Gregory Simon, who has been involved a number of important studies of collaborative care, indicates this pattern is all too common. John Ioannidis has demonstrated that is more generally common in medicine in a paper aptly entitled “why most discoveries are inflated?” Other authors have referred to this pattern as the decline effect.

Before clinicians and policy makers invest in this complex intervention, we should see what happens if we simply offered free, better quality and more appropriate antidepressant treatment, but also access to this kind of psychotherapy, which is difficult for general medical care patients to obtain. I would not be surprised if the same effects could be achieved simply by providing this access without this complex intervention. Or, before we conclude that this intervention have features that were particularly effective, we should at least similarly pay for the treatment of patients in the intervention and control groups and that might in itself take away any differences between the intervention the control group.

Will publicizing of this study encourage overtreatment of patients who are not depressed with antidepressants? Maybe. One thing that I am uncomfortable with in the study is that decisions were made that patients were “depressed” and very likely even the type of treatment offered based solely on questionnaires. Sure, physicians and nurses discussed treatment with the patients, but notoriously primary-care providers don’t do a formal diagnostic interview or even ask for a many questions coming to a decision that patient is depressed. The exclusive reliance on the questionnaires in this study sends the wrong message and is a poor model for routine care. We already faced with the problem of considerable overtreatment with antidepressants but undermanagement, and if this study is taken seriously, it could contribute to that trend.

The packaging and selling of this trial. I teach scientific writing in Europe and I constantly remind participants in my workshops of the need for Europeans to do a better job of promoting their work. I coach them in doing elevator talks promoting their work and themselves, which some participants find difficult and counter to ingrained cultural prohibitions against self-promotion. The Dutch, for instance, want their children and Dutch professionals warn their colleagues of the “tall poppy syndrome,” kop boven het maaiveld uitsteken! This roughly translates as ‘the tall poppy gets cut’.  I tell my European participants that they need to emulate the Americans in some ways, even if they would not want to become an American.

This article provides a masterful example of how to promote a study. It is no surprise because the list of authors include a former journal editor and associate editors and a NIH program officer. The artful persuasion starts in the abstract and introduction, which do not take for granted that the reader already has a sense of the importance of evaluating this particular intervention. In the abstract and introduction, the authors spell out just what a serious problem depression is, however is suboptimally treated, and what the consequences can be of it being left inadequately treated.

Starting in the abstract, results that are actually not all that clear cut are presented as strong effects. The abstract ends with a call for a larger, more ambitious study (funding agencies and journalists, please take note).  This is no boring “further research is needed,” but a specific call to action for a more ambitious study. A reader can quickly lose sight of our not even knowing how many of these patients were actually depressed and in the end, we don’t know if a specific component of the intervention was needed to get the difference in outcomes. We don’t even know if the apparent effects of the intervention largely depend more on the poor quality care being received by the patients assigned to the control condition. But the article does not call attention to these issues.

The argument that this intensive intervention will not cost more is seemingly compelling, when there is actually little basis for it.

In the discussion section, the possibility is raised that improving depression can lead to improved physical health outcomes among cardiac patients, including longer lives. This is a dubious assertion because there is is at present no data is to support this, but pointing to possibility is quite important in promoting the significance of this study. I’m not saying anything of what the authors do here is inappropriate, but it does go far beyond the data. So, Europeans, please study how these authors open with an impressive statement of the importance of their work and close with an elaboration of that, but also maybe where, as the Dutch would say, they go over the top.

What are we being sold in this article? I think that the CODIACS intervention has some promising elements, but represents a complex intervention that is more expensive and less sustainable than the authors note, when administrative and infrastructure costs are taken into account. Its efficacy remains to be demonstrated in trials with patients appropriately selected for clinical depression. A credible demonstration of its efficacy will require pitting it against a routine care that provides greater likelihood that depressed patients can readily access affordable treatment that is acceptable to them and that they can get the minimal professional attention to facilitate that access. The validity of routine care as an appropriate comparison/control group needs to be demonstrated, not assumed, and this requires showing that assignment of patients to it leads to at least modest increase in utilization of treatment and at least modest improvements in depression scores.

I’d be interesting in hearing from readers if they also zeroed in on what I found important about this article and whether they agree with me about just how ambiguous these findings are.