Were any interventions to prevent teen suicide effective in the SEYLE trial?

Disclaimer: I’ve worked closely with some of the SEYLE investigators on other projects. I have great respect for their work. Saving and Empowering Young Lives in Europe was a complex, multisite suicide prevention project of historical size and scale that was exceptionally well implemented.

However, I don’t believe that The Lancet article reported primary outcomes in a way that their clinical and public health significance can be fully and accurately appreciated. Some seemingly positive results were reported with a confirmation bias. Important negative findings were reported in ways that they are likely to be ignored, losing important lessons for the future.

I don’t think we benefit from minmizing the great difficulty in showing that any interventions work to prevent death by suicide, particularly in a relatively low risk group like teens. We don’t benefit from exaggerating the strength of evidence for particular approaches.

The issue of strength of evidence is compounded by Danuta Wasserman, the first author also being among the authors of a systematic review.

Zalsman G, Hawton K, Wasserman D, van Heeringen K, Arensman E, Sarchiapone M, Carli V, Höschl C, Barzilay R, Balazs J, Purebl G. Suicide prevention strategies revisited: 10-year systematic review. The Lancet Psychiatry. 2016 Jul 31;3(7):646-59.

In a post at Mental Elf, psychiatrist and expert on suicidology  Stanley Kutcher pointed to a passage in the abstract of the systematic review:

The review’s abstract notes that YAM (one of the study arms) “was associated with a significant reduction of incident suicide attempts (odds ratios [OR] 0.45, 95% CI 0.24 to 0.85; p=0.014) and severe suicidal ideation (0.50, 0.27 to 0.92; p=0.025)”. If this analysis seems familiar to the reader that is because this is the information also provided in the Zalsman abstract! This analysis refers to the SELYE study ONLY! However, the way in which the Zalsman abstract is written suggests this analysis refers to all school based suicide awareness programs the reviewers evaluated. Misleading at best. Conclusion supporting, not at all.

[Another reminder that authors of major studies should not also be authors on systematic reviews and meta analyses that review their work. But tell that to Cochrane Collaboration, which now has a policy of inviting authors of studies from which individual data are needed. But that is for another blog post.]

The article reporting the trial is currently available open access here.

Wasserman D, Hoven CW, Wasserman C, Wall M, Eisenberg R, Hadlaczky G, Kelleher I, Sarchiapone M, Apter A, Balazs J, Bobes J. School-based suicide prevention programmes: the SEYLE cluster-randomised, controlled trial. The Lancet. 2015 Apr 24;385(9977):1536-44.

The trial protocol is available here.

Wasserman D, Carli V, Wasserman C, et al. Saving and empowering young lives in Europe (SEYLE): a randomized controlled trial. BMC Public Health 2010; 10: 192.

seyle protocol

 

 

From the abstract of the Lancet paper:

Methods. The Saving and Empowering Young Lives in Europe (SEYLE) study is a multicentre, cluster-randomised controlled trial. The SEYLE sample consisted of 11 110 adolescent pupils, median age 15 years (IQR 14–15), recruited from 168 schools in ten European Union countries. We randomly assigned the schools to one of three interventions or a control group. The interventions were: (1) Question, Persuade, and Refer (QPR), a gatekeeper training module targeting teachers and other school personnel, (2) the Youth Aware of Mental Health Programme (YAM) targeting pupils, and (3) screening by professionals (ProfScreen) with referral of at-risk pupils. Each school was randomly assigned by random number generator to participate in one intervention (or control) group only and was unaware of the interventions undertaken in the other three trial groups. The primary outcome measure was the number of suicide attempt(s) made by 3 month and 12 month follow-up…

No significant differences between intervention groups and the control group were recorded at the 3 month follow-up. At the 12 month follow-up, YAM was associated with a significant reduction of incident suicide attempts (odds ratios [OR] 0·45, 95% CI 0·24–0·85; p=0·014) and severe suicidal ideation (0·50, 0·27–0·92; p=0·025), compared with the control group. 14 pupils (0·70%) reported incident suicide attempts at the 12 month follow-up in the YAM versus 34 (1·51%) in the control group, and 15 pupils (0·75%) reported incident severe suicidal ideation in the YAM group versus 31 (1·37%) in the control group. No participants completed suicide during the study period.

What can be noticed right away: (1) this is a four-armed study in which three interventions are compared to the control group; (2) apparently there were no effects observed at three months; (3) results are not reported for three of the four interventions at 12 months, only differences for one of the intervention group versus the control group; (4) the differences between the intervention group and the control group were numerically small; (5) despite enrolling over 11,000 students, no suicides were observed in any of the groups.

[A curious thing about the abstract to be discussed later in the post. What is identified as the statistical effect of YAM on self-reported suicide attempts is expressed in an odds ratio and statistical significance. No actual number are given. Yet, e

Effects on suicidal ideation are expressed in absolute numbers, with a small number of students identified as having severe ideation and a small absolute difference between YAM and the control group. Presumably, there were fewer suicide attempts than students with severe ideation. Like me, are you wondering how may self-reported attempts we are talking about?]

This study did not target actual suicides. That decision is appropriate, because even with 11,000 students there were no suicides. The significance of the lack of suicides is even with this many students followed for a year, one might not even have a single suicide, and so one cannot expect to observe an actual decrease in suicides, and certainly not a statistically significant decrease.

We should keep this in mind the next time we encounter claims about teen suicides being an epidemic or expectation that an intervention a particular community will lead to an observable reduction in teen suicides.

We should also keep this in mind when we see in the future that a community implemented suicide prevention programs after some spike in suicides. It’s very likely that a reduction in suicides will be observed, but that’s simply regression to the mean, the community returned to more typical rates of suicide.

hilda surrogate outcomesRather than actual suicides, the study specified suicidal ideation and self-reported suicidal acts. We have to be cautious about inferring changes in suicide from changes in these surrogate outcomes. Changes in surrogate outcomes don’t necessarily translate into changes in the outcomes that we are most interested in, but for whatever reason are not measuring. In this study, investigators were convinced with even such a large sample, a reduction in suicides would not be observed. Hardly a reason to argue that  whatever reduction in surrogate outcomes is observed would translate into a reduction in deaths.

Let’s temporarily put aside the issue of suicidal acts being self-reported and subject to both on unreliability and a likely overestimate of life-threatening acts. I would estimate from other studies that one would have to prevent hundred documented attempts at suicide in order to prevent one actual suicide.

But these are self-report measures.

Pupils  were identified as having severe suicidal ideation, if they answered: “sometimes, often, very often or always”  to the question: “during the past 2 weeks, have you  reached the point where you seriously considered  taking your life, or perhaps made plans how you would go about doing it?”

So any endorsement  of any of these categories were lumped together as “severe ideation.” We might not agree with that designation, but without this lumping, a sample of 11,000 students does not yield differences in occurrences of “severe suicidal ideation.”

Readers are not given a breakdown of the endorsements of suicidality across categories, but I think we can reasonably make some extrapolations about the skewness of the distribution from a study that I blogged about of the screening of 10,000 postpartum women  with a single item question:

In the sample of 10 000 women who underwent screening, 319 (3.2%) had thoughts of self-harm, including 8 who endorsed “yes, quite often”; 65, “sometimes”; and 246, “hardly ever.”

We can be confident that most instances of “severe suicidal ideation” in the SEYLE study did not indicate a strong likelihood of a teen making a suicide attempt. Such self-report measures are more related to other depressive symptoms than to attempted suicide.

This is all yet a reminder of the difficulty targeting suicide as a public health outcome. It’s very difficult to show an effect.

The abstract of the article prominently features a claim that one of three interventions was different than the control group in severe suicidal ideation and suicide attempts at 12 months, but not at three months.

We should be left pondering what happened at 12 months with respect to two of the three interventions. The interventions were carefully selected and we have the opportunity to examine what effect they had. After all, we may not get another opportunity to evaluate such interventions in such a large sample in the near future. We might simply assume these interventions had no effect at 12 months, but the abstract is written to distract from that potentially important finding that has significance for future trials.

But there is another problem in the reporting of outcomes. The results section states:

Analyses of the interaction between intervention groups and time (3 months and 12 months) showed no significant effect on incident suicide attempts in the three intervention groups, compared with the control group at the 3 month follow-up.

And

After analyses of the interaction between intervention groups and time (3 months and 12 months), we noted the following results for severe suicidal ideation: at the 3 month follow-up, there were no signifi cant effects of QPR, YAM, or ProfScreen compared with the control group.

It’s not appropriate to focus on the difference between one of the interventions and the control group without taken into account the context of it being a four-armed trial, a a 4 (conditions)  x  2 (3 or 12 follow up) design.

In the absence of a clearly specified a priori hypothesis, we should first look to the condition x time interaction effect. If we can reject the null hypothesis of no interaction effect having occurred, we should then examine where the effect occurred, more confident that there is something to be explained. However, if we do what was done in the abstract, we need to appreciate the high likelihood of spurious effects when we single out one difference between one of the intervention groups and the control group at one of the two times.

Let’s delve into a table of results for suicide attempts:

self-report attempts

These results demonstrate  we should not make too much of YAM being statistically significant, compared to compared to the two other active intervention groups.

We’re talking about a difference of only a few numbers in suicide attempts of students assigned to YAM versus the other two active intervention groups.

On this basis of theses differences, are we willing to say that YAM represents best practices, an empirically based approach to preventing suicides in schools, whereas the other two interventions are ineffective?

Note that even the difference between YAM in the control group has a broad confidence interval around a different significant at the level of p<.014.

It gets worse. Note that these are not differences in actual attempts but results obtained with an imputation:

A multiple imputation procedure  35(50 imputations with full conditional specification for dichotomous variables)36was used to manage missing values of individual characteristics  (<1% missing for each individual characteristic), so that all pupils with an outcome at 3 months or 12 months  were included in the GLMMs. Additional models,  including sex-by-intervention group interactions, and age-by-intervention group interactions were tested for differential intervention effects by sex and age. To assess the robustness of the findings, tests for intervention group differences were redone including only the subset of pupils with complete outcome data at both 3 months and 12 months.

Overall, we are dealing with small numbers of events that likely assessed with considerable error of measurement occurring with multiple imputation procedures, with the possibility of specification error based on false assumptions that cannot be tested with such a small number of events. Then, we have the broad overlapping confidence intervals for the three interventions. Finally, there is the problem of not taking into account the multiple pairwise comparisons that were possible in this 3x (2) design in which the critical overall treatment x time interaction was not significant.

Misclassification of just a couple of events or  a recovery of data that were thought to be lost and therefore had to be estimated with imputation could alter significance levels – as if they really matter in such a large trial, anyway.

Let’s return to the issue of the systematic review in which the senior author of the SEYLE trial participated. The text in the abstract borrowed without attribution from the abstract of this SEYLE study reflects a bit of overenthusiasm or at least premature enthusiasm for the senior author’s own results.

Let’s look at the interventions that were actually evaluated. The three active interventions:

The Screening by Professionals programme (ProfScreen)…is a selective or indicated intervention based on responses to the SEYLE baseline questionnaire. When pupils had completed the baseline assessment, health professionals reviewed their answers and pupils who screened at or above pre-established cutoff points were invited to participate in a professional mental health clinical assessment and subsequently referred to clinical services, if needed.3

Question, Persuade, and Refer (QPR) is a manualized gatekeeper programme, developed in the USA.28 In SEYLE, QPR was used to train teachers and other school personnel to recognise the risk of suicidal behaviour in pupils and to enhance their communication skills to motivate and help pupils at risk of suicide to seek professional care. QPR training materials included standard power point presentations and a 34-page booklet distributed to all trainees.

Teachers were also given cards with local health-care contact information for distribution to pupils identified by them as being at risk. Although QPR targeted all school staff, it was, in effect, a selective approach, because only pupils recognised as being at suicidal risk were approached by the gatekeepers (trained school personnel).

YAM

The Youth Aware of Mental Health Programme (YAM) was developed for the SEYLE study29 and is a manualised, universal intervention targeting all pupils, which includes 3 h of role-play sessions with interactive workshops combined with a 32-page booklet that pupils could take home, six educational posters displayed in each participating classroom and two 1 h interactive lectures about mental health at the beginning and end of the intervention. YAM aimed to raise mental health awareness about risk and protective factors associated with suicide, including knowledge about depression and anxiety, and to enhance the skills needed to deal with adverse life events, stress, and suicidal behaviours.

This programme was implemented at each site by instructors trained in the methodology through a detailed 31 page instruction manual.

I of course could be criticized as offering my predictions about effects of these interventions after results are known. Nonetheless, I think my skepticism is well known and the criticisms I have of these interventions might be anticipated.

ProfScreen is basically a screening and referral effort. Its vulnerability is the lack of evidence that screening instruments have adequate positive predictive value. None of the available screening measures proved useful in a recent large-scale study. Armed with screening instruments that don’t work particularly well, the health professionals are going to be referring a lot of students for further evaluation and treatment with a lot of false positives. I would anticipate that is already difficult getting a timely appointment for adolescent mental health treatment. These referrals could only further clog the system. Given the performance of the instruments, is not clear that students who screen positive should be given priority over other adolescents with known serious mental health problems.

I am sure a lot of activists and advocates for reducing teen suicide were rooting for screening and referral efforts. A clearer statement of the lack of any evidence in this large-scale study for the effectiveness of such an approach is invaluable and might prevent misdirection of resources.

The effectiveness of QPR would depend on raising the awareness of a school gatekeeper so that the gatekeeper was in a position at a rare, but decisive moment with a student otherwise inclined to life-threatening self harm, and prevent the progression to self harm from occurring.

Observing such a sequence and being able to intervene is going to be an infrequent occurrence. Of course, there’s the further doubtful assumption that suicidality is going to be so obvious that it can be recognized.

The YAM intervention is the only one that actually involves live interaction with students, but it is only 3 hours of role playing, added to lectures and posters. Nice, but I would not think that would have prevented suicide attempts, although maybe it would affect self-reports.

I recall way back when I was asked by NIMH program officers to apply for funding for intervention study of suicide prevention intervention targeting primary care physicians serving older adults. That focus was specifically being required by at the time House Majority Leader Senate Majority Leader Harry Reid (Nevada, Democrat, whose father had died from suicide after an encounter with a primary care physician in which the father being at risk was not uncovered. Senator Reid was demanding that NIMH conduct a clinical trial showing that such strategies could be averted. I told the program officers that I was sorry for the loss of Senator Reid’s father, but that given the rate of suicide even is relatively high risk group of elderly men, a primary care physician with only have a relevant encounter with an elderly, potentially suicidal patient about once every 18 months. It was difficult to conceive of an intervention they could demonstrate effectiveness in reducing suicide under those circumstances. I didn’t believe that suicidal ideation was a suitable surrogate, but the trial that got funded focused on reducing suicidal ideation as its primary outcome. The entire large, multisite trial only had one suicide during the trial and follow-up period, and happened to be someone who was in the intervention group. Not much that can be inferred from that.

What can we learn from SEYLE, given that it cannot define best practices for preventing teen suicide?

Do we undertake a bigger trial and hope the stars align so that one intervention is shown to be better than others? If we don’t get that result, do we resort to hocus pocus multiple imputation methods and insist the result is really there, we just can’t see it?

Of course, some will say we have to do something, we just can’t let more teens die by suicide. So, do we proceed without the benefit  of strong evidence?

I will soon be offering e-books providing skeptical looks at mindfulness and positive psychology, as well as scientific writing courses on the web as I have been doing face-to-face for almost a decade.

Sign up at my new website to get advance notice of the forthcoming e-books and web courses, as well as upcoming blog posts at this and other blog sites.  Get advance notice of forthcoming e-books and web courses. Lots to see at CoyneoftheRealm.com.

 

COBRA study would have shown homeopathy can be substituted for cognitive behavior therapy for depression

If The Lancet COBRA study had evaluated homeopathy rather than behavioural activation (BA), homeopathy would likely have similarly been found “non-inferior” to cognitive behavior therapy.

This is not an argument for treating depression with homeopathy, but an argument that the 14 talented authors of The Lancet COBRA study stacked the deck for their conclusion that BA could be substituted for CBT in routine care for depression without loss of effectiveness. Conflict of interest and catering to politics intruded on science in the COBRA trial.

If a study like COBRA produces phenomenally similar results with treatments based on distinct mechanisms of change, one possibility is that background nonspecific factors are dominating the results. Insert homeopathy, a bogus treatment with strong nonspecific effects, in place of BA, and a non-inferiority may well be shown.

Why homeopathy?

Homeopathy involves diluting a substance so thoroughly that no molecules are likely to be present in what is administered to patients. The original substance is first diluted to one part per 10,000 part alcohol or distilled water. This process is repeated six times, ending up with the original material diluted by a factor of 100−6=10−12 .

Nonetheless, a super diluted and essentially inert substance is selected and delivered within a complex ritual.  The choice of the particular substance being diluted and the extent of its dilution is determined with detailed questioning of patients about their background, life style, and personal functioning. Naïve and unskeptical patients are likely to perceive themselves as receiving exceptionally personalized medicine delivered by a sympathetic and caring provider. Homeopathy thus has potentially strong nonspecific (placebo) elements that may be lacking in the briefer and less attentive encounters of routine medical care.

As an academic editor at PLOS One, I received considerable criticism for having accepted a failed trial of homeopathy for depression. The study had been funded by the German government and had fallen miserably short in efforts to recruit the intended sample size. I felt the study should be published in PLOS One  to provide evidence whether such and worthless studies should be undertaken in the future. But I also wanted readers to have the opportunity to see what I had learned from the article about just how ritualized homeopathy can be, with a strong potential for placebo effects.

Presumably, readers would then be better equipped to evaluate when authors claim in other contexts that homeopathy is effective from clinical trials with it was inadequate control of nonspecific effects. But that is also a pervasive problem in psychotherapy trials [ 1,  2 ]  that do not have a suitable comparison/control group.

I have tried to reinforce this message in the evaluation of complementary or integrative treatments in Relaxing vs Stimulating Acupressure for Fatigue Among Breast Cancer Patients: Lessons to be Learned.

The Lancet COBRA study

The Lancet COBRA study has received extraordinary promotion as evidence for the cost-effectiveness of substituting behavioural activation therapy (BA) delivered by minimally trained professionals for cognitive behaviour therapy (CBT) for depression. The study  is serving as the basis for proposals to cut costs in the UK National Health Service by replacing more expensive clinical psychologists with less trained and experienced providers.

Coached by the Science Media Centre, the authors of The Lancet study focused our attention on their finding no inferiority of BA to CBT. They are distracting us from the more important question of whether either treatment had any advantage over nonspecific interventions in the unusual context in which they were evaluated.

The editorial accompanying the COBRA study suggest a BA involves a simple message delivered by providers with very little training:

“Life will inevitably throw obstacles at you, and you will feel down. When you do, stay active. Do not quit. I will help you get active again.”

I encourage readers to stop and think how depressed persons suffering substantial impairment, including reduced ability to experience pleasure, would respond to such suggestions. It sounds all too much like the “Snap out of it, Debbie” they may have already heard from people around them or in their own self-blame.

Snap out of it, Debbie (from South Park)

 BA by any other name…

Actually, this kind of activation is routinely provided in in primary care in some countries as a first stage treatment in a stepped care approach to depression.

In such a system, when emergent mild to moderate depressive symptoms are uncovered in a primary medical care setting, providers are encouraged neither to initiate an active treatment nor even make a formal psychiatric diagnosis of a condition that could prove self-limiting with a brief passage of time. Rather, providers are encouraged to defer diagnosis and schedule a follow-up appointment. This is more than simple watchful waiting. Until the next appointment, providers encourage patients to undertake some guided self-help, including engagement in pleasant activities of their choice, much as apparently done in the BA condition in the COBRA study. Increasingly, they may encourage Internet-based therapy.

In a few parts of the UK, general practitioners may refer patients to a green gym.

green gym

It’s now appreciated that to have any effectiveness, such prescriptions have to be made in a relationship of supportive accountability. For patients to adhere adequately to such prescriptions and not feel they are simply being dismissed by the provider and sent away. Patients need to have a sense that the prescription is occurring within the context of a relationship with someone who cares with whether they carry out and benefit from the prescription.

Used in this way, this BA component of stepped care could possibly be part of reducing unnecessary medication and the need for more intensive treatment. However, evaluation of cost effectiveness is complicated by the need for a support structure in which treatment can be monitored, including any antidepressant medication that is subsequently prescribed. Otherwise, the needs of a substantial number of patients needing more intensive, quality care for depression would be neglected.

The shortcomings of COBRA as an evaluation of BA in context

COBRA does not provide an evaluation of any system offering BA to the large pool of patients who do not require more intensive treatment in a system where they would be provided appropriate timely evaluation and referral onwards.

It is the nature of mild to moderate depressive symptoms being presented in primary care, especially when patients are not specifically seeking mental health treatment, that the threshold for a formal diagnosis of major depression is often met by the minimum or only one more than the five required symptoms. Diagnoses are of necessity unreliable, in part because the judgment of particular symptoms meeting a minimal threshold of severity is unreliable. After a brief passage of time and in the absence of formal treatment, a substantial proportion of patients will no longer meet diagnostic criteria.

COBRA also does not evaluate BA versus CBT in the more select population that participates in clinical trials of treatment for depression. Sir David Goldberg is credited  with first describing the filters that operate on the pathway of patients from presenting a complex combination of problems in living and psychiatric symptoms in primary medical care to treatment in specialty settings.

Results of the COBRA study cannot be meaningfully integrated into the existing literature concerning BA as a component of stepped care or treatment for depression that is sufficient in itself.

More recently, I reviewed in detail The Lancet COBRA study, highlighting how one of the most ambitious and heavily promoted psychotherapy studies ever – was noninformative.  The authors’ claim was unwarranted that it would be wise to substitute BA delivered by minimally trained providers for cognitive behavior therapy delivered by clinical psychologists.

I refer readers to that blog post for further elaboration of some points I will be making here. For instance, some readers might want to refresh their sense of how a noninferiority trial differs from a conventional comparison of two treatments.

Risk of bias in noninferiority trial

 Published reports of clinical trials are notoriously unreliable and biased in terms of the authors’ favored conclusions.

With the typical evaluation of an active treatment versus a control condition, the risk of bias is that reported results will favor the active treatment. However, the issue of bias in a noninferiority trial is more complex. The investigators’ interest is in demonstrating that within certain limits, there are no significant differences between two treatments. Yet, although it is not always tested directly, the intention is to show that this lack of difference is due them both being effective, rather than ineffective.

In COBRA, the authors’ clear intention was to show that less expensive BA was not inferior to CBT, with the assumption that both were effective. Biases can emerge from building in features of the design, analysis, and interpretation of the study that minimized differences between these two treatments. But bias can also arise from a study design in which nonspecific effects are distributed across interventions so that any difference in active ingredients is obscured by shared features of the circumstances in which the interventions are delivered. As in Alice in Wonderland [https://en.wikipedia.org/wiki/Dodo_bird_verdict ], the race is rigged so that almost everybody can get a prize.

Why COBRA could have shown almost any treatment with nonspecific effects was noninferior to CBT for depression

 1.The investigators chose a population and a recruitment strategy that increase the likelihood that patients participating in the trial would likely get better with minimal support and contact available in either of the two conditions – BA versus CBT.

The recruited patients were not actively seeking treatment. They were identified from records of GPs has having had a diagnosis of depression, but were required to not currently being in psychotherapy.

GP recording of a diagnosis of depression has poor concordance with a formal, structured interview-based diagnosis, with considerable overdiagnosis and overtreatment.

A recent Dutch study found that persons meeting interview-based criteria for major depression in the community who do not have a past history of treatment mostly are not found to be depressed upon re-interview.

To be eligible for participation in the study, the patients also had to meet criteria for major depression in a semi structured research interview with (Structured Clinical Interview for the Diagnostic and Statistical Manual of  Mental Disorders, Fourth Edition [SCID]. Diagnoses with the SCID obtained under these circumstances are also likely to have a considerable proportion of false positives.

A dirty secret from someone who has supervised thousands of SCID interviews of medical patients. The developers of the SCID recognized that it yielded a lot of false positives and inflated rates of disorder among patients who are not seeking mental health care.

They attempted to compensate by requiring that respondents not only endorse symptoms, but indicate that the symptoms are a source of impairment. This is the so-called clinical significance criterion. Respondents automatically meet the criterion if they are seeking mental health treatment. Those who are not seeking treatment are asked directly whether the symptoms impair them. This is a particularly on validated aspect of the SCID in patients typically do not endorse their symptoms as a source of impairment.

When we asked breast cancer patients who otherwise met criteria for depression with the SCID whether the depressive symptoms impaired them, they uniformly said something like ‘No, my cancer impairs me.’ When we conducted a systematic study of the clinical significance criterion, we found that whether or not it was endorsed substantially affected individual in overall rates of diagnosis. Robert Spitzer, who developed the SCID interview along with his wife Janet Williams, conceded to me in a symposium that application of the clinical significance criterion was a failure.

What is the relevance in a discussion of the COBRA study? I would wager that the authors, like most investigators who use the SCID, did not inquire about the clinical significance criterion, and as a result they had a lot of false positives.

The population being sampled in the recruitment strategy used in COBRA is likely to yield a sample unrepresentative of patients participating in the usual trials of psychotherapy and medication for depression.

2. Most patients participating in COBRA reported already receiving antidepressants at baseline, but adherence and follow-up are unknown, but likely to be inadequate.

Notoriously, patients receiving a prescription for an antidepressant in primary care actually take the medication inconsistently and for only a short time, if at all. They receive inadequate follow-up and reassessment. Their depression outcomes may actually be poorer than for patients receiving a pill placebo in the context of a clinical trial, where there is blinding and a high degree of positive expectations, attention and support.

Studies, including one by an author of the COBRA study suggests that augmenting adequately managed treatment with antidepressants with psychotherapy is unlikely to improve outcomes.

We’re stumbling upon one of the more messy features of COBRA. Most patients had already been prescribed medication at baseline, but their adherence and follow-up is left unreported, but is likely to be poor. The prescription is likely to have been made up to two years before baseline.

It would not be cost-effective to introduce psychotherapy to such a sample without reassessing whether they were adequately receiving medication. Such a sample would also be highly susceptible to nonspecific interventions providing positive expectations, support, and attention that they are not receiving in their antidepressant treatment. There are multiple ways in which nonspecific effects could improve outcomes – perhaps by improving adherence, but perhaps because of the healing effects of support on mild depressive symptoms.

3. The COBRA authors’ way of dealing with co-treatment with antidepressants blocked readers ability to independently evaluate main effects and interactions with BA versus CBT.

 The authors used antidepressant treatment as a stratification factor, insuring that the 70% of patients receiving them were evenly distributed the BA in CBT conditions. This strategy made it more difficult to separate effects of antidepressants. However, the problem is compounded by the authors failure to provide subgroup analyses based on whether patients had received an antidepressant prescription, as well as the authors failure to provide any descriptions of the extent to which patients received management of their antidepressants at baseline or during active psychotherapy and follow-up. The authors incorporated data concerning the cost of medication into their economic analyses, but did not report the data in a way that could be scrutinized.

I anticipate requesting these data from the authors to find out more, although they have not responded to my previous query concerning anomalies in the reporting of how long since patients had first received a prescription for antidepressants.

4. The 12 month assessment designated as the primary outcomes capitalized on natural recovery patterns, unreliability of initial diagnosis, and simple regression to the mean.

Depression identified in the community and in primary care patient populations is variable in the course, but typically resolves in nine months. Making reassessment of primary outcomes at 12 months increases the likelihood that effects of active ingredients of the two treatments would be lost in a natural recovery process.

5. The intensity of treatment (allowable number of 20 sessions plus for additional sessions) offered in the study exceeded what is available in typical psychotherapy trials and exceeded what was actually accessed by patients.

Allowing this level of intensity of treatment generates a lot of noise in any interpretation of the resulting data. Offering so much treatment encourages patients dropping out, with the loss of their follow-up data. We can’t tell if they simply dropped out because they had received what they perceived as sufficient treatment or if they were dissatisfied. This intensity of offered treatment reduces generalizability to what actually occurs in routine care and comparing and contrasting results of the COBRA study to the existing literature.

 6. The low rate of actual uptake of psychotherapy and retention of patients for follow-up present serious problems for interpreting the results of the COBRA study.

Intent to treat analyses with imputation of missing data are simply voodoo statistics with so much missing data. Imputation and other multivariate techniques make the assumption that data are missing at random, but as I just noted, this is an improbable assumption. [I refer readers back to my previous blog post who want to learn more about intent to treat versus per-protocol analyses].

The authors cite past literature in their choice to emphasize the per-protocol analyses. That means that they based their interpretation of the results on 135 of 221 patients originally assigned to the BA and in the 151 of 219 patients originally signed to CBT. This is a messy approach and precludes generalizing back to original assignment. That’s why that intent to treat analyses are emphasized in conventional evaluations of psychotherapy.

A skeptical view of what will be done with the COBRA data

 The authors clear intent was to produce data supporting an argument that more expensive clinical psychologists could be replaced by less trained clinicians providing a simplified treatment. The striking lack of differences between BA and CBT might be seen as strong evidence that BA could replace CBT. Yet, I am suggesting that the striking lack of differences could also indicate features built into the design that swamped any differences in limited any generalizability to what would happen if all depressed patients were referred to BA delivered by clinicians with little training versus CBT. I’m arguing that homeopathy would have done as well.

BA is already being implemented in the UK and elsewhere as part of stepped care initiatives for depression. Inclusion of BA is inadequately evaluated, as is the overall strategy of stepped care. See here for an excellent review of stepped care initiatives and a tentative conclusion that they are moderately effective, but that many questions remain.

If the COBRA authors were most committed to improving the quality of depression care in the UK, they would’ve either designed their study as a fairer test of substituting BA for CBT or they would have tackled the more urgent task of evaluating rigorously whether stepped care initiatives work.

Years ago, collaborative care programs for depression were touted as reducing overall costs. These programs, which were found to be robustly effective in many contexts, involved placing depression managers in primary care to assist the GPs in improved monitoring and management of treatment. Often the most immediate and effective improvement was that patients got adequate follow-up, where previously they were simply being ignored. Collaborative care programs did not prove to be cheaper, and not surprising, because better care is often more expensive than ineptly provided inadequate care.

We should be extremely skeptical of experienced investigators who claim that they demonstrate that they can cut costs and maintain quality with a wholesale reduction in the level of training of providers treating depression, a complex and heterogeneous disorder, especially when their expensive study fails to deal with this complexity and heterogeneity.

 

A skeptical look at The Lancet behavioural activation versus CBT for depression (COBRA) study

A skeptical look at:

Richards DA, Ekers D, McMillan D, Taylor RS, Byford S, Warren FC, Barrett B, Farrand PA, Gilbody S, Kuyken W, O’Mahen H. et al. Cost and Outcome of Behavioural Activation versus Cognitive Behavioural Therapy for Depression (COBRA): a randomised, controlled, non-inferiority trial. The Lancet. 2016 Jul 23.

 

humpty dumpty fallenAll the Queen’s horses and all the Queen’s men (and a few women) can’t put a flawed depression trial back together again.

Were they working below their pay grade? The 14 authors of the study collectively have impressive expertise. They claim to have obtained extensive consultation in designing and implementing the trial. Yet they produced:

  • A study doomed from the start by serious methodological problems from yielding any scientifically valid and generalizable results.
  • Instead, they produced tortured results that pander to policymakers seeking an illusory cheap fix.

 

Why the interests of persons with mental health problems are not served by translating the hype from a wasteful project into clinical practice and policy.

Maybe you were shocked and awed, as I was by the publicity campaign mounted by The Lancet on behalf of a terribly flawed article in The Lancet Psychiatry about whether locked inpatient wards fail suicidal patients.

It was a minor league effort compared to the campaign orchestrated by the Science Media Centre for a recent article in The Lancet The study concerned a noninferiority trial of behavioural activation (BA) versus cognitive behaviour therapy (CBT) for depression. The message echoing through social media without any critical response was behavioural activation for depression delivered by minimally trained mental health workers was cheaper but just as effective as cognitive behavioural therapy delivered by clinical psychologists.

Reflecting the success of the campaign, the immediate reactions to the article are like nothing I have recently seen. Here are the published altmetrics for an article with an extraordinary overall score of 696 (!) as of August 24, 2016.

altmetrics

 

Here is the press release.

Here is the full article reporting the study, which nobody in the Twitter storm seems to have consulted.

some news coverage

 

 

 

 

 

 

 

 

 

Here are supplementary materials.

Here is the well-orchestrated,uncritical response from tweeters, UK academics and policy makers.

.

The Basics of the study

The study was an open-label  two-armed non-inferiority trial of behavioural activation therapy (BA) versus cognitive behavioural therapy (CBT) for depression with no non-specific comparison/control treatment.

The primary outcome was depression symptoms measured with the self-report PHQ-9 at 12 months.

Delivery of both BA and CBT followed written manuals for a maximum of 20 60-minute sessions over 16 weeks, but with the option of four additional booster sessions if the patients wanted them. Receipt of eight sessions was considered an adequate exposure to the treatments.

The BA was delivered by

Junior mental health professionals —graduates trained to deliver guided self-help interventions, but with neither professional mental health qualifications nor formal training in psychological therapies—delivered an individually tailored programme re-engaging participants with positive environmental stimuli and developing depression management strategies.

CBT, in contrast, was delivered by

Professional or equivalently qualified psychotherapists, accredited as CBT therapists with the British Association of Behavioural and Cognitive Psychotherapy, with a postgraduate diploma in CBT.

The interpretation provided by the journal article:

Junior mental health workers with no professional training in psychological therapies can deliver behavioural activation, a simple psychological treatment, with no lesser effect than CBT has and at less cost. Effective psychological therapy for depression can be delivered without the need for costly and highly trained professionals.

A non-inferiority trial

An NHS website explains non-inferiority trials:

The objective of non-inferiority trials is to compare a novel treatment to an active treatment with a view of demonstrating that it is not clinically worse with regards to a specified endpoint. It is assumed that the comparator treatment has been established to have a significant clinical effect (against placebo). These trials are frequently used in situations where use of a superiority trial against a placebo control may be considered unethical.

I have previously critiqued  [ 1,   2 ] noninferiority psychotherapy trials. I will simply reproduce a passage here:

Noninferiority trials (NIs) have a bad reputation. Consistent with a large literature, a recent systematic review of NI HIV trials  found the overall methodological quality to be poor, with a high risk of bias. The people who brought you CONSORT saw fit to develop special reporting standards for NIs  so that misuse of the design in the service of getting publishable results is more readily detected.

Basically, an NI RCT commits investigators and readers to accepting null results as support for a new treatment because it is no worse than an existing one. Suspicions are immediately raised as to why investigators might want to make that point.

Noninferiority trials are very popular among Pharma companies marketing rivals to popular medications. They use noninferiority trials to show that their brand is no worse than the already popular medication. But by not including a nonspecific control group, the trialists don’t bother to show that either of the medications is more effective than placebo under the conditions in which they were administered in these trials. Often, the medication dominating the market had achieved FDA approval for advertising with evidence of only being only modestly effective. So, potato are noninferior to spuds.

Compounding the problems of a noninferiority trial many times over

Let’s not dwell on this trial being a noninferiority trial, although I will return to the problem of knowing what would happen in the absence of either intervention or with a credible, nonspecific control group. Let’s focus instead on some other features of the trial that seriously compromised an already compromised trial.

Essentially, we will see that the investigators reached out to primary care patients who were mostly already receiving treatment with antidepressants, but unlikely with the support and positive expectations or even adherence necessary to obtain benefit. By providing these nonspecific factors, any psychological intervention would likely to prove effective in the short run.

The total amount of treatment offered substantially exceeded what is typically provided in clinical trials of CBT. However, uptake and actual receipt of treatment is likely to be low in such a population recruited by outreach, not active seeking treatment. So, noise is being introduced by offering so much treatment.

A considerable proportion of primary care patients identified as depressed won’t accept treatment or will not accept the full intensity available. However, without careful consideration of data that are probably not available for this trial, it will be ambiguous whether the amount of treatment received by particular patients represented dropping out prematurely or simply dropping out when they were satisfied with the benefits they had been received. Undoubtedly, failures to receive minimal intensity of treatment and even the overall amount of treatment received by particular patients are substantial and complexly determined, but nonrandom and differ between patients.

Dropping out of treatment is often associated with dropping out of a study – further data not being available for follow-up. These conditions set the stage for considerable challenges in analyzing and generalizing from whatever data are available. Clearly, the assumption of data being missing at random will be violated. But that is the key assumption required by multivariate statistical strategies that attempt to compensate for incomplete data.

12 months – the time point designated for assessment of primary outcomes – is likely to exceed the duration of a depressive episode in a primary care population, which is approximately 9 months. In the absence of a nonspecific active comparison/control or even a waitlist control group, recovery that would’ve occurred in the absence of treatment will be ascribed to the two active interventions being studied.

12 months is likely to exceed substantially the end of any treatment being received and so effects of any active treatments are likely to dissipate. The design allowed for up to four booster sessions. However, access to booster sessions was not controlled. It was not assigned and access cannot be assumed to be random. As we will see when we examined the CONSORT flowchart for the study, there was no increase in the number of patients receiving an adequate exposure to psychotherapy from 6 to 12 months. That is likely to indicate that most active treatment had ended within the first six months.

Focusing on 12 months outcomes, rather than six months, increases the unreliability of any analyses because more 12 month outcomes will be missing than what were available at six months.

Taken together, the excessively long 12 month follow-up being designated as primary outcome and the unusually amount of treatment being offered, but not necessarily being accepted, create substantial problems of missing data that cannot be compensated by typical imputation and multivariate methods; difficulties interpreting results in terms of the amount of treatment actually received; and comparison to the primary outcomes typical trials of psychotherapy being offered to patients seeking psychotherapy.

The authors’ multivariate analysis strategy was inappropriate, given the amount of missing data and the violation of data being missing at random..

Surely the more experienced of the 14 authors of The Lancet should have anticipated these problems and the low likelihood that this study would produce generalizable results.

Recruitment of patients

The article states:

 We recruited participants by searching the electronic case records of general practices and psychological therapy services for patients with depression, identifying potential participants from depression classification codes. Practices or services contacted patients to seek permission for researcher contact. The research team interviewed those that responded, provided detailed information on the study, took informed written consent, and assessed people for eligibility.

Eligibility criteria

Eligible participants were adults aged 18 years or older who met diagnostic criteria for major depressive disorder assessed by researchers using a standard clinical interview (Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition [SCID]9). We excluded people at interview who were receiving psychological therapy, were alcohol or drug dependent, were acutely suicidal or had attempted suicide in the previous 2 months, or were cognitively impaired, or who had bipolar disorder or psychosis or psychotic symptoms.

Table 3 Patient Characteristics reveals a couple of things about co-treatment with antidepressants that must be taken into consideration in evaluating the design and interpreting results.

antidepressant stratificationAnd

 

antidepressant stratification

So, investigators did not wait for patients to refer themselves or to be referred by physicians to the trial, they reached out to them. Applying their exclusion criteria, the investigators obtained a sample that mostly had been prescribed antidepressants, with no indication that the prescription had ended. The length of time which 70% patients had been on antidepressants was highly skewed, with a mean of 164 weeks and a median of 19. These figures strain credibility. I have reached out to the authors with a question whether there is an error in the table and await clarification.

We cannot assume that patients whose records indicate they were prescribed an antidepressant were refilling their prescriptions at the time of recruitment, were faithfully adhering, or were even being monitored.  The length of time since initial prescription increases skepticism whether there was adequate exposure to antidepressants at the time of recruitment to the study..

The inadequacy of antidepressant treatment in routine primary care

Refilling of first prescriptions of antidepressants in primary care, adherence, and monitoring and follow-up by providers are notoriously low.

Guideline-congruent treatment with antidepressants in the United States requires a five week follow up visit, which is only infrequently received in routine. When the five week follow-up visit is kept,

Rates of improvement in depression associated with prescription of an antidepressant in routine care approximate that achieved with pill placebo in antidepressant trials. The reasons for this are complex: but center on depression being of mild to moderate severity in primary care. Perhaps more important is that the attention, provisional positive expectations and support provided in routine primary care is lower than what is provided in the blinded pill-placebo condition in clinical trials. In blinded trials, neither the provider nor patient know whether the active medication or a pill placebo is being administered. The famous NIMH National Collaborative Study found, not surprisingly, that response in the pill-placebo condition was predicted by the quality of the therapeutic alliance between patient and provider.

In The Lancet study, readers are not provided with important baseline characteristics of the patients that are crucial to interpreting the results and their generalizability. We don’t know the baseline or subsequent adequacy of antidepressant treatment or of the quality of the routine care being provided for it. Given that antidepressants are not the first-line treatment for mild to moderate depression, we don’t know why these patients were not receiving psychotherapy. We don’t know even whether the recruited patients were previously offered psychotherapy and with what uptake, except that they were not receiving it two months prior to recruitment.

There is a fascinating missing story about why these patients were not receiving psychotherapy at the start of the study and why and with what accuracy they were described as taking antidepressants.

Readers are not told what happened to antidepressant treatment during the trial. To what extent did patients who were not receiving antidepressants begin doing so? As result of the more frequent contact and support provided in the psychotherapy, to what extent was there improvement in adherence, as well as the ongoing support inattention per providers and attention from primary care providers?

Depression identified in primary care is a highly heterogeneous condition, more so than among patients recruited from treatment in specialty mental health settings. Much of the depression has only the minimum number of symptoms required for a diagnosis or one more. The reliability of diagnosis is therefore lower than in specialty mental health settings. Much of the depression and anxiety disorders identified with semi structured research instruments in populations that is not selected for having sought treatment resolves itself without formal intervention.

The investigators were using less than ideal methods to recruit patients from a population in which major depressive disorder is highly heterogeneous and subject to recovery in the absence of treatment by the time point designated for assessment of primary outcome. They did not sufficiently address the problem of a high level of co-treatment having been prescribed long before the beginning of the study. They did not even assess the extent to which that prescribed treatment had patient adherence or provider monitoring and follow-up. The 12 month follow-up allowed the influence of lots of factors beyond the direct effects of the active ingredients of the two interventions being compared in the absence of a control group.

decline in scores

Examination of a table presented in the supplementary materials suggests that most change occurred in the first six months after enrollment and little thereafter. We don’t know the extent to which there was any treatment beyond the first six-month or what effect it had. A population with clinically significant depression drawn from specially care, some deterioration can be expected after withdrawal of active treatment. In a primary care population, such a graph could be produced in large part because of the recovery from depression that would be observed in the absence of active treatment.

 

Cost-effectiveness analyses reported in the study address the wrong question. These analyses only considered the relative cost of these two active treatments, leaving unaddressed the more basic question of whether it is cost-effective to offer either treatments at this intensity. It might be more cost-effective to have a person with even less mental health training contact patients, inquire about adherence, side effects, and clinical outcomes, and prompt patients to accept another appointment with the GP if an algorithm indicates that would be appropriate.

The intensity of treatment being offered and received

The 20 sessions plus 4 booster sessions of psychotherapy being offered in this trial is considerably higher than the 12 to 16 sessions offered in the typical RCT for depression. Having more sessions available than typical introduces some complications. Results are not comparable to what is found inthe trials offering less treatment. But in a primary care population not actively seeking psychotherapy for depression, there is further complication in that many patients will not access the full 20 sessions. There will be difficulties interpreting results in terms of intensity of treatment because of the heterogeneity of reasons for getting less treatment. Effectively, offering so much therapy to a group that is less inclined to accept psychotherapy introduces a lot of noise in trying to make sense of the data, particularly when cost-effectiveness is an issue.

This excerpt from the CONSORT flowchart demonstrates the multiple problems associated with offering so much treatment to a population that was not actively seeking it and yet needing twelve-month data for interpreting the results of a trial.

CONSORT chart

The number of patients who had no data at six months increased by 12 months. There was apparently no increase in the number of patients receiving an adequate exposure to psychotherapy

Why the interest of people with mental health problems are not served by the results claimed by these investigators being translated into clinical practice.

 The UK National Health Service (NHS) is seriously underfunding mental health services. Patients being referred for psychotherapy from primary care have waiting periods that often exceed the expected length of an episode of depression in primary care. Simply waiting for depression to remit without treatment is not necessarily cost effective because of the unneeded suffering, role impairment, and associated social and personal costs of an episode that persist. Moreover, there is a subgroup of depressed patients in primary care who need more intensive or different treatment. Guidelines recommending assessment after five weeks are not usually reflected in actual clinical practice.

There’s a desperate search for ways in which costs can be further reduced in the NHS. The Lancet study is being interpreted to suggest that more expensive clinical psychologists can be replaced by less expensive and less trained mental health workers. Uncritically and literally accepted, the message is that clinical psychologist working half-time addressing particular comment clinical problems can be replaced by less expensive mental health workers achieving the same effects in the same amount of time.

The pragmatic translation of these claims into practice are replace have a clinical psychologists with cheaper mental health workers. I don’t think it’s being cynical to anticipate the NHS seizing upon an opportunity to reduce costs, while ignoring effects on overall quality of care.

Care for the severely mentally ill in the NHS is already seriously compromised for other reasons. Patients experiencing an acute or chronic breakdown in psychological and social functioning often do not get minimal support and contact time to avoid more intensive and costly interventions like hospitalization. I think would be naïve to expect that the resources freed up by replacing a substantial portion of clinical psychologists with minimally trained mental health workers would be put into addressing unmeet needs of the severely mentally ill.

Although not always labeled as such, some form of BA is integral to stepped care approaches to depression in primary care. Before being prescribed antidepressants or being referred to psychotherapy, patients are encouraged to increased pleasant activities. In Scotland, they may be even given free movie passes for participating in cleanup of parks.

A stepped care approach is attractive, but evaluation of cost effectiveness is complicated by consideration of the need for adequate management of antidepressants for those patients who go on to that level of care.

If we are considering a sample of primary care patients mostly already receiving antidepressants, the relevant comparator is introduction of a depression care manager.

Furthermore, there are issues in the adequacy of addressing the needs of patients who do not benefit from lower intensity care. Is the lack of improvement with low levels of care adequately monitored and addressed. Is the uncertain escalation in level of care adequately supported so that referrals are completed?

The results of The Lancet study don’t tell us very much about the adequacy of care that patients who were enrolled in the study were receiving or whether BA is as effective as CBT as stand-alone treatments or whether nonspecific treatments would’ve done as well. We don’t even know whether patients assigned to a waitlist control would’ve shown as much improvement by 12 months and we have reason to suspect that many would.

I’m sure that the administrations of NHS are delighted with the positive reception of this study. I think it should be greeted with considerable skepticism. I am disappointed that the huge resources that went into conducting this study which could have put into more informative and useful research.

I end with two questions for the 14 authors – Can you recognize the shortcomings of your study and its interpretation that you have offered? Are you at least a little uncomfortable with the use to which these results will be put?