Why PhD students should not evaluate a psychotherapy for their dissertation project

  • Things some clinical and health psychology students wish they had known before they committed themselves to evaluating a psychotherapy for their dissertation study.
  • A well designed pilot study addressing feasibility and acceptability issues in conducting and evaluating psychotherapies is preferable to an underpowered study which won’t provide a valid estimate of the efficacy of the intervention.
  • PhD students would often be better off as research parasites – making use of existing published data – rather than attempting to organize their own original psychotherapy study, if their goal is to contribute meaningfully to the literature and patient care.
  • Reading this blog, you will encounter a link to free, downloadable software that allows you to make quick determinations of the number of patients needed for an adequately powered psychotherapy trial.

I so relish the extra boost of enthusiasm that many clinical and health psychology students bring to their PhD projects. They not only want to complete a thesis of which they can be proud, they want their results to be directly applicable to improving the lives of their patients.

Many students are particularly excited about a new psychotherapy about which extravagant claims are being made that it’s better than its rivals.

I have seen lots of fad and fashions come and go, third wave, new wave, and no wave therapies. When I was a PhD student, progressive relaxation was in. Then it died, mainly because it was so boring for therapists who had to mechanically provide it. Client centered therapy was fading with doubts that anyone else could achieve the results of Carl Rogers or that his three facilitative conditions of unconditional positive regard, genuineness,  and congruence were actually distinguishable enough to study.  Gestalt therapy was supercool because of the charisma of Fritz Perls, who distracted us with his showmanship from the utter lack of evidence for its efficacy.

I hate to see PhD students demoralized when their grand plans prove unrealistic.  Inevitably, circumstances force them to compromise in ways that limit any usefulness to their project, and maybe even threaten their getting done within a reasonable time period. Overly ambitious plans are the formidable enemy of the completed dissertation.

The numbers are stacked against a PhD student conducting an adequately powered evaluation of a new psychotherapy.

This blog post argues against PhD students taking on the evaluation of a new therapy in comparison to an existing one, if they expect to complete their projects and make meaningful contribution to the literature and to patient care.

I’ll be drawing on some straightforward analysis done by Pim Cuijpers to identify what PhD students are up against when trying to demonstrate that any therapy is better than treatments that are already available.

Pim has literally done dozens of meta-analyses, mostly of treatments for depression and anxiety. He commands a particular credibility, given the quality of this work. The way Pim and his colleagues present a meta-analysis is so straightforward and transparent that you can readily examine the basis of what he says.

Disclosure: I collaborated with Pim and a group of other authors in conducting a meta-analysis as to whether psychotherapy was better than a pill placebo. We drew on all the trials allowing a head-to-head comparison, even though nobody ever really set out to pit the two conditions against each other as their first agenda.

Pim tells me that the brief and relatively obscure letter, New Psychotherapies for Mood and Anxiety Disorders: Necessary Innovation or Waste of Resources? on which I will draw is among his most unpopular pieces of work. Lots of people don’t like its inescapable message. But I think that if PhD students should pay attention, they might avoid a lot of pain and disappointment.

But first…

Note how many psychotherapies have been claimed to be effective for depression and anxiety. Anyone trying to make sense of this literature has to contend with claims being based on a lot of underpowered trials– too small in sample size to be expected reasonably to detect the effects that investigators claim – and that are otherwise compromised by methodological limitations.

Some investigators were simply naïve about clinical trial methodology and the difficulties doing research with clinical populations. They may have not understand statistical power.

But many psychotherapy studies end up in bad shape because the investigators were unrealistic about the feasibility of what they were undertaken and the low likelihood that they could recruit the patients in the numbers that they had planned in the time that they had allotted. After launching the trial, they had to change strategies for recruitment, maybe relax their selection criteria, or even change the treatment so it was less demanding of patients’ time. And they had to make difficult judgments about what features of the trial to drop when resources ran out.

Declaring a psychotherapy trial to be a “preliminary” or a “pilot study” after things go awry

The titles of more than a few articles reporting psychotherapy trials contain the apologetic qualifier after a colon: “a preliminary study” or “a pilot study”. But the studies weren’t intended at the outset to be preliminary or pilot studies. The investigators are making excuses post-hoc – after the fact – for not having been able to recruit sufficient numbers of patients and for having had to compromise their design from what they had originally planned. The best they can hope is that the paper will somehow be useful in promoting further research.

Too many studies from which effect sizes are entered into meta-analyses should have been left as pilot studies and not considered tests of the efficacy of treatments. The rampant problem in the psychotherapy literature is that almost no one treats small scale trials as mere pilot studies. In a recent blog post, I provided readers with some simple screening rules to identify meta-analyses of psychotherapy studies that they could dismiss from further consideration. One was whether there were sufficient numbers of adequately powered studies,  Often there are not.

Readers take their inflated claims of results of small studies seriously, when these estimates should be seen as unrealistic and unlikely to be replicated, given a study’s sample size. The large effect sizes that are claimed are likely the product of p-hacking and the confirmation bias required to get published. With enough alternative outcome variables to choose from and enough flexibility in analyzing and interpreting data, almost any intervention can be made to look good.

The problem is is readily seen in the extravagant claims about acceptance and commitment therapy (ACT), which are so heavily dependent on small, under-resourced studies supervised by promoters of ACT that should not have been used to generate effect sizes.

Back to Pim Cuijpers’ brief letter. He argues, based on his numerous meta-analyses, that it is unlikely that a new treatment will be substantially more effective than an existing credible, active treatment.  There are some exceptions like relaxation training versus cognitive behavior therapy for some anxiety disorders, but mostly only small differences of no more than d= .20 are found between two active, credible treatments. If you search the broader literature, you can find occasional exceptions like CBT versus psychoanalysis for bulimia, but most you find prove to be false positives, usually based on investigator bias in conducting and interpreting a small, underpowered study.

You can see this yourself using the freely downloadable G*power program and plug in d= 0.20 for calculating the number of patients needed for a study. To be safe, add more patients to allow for the expectable 25% dropout rate that has occurred across trials. The number you get would require a larger study than has ever been done in the past, including the well-financed NIMH Collaborative trial.

G power analyses

Even more patients would be needed for the ideal situation in which a third comparison group allowed  the investigator to show the active comparison treatment had actually performed better than a nonspecific treatment that was delivered with the same effectiveness that the other had shown in earlier trials. Otherwise, a defender of the established therapy might argue that the older treatment had not been properly implemented.

So, unless warned off, the PhD student plans a study to show not only that now hypothesis can be rejected that the new treatment is no better than the existing one, but that in the same study the existing treatment had been shown to be better than wait list. Oh my, just try to find an adequately powered, properly analyzed example of a comparison of two active treatments plus a control comparison group in the existing published literature. The few examples of three group designs in which a new psychotherapy had come out better than an effectively implemented existing treatment are grossly underpowered.

These calculations so far have all been based on what would be needed to reject the null hypothesis of no difference between the active treatment and a more established one. But if the claim is that the new treatment is superior to the existing treatment, our PhD student now needs to conduct a superiority trial in which some criteria is pre-set (such as greater than a moderate difference, d= .30) and the null hypothesis is that the advantage of the new treatment is less. We are now way out into the fantasyland of breakthrough, but uncompleted dissertation studies.

Two take away messages

 The first take away message is that we should be skeptical of claims of the new treatment is better than past ones except when the claim occurs in a well-designed study with some assurance that it is free of investigator bias. But the claim also has to arise in a trial that is larger than almost any psychotherapy study is ever been done. Yup, most comparative psychotherapy studies are underpowered and we cannot expect robust claims are robust that one treatment is superior to another.

But for PhD students been doing a dissertation project, the second take away message is that they should not attempt to show that one treatment is superior to another in the absence of resources they probably don’t have.

The psychotherapy literature does not need another study with too few patients to support its likely exaggerated claims.

An argument can be made that it is unfair and even unethical to enroll patients in a psychotherapy RCT with insufficient sample size. Some of the patients will be randomized to the control condition that is not what attracted them to the trial. All of the patients will be denied having been in a trial makes a meaningful contribution to the literature and to better care for patients like themselves.

What should the clinical or health psychology PhD student do, besides maybe curb their enthusiasm? One opportunity to make meaningful contributions to literature by is by conducting small studies testing hypotheses that can lead to improvement in the feasibility or acceptability of treatments to be tested in studies with more resources.

Think of what would’ve been accomplished if PhD students had determined in modest studies that it is tough to recruit and retain patients in an Internet therapy study without some communication to the patients that they are involved in a human relationship – without them having what Pim Cuijpers calls supportive accountability. Patients may stay involved with the Internet treatment when it proves frustrating only because they have the support and accountability to someone beyond their encounter with an impersonal computer. Somewhere out there, there is a human being who supports them and sticking it out with the Internet psychotherapy and will be disappointed if they don’t.

A lot of resources have been wasted in Internet therapy studies in which patients have not been convinced that what they’re doing is meaningful and if they have the support of a human being. They drop out or fail to do diligently any homework expected of them.

Similarly, mindfulness studies are routinely being conducted without anyone establishing that patients actually practice mindfulness in everyday life or what they would need to do so more consistently. The assumption is that patients assigned to the mindfulness diligently practice mindfulness daily. A PhD student could make a valuable contribution to the literature by examining the rates of patients actually practicing mindfulness when the been assigned to it in a psychotherapy study, along with barriers and facilitators of them doing so. A discovery that the patients are not consistently practicing mindfulness might explain weaker findings than anticipated. One could even suggest that any apparent effects of practicing mindfulness were actually nonspecific, getting all caught up in the enthusiasm of being offered a treatment that has been sought, but not actually practicing mindfulness.

An unintended example: How not to recruit cancer patients for a psychological intervention trial

Randomized-controlled-trials-designsSometimes PhD students just can’t be dissuaded from undertaking an evaluation of a psychotherapy. I was a member of a PhD committee of a student who at least produced a valuable paper concerning how not to recruit cancer patients for a trial evaluating problem-solving therapy, even though the project fell far short of conducting an adequately powered study.

The PhD student was aware that  claims of effectiveness of problem-solving therapy reported in in the prestigious Journal of Consulting and Clinical Psychology were exaggerated. The developer of problem-solving therapy for cancer patients (and current JCCP Editor) claimed  a huge effect size – 3.8 if only the patient were involved in treatment and an even better 4.4 if the patient had an opportunity to involve a relative or friend as well. Effect sizes for this trial has subsequently had to be excluded from at least meta-analyses as an extreme outlier (1,2,3,4).

The student adopted the much more conservative assumption that a moderate effect size of .6 would be obtained in comparison with a waitlist control. You can use G*Power to see that 50 patients would be needed per group, 60 if allowance is made for dropouts.

Such a basically inert control group, of course, has a greater likelihood of seeming to demonstrate a treatment is effective than when the comparison is another active treatment. Of course, such a control group also has the problem of not allowing a determination if it was the active ingredient of the treatment that made the difference, or just the attention, positive expectations, and support that were not available in the waitlist control group.

But PhD students should have the same option as their advisors to contribute another comparison between an active treatment and a waitlist control to the literature, even if it does not advance our knowledge of psychotherapy. They can take the same low road to a successful career that so many others have traveled.

This particular student was determined to make a different contribution to the literature. Notoriously, studies of psychotherapy with cancer patients often fail to recruit samples that are distressed enough to register any effect. The typical breast cancer patient, for instance, who seeks to enroll in a psychotherapy or support group trial does not have clinically significant distress. The prevalence of positive effects claimed in the literature for interventions with cancer patients in published studies likely represents a confirmation bias.

The student wanted to address this issue by limiting patients whom she enrolled in the study to those with clinically significant distress. Enlisting colleagues, she set up screening of consecutive cancer patients in oncology units of local hospitals. Patients were first screened for self-reported distress, and, if they were distressed, whether they were interested in services. Those who met both criteria were then re-contacted to see if that be willing to participate in a psychological intervention study, without the intervention being identified. As I reported in the previous blog post:

  • Combining results of  the two screenings, 423 of 970 patients reported distress, of whom 215 patients indicated need for services.
  • Only 36 (4% of 970) patients consented to trial participation.
  • We calculated that 27 patients needed to be screened to recruit a single patient, with 17 hours of time required for each patient recruited.
  • 41% (n= 87) of 215 distressed patients with a need for services indicated that they had no need for psychosocial services, mainly because they felt better or thought that their problems would disappear naturally.
  • Finally, 36 patients were eligible and willing to be randomized, representing 17% of 215 distressed patients with a need for services.
  • This represents 8% of all 423 distressed patients, and 4% of 970 screened patients.

So, the PhD student’s heroic effort did not yield the sample size that she anticipated. But she ended up making a valuable contribution to the literature that challenges some of the basic assumptions that were being made about how cancer patients in psychotherapy research- that all or most were distressed. She also ended up producing some valuable evidence that the minority of cancer patients who report psychological distress are not necessarily interested in psychological interventions.

Fortunately, she had been prepared to collect systematic data about these research questions, not just scramble within a collapsing effort at a clinical trial.

Becoming a research parasite as an alternative to PhD students attempting an under-resourced study of their own

research parasite awardPsychotherapy trials represent an enormous investment of resources, not only the public funding that is often provided for them,be a research parasite but in the time, inconvenience, and exposure to ineffective treatments experienced by patients who participate in the trials. Increasingly, funding agencies require that investigators who get money to do a psychotherapy study some point make their data available for others to use.  The 14 prestigious medical journals whose editors make up the International Committee of Medical Journal Editors (ICMJE) each published in earlier in 2016 a declaration that:

there is an ethical obligation to responsibly share data generated by interventional clinical trials because participants have put themselves at risk.

These statements proposed that as a condition for publishing a clinical trial, investigators would be required to share with others appropriately de-identified data not later than six months after publication. Further, the statements proposed that investigators describe their plans for sharing data in the registration of trials.

Of course, a proposal is only exactly that, a proposal, and these requirements were intended to take effect only after the document is circulated and ratified. The incomplete and inconsistent adoption of previous proposals for registering of  trials in advance and investigators making declarations of conflicts of interest do not encourage a lot of enthusiasm that we will see uniform implementation of this bold proposal anytime soon.

Some editors of medical journals are already expressing alarmover the prospect of data sharing becoming required. The editors of New England Journal of Medicine were lambasted in social media for their raising worries about “research parasites”  exploiting the availability of data:

a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

 Richard Lehman’s  Journal Review at the BMJ ‘s blog delivered a brilliant sarcastic response to these concerns that concludes:

I think we need all the data parasites we can get, as well as symbionts and all sorts of other creatures which this ill-chosen metaphor can’t encompass. What this piece really shows, in my opinion, is how far the authors are from understanding and supporting the true opportunities of clinical data sharing.

However, lost in all the outrage that The New England Journal of Medicine editorial generated was a more conciliatory proposal at the end:

How would data sharing work best? We think it should happen symbiotically, not parasitically. Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested. What is learned may be beautiful even when seen from close up.

The PLOS family of journals has gone on record as requiring that all data for papers published in their journals be publicly available without restriction.A February 24, 2014 PLOS’ New Data Policy: Public Access to Data  declared:

In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings. This Data Availability Statement will be published on the first page of each article.

Many of us are aware of the difficulties in achieving this lofty goal. I am holding my breath and turning blue, waiting for some specific data.

The BMJ has expanded their previous requirements for data being available:

Loder E, Groves T. The BMJ requires data sharing on request for all trials. BMJ. 2015 May 7;350:h2373.

The movement to make data from clinical trials widely accessible has achieved enormous success, and it is now time for medical journals to play their part. From 1 July The BMJ will extend its requirements for data sharing to apply to all submitted clinical trials, not just those that test drugs or devices. The data transparency revolution is gathering pace.

I am no longer heading dissertation committees after one that I am currently supervising is completed. But if any PhD students asked my advice about a dissertation project concerning psychotherapy, I would strongly encourage them to enlist their advisor to identify and help them negotiate access to a data set appropriate to the research questions they want to investigate.

Most well-resourced psychotherapy trials have unpublished data concerning how they were implemented, with what bias and with which patient groups ending up underrepresented or inadequately exposed to the intensity of treatment presumed to be needed for benefit. A story awaits to be told. The data available from a published trial are usually much more adequate than then any graduate student could collect with the limited resources available for a dissertation project.

I look forward to the day when such data is put into a repository where anyone can access it.

until youre done In this blog post I have argued that PhD students should not take on responsibility for developing and testing a new psychotherapy for their dissertation project. I think that using data from existing published trials is a much better alternative. However, PhD students may currently find it difficult, but certainly not impossible to get appropriate data sets. I certainly am not recruiting them to be front-line infantry in advancing the cause of routine data sharing. But they can make an effort to obtain such data and they deserve all support they can get from their dissertation committees in obtaining data sets and in recognizing when realistically that data are not being made available, even when the data have been promised to be available as a condition for publishing. Advisors, please request the data from published trials for your PhD students and protect them from the heartache of trying to collect such data themselves.

 

Cognitive behavior and psychodynamic therapy no better than routine care for anorexia.

Putting a positive spin on an ambitious, multisite trial doomed from the start.

I announced in my last blog post that this one would be about bad meta-analyses of weakStop_Press_2 data used to secure insurance reimbursement for long-term psychotherapy. But that is postponed so that I can give timely coverage to the report in Lancet of results of the Anorexia Nervosa Treatment of OutPatients (ANTOP) randomized clinical trial (RCT). The trial, proclaimed the largest ever of its kind, compared cognitive behavior therapy, focal psychodynamic therapy, and “optimized” routine care for the treatment of anorexia.

This post is an adapt sequel to my last one. I had expressed a lot of enthusiasm for a RCT comparing cognitive behavior therapy (CBT) to psychoanalytic therapy for bulimia. I was impressed with its design and execution and the balanced competing investigator allegiances. The article’s reporting was transparent, substantially reducing risk of bias and allowing a clear message. You will not see me very often being so positive about a piece of research in this blog, although I did note some limitations.

Hands down, CBT did better than psychoanalytic therapy in reducing binging and purging, despite there being only five months of cognitive therapy and two years of psychoanalysis. This difference seems to be a matter of psychoanalysis doing quite poorly, and not that the cognitive behavior CBT doing so well.

However, on my Facebook wall, Ioana Cristea, a known contrarian and evidence-based skeptic like myself, posted a comment about my blog:

Did you see there’s also a recent very similar Lancet study for anorexia? With different results, of course.

She was referring to

Zipfel, Stephan, Beate Wild, Gaby Groß, Hans-Christoph Friederich, Martin Teufel, Dieter Schellberg, Katrin E. Giel et al. Focal psychodynamic therapy, cognitive behaviour therapy, and optimised treatment as usual in outpatients with anorexia nervosa (ANTOP study): randomised controlled trial. The Lancet (2013).

The abstract of the Lancet article is available here, but the full text is behind a pay wall. Fortunately, the registered trial protocol for the study is available open access here. You can at least get the details of what the authors said they were going to do, ahead of doing it.

For an exceedingly quick read, try the press release for the trial here, entitled

Largest therapy trial worldwide: Psychotherapy treats anorexia effectively.

Or an example of a thorough uncritical churnalling of this press release in the media here.

What we are told about anorexia

anorexia-cuando-21
Media portrayals of anorexia often show the extreme self-starvation associated with the severe disorder, but this study recruited women with mild to moderate anorexia.

The introduction of the ANTOP article states

  • Anorexia nervosa is associated with serious medical morbidity and pronounced psychosocial comorbidity.
  • It has the highest mortality rate of all mental disorders, and relapse happens frequently.
  • The course of illness is very often chronic, particularly if left untreated.

A sobering accompanying editorial in Lancet stated

The evidence base for anorexia nervosa treatment is meagre1, 2 and 3 considering the extent to which this disorder erodes quality of life and takes far too many lives prematurely.4 But clinical trials for anorexia nervosa are difficult to conduct, attributable partly to some patients’ deep ambivalence about recovery, the challenging task of offering a treatment designed to remove symptoms that patients desperately cling to, the fairly low prevalence of the disorder, and high dropout rates. The combination of high dropout and low treatment acceptability has led some researchers to suggest that we pause large-scale clinical trials for anorexia nervosa until we resolve these fundamental obstacles.

What the authors claim that this study found.

The press release states

Overall, the two new types of therapy demonstrated advantages compared to the optimized therapy as usual,” said Prof. Zipfel. “At the end of our study, focal psychodynamic therapy proved to be the most successful method, while the specific cognitive behavior therapy resulted in more rapid weight gain.

And the abstract

At the end of treatment, BMI [body mass index] had increased in all study groups (focal psychodynamic therapy 0·73 kg/m², enhanced cognitive behavior therapy 0·93 kg/m², optimised treatment as usual 0·69 kg/m²); no differences were noted between groups (mean difference between focal psychodynamic therapy and enhanced cognitive behaviour therapy –0·45, 95% CI –0·96 to 0·07; focal psychodynamic therapy vs optimised treatment as usual –0·14, –0·68 to 0·39; enhanced cognitive behaviour therapy vs optimised treatment as usual –0·30, –0·22 to 0·83). At 12-month follow-up, the mean gain in BMI had risen further (1·64 kg/m², 1·30 kg/m², and 1·22 kg/m², respectively), but no differences between groups were recorded (0·10, –0·56 to 0·76; 0·25, –0·45 to 0·95; 0·15, –0·54 to 0·83, respectively). No serious adverse events attributable to weight loss or trial participation were recorded.

How can we understand results presented in terms of changes in BMI?

body-mass-index-formulaYou can find out more about BMI [body mass index] here and you can calculate your own here. But note that BMI is a controversial measure, does not directly assess body fat, and is not particularly accurate for people who are large- or small-framed or fit or athletic.

These patients had to have been quite underweight to be diagnosed with anorexia, and so how much weight did they gain as result of treatment?  The authors should have given us the results in numbers that make sense to most people.

The young adult women in the study averaged 46.7 kg or 102.7 pounds at the beginning of the study. I had to do some calculations to translate the changes in BMI reported by these authors with the assumption that they were an average height of 5’6”, like other German women.

Four months after beginning the 10 month treatment, the women had gained an average of 5 pounds and at 12 months after the end of treatment (so 22 months after beginning treatment), they had gained another 3 pounds.

On average, the women participating in the trial were still underweight 22 months after the trial’s start and would have still qualified for entering the trial, at least according to the weight criterion.

How the authors explain their results.

Optimised treatment as usual, combining psychotherapy and structured care from a family doctor, should be regarded as solid baseline treatment for adult outpatients with anorexia nervosa. Focal psychodynamic therapy proved advantageous in terms of recovery at 12-month follow-up, and enhanced cognitive behaviour therapy was more effective with respect to speed of weight gain and improvements in eating disorder psychopathology. Long-term outcome data will be helpful to further adapt and improve these novel manual-based treatment approaches.

My assessment after reading this article numerous times and consulting supplementary material:

  • Anorexia was treated with two therapies, each compared to an unusual control condition termed “optimized” treatment as usual. When the study was over and even in follow-up, anorexia won and the treatments lost.
  • In interpreting these results, note that the study involved a sample of young women with mostly only mild to moderate anorexia. Only a little more than half had full syndrome anorexia.
  • In post hoc “exploratory analyses,” the authors emphasized a single measure at a single time point that favored focal psychodynamic therapy, despite null findings with most other standard measures at all time points.
  • The authors expressed their outcomes in within-group effect sizes. This is an unusual way that exaggerated results, particularly when comparisons are made to the effect sizes reported for other studies.
  • Put another way, results of the trial were very likely spun, starting with the abstract, and continuing in the results and press release.
  • The study demonstrates the difficulty treating anorexia and evaluating this treatment. Only modest increases in body weight were obtained despite intensive treatment.  Interpretation of what happened is complicated by high rates of dropping out of therapy and loss to follow-up, and the necessity of inpatient stays and other supplementary treatment.
  • The optimized routine care condition involved ill-described, uncontrolled  psychotherapeutic and medical interventions. Little sense can be made of this clinical trial except that availability of manualized treatment proved no better (or no worse), and none of the treatments, including routine care, did particularly well.
  • The study is best understood as testing the effectiveness of treating anorexia in some highly unusual circumstances in Germany, not an efficacy trial testing the strength of the two treatments. Results are not generalizable to either of the psychotherapies administered by themselves in other contexts.
  • The study probably demonstrates that  meaningful RCTs of the treatment of anorexia cannot be conducted in Germany with generalizable results.
  • Maybe this trial is just another demonstration that we do not know enough to undertake a randomized study of the treatment of anorexia that would yield readily interpretable findings.

Sad, sad, sad. So you can stop here if all you wanted was my evaluation. Or you can continue reading to find out how I arrived at and whether you agree.

Outcomes for the trial: why am I so unimpressed?

On average, the women were still underweight at follow up, despite having had only mildly to moderate anorexia at the start of the study.  The sample was quite heterogeneous at baseline. We don’t know how much of the modest weight gain and the minority of women who were considered “fully recovered” represents small improvements in women starting with higher BMI and milder, subsyndromal anorexia at baseline.

Any discussion of outcomes has to take into account the substantial number of women not completing treatment and lost to follow up.

Missing data can be estimated with fancy imputational techniques. But they are not magic, and involve some assumptions that cannot be tested with loss of patients to follow up in such small treatment groups. And yet, we need some way to account for all patients initially entering a clinical trial (termed an intent-to-treat analysis) for valid, generalizable results. So, we cannot ignore these problems and simply concentrate just on the women completing treatment and remaining available.

And then there is the issue of nonstudy treatment, including inpatient stays. The study has no way of taking them into account, other than reporting them. Inpatient stays could have occurred for different reasons across the three conditions. We cannot determine if the inpatient stays contributed to the results that were observed or maybe interfered with the outpatient treatment. But here too, we cannot simply ignore this factor.

We certainly cannot assume that failures to complete treatment, loss to follow up and the necessity of inpatient stays are randomly distributed between groups. We cannot convincingly rule out that some combination of these factors are decisive for the results that were obtained.

The spinning of the trial in favor of focal psychodynamic treatment.

positive spin 2The preregistration of the trial listed BMI at the end of treatment as the primary outcome. That means the investigators staked any claims about the trial on this outcome at this time point. There were no overall differences.

The preregistration also listed numerous secondary outcomes: the Morgan-Russell-criteria; general axis I psychopathology (SCID I) ; eating disorder specific psychopathology (SIAB-Ex; Eating Disorder Inventory-2) severity of depressive comorbidity (PHQ-9); and quality of life according to the SF-36. Not all of these outcomes are reported in the article, and for the ones that are reported, almost all are not significantly different at any timepoint.

The authors’ failure to designate one or two of these variables a priori (ahead of time) sets them up to pick-the-best hypothesizing after results are known or HARKING. We do not actually know what was done, but there is a high risk of bias.

We should in general be highly skeptical about post hoc exploratory analyses of variables that were not pre-designated as outcomes for a clinical trial, in either primary or secondary analyses.

In table 3 of their article, the investigators present within-group effect sizes that portray the manualized treatments as doing impressively well.

 ANTOP study 1 page-page-0

Yet, as I will discuss in forthcoming blogs, within-group effect sizes are highly misleading compared to the usually reported between-group effect sizes. These within-group effect sizes attribute all changes that occurred in a particular group to the effects of the intervention. That includes claiming credit for nonspecific effects common across conditions, as well as any improvement due to positive expectations or patients bouncing back after having enrolled in the study at a particular bad time.

The conventional strategy is to provide between-group effect sizes comparing a treatment to what was obtained the other groups.  This preserves the effects of randomization and makes use of what can be learned from comparison/control conditions. Treatment do not have effect sizes, but comparisons of treatments do.

As an example, we do not pay much attention to the within-group effect size for antidepressants in a particular study, because these numbers do not take into account how the antidepressants did relative to a pill placebo condition. Presumably the pill placebo is chemically inert, but it is provided with the same attention from clinicians, positive expectations, and support that come with the antidepressant. Once these factors shared by both the antidepressant and pill placebo conditions are taken into account, the effect size for antidepressant decreases.

Take a look at weight gain by the end of the 12 month follow-up among patients receiving focal psychodynamic therapy. In Table 3, the within-group effect size for focal psychodynamic therapy is a whopping 1.6, p < .001. But the more appropriate between-group effect size for comparing focal psychodynamic therapy to treatment as usual shown in Table 2 is  a wimpy, nonsignificant .13, p< .48 (!)

An extraordinary “optimized” treatment as usual.

Descriptions in the preregistered study protocol, press releases, and methods section of the article do not do justice to the “optimized” treatment as usual. The method section did not rouse particular concern from me. It described patients assigned to the treatment as usual being provided with a list of psychotherapists specializing in the treatment of eating disorders and their family physicians assuming an active role in monitoring and providing actual treatment. This does not sound particularly unusual for a comparison/control group. After all, it would be unethical to leave women with such a threatening, serious disorder on a waiting list just to allow a comparison.

But then I came across this shocker description of the optimized routine care condition in the discussion section:

Under close guidance from their family doctor—eg, regular weight monitoring and essential blood testing—and with close supervision of their respective study centre, patients allocated optimised treatment as usual were able to choose their favourite treatment approach and setting (intensity, inpatient, day patient, or outpatient treatment) and their therapist, in accordance with German national treatment guidelines for anorexia nervosa.11 Moreover, comparisons of applied dosage and intensity of treatment showed that all patients— irrespective of treatment allocation—averaged a similar number of outpatient sessions over the course of the treatment and follow-up periods (about 40 sessions). These data partly reflect an important achievement of the German health-care system: that access to psychotherapy treatment is covered by insurance. However, patients allocated optimised treatment as usual needed additional inpatient treatment more frequently (41%) than either those assigned focal psychodynamic therapy (23%) or enhanced cognitive behaviour therapy (35%).

OMG! I have never seen such intensive treatment-as-usual in a clinical trial. I doubt anything like this treatment would be available elsewhere in the world as standard care.

This description raises a number of disturbing questions about the trial:

Why would any German women with anorexia enroll in the clinical trial? Although a desire to contribute to science is sometimes a factor, the main reason for patients entering clinical trials are because they think they will get better treatment and maybe because they think they can get a preferred treatment which they cannot get it elsewhere. But, if this is the situation of routine care in Germany, why would eligible women not just remain in routine care without the complications of being in a clinical trial?

At one point, the authors claim that 1% of the population has a diagnosis of anorexia. That represents a lot of women. Yet, they were only able to randomize 242 patients, despite a massive two-year effort to recruit patients involving 10 German departments of psychotherapy and psychosomatic medicine. It appears that a very small minority of the available patients were recruited, raising questions about the representativeness of the sample.

Patients had little incentive to remain in the clinical trial rather than dropping out. Dropping out of the clinical trial would still give them access to free treatment–without the hassle of remaining in the trial.

In a more typical trial, patients assigned to treatment as usual are provided with a list of referrals. Often few bother to complete a referral or remain in treatment, and so we can assume that the treatment-as-usual condition usually represents minimal treatment, providing a suitable comparison  with a positive outcome for more active, free treatment. In the United States, patients enrolling in clinical trials often either do not have health insurance or can find only providers who will not accept what health insurance they have for the treatment they want. Patients in the United States enter a clinical trial just to get the possibility of treatment, very different circumstances than in Germany.

Overall, no matter what condition patients were assigned, all received about the same amount of outpatient psychotherapy, about 40 sessions. How could these authors have expected to find a substantial difference between the two manualized treatments and this intensity of routine care? Differences between groups of the magnitude they assumed in calculating sample sizes under these conditions would be truly extraordinary.

Alot of attention and support is provided in 40 sessions of such psychotherapy, making it difficult to detect the specific effects provided by the manualized therapies, above and beyond the attention support they provide..

In short, the manualized treatments were doomed to null findings in comparison to treatment as usual. The only thing really unexpected about this trial is that all three conditions did so poorly.

What is a comparison/control group supposed to accomplish, anyway?

Investigators undertaking randomized controlled trials of psychotherapies know about the necessity of comparison/control groups, but they generally understand less the implication of their choice of a comparison/control group.

Most evidence-based treatments earned their status by proving superior in a clinical trial to a control group such as wait list or no treatment at all. Such comparisons provide the backbone to claims of evidence-based treatments, but are not particularly informative. It may simply be that many manualized, structured treatments are no better than other active treatments patients have similar intensity of treatment, positive expectations, and attention and support.

Some investigators, however, are less interested in establishing the efficacy of treatments, then in demonstrating the effectiveness of particular treatments over what is already being done in the community. Effectiveness studies typically find small effects been obtained in straw-man comparisons between treatments and the weak effects observed in control groups.

But even if their intention is to conduct an effectiveness study, investigators need to better describe the nature of of treatment as usual, if they are to make reasonable generalizations to other clinical and health system contexts.

We know that the optimized treatment as usual was exceptionally intensive, but we have no idea from the published article what it entailed, except lots of treatment, as much as what was provided provided in the active treatment conditions. It may even be that some of the women assigned to optimized treatment obtained therapists providing much the same treatment.

Again, if all of the conditions had done well in terms of improved patient outcomes, then we could have concluded that introducing manualized treatment does not accomplish much in Germany at least. But my assessment is that none of the three conditions did particularly well.

The optimized treatment as usual is intensive but not evidence-based. In my last blog post, we viewed a situation in which less treatment proved better than more. Maybe the availability of intensive and extensive treatment discourages women from taking responsibility for their health threatening condition. They do not improve, simply because they can always get more treatment. That is simply a hypothesis, but Germany is spending lots of money assuming that it is incorrect.

Why Germany may not be the best place to do a clinical trial for treatment of anorexia.

Germany may not be an appropriate place to do a clinical trial of treatment for anorexia for a number of reasons:

  • The ready availability of free, intensive treatment prevents recruitment of a large, representative sample of women with anorexia to a potentially burdensome clinical trial.
  • There is less incentive for women to remain in the study once they are enrolled because they can always drop out and get the same intensity of treatment elsewhere.
  • The control/comparison group of “optimized” treatment as usual complied with the extensive requirements of the German national treatment guidelines for anorexia nervosa. But these standards are not evidence-based and appear to have produced mediocre outcomes in at least this trial.
  • Treatment as usual available to everyone is not necessarily effective, but it precludes detecting incremental improvements obtained by less intensive, but focused treatments.

Prasad and Ioannidis have recently called attention to the pervasiveness of non-evidence-based medical treatments and practice guidelines that are not either cost-effective, ensuring good patient outcomes, or avoiding unnecessary risks. They propose de-implementing such unproven practices, but acknowledge the likelihood that cultural values, vested interests, and politics can interfere with efforts to subject established but unproven practices to empirical test.

Surely, that would be the case in any effort to de-implement guidelines the treatment of anorexia in Germany.

The potentially life-threatening nature of anorexia may discourage any temporary suspension of treatment guidelines until evidence can be obtained. But we need only to look to the example of similarly life-threatening cancers where improved treatments only came about only when investigators were able to suspend well-established but unproven treatments and conduct randomized trials.

It would be unethical to assigned women with anorexia to waitlist control or no treatment when free treatment is readily available in the community. So, there may be no other options but to use treatment has usual has a control condition.

If so, a finding of no differences between groups is almost certainly guaranteed. And given the poor performance of routine care observed in this study, such results were not represent the familiar Dodo Bird Verdict for comparisons between psychotherapies in which all of the treatments were winners in all get prizes.

Why it may be premature to conduct randomized trials of treatment of anorexia.

This may well be, as the investigators proclaim in their press release, the largest ever RCT of treatment for anorexia. But it is very difficult to make sense of it, other than to conclude that no treatments, including treatment as usual, had particularly impressive results.

For me, this study highlights the anonymous barriers to conducting a well-controlled RCT for anorexia with patients representative of the kinds that would seek treatment in real-world clinical context.

There are unsolved issues of patient dropout and retention for follow-up that seriously threaten the integrity of any results. We just do not know how to recruit a representative sample of patients with anorexia and keep them in therapy and around for follow-up.

Maybe we should ask women with anorexia about what they think. Maybe we could enlist some of them to assist in a design of a randomized trial or at least a treatment investigators could retain sufficient numbers of them to conduct a randomized trial

I am not sure how we would otherwise get this understanding without involving women with anorexia in the design of treatment in future clinical trials.

There are unsolved issues of medical surveillance and co-treatment confounding. Anorexia poses physical health problems in the threats associated with sudden weight loss. But we do not have evidence-based protocols in place for standardizing surveillance and decision-making.

Before we undertake massive randomized trials such as ANTOP, we need to get information to set basic parameters from nonrandomized but nonetheless informative small-scale studies. Obviously the investigators in this study could not even estimate effect sizes in order to set sample sizes.

Well,  you presumably having made it through this long read, what do you think?