Lessons we need to learn from a Lancet Psychiatry study of the association between exercise and mental health

The closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

giphyThe closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

Apparently, the editor of Lancet Psychiatry and reviewers did not give the study a close look before it was accepted.

The article was used to raise funds for a startup company in which one of the authors was heavily invested. This was disclosed, but doesn’t let the authors off the hook for promoting a seriously flawed study. Nor should the editor of Lancet Psychiatry or reviewers escape criticism, nor the large number of people on Twitter who thoughtlessly retweeted and “liked” a series of tweets from the last author of the study.

This blog post is intended to raise consciousness about bad science appearing in prestigious journals and to allow citizen scientists to evaluate their own critical thinking skills in terms of their ability to detect misleading and exaggerated claims.

1.Sometimes a disclosure of extensive conflicts of interest alerts us not to pay serious attention to a study. Instead, we should question why the study got published in a prestigious peer-reviewed journal when it had such an obvious risk of bias.

2.We need citizen scientists with critical thinking skills to identify such promotional efforts and alert others in their social network that hype and hokum are being delivered.

3.We need to stand up to authors who use scientific papers for commercial purposes, especially when they troll critics.

Read on and you will see what a skeptical look at the paper and its promotion revealed.

  • The study failed to capitalize on the potential of multiple years of data for developing and evaluating statistical models. Bigger is not necessarily better. Combining multiple years of data was wasteful and served only the purpose of providing the authors bragging rights and the impressive, but meaningless p-values that come from overly large samples.
  • The study relied on an unvalidated and inadequate measure of mental health that confounded recurring stressful environmental conditions in the work or home with mental health problems, even where validated measures of mental health would reveal no effects.
  • The study used an odd measure of history of mental health problems that undoubtedly exaggerated past history.
  • The study confused physical activity with (planned) exercise. Authors amplified their confusion by relying on an exceedingly odd strategy for getting estimate of how much participants exercised: Estimates of time spent in a single activity was used in analyses of total time spent exercising. All other physical activity was ignored.
  • The study made a passing acknowledgment of the problems interpreting simple associations as causal, but then went on to selectively sample the existing literature to make the case that interventions to increase exercise improve mental health.
  • Taken together, a skeptical of assessment of this article provides another demonstration that disclosure of substantial financial conflicts of interests should alert readers to a high likelihood of a hyped, inaccurately reported study.
  • The article was pay walled so that anyone interested in evaluating the authors claims for themselves had to write to the author or have access to the article through a university library site. I am waiting for the authors to reply to my requests for the supplementary tables that are needed to make full sense of their claims. In the meantime, I’ll just complain about authors with significant conflicts of interest heavily promoting studies that they hide behind paid walls.

I welcome you to  examine the author’s thread of tweets. Request the actual article from the author if you want to evaluate independently my claims. This can be great material for a masters or honors class on critical appraisal, whether in psychology or journalism.

title of article

Let me know if you think that I’ve been too hard on this study.

A thread of tweets  from the last author celebrated the success of well orchestrated publicity campaign for a new article concerning exercise and mental health in Lancet Psychiatry.

The thread started:

Our new @TheLancetPsych paper was the biggest ever study of exercise and mental health. it caused quite a stir! here’s my guided tour of the paper, highlighting some of our excitements and apprehensions along the way [thread] 1/n

And ended with pitch for the author’s do-good startup company:

Where do we go from here? Over @spring_health – our mental health startup in New York City – we’re using these findings to develop personalized exercise plans. We want to help every individual feel better—faster, and understand exactly what each patient needs the most.

I wasn’t long into the thread before my skepticism was stimulated. The fourth tweet in the thread had a figure that didn’t get any comments about how bizarre it was.

The tweet

It looks like those differences mattered. for example, people who exercised for about 45 minutes seemed to have better mental health than people who exercised for less than 30, or more than 60 minutes. — a sweet spot for mental health, perhaps?

graphs from paper

Apparently the author does not comment on an anomaly either. Housework appears to be better for mental health than a summary score of all exercise and looks equal to or better than cycling or jogging. But how did housework slip into the category “exercise”?

I begin wondering what the authors meant by “exercise” or if they’d given the definition serious consideration when constructing their key variable from the survey data.

But then that tweet was followed by another one that generated more confusion with a  graph the seemingly contradicted the figures in the last one

the type of exercise people did seems important too! People doing team sports or cycling had much better mental health than other sports. But even just walking or doing household chores was better than nothing!

Then a self-congratulatory tweet for a promotional job well done.

for sure — these findings are exciting, and it has been overwhelming to see the whole world talking openly and optimistically about mental health, and how we can help people feel better. It isn’t all plain sailing though…

The author’s next tweet revealed a serious limitation to the measure of mental health used in the study in a screenshot.

screenshot up tweet with mental health variable

The author acknowledged the potential problem, sort of:

(1b- this might not be the end of the world. In general, most peple have a reasonable understanding of their feelings, and in depressed or anxious patients self-report evaluations are highly correlated with clinician-rated evaluations. But we could be more precise in the future)

“Not the end of the world?” Since when does the author of the paper in the Lancet family of journals so casually brush off a serious methodological issue? A lot of us who have examined the validity of mental health measures would be skeptical of this dismissal  of a potentially fatal limitation.

No validation is provided for this measure. On the face of it, respondents could endorse it on basis of facing  recurring stressful situations that had no consequences for their mental health. This reflects ambiguity of the term stress for both laypersons and scientists. “Stress” could variously refer to an environmental situation, a subjective experience of stress, or an adaptational outcome. Waitstaff could consider Thursday when the chef is off, a recurrent, weekly stress. Persons with diagnosable persistent depressive disorder would presumably endorse more days than not as being a mental health challenge. But they would mean something entirely different.

The author acknowledged that the association between exercise and mental health might be bidirectional in terms of causality

adam on lots of reasons to believe relationship goes both ways.PNG

But then made a strong claim for increased exercise leading to better mental health.

exercise increases mental health.PNG

[Actually, as we will see, the evidence from randomized trials of exercise to improve mental health is modest, and entirely disappears one limits oneself to the quality studies.]

The author then runs off the rail with the claim that the benefits of exercise exceed benefits of having greater than poverty-level income.

why are we so excited.PNG

I could not resist responding.

Stop comparing adjusted correlations obtained under different circumstances as if they demonstrated what would be obtained in RCT. Don’t claim exercising would have more effect than poor people getting more money.

But I didn’t get a reply from the author.

Eventually, the author got around to plugging his startup company.

I didn’t get it. Just how did this heavy promoted study advance the science fo such  “personalized recommendation?

Important things I learned from others’ tweets about the study

I follow @BrendonStubbs on Twitter and you should too. Brendon often makes wise critical observations of studies that most everyone else is uncritically praising. But he also identifies some studies that I otherwise would miss and says very positive things about them.

He started his own thread of tweets about the study on a positive note, but then he identified a couple of critical issues.

First, he took issue with the author’s week claiming to have identified a tipping point, below which exercise is beneficial, and above which exercise could prove detrimental the mental health.

4/some interpretations are troublesome. Most confusing, are the assumptions that higher PA is associated/worsens your MH. Would we say based on cross sect data that those taking most medication/using CBT most were making their MH worse?

A postdoctoral fellow @joefirth7  seconded that concern:

I agree @BrendonStubbs: idea of high PA worsening mental health limited to observation studies. Except in rare cases of athletes overtraining, there’s no exp evidence of ‘tipping point’ effect. Cross-sect assocs of poor MH <–> higher PA likely due to multiple other factors…

Ouch! But then Brendan follows up with concerns that the measure of physical activity has not been adequately validated, noting that such self-report measures prove to be invalid.

5/ one consideration not well discussed, is self report measures of PA are hopeless (particularly in ppl w mental illness). Even those designed for population level monitoring of PA https://journals.humankinetics.com/doi/abs/10.1123/jpah.6.s1.s5 … it is also not clear if this self report PA measure has been validated?

As we will soon see, the measure used in this study is quite flawed in its conceptualization and its odd methodology of requiring participants to estimate the time spent exercising for only one activity, with 70 choices.

Next, Brandon points to a particular problem using self-reported physical activity in persons with mental disorder and gives an apt reference:

6/ related to this, self report measures of PA shown to massively overestimate PA in people with mental ill health/illness – so findings of greater PA linked with mental illness likely bi-product of over-reporting of PA in people with mental illness e.g Validity and Value of Self-reported Physical Activity and Accelerometry in People With Schizophrenia: A Population-Scale Study of the UK Biobank [ https://academic.oup.com/schizophreniabulletin/advance-article/doi/10.1093/schbul/sbx149/4563831 ]

7/ An additional point he makes: anyone working in field of PA will immediately realise there is confusion & misinterpretation about the concepts of exercise & PA in the paper, which is distracting. People have been trying to prevent this happening over 30 years

Again, Brandon provides a spot-on citation clarifying the distinction between physical activity and exercise:, Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research 

The mysterious pseudonymous Zad Chow @dailyzad called attention to a blog post they had just uploaded and let’s take a look at some of the key points.

Lessons from a blog post: Exercise, Mental Health, and Big Data

Zad Chow is quite balanced in dispensing praise and criticism of the Lancet Psychiatry paper. They noted the ambiguity of any causality in cross-sectional correlation and that investigated the literature on their own.

So what does that evidence say? Meta-analyses of randomized trials seem to find that exercise has large and positive treatment effects on mental health outcomes such as depression.

Study Name     # of Randomized Trials             Effects (SMD) + Confidence Intervals

Schuch et al. 2016       25         1.11 (95% CI, 0.79-1.43)

Gordon et al. 2018      33         0.66 (95% CI, 0.48-0.83)

Krogh et al. 2017          35         −0.66 (95% CI, -0.86, -0.46)

But, when you only pool high-quality studies, the effects become tiny.

“Restricting this analysis to the four trials that seemed less affected of bias, the effect vanished into −0.11 SMD (−0.41 to 0.18; p=0.45; GRADE: low quality).” – Krogh et al. 2017

Hmm, would you have guessed this from the Lancet Psychiatry author’s thread of tweets?

Zad Chow showed the hype and untrustworthiness of the press coverage in prestigious media with a sampling of screenshots.

zad chou screenshots of press coverage

I personally checked and don’t see that Zad Chow’s selection of press coverage was skewed. Coverage in the media all seemed to be saying the same thing. I found the distortion to continue with uncritical parroting – a.k.a. churnaling – of the claims of the Lancet Psychiatry authors in the Wall Street Journal. 

The WSJ repeated a number of the author’s claims that I’ve already thrown into question and added a curiosity:

In a secondary analysis, the researchers found that yoga and tai chi—grouped into a category called recreational sports in the original analysis—had a 22.9% reduction in poor mental-health days. (Recreational sports included everything from yoga to golf to horseback riding.)

And the NHS England totally got it wrong:

NHS getting it wrong.PNG

So, we learned that the broad category “recreational sports” covers yoga and tai chi , as well as golf and  horseback riding. This raises serious questions about the lumping and splitting of categories of physical activity in the analyses that are being reported.

I needed to access the article in order to uncover some important things 

I’m grateful for the clues that I got from Twitter, and especially Zad Chow that I used in examining the article itself.

I got hung up on the title proclaiming that the study involved 1·2 million individuals. When I checked the article, I saw that the authors use three waves of publicly available data to get that number. Having that many participants gave them no real advantage except for bragging rights and the likelihood that modest associations could be expressed in expressed in spectacular p-values, like p<2・2 × 10–16. I don’t understand why the authors didn’t conduct analyses with one-way and Qwest validate results in another.

The obligatory Research in Context box made it sound like a systematic search of the literature had been undertaken. Maybe, but the authors were highly selective in what they chose to comment upon, as seen in its contradiction by the brief review of Zad Chow. The authors would have us believe that the existing literature is quite limited and inconclusive, supporting the need for like their study.

research in context

Caveat Lector, a strong confirmation bias is likely ahead in this article.

Questions accumulated quickly as to the appropriateness of the items available from a national survey undoubtedly constructed with other purposes. Certainly these items would not have been selected if the original investigators were interested in the research question at the center of this article.

Participants self-reported a previous diagnosis of depression or depressive episode on the basis of the following question: “Has a doctor, nurse, or other health professional EVER told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?”

Our own work has cast serious doubt on the correspondence of reports of a history of depression in response to a brief question embedded in a larger survey with results of a structured interview in which respondents’ answers can be probed. We found that answers to such questions were more related to current distress, then to actual past diagnoses and treatment of depression. However, the survey question used in the Lancet Psychiatry study added the further ambiguity and invalidity with the added  “or minor depression.” I am not sure under what circumstances a health care professional would disclose a diagnosis of “minor depression” to a patient, but I doubt it would be in context in which the professional felt treatment was needed.

Despite the skepticism that I was developing about the usefulness of the survey data, I was unprepared for the assessment of “exercise.”

Other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” Participants who answered yes to this question were then asked: “What type of physical activity or exercise did you spend the most time doing during the past month?” A total of 75 types of exercise were represented in the sample, which were grouped manually into eight exercise categories to balance a diverse representation of exercises with the need for meaningful cell sizes (appendix).

Participants indicated the number of times per week or month that they did this exercise and the number of minutes or hours that they usually spend exercising in this way each time.

I had already been tipped off by the discussion on twitter that there would be a thorough confusion of planned exercise and mere physical activity. But now that was compounded. Why was physical activity during employment excluded? What if participants were engaged in a number of different physical activities,  like both jogging and bicycling? If so, the survey obtained data for only one of these activities, with the other excluded, and the choice could’ve been quite arbitrary as to which one the participant identified as the one to be counted.

Anyone who has ever constructed surveys would be alert to the problems posed by participants’ awareness that saying “yes” to exercising would require contemplating  75 different options, arbitrarily choosing one of them for a further question how much time the participant engaged in this activity. Unless participants were strongly motivated, then there was an incentive to simply say no, they didn’t exercise.

I suppose I could go on, but it was my judgment that any validity what the authors were claiming  had been ruled out. Like someone once said on NIH grant review panel, there are no vital signs left, let’s move on to the next item.

But let’s refocus just a bit on the overall intention of these authors. They want to use a large data set to make statements about the association between physical activity and a measure of mental health. They have used matching and statistical controls to equate participants. But that strategy effectively eliminates consideration of crucial contextual variables. Persons’ preferences and opportunities to exercise are powerfully shaped by their personal and social circumstances, including finances and competing demands on their time. Said differently, people are embedded in contexts in which a lot of statistical maneuvering has sought to eliminate.

To suggest a small number of the many complexities: how much physical activity participants get  in their  employment may be an important determinant of choices for additional activity, as well as how much time is left outside of work. If work typically involves a lot of physical exertion, people may simply be left too tired for additional planned physical activity, a.k.a. exercise, and the physical health may require it less. Environments differ greatly in terms of the opportunities and the safety of engaging in various kinds of physical activities. Team sports require other people being available. Etc., etc.

What I learned from the editorial accompanying the Lancet Psychiatry article

The brief editorial accompanying the article aroused my curiosity as to whether someone assigned to reading and commenting on this article would catch things that apparently the editor and reviewer missed.

Editorial commentators are chosen to praise, not to bury articles. There are strong social pressures to say nice things. However, this editorial leaked a number of serious concerns.

First

In presenting mental health as a workable, unified concept, there is a presupposition that it is possible and appropriate to combine all the various mental disorders as a single entity in pursuing this research. It is difficult to see the justification for this approach when these conditions differ greatly in their underlying causes, clinical presentation, and treatment. Dementia, substance misuse, and personality disorder, for example, are considered as distinct entities for research and clinical purposes; capturing them for study under the combined banner of mental health might not add a great deal to our understanding.

The problem here of categorisation is somewhat compounded by the repeated uncomfortable interchangeability between mental health and depression, as if these concepts were functionally equivalent, or as if other mental disorders were somewhat peripheral.

Then:

A final caution pertains to how studies approach a definition of exercise. In the current study, we see the inclusion of activities such as childcare, housework, lawn-mowing, carpentry, fishing, and yoga as forms of exercise. In other studies, these activities would be excluded for not fulfilling the definition of exercise as offered by the American College of Sports Medicine: “planned, structured and repetitive bodily movement done to improve or maintain one or more components of physical fitness.” 11 The study by Chekroud and colleagues, in its all-encompassing approach, might more accurately be considered a study in physical activity rather than exercise.

The authors were listening for a theme song with which they could promote their startup company in a very noisy data set. They thought they had a hit. I think they had noise.

The authors’ extraordinary disclosure of interests (see below this blog post) should have precluded publication of this serious flawed piece of work, either simply for reason of high likelihood of bias or because it promoted the editor and reviewers to look more carefully at the serious flaws hiding in plain sight.

Postscript: Send in the trolls.

On Twitter, Adam Chekroud announced he felt no need to respond to critics. Instead, he retweeted and “liked” trolling comments directed at critics from the twitter accounts of his brother, his mother, and even the official Twitter account of a local fried chicken joint @chickenlodge, that offered free food for retweets and suggested including Adam Chekroud’s twitter handle if you wanted to be noticed.

chicken lodge

Really, Adam, if you can’t stand the heat, don’t go near  where they are frying chicken.

The Declaration of Interests from the article.

declaration of interest 1

declaration of interest 2

 

Why Lancet Psychiatry study didn’t show locked inpatient wards ineffective in reducing suicide

  • A well-orchestrated publicity campaign for a Lancet Psychiatry article promoted the view that locked inpatient wards are ineffective in reducing suicide.
  • This interpretation is not supported by data in the actual paper, but plays to some entrenched political stances and prejudices.
  • Hype and distortions in conventional and social media about this article are traceable directly to quotes from the authors in press releases from Lancet and from their university.
  • Mental Elf  posted a blog the day the embargo on reporting this study was lifted. The blog post and an associated Twitter campaign generated lots of social media attention. Yet, there is no indication that the blogger went beyond what was in press releases or compared the press releases to what was in the actual article.
  • Not many of the re-tweets and “likes” were likely from people who had read the original research.
  • The publicity orchestrated for this study raises issues about the ethics of promoting clinical and public policy with claims of being evidence-based when the audience does not have the ability to evaluate independently the claims by actually reading the peer-reviewed article.
king of heratrs poster
As seen in the popularity of this movie, many of us had romanticized views of emancipating psychiatric inpatients in the 60s – 70s. De-institutionalization and neglect of huge numbers of homeless persons with psychosis was the unanticipated result.

I obtained the article from interlibrary loan and the supplementary material from the authors. I appreciate the authors’ immediate responsiveness to my request.

[I delayed this blog post for a week because of indications that the article would be released from behind the pay wall, but apparently it has not been freed.]

In this blog post I identify important contradictions between the authors’ claims in the article and what they promoted in the media. The contradictions are obvious enough that someone other than the authors – the Lancet Psychiatry editor and reviewers – should have immediately caught them.

Spoiler: Claims supposedly based on sophisticated multivariate techniques that were applied to data from hundreds of thousands of patients were actually based on a paltry 75 completed suicides. These were a subsample of at least 174 that occurred in 21 hospital settings in the course of 15 years. Throwing way a chunk of the data and the application of multivariate analyses to such a small, arbitrarily chosen subsample is grossly inappropriate. Any interpretations are likely to be invalid and unreliable.

No one else seems to be commenting on these key features of the study, nor the other serious problems of the study that I uncovered when I actually examine the paper and supplements. Join me in the discovery process and see if you agree with me. Please let me know if you don’t agree with my assessment.

The promotion of the study can be seen as a matter of ideologically-driven mistreatment of data with the intention of promoting clinical and public policies that put severely disturbed persons at risk for suicide.

Regardless of where one stands as to whether severely disturbed persons should be prevented from hurting or killing themselves, this attempted manipulation of public policy should be viewed as objectionable.

In presenting what may be controversial points, I’ll start with editorials that were easily accessible. I’ll then delve into the paywalled article itself.

The press release from the authors’ University of Basil

This press release, Psychiatry on closed and open wards: The suicide risk remains the same  provided limited details of the study, but misrepresented the study’s finding of risk for suicide as being based on 350,000 patients.

The study’s last author declared his agenda in promoting the study:

Focus on ethical standards

“Our results are important for the destigmatization, participation and emancipation of patients, as well as for psychiatric care in general,” comments last author Undine Lang, Director of the Adult Psychiatric Clinic at UPK Basel. The results will also have an influence on legal issues that arise when clinics adopt an open door policy. In future, treatment should focus more on ethical standards that ensure patients retain their autonomy as far as possible, says Undine Lang. Efforts should also be made to strengthen the therapeutic relationship and joint decision-making with patients.

The press release from The Lancet

Distributed while the article was still embargoed, Locking doors in mental health hospitals does not lower suicide rate provided more details of the study, but more editorializing grounded in direct quotes from the authors:

Locking the doors of mental health hospitals does not reduce the risk of suicide or of patients leaving without permission, according to a study published in The Lancet Psychiatry.

Authorities around the world are increasingly using locked-door policies to keep patients safe from harm, but locked doors also restrict personal freedom.

European countries tend to follow traditional approaches in caring for patients in psychiatric care, because there has been little evidence so far that one method is better than another.

Similar outcomes whether doors are open or locked.

Of 349,574 patients, they selected 72,869 cases from each hospital type, or 145,738 cases altogether. Creating matched pairs enabled a direct comparison between hospitals.

Translation: to prepare the data for the statistical analyses the authors had planned, they threw away 203,836 cases, or 58.3% of the available cases.

And they concluded:

Findings revealed similar rates of suicide and attempted suicide, regardless of whether a hospital had a locked door policy or not. Furthermore, hospitals with an open door policy did not have higher rates of absconding, either with or without return. Patients who left an open door hospital without permission were more likely to return than those from a closed facility.

The press release next raised a dramatic question. But could these data answer it?

Do locked doors unnecessarily create a sense of oppression?

Given the similarity of outcomes between the two types of hospital, the researchers propose that an open door policy might be preferable.

“These findings suggest that locked door policies may not help to improve the safety of patients in psychiatric hospitals, and are not generally successful in preventing people from absconding. In fact, a locked door policy probably imposes a more oppressive atmosphere, which could reduce the effectiveness of treatments, resulting in longer stays in hospital. The practice may even lend motivation for patients to abscond.” -Dr. Christian Huber, of the Universitäre Psychiatrische Kliniken Basel, Switzerland

Of course, the study did not assess anything like “sense of oppression” and so cannot answer this question. As we will see when I discuss what I found in the actual paper, Dr. Huber’s characterization of his findings is untrue. Patients on locked wards did actually not have longer stays.

Since each hospital serves a specific location, there was no chance of higher-risk patients being allocated to hospitals with locked wards. This reduced the risk of bias.

This is also not true. An unknown proportion of the hospitals, probably most, had both locked and unlocked wards. There could easily have been strong selection bias by which patients was referred to a locked ward. We are not told whether patients could be referred into other catchment areas, but this information would be useful in interpreting the authors’ claims.

The authors warn that an open door strategy might not be appropriate everywhere, as mental health care provision differs in other ways, too, for example, how many beds are available, the percentage of acutely ill patients, and how long they are treated for.

Germany has around 1.1 psychiatric care beds for every 1,000 people, compared with 0.5 beds per 1,000 in the United Kingdom and 0.3 in the United States. Where there are fewer beds, patients who receive treatment are more likely to be severely ill and more at risk.

So, Germany has more than 3 times the beds/100 people than the USA and more than the twice the availability of beds in the UK. We can learn from other sources:

Germany is one of the countries with highest expenditure for mental health care in the world. However, in contrast to other western European countries, psychiatric treatment in Germany is still mainly provided by psychiatric hospitals, outpatient clinics and office based psychiatrists and only rarely by community mental health teams. As mental health policy, except the provision of pharmaceutical treatment, is the responsibility of the federal states, no national mental health plan exists. Therefore, community mental health care systems vary widely with regard to conceptual, organisational and economic conditions across the country. Moreover, the fact that different components of community mental health care are funded by different payers (and on different legal bases) hampers coordination and integration of services.

Studies largely conducted in other countries with organizations of care different than in Germany have consistently concluded that Assertive Community Treatment (ACT) programs are effective in reducing the need for inpatient treatment.

In order to keep the level of psychiatric inpatient treatment and institutional care as low as possible these services should be provided by multi-professional community mental health teams organized according to the principles of Assertive Community Treatment (ACT).

ACT programs keep persons with psychosis from being placed in psychiatric inpatient units like those studied in the Lancet Psychiatry and they lead to shorter hospital stays.

The Lancet Psychiatry article makes no mention of ACT in Germany. My inference is that implementation was not widespread during the study. If there are ACT programs in Germany, their influence on this data set is through an invisible hand.

Inpatient psychiatric beds are quite scarce in the US, even for patients and families willing to pay out of pocket. To deal with demand that is not met by psychiatric facilities, the Los Angeles jail has become the largest locked facility. Whether it not it was the intention of the Lancet Psychiatry, the ideology with which it is infused has served to make inpatient beds less available in the United States and greater reliance on jails instead of least restrictive, and more supportive settings for protecting persons with psychosis.

as Alabama cuts

Alabama sheriff
Just one of numerous results of a movement of resources away from mental health services for the severely impaired and vulnerable.

 

 

 

Inpatient hospitalizations in the United States are much shorter than in Germany. In some states, the mean length of stay is five days. Hospitalization has different goals in the US- only stabilization of the patient’s condition.

The means of killing oneself are also different between the US and Germany. Firearms are much more readily available in the US than in Germany, suggesting different means-restriction strategies for reducing suicide.

So, I cannot see the generalizability of the findings from the Lancet Psychiatry study to the US – or the UK, for that matter. Can you?

The Mental Elf: Locked wards vs open wards: does control = safety?

The Mental Elf advertises itself as offering “no bias, no misinformation, just what you need.” Its coverage of the study occurred the same day the embargo was lifted. Its coverage uncritically echoed what was in the press releases, adding some emotional and ideologically-driven amplification.

The reason usually given for wards being locked is that the people within them need to be kept safe; safe from harming themselves and safe from committing harm to others. Of course these are very real fears, but they are often wrongly magnified by a still sadly stigmatising media and public perception of severe mental illness.

There is certainly an uneasy tension between the Mental Health Act Code of Practice and the reality of locking up severely ill mental health patients, which is brought into sharp focus when we consider the lack of evidence for locked wards. The literature is primarily made up of expert opinion that insists safety is paramount, but fails to provide any compelling evidence that locking people up actually increases safety.

Let’s examine Mental Elf’s claim of the lack of “any compelling evidence that locking people up actually increases safety.” Presumably, he is referring to the lack of RCTs.

another without a parachuteI have been a scientific advisor to experimental studies like the US PROSPECT study and quasi-experimental European studies attempting to test whether suicidality could be reduced. Any such studies suffer from the serious practical limitation that suicide is an infrequent event. But to say there is no compelling evidence for restricting opportunities for acutely suicidal persons to hurt themselves is akin to BMJ’s spoof systematic review  finding no evidence from RCTs that parachutes reduce deaths when jumping of planes.

Neither RCTs nor the propensity analyses of administrative data that Mental Elf favors can produce “compelling data.”  As I will soon show, this study displays the pitfalls of propensity analyses.

We can systematically examine the contextual circumstances of particular deaths by suicide when they do occur, and make suggestions whether some sort of means restriction, including access to a locked  inpatient unit would have made a difference. We can also hold professionals in a decision making capacity legally responsible when they fail to avail themselves of such facilities, and we should.

The Mental Elf wrapped on a rousing, uncritical, and ultimately nonsensical note:

This is a novel and compelling study, conducted in Germany, but very relevant to any Western country that has a secure system for mentally ill inpatients.

Our obsession with security and safety in an ever more dangerous world is justified if you watch the TV news channels for any prolonged period of time. The world is after all full of war, terrorism, violent crime, child abuse; or so we’re led to believe.

I spent a very enjoyable day at City University last week, participating in the #COCAPPimpact discussions, which included some rich and very constructive conversations about therapeutic relationships. It doesn’t take much to appreciate that relationships (therapeutic or otherwise) are stronger and more equitable on open wards.

The Mental Elf website claims (8/5/20016) 215 responses to this post. All but a very few were approving tweets that did not depend on the tweeter having read the study.

The reference to TV news channels is at the level of evidence of a Donald Trump tweet in which he refers to something he saw on TV.

Taking a look at the actual article and its supplementary information.

Christian G. Huber, Andres R. Schneeberger, Eva Kowalinski, Daniela Fröhlich, Stefanie von Felten, Marc Walter, Martin Zinkler, Karl Beine, Andreas Heinz, Stefan Borgwardt, and Undine E. Lang. Suicide Risk and Absconding in Psychiatric Hospitals with and without Open Door Policies: A 15-year Naturalistic Observational Study. The Lancet Psychiatry, 2016 DOI: 10.1016/S2215-0366(16)30168-7

At the time of the media campaign, most people who wanted to access the article could only obtain its abstract, which you can click here  .

Why were there only 75 suicides being explained?

Much ado is being made of 75 suicides that occurred over a 15 year period across 21 hospitals. Suicides are an infrequent event, even in high risk populations. But why were only 75 available for analysis from a sample that initially consisted of 350,000 in this amount of time?

Let’s start with the 350,000 admissions that are misrepresented as “cases” in the official press releases. The article states:

The resulting dataset contained 349 574 hospital admissions from 177 295 patients.

models lockedPresumably, a considerable proportion of these patients had multiple admissions over the 15 years. Suicides were probably concentrated in the group with multiple admissions.  But some patients had only one admission. Moreover, some patients may have been admitted  to different types of facilities – locked versus unlocked –  on different occasions. Confusion is being generated, bias is being introduced, and valuable information is being lost about the non-independence of observations – i.e., admissions.

How many suicides occurred among these 349 574 hospital admissions? Readers cannot tell from the article. Table 4 states that multivariate analyses were based on predicting 79 suicides. Yet, going to supplementary materials, Table S1 indicates that the analyses were done without the matching requirements imposed by propensity analyses, there were 174 suicides to it explain. The authors aren’t particularly clear, but it appears that in order to meet the requirements of their propensity analysis, they threw away data on most of the suicides.

The exaggerated power of propensity analyses

The authors extol the virtues of propensity analyses:

We used propensity score matching and generalised linear mixed-effects models to achieve the strongest causal inference possible without an experimental design. Since patients were not randomly allocated to the different hospital types, causal inference between hospital type and outcomes might be biased—potential confounders could affect both the probability of relevant outcomes and the probability of a case having been admitted to a specific hospital type. The propensity score of patients reflects their probability of having been admitted to a hospital with an open-door policy rather than one with a locked-door policy.15 By matching cases from both hospital types based on their propensity score, datasets with similar distributions of confounders can be generated. These allow stronger causal inference when analysed.15

A full discussion of propensity analyses is beyond the scope of this blog post. I worry that I would lose a lot of readers here if I attempted one. But here is a very readable, accessible source:

Glynn RJ, Schneeweiss S, Stürmer T. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic & Clinical Pharmacology & Toxicology. 2006 Mar 1;98(3):253-9.

It states:

It remains unclear whether, and if so when, use of propensity scores provides estimates of drug effects that are less biased than those obtained from conventional multivariate models. In the great majority of published studies that have used both approaches, estimated effects from propensity score and regression methods have been similar.

And

Use of propensity scores will not correct biases from unmeasured confounders, but can aid in understanding determinants of drug use and lead to improved estimates of drug effects in some settings.

One problem with applying analysis of propensity scores to the data set used in the Lancet Psychiatry is that there was a great deal of difficulty matching the admissions to different settings. Moreover, because it was an administrative data set, there are numerous unmeasured, but particularly crucial confounds that could not be included in the propensity matching or in the generalised linear mixed-effects model analyses thereafter. So, in using propensity analysis, the authors threw way most of their data without been able to achieve adequate statistical control for confounds.

We calculated propensity scores for all cases based on a model that included all clinical characteristics before admission as exploratory variables (age, sex, marital status, housing situation, living together with others, employment situation, main diagnosis, comorbid substance use disorder, comorbid personality disorder, comorbid mental retardation, self-injuring behaviour before admission, suicidal ideation before admission, suicide attempt before admission, type of admission, and voluntary admission). These calculations were done on a complete case basis, therefore 36 300 (10·4%) cases with missing covariate were excluded.

There is the temptation to ask “what is the harm in adjustments that involve the loss of only 10.4% of cases, particularly if better statistical control is achieved?” Well,

Overall, 72,869 pairs of matched cases could be created, resulting in a total matched set consisting of 145,738 cases from 87,640 individual patients for the analyses themselves.

So, the authors have lost a nonrandom selection of more than half the admissions with which they started, and they’ve lost the nonindependence of observations in this shrunken data set. Just look at the ratio of 145,738 “cases” to the 87,640 individual patients from which they came. There is a lot of valuable data being suppressed concerning the fate of individual patients when hospitalized in different settings.

How complete is the data available for matching and control of statistical confounds?

We calculated propensity scores for all cases based on a model that included all clinical characteristics before admission as exploratory variables (age, sex, marital status, housing situation, living together with others, employment situation, main diagnosis, comorbid substance use disorder, comorbid personality disorder, comorbid mental retardation, self-injuring behaviour before admission, suicidal ideation before admission, suicide attempt before admission, type of admission, and voluntary admission.

full clinical characteristicsLet’s look at baseline characteristics in Table 1 of the Lancet Psychiatry article. These are the only variables that are available for matching or controlling for statistical confounds.

Recall that the effectiveness statistical controls assumes that all relevant variables have been measured with perfect precision. Statistical control is supposed to eliminate crucial differences among patients so they can be assumed to be otherwise equivalent in likelihood of being admitted to a locked or unlocked ward for the basis of analysis and interpretation. Statistical control is supposed to equip us to make “all-other-things-being equal” judgments about the effects of being in a locked or unlocked ward.

 

Zero in on main and comorbid diagnoses. What kind of statistical voodoo can possibly be expected to level other differences between patients at higher risk for suicide like the 49% minority with schizophrenia spectrum or affective of disorder versus the others at considerably lower risk? How does it help that this large minority of higher risk patient is thrown in with lower risk patients with organic mental disorder (dementia or mental retardation) and “neurotic, stress-related and somatoform disorders”?*

If there’s any rationality to the German system of care (and I assume there is), at least some crude risk assessment would guide patients with lower risk into less restrictive settings.

And then there is the question of substance use disorder, which was the primary diagnosis for 67,811 (25·5%) of the patients going into locked facilities and 14,621 (18·7%);

Substance use disorder was the comorbidity for another 100 128 (36·9%) going into locked facilities and 28 363 (36·2%) going into unlocked facilities. Issues for substance use disorder and exit security on psychiatric wards are very different than for patients without such disorders. These issues in relationship to absconding  or dying by suicide are not going to be sorted by entering diagnosis into a propensity analysis or generalised linear mixed-effects model analyses of a data set shrunken by matching in a propensity analysis.

Postscript

I conclude that the data set is much less impressive and relevant than it first appears. There are not a lot of suicides. They occur in a heterogeneous population in a length of time in which the patterning of circumstances associated with these characteristics likely changed. Because it was the administrative data set, there were restricted opportunities for matching of patients or control of confounds. Any substantive interpretation of multivariate results requires dubious, unsubstantiated assumptions.

But more importantly, the data set does not provide much evidence for the ideologically saturated claims of the authors or their promoter, Mental Elf. They can pound their drums, but it is not evidence that they are announcing. And patients and their families in both Germany and elsewhere could suffer if the recommendations are taking seriously.

Note

*The “neurotic, stress-related and somatoform disorders” admissions to inpatient units are a distinctly German phenomenon. Persons from the community claiming “burnout” can be admitted to facilities overseen by departments of psychotherapy and psychosomatics. There is ample insurance coverage for what can be a spa-like experience with massage and integrative medicine approaches.

 

 

Uninterpretable: Fatal flaws in PACE Chronic Fatigue Syndrome follow-up study

Earlier decisions by the investigator group preclude valid long-term follow-up evaluation of CBT for chronic fatigue syndrome (CFS).

CFS-Think-of-the-worst1At the outset, let me say that I’m skeptical whether we can hold the PACE investigators responsible for the outrageous headlines that have been slapped on their follow-up study and on the comments they have made in interviews.

The Telegraph screamed

Chronic Fatigue Syndrome sufferers ‘can overcome symptoms of ME with positive thinking and exercise’

Oxford University has found ME is not actually a chronic illness

My own experience critiquing media interpretation of scientific studies suggests that neither researchers nor even journalists necessarily control shockingly inaccurate headlines placed on otherwise unexceptional media coverage. On the other hand, much distorted and exaggerated media coverage starts with statements made by researchers and by press releases from their institutions.

The one specific quote attributed to a PACE investigator is unfortunate because of its potential to be misinterpreted by professionals, persons who suffer from chronic fatigue syndrome, and the people around them affected by their functioning.

“It’s wrong to say people don’t want to get better, but they get locked into a pattern and their life constricts around what they can do. If you live within your limits that becomes a self-fulfilling prophesy.”

It suggests that willfulness causes CFS sufferers’ impaired functioning. This is ridiculous as application of the discredited concept of fighting spirit to cancer patients’ failure to triumph against their life altering and life-threatening condition. Let’s practice the principle of charity and assume this is not the intention of the PACE investigator, particularly when there is so much more for which we should give them responsibility.

Go here for a fuller evaluation that I endorse of the Telegraph coverage of PACE follow-up study.

Having read the PACE follow-up study carefully, my assessment is that the data presented are uninterpretable. We can temporarily suspend critical thinking and some basic rules for conducting randomized trials (RCTs), follow-up studies, and analyzing the subsequent data. Even if we do, we should reject some of the interpretations offered by the PACE investigators as unfairly spun to fit what has already a distorted positive interpretation oPACE trial HQf the results.

It is important to note that the PACE follow-up study can only be as good as the original data it’s based on. And in the case of the PACE study itself, a recent longread critique by UC Berkeley journalism and public health lecturer David Tuller has arguably exposed such indefensible flaws that any follow-up is essentially meaningless. See it for yourself [1, 2, 3 ].

This week’s report of the PACE long term follow-up study and a commentary  are available free at the Lancet Psychiatry website after a free registration. I encourage everyone to download a copy before reading further. Unfortunately, some crucial details of the article are highly technical and some details crucial to interpreting the results are not presented.

I will provide practical interpretations of the most crucial technical details so that they are more understandable to the nonspecialist. Let me know where I fail.

1When Cherished Beliefs Clash with EvidenceTo encourage proceeding with this longread, but to satisfy those who are unwilling or unable to proceed, I’ll reveal my main points are

  • The PACE investigators sacrificed any possibility of meaningful long-term follow-up by breaking protocol and issuing patient testimonials about CBT before accrual was even completed.
  • This already fatal flaw was compounded with a loose recommendation for treatment after the intervention phase of the trial ended. The investigators provide poor documentation of which treatment was taken up by which patients and whether there was crossover in the treatment being received during follow up.
  • Investigators’ attempts to correct methodological issues with statistical strategies lapses into voodoo statistics.
  • The primary outcome self-report variables are susceptible to manipulation, investigator preferences for particular treatments, peer pressure, and confounding with mental health variables.
  • The Pace investigators exploited ambiguities in the design and execution of their trial with self-congratulatory, confirmatory bias.

The Lancet Psychiatry summary/abstract of the article

Background. The PACE trial found that, when added to specialist medical care (SMC), cognitive behavioural therapy (CBT), or graded exercise therapy (GET) were superior to adaptive pacing therapy (APT) or SMC alone in improving fatigue and physical functioning in people with chronic fatigue syndrome 1 year after randomisation. In this pre-specified follow-up study, we aimed to assess additional treatments received after the trial and investigate long-term outcomes (at least 2 years after randomisation) within and between original treatment groups in those originally included in the PACE trial.

Findings Between May 8, 2008, and April 26, 2011, 481 (75%) participants from the PACE trial returned questionnaires. Median time from randomisation to return of long-term follow-up assessment was 31 months (IQR 30–32; range 24–53). 210 (44%) participants received additional treatment (mostly CBT or GET) after the trial; with participants originally assigned to SMC alone (73 [63%] of 115) or APT (60 [50%] of 119) more likely to seek treatment than those originally assigned to GET (41 [32%] of 127) or CBT (36 [31%] of 118; p<0·0001). Improvements in fatigue and physical functioning reported by participants originally assigned to CBT and GET were maintained (within-group comparison of fatigue and physical functioning, respectively, at long-term follow-up as compared with 1 year: CBT –2·2 [95% CI –3·7 to –0·6], 3·3 [0·02 to 6·7]; GET –1·3 [–2·7 to 0·1], 0·5 [–2·7 to 3·6]). Participants allocated to APT and to SMC alone in the trial improved over the follow-up period compared with 1 year (fatigue and physical functioning, respectively: APT –3·0 [–4·4 to –1·6], 8·5 [4·5 to 12·5]; SMC –3·9 [–5·3 to –2·6], 7·1 [4·0 to 10·3]). There was little evidence of differences in outcomes between the randomised treatment groups at long-term follow-up.

Interpretation The beneficial effects of CBT and GET seen at 1 year were maintained at long-term follow-up a median of 2·5 years after randomisation. Outcomes with SMC alone or APT improved from the 1 year outcome and were similar to CBT and GET at long-term follow-up, but these data should be interpreted in the context of additional therapies having being given according to physician choice and patient preference after the 1 year trial final assessment. Future research should identify predictors of response to CBT and GET and also develop better treatments for those who respond to neither.

fem imageNote the contradiction here which will persist throughout the paper, the official Oxford University press release, quotes from the PACE investigators to the media, and media coverage. On the one hand we are told:

Improvements in fatigue and physical functioning reported by participants originally assigned to CBT and GET were maintained…

Yet we are also told:

There was little evidence of differences in outcomes between the randomised treatment groups at long-term follow-up.

Which statement is to be given precedence? To the extent that features of a randomized trial have been preserved in the follow-up (which we will see, is not actually the case), a lack of between group differences at follow-up should be given precedence over any persistence of change within groups from baseline. That is a not controversial point for interpreting clinical trials.

A statement about group differences at follow up should proceed and qualify any statement about within-group follow up. Otherwise why bother with a RCT in the first place?

The statement in the Interpretation section of the summary/abstract has an unsubstantiated spin in favor of the investigators’ preferred intervention.

Outcomes with SMC alone or APT improved from the 1 year outcome and were similar to CBT and GET at long-term follow-up, but these data should be interpreted in the context of additional therapies having being given according to physician choice and patient preference after the 1 year trial final assessment.

If we’re going to be cautious and qualified in our statements, there are lots of other explanations for similar outcomes in the intervention and control groups that are more plausible. Simply put and without unsubstantiated assumptions, any group differences observed earlier have dissipated. Poof! Any advantages of CBT and GET are not sustained.

How the PACE investigators destroyed the possibility of an interpretable follow-up study

imagesNeither the Lancet Psychiatry article nor any recent statements by the PACE investigators acknowledged how these investigators destroyed any possibility of analyses of meaningful follow-up data.

Before the intervention phase of the trial was even completed, even before accrual of patients was complete, the investigators published a newsletter in December 2008 directed at trial participants. An article appropriately reminds participants of the upcoming two and one half year follow-up. But then it acknowledges difficulty accruing patients, but that additional funding has been received from the MRC to extend recruiting. And then glowing testimonials appear on p. 3 of the newsletter about the effects of their intervention.

“Being included in this trial has helped me tremendously. (The treatment) is now a way of life for me, I can’t imagine functioning fully without it. I have nothing but praise and thanks for everyone involved in this trial.”

“I really enjoyed being a part of the PACE Trial. It helped me to learn more about myself, especially (treatment), and control factors in my life that were damaging. It is difficult for me to gauge just how effective the treatment was because 2007 was a particularly strained, strange and difficult year for me but I feel I survived and that the trial armed me with the necessary aids to get me through. It was also hugely beneficial being part of something where people understand the symptoms and illness and I really enjoyed this aspect.”

These testimonials are a horrible breach of protocol. Taken together with the acknowledgment of the difficulty accruing patients, the testimonials solicit expression of gratitude and apply pressure on participants to endorse the trial by providing a positive of their outcome. Some minimal effort is made to disguise the conditions from which the testimonials come. However, references to a therapist and, in the final quote above, to “control factors in my life that were damaging” leave no doubt that the CBT and GET favored by the investigators is having positive results.

Probably more than in most chronic illnesses, CFS sufferers turn to each other for support in the face of bewildering and often stigmatizing responses from the medical community. These testimonials represent a form of peer pressure for positive evaluations of the trial.

Any investigator group that would deliberately violate protocol in this manner deserves further scrutiny for other violations and threats to the validity of their results. I challenge defenders of the PACE study to cite other precedents for this kind of manipulation of clinical trials participants. What would they have thought if a drug company had done this for the evaluation of their medication?

The breakdown of randomization as further destruction of the interpretability of follow-up results

Returning to the Lancet Psychiatry article itself, note the following:

After completing their final trial outcome assessment, trial participants were offered an additional PACE therapy if they were still unwell, they wanted more treatment, and their PACE trial doctor agreed this was appropriate. The choice of treatment offered (APT, CBT, or GET) was made by the patient’s doctor, taking into account both the patient’s preference and their own opinion of which would be most beneficial. These choices were made with knowledge of the individual patient’s treatment allocation and outcome, but before the overall trial findings were known. Interventions were based on the trial manuals, but could be adapted to the patient’s needs.

Readers who are methodologically inclined might be interested in a paper in which I discuss incorporating patient preference in randomized trials, as well as another paper describing clinical trial conducted with German colleagues  in which we incorporated patient preference in evaluation of antidepressants and psychotherapy for depression in primary care. Patient preference can certainly be accommodated in a clinical trial in ways that preserve the benefits of randomization, but not as the PACE investigators have done.

Following completion of the treatment to which particular patients were randomly assigned, the PACE trial offered a complex negotiation between patient and trial physician about further treatment. This represents a thorough breakdown of the benefits of a controlled randomized trial for the evaluation of treatments. Any focus on the long-term effects of initial randomization is sacrificed by what could be substantial departures from that randomization. Any attempts at statistical corrections will fail.

Of course, investigators cannot ethically prevent research participants from seeking additional treatment. But in the case of PACE, the investigators encouraged departures from the randomized treatment yet did not adequately take into account the decisions that were made. An alternative would have been to continue with the randomized treatment, taking into account and quantifying any cross over into another treatment arm.

2When Cherished Beliefs Clash with EvidenceVoodoo statistics in dealing with incomplete follow-up data.

Between May 8, 2008, and April 26, 2011, 481 (75%) participants from the PACE trial returned questionnaires.

This is a very good rate of retention of participants for follow-up. The serious problem is that neither

  • loss to follow-up nor
  • whether there was further treatment, nor
  • whether there was cross over in the treatment received in follow-up versus the actual trial

is random.

Furthermore, any follow-up data is biased by the exhortation of the newsletter.

No statistical controls can restore the quality of the follow-up data to what would’ve been obtained with preservation of the initial randomization. Nothing can correct for the exhortation.

Nonetheless, the investigators tried to correct for loss of participants to follow-up and subsequent treatment. They described their effort in a technically complex passage, which I will subsequently interpret:

We assessed the differences in the measured outcomes between the original randomised treatment groups with linear mixed-effects regression models with the 12, 24, and 52 week, and long-term follow-up measures of outcomes as dependent variables and random intercepts and slopes over time to account for repeated measures.

We included the following covariates in the models: treatment group, trial stratification variables (trial centre and whether participants met the international chronic fatigue syndrome criteria,3 London myalgic encephalomyelitis criteria,4 and DSM IV depressive disorder criteria),18,19 time from original trial randomisation, time by treatment group interaction term, long-term follow-up data by treatment group interaction term, baseline values of the outcome, and missing data predictors (sex, education level, body-mass index, and patient self-help organisation membership), so the differences between groups obtained were adjusted for these variables.

Nearly half (44%; 210 of 479) of all the follow-up study participants reported receiving additional trial treatments after their final 1 year outcome assessment (table 2; appendix p 2). The number of participants who received additional therapy differed between the original treatment groups, with more participants who were originally assigned to SMC alone (73 [63%] of 115) or to APT (60 [50%] of 119) receiving additional therapy than those assigned to GET (41 [32%] of 127) or CBT (36 [31%] of 118; p<0·0001).

In the trial analysis plan we defined an adequate number of therapy sessions as ten of a maximum possible of 15. Although many participants in the follow-up study had received additional treatment, few reported receiving this amount (table 2). Most of the additional treatment that was delivered to this level was either CBT or GET.

The “linear mixed-effects regression models” are rather standard techniques for compensating for missing data by using all of the available data to estimate what is missing. The problem is that this approach assumes that any missing data are random, which is an untested assumption that is unlikely to be true in this study.

3aWhen Cherished Beliefs Clash with Evidence-page-0The inclusion of “covariates” is an effort to control for possible threats to the validity of the overall analyses by taking into account what is known about participants. There are numerous problems here. We can’t be assured that the results are any more robust and reliable than what would be obtained without these efforts at statistical control. The best publishing practice is to make the unadjusted outcome variables available and let readers decide. Greatest confidence in results is obtained when there is no difference between the results in the adjusted and unadjusted analyses.

Methodologically inclined readers should consult an excellent recent article by clinical trial expert, Helene Kraemer, A Source of False Findings in Published Research Studies Adjusting for Covariates.

The effectiveness of statistical controls depends on certain assumptions being met about patterns of variation within the control variables. There is no indication that any diagnostic analyses were done to determine whether possible candidate control variables should be eliminated in order to avoid a violation of assumptions about the multivariate distribution of covariates. With so many control variables, spurious results are likely. Apparent results could change radically with the arbitrary addition or subtraction of control variables. See here for a further explanation of this problem.

We don’t even know how this set of covariate/control variables, rather than some other set, was established. Notoriously, investigators often try out various combinations of control variables and present only those that make their trial looked best. Readers are protected from this questionable research practice only with pre-specification of analyses before investigators know their results—and in an unblinded trial, researchers often know the result trends long before they see the actual numbers.

See JP Simmons’  hilarious demonstration that briefly listening to the Beatles’ “When I’m 64” can be leave research participants a year and a half older younger than listening to “Kalimba” – at least when investigators have free reign to manipulate the results they want in an study without pre-registration of analytic plans.

Finally, the efficacy of complex statistical controls is widely overestimated and depends on unrealistic assumptions. First, it is assumed that all relevant variables that need to be controlled have been identified. Second, even when this unrealistic assumption has been met, it is assumed that all statistical control variables have been measured without error. When that is not the case, results can appear significant when they actually are not. See a classic paper by Andrew Phillips and George Davey Smith for further explanation of the problem of measurement error producing spurious findings.

What the investigators claim the study shows

In an intact clinical trial, investigators can analyze outcome data with and without adjustments and readers can decide which to emphasize. However, this is far from an intact clinical trial and these results are not interpretable.

The investigators nonetheless make the following claims in addition to what was said in the summary/abstract.

In the results the investigators state

The improvements in fatigue and physical functioning reported by participants allocated to CBT or GET at their 1 year trial outcome assessment were sustained.

This was followed by

The improvements in impairment in daily activities and in perceived change in overall health seen at 1 year with these treatments were also sustained for those who received GET and CBT (appendix p 4). Participants originally allocated to APT reported further improvements in fatigue, physical functioning, and impairment in daily activities from the 1 year trial outcome assessment to long-term follow-up, as did those allocated to SMC alone (who also reported further improvements in perceived change in overall health; figure 2; table 3; appendix p 4).

If the investigators are taking their RCT design seriously, they should give precedence to the null findings for group differences at follow-up. They should not be emphasizing the sustaining of benefits within the GET and CBT groups.

The investigators increase their positive spin on the trial in the opening sentence of the Discussion

The main finding of this long-term follow-up study of the PACE trial participants is that the beneficial effects of the rehabilitative CBT and GET therapies on fatigue and physical functioning observed at the final 1 year outcome of the trial were maintained at long-term follow-up 2·5 years from randomisation.

This is incorrect. The main finding   is that any reported advantages of CBT and GET at the end of the trial were lost by long-term follow up. Because an RCT is designed to focus on between group differences, the statement about sustaining of benefits is post-hoc.

The Discussion further states

In so far as the need to seek additional treatment is a marker of continuing illness, these findings support the superiority of CBT and GET as treatments for chronic fatigue syndrome.

This makes unwarranted and self-serving assumptions that treatment choice was mainly driven by the need for further treatment, when decision-making was contaminated by investigative preference, as stated in the newsletter. Note also that CBT is a novel treatment for research participants and more likely to be chosen on the basis of novelty alone in the face of overall modest improvement rates for the trial and lack of improvements in objective measures. Whether or not the investigators designate a limited range of self-report measures as primary, participant decision-making may be driven by other, more objective measures.

Regardless, investigators have yet to present any data concerning how decisions for further treatment were made, if such data exist.

The investigators further congratulate themselves with

There was some evidence from an exploratory analysis that improvement after the 1 year trial final outcome was not associated with receipt of additional treatment with CBT or GET, given according to need. However this finding must be interpreted with caution because it was a post-hoc subgroup analysis that does not allow the separation of patient and treatment factors that random allocation provides.

However, why is this analysis singled out has exploratory and to be interpreted with caution because it is a post-hoc subgroup analysis when similarly post-hoc subgroup analyses are recommended without such caution?

The investigators finally get around to depicting what should be their primary finding, but do so in a dismissive fashion.

Between the original groups, few differences in outcomes were seen at long-term follow-up. This convergence in outcomes reflects the observed improvement in those originally allocated to SMC and APT, the possible reasons for which are listed above.

The discussion then discloses a limitation of the study that should have informed earlier presentation and discussion of results

First, participant response was incomplete; some outcome data were missing. If these data were not missing at random it could have led to either overestimates or underestimates of the actual differences between the groups.

This minimizes the implausibility of the assumption of random missing variables, as well as the problems introduced by the complex attempts to control confounds statistically.

And then there is an unsubstantiated statement that is sure to upset persons who suffer from CFS and those who care for them.

the outcomes were all self-rated, although these are arguably the most pertinent measures in a condition that is defined by symptoms.

I could double the length of this already lengthy blog post if I fully discussed this. But let me raise a few issues.

  1. The self-report measures do not necessarily capture subjective experience, only forced choice responses to a limited set of statements.
  2. One of the two outcome measures, the physical health scale of the SF-36  requires forced choice responses to a limited set of statements selected for general utility across all mental and physical conditions. Despite its wide use, the SF-36 suffers from problems in internal consistency and confounding with mental health variables. Anyone inclined to get excited about it should examine  its items and response options closely. Ask yourself, do differences in scores reliably capture clinically and personally significant changes in the experience and functioning associated with the full range of symptoms of CHF?
  3. The validity other primary outcome measure, the Chalder Fatigue Scale depends heavily on research conducted by this investigator group and has inadequate validation of its sensitivity to change in objective measures of functioning.
  4. Such self-report measures are inexorably confounded with morale and nonspecific mental health symptoms with large, unwanted correlation tendency to endorse negative self-statements that is not necessarily correlated with objective measures.

Although it was a long time ago, I recall well my first meeting with Professor Simon Wessely. It was at a closed retreat sponsored by NIH to develop a consensus about the assessment of fatigue by self-report questionnaire. I listened to a lot of nonsense that was not well thought out. Then, I presented slides demonstrating a history of failed attempts to distinguish somatic complaints from mental health symptoms by self-report. Much later, this would become my “Stalking bears, finding bear scat in the woods” slide show.

you can't see itBut then Professor Wessely arrived at the meeting late, claiming to be grumbly because of jet lag and flight delays. Without slides and with devastating humor, he upstaged me in completing the demolition of any illusions that we could create more refined self-report measures of fatigue.

I wonder what he would say now.

But alas, people who suffer from CFS have to contend with a lot more than fatigue. Just ask them.

borg max[To be continued later if there is interest in my doing so. If there is, I will discuss the disappearance of objective measures of functioning from the PACE study and you will find out why you should find some 3-D glasses if you are going to search for reports of these outcomes.]

Delusional? Trial in Lancet Psychiatry claims brief CBT reduces paranoid delusions

lancet psychiatryIn this issue of Mind the Brain, I demonstrate a quick assessment of the conduct and reporting of a clinical trial.  The authors claimed in Lancet Psychiatry a “first ever” in targeting “worries” with brief cognitive therapy as a way of reducing persistent persecutory delusions in psychotic persons. A Guardian article written by the first author claims effects were equivalent to what is obtained with antipsychotic medication. Lancet Psychiatry allowed the authors a sidebar to their article presenting glowing testimonials of 3 patients making extraordinary gains. Oxford University lent its branding* to the first author’s workshop promoted with a video announcing a status of “evidence-based” for the treatment.

There is much claiming to be new here. Is it a breakthrough in treatment of psychosis and in standards for reporting a clinical trial? Or is what is new not praiseworthy?

I identify the kinds of things that I sought in first evaluating the Lancet Psychiatry article and what additional information needed to be consulted to assess the contribution to the field and relevance to practice.

The article is available open access.

Its publication was coordinated with the first author’s extraordinarily self-promotional elarticle in The Guardian

The Guardian article makes the claim that

benefits were what scientists call “moderate” – not a magic bullet, but with meaningful effects nonetheless – and are comparable with what’s seen with many anti-psychotic medications.

The advertisement for the workshop is here

 

The Lancet Psychiatry article also cites the author’s self-help book for lay persons. There was no conflict of interest declared.

Probing the article’s Introduction

Reports of clinical trials should be grounded in a systematic review of the existing literature. This allows readers to place the study in the context of existing research and the unsolved clinical and research problems the literature poses. This background prepares the reader to evaluate the contribution the particular trial can make.

Just by examining the references for the introduction, we can find signs of a very skewed presentation.

The introduction cites 13 articles, 10 of which are written by the author and an eleventh is written by a close associate. The remaining 2 citations are more generic, to a book and an article about causality.

Either the author is at the world center of this kind of research or seriously deficient in his attention to the larger body of evidence. At the outset, the author announces a bold reconceptualization of the role of worry in causing psychotic symptoms:

Worry is an expectation of the worst happening. It consists of repeated negative thoughts about potential adverse outcomes, and is a psychological component of anxiety. Worry brings implausible ideas to mind, keeps them there, and increases the level of distress. Therefore we have postulated that worry is a causal factor in the development and maintenance of persecutory delusions, and have tested this theory in several studies.

This is controversial, to say the least. The everyday experience of worrying is being linked to persecutory delusions. A simple continuum seems to be proposed – people can start off with everyday worrying and end out with a psychotic delusion and twenty years of receiving psychiatric services. Isn’t this too simplistic or just plain wrong?

Has no one but the author done relevant work or even reacted to the author’s work? The citations provided in the introduction suggest the author’s work is all we need in order to interpret this study in the larger context of what is known about psychotic persecutory delusions.

Contrast my assessment with the author’s own:

Panel 2: Research in context
Systematic review We searched the ISRCTN trial registry and the PubMed database with the search terms “worry”,“delusions”. “persecutory”,“paranoia”,and “schizophrenia”without date restrictions, for English-language publications of randomised controlled trials investigating the treatment of worry in patients with persecutory delusions. Other than our pilot investigation12 there were no other such clinical trials in the medical literature. We also examined published meta-analyses on standard cognitive behavioural therapy (CBT) for persistent delusions or hallucinations, or both.

The problem is that “worry” is a nonspecific colloquial term, not a widely used scientific one. For the author to require that studies have “worry” as a keyword in order to be retrieved is a silly restriction.

PubMedI welcome readers to redo the PubMed search dropping this term. Next replace “worry” with “anxiety.” Furthermore, the author makes unsubstantiated assumptions about a causal role for worry/anxiety in development of delusions. Drop the “randomized controlled trial” restriction from the PubMed search and you find a large relevant literature. Persons with schizophrenia and persecutory delusions are widely acknowledged to be anxious. But you won’t find much suggestion in this literature that the anxiety is causal or that people progress from worrying about something to developing schizophrenia and persecutory delusions. This seems a radical version gone wild of the idea that normal and psychotic experiences are on a continuum, concocted with a careful avoidance of contrary evidence.

Critical appraisal of clinical trials often skips examination of whether the background literature cited to justify the study is accurate and balanced. I think this brief foray has demonstrated that it can be important in establishing whether an investigator is claiming false authority for a view with cherry picking and selective attention to the literature.

Basic design of the study

The 150 patients randomized in this study are around 40 years old. Half of the sample of has been in psychiatric services for 11 or more years, with 29% of the patients in the intervention group and 19% in the control group receiving services for more than 20 years. The article notes in passing that all patients were prescribed antipsychotic medication at the outset of the study except 1 in the intervention group and 9 in the control group – 1:9? It is puzzling how such differences emerged if randomization was successful in controlling for baseline differences. Maybe it demonstrates the limitations of block randomization.

The intervention is decidedly low intensity for what is presumably a long standing symptom in chronically psychotic population.

We aimed to provide the CBT worry-reduction intervention in six sessions over 8 weeks. Each session lasted roughly an hour and took place in NHS clinics or at patients’ homes.

The six sessions were organized around booklets shared by the patient and therapist.

The main techniques were psychoeducation about worry, identification and reviewing of positive and negative beliefs about worry, increasing awareness of the initiation of worry and individual triggers, use of worry periods, planning activity at times of worry (which could include relaxation), and learning to let go of worry.

Patients were expected to practice exercises from the author’s self-help book for lay persons.

The two main practical techniques to reduce worry were then introduced: the use of worry periods (confining worry to about a 20 minute set period each day) and planning of activities at peak worry times. Worry periods were implemented flexibly. For example, most patients set up one worry period a day, but they could choose to have two worry periods a day or, in severe instances, patients instead aimed for a worry-free period. Ideally, the worry period was then substituted with a problem-solving period.

Compared to what?

The treatment of the control group was ill-defined routine care “delivered according to national and local service protocols and guidelines.” Readers are not told how much treatment the patients received or whether their care was actually congruent with these guidelines. Routine care of mental health patients in the community is notoriously deficient. That over half of these patients had been in services for more than a decade suggests that treatment for many of them had tapered off and was being delivered with no expectation of improvement.

To accept this study as an evaluation of the author’s therapy approach, we need to know how much in the way of other treatment was received by patients in both the intervention and control group. Were patients in the routine care condition, as I suspect, largely being ignored? The intervention group got 6 sessions of therapy over 8 weeks. Is that a substantial increase in psychotherapy or even in time to talk with a professional over what they would otherwise receive? Did being assigned to the intervention also increase patients’ other contact with mental health services? If the intervention therapists heard that patients was having problems with medication or serious unmet medical needs, how did they respond?

The authors report collecting data concerning receipt of services with the Client Service Receipt Inventory, but nowhere is that reported.

Most basically, we don’t know what elements the comparison/control group controlled. We have no reason to presume that the amount of contact time and basic relationship with a treatment provider was controlled.

As I have argued before, it is inappropriate and arguably unethical to use ill defined routine care or treatment-as-usual in the evaluation of a psychological intervention. We cannot tell if any apparent benefits to patients having been assigned to the intervention are due to correcting the inadequacies of routine care, including its missing of basic elements of support, attention, and encouragement. We therefore cannot tell if there are effective elements to the intervention other than  these nonspecific factors.

We cannot tell if any positive results to this trial suggest encourage dissemination and implementation or only improving likely deficiencies in the treatment received by patients in long term psychiatric care.

In terms of quickly evaluating articles reporting clinical trials, we see that imply asking “compared to what” and jumping to the comparison/control condition revealed a lot of deficiencies at the outset in what this trial could reveal.

Measuring outcomes

Two primary outcomes were declared – changes in the Penn State Worry Questionnaire and the Psychotic Symptoms Rating Scale- Delusion (PSYRATS-delusion) subscale. The authors use multivariate statistical techniques to determine whether patients assigned to the intervention group improved more on either of these measures, and whether specifically reduction in worry caused reductions in persecutory delusions.

Understand what is at stake here: the authors are trying to convince us that this is a groundbreaking study that shows that reducing worry with a brief intervention reduces long standing persecutory delusions.

The authors lose substantial credibility if we look closely at their primary measures, including their items, not just the scale names.

what-me-worry-715605The Penn State Worry Questionnaire (PSWQ) is a 16 item questionnaire widely used with college student, community and clinical samples. Items include

When I am under pressure I worry a lot.

I am always worrying about something.

And reverse direction items scored so greater endorsement indicates less worrying –

I do not tend to worry about things.

I never worry about anything.

I know, how many times does basically the same question have to be asked?

The questionnaire is meant to be general. It focuses on a single complaint that could be a symptom of anxiety. While the questionnaire could be used to screening for anxiety disorders, it does not provide a diagnosis of a mental disorder, which requires other symptoms be present. Actually, worry is only one of three components of anxiety. The others are physiological – like racing heart, sweating, or trembling – and behavioral – like avoidance or procrastination.

But “worry” is also a feature of depressed mood. Another literature discusses “worry” as “rumination.” We should not be surprised to find this questionnaire functions reasonably well as a screen for depression.

But past research has shown that even in nonclinical populations, using a cutpoint to designate high versus low worriers results in unstable classification. Without formal intervention, many of those who are “high” become  “low” over time.

In order to be included in this study, patients had to have a minimum score of 44 on the PSWQ. If we skip to the results of the study we find that the patients in the intervention group dropped from 64.8 to 56.1 and those receiving only routine care dropped from 64.5 to 59.8. The average patient in either group would have still qualified for inclusion in the study at the end of follow up.

The second outcome measure, the Psychotic Symptoms Rating Scale- Delusion subscale has six items: duration and frequency of preoccupation; intensity of distress; amount of distressing content; conviction and disruption. Each item is scored 0-4, with 0 = no problem and 4 = maximum severity.

The items are so diverse that interpretation of a change in the context of an intervention trial targeting worry becomes difficult. Technically speaking, the lack of comparability among items is so great that the measure cannot be considered an interval scale for which conventional parametric statistics could be used. We cannot reasonably assume changes in one item is equivalent to changes in other items.

It would seem, for instance, that amount of preoccupation with delusions, amount and intensity of distress, and amount of preoccupation with delusions are very different matters. The intervention group changed from a mean of 18.7 on a scale with a possible score of 24 to 13.6 at 24 weeks; the control group from 18.0 to 16.4. This change could simply represent reduction in the amount and intensity of distress, not in patients’ preoccupation with the delusions, their conviction that the delusions are true, or the disruption in their lives. Overall, the PSYRATS-delusion subscale is not a satisfactory measure on which to make strong claims about reducing worry reducing delusions. The measure is too contaminated with content similar to the worries questionnaire. We might only be finding ‘changes in worries results in changes in worries.”

Checking primary outcomes is important in evaluating a clinical trial, but in this case, it was crucial to examine what the measures assessed at an item content level. Too often reviewers uncritically accept the name of an instrument as indicating what it validly measures when used as an outcome measure.

The fancy multivariate analyses do not advance our understanding of what went on in the study. The complex statistical analyses might simply be demonstrating patients were less worried as seen in questionnaires and interview ratings based on what patients say when asked whether they are distressed.

My summary assessment is that a low intensity intervention is being evaluated against an ill-defined treatment as usual. The outcome measures are too nonspecific and overlapping to be helpful. We may simply be seeing effects of contact and reassurance among patients who are not getting much of either. So what?

testimonialsBring on the patient endorsements

Panel 1: Patient comments on the intervention presents glowing endorsements from 3 of the 73 patients assigned to the intervention group. The first patient describes the treatment as “extremely helpful” and as providing a “breakthrough.” The second patient suggests describing starting treatment being lost and without self-confidence but now being relaxed at times of the day that had previously been stressful. The third patient declared

“The therapy was very rewarding. There wasn’t anything I didn’t like. I needed that kind of therapy at the time because if I  didn’t have that therapy at that time, I wouldn’t be here.

Wow, but these dramatic gains seem inconsistent with the modest gains registered with the quantitative primary outcome measures. We are left guessing how these endorsements were elicited – where they obtained in a context where patients were expected to express gratitude for the extra attention they received? –  and the criteria by which the particular quotes were selected from what is presumably a larger pool.

Think of the outcry if Lancet Psychiatry extended this innovation to reporting of clinical trials to evaluations of medications by their developers. If such side panels are going to be retained in the future in the reporting of a clinical trial, maybe it would be best that they be marked “advertisement” and accompanied by a declaration of conflict of interest.

A missed opportunity to put the authors’ intervention to a fair test

In the Discussion section the authors state

although we think it highly unlikely that befriending or supportive counselling [sic] would have such persistent effects on worry and delusions, this possibility will have to be tested specifically in this group.

Actually, the authors don’t have much evidence of anything but a weak effect that might well have been achieved with befriending or supportive counseling delivered by persons with less training. We should be careful of accepting claims of any clinically significant effects on delusions. At best, the authors have evidence that distress associated with delusions was reduced and that in any coordination in scores between the two measurs may simply reflect confounding of the two outcome measures.

It is a waste of scarce research funds, an unethical waste of patients willingness to contribute to science to compare this low intensity psychotherapy to ill-described, unquantified treatment as usual. Another low intensity treatment like befriending or supportive counseling might provide sufficient elements of attention, support, and raised expectations to achieve comparable results.

Acknowledging the Supporting Cast

In evaluating reports of clinical trials, it is often informative to look to footnotes and acknowledgments, as well as the main text. This article acknowledges Anthony Morrison as a member of the Trial Steering Committee and Douglas Turkington as a member of the Data Monitoring and Ethics Committee. Readers of Mind the Brain might recognize Morrison as first author of a Lancet trial that I critiqued for exaggerated claims and Turkington as the first author of a trial that became an internet sensation when post-publication reviewers pointed out fundamental problems in the reporting of data.  Turkington and an editor of the journal in which the report of the trial was published counterattacked.

All three of these trials involve exaggerated claims based on a comparison between CBT and an ill-defined routine care. Like the present one, Morrison’s trial failed to report data concerning collected receipt of services. And in an interview with Lancet, Morrison admitted to avoiding a comparison between CBT and anything but routine care out of concern that differences might not be found with any treatment providing a supportive relationship, even basic supportive counseling.

MRCA note to funders

This project (09/160/06) was awarded by the Efficacy and Mechanism Evaluation (EME) Programme, and is funded by the UK Medical Research Council (MRC) and managed by the UK NHS National Institute for Health Research (NIHR) on behalf of the MRC-NIHR partnership.

Really, UK MRC, you are squandering scarce funds on methodologically poor, often small trials for which investigators make extravagant claims and that don’t include a comparison group allowing control for nonspecific effects. You really ought to insist on better attention to the existing literature in justifying another trial and adequate controls for amount of contact time, attention and support.

Don’t you see the strong influence of investigator allegiance dictating reporting of results consistent with the advancement of the investigators’ product?

I don’t understand why you allowed the investigator group to justify the study with such idiosyncratic, highly selective review of the literature driven by substituting a colloquial term “worry” for more commonly used search terms.

Do you have independent review of grants by persons who are more accepting of the usual conventions of conducting and reporting trials? Or are you faced with the problems of a small group of reviewers giving out money to like-minded friends and family? Note that the German Federal Ministry of Education and Research (BMBF) has effectively dealt with inbred old boy networks by excluding Germans from the panels of experts reviewing German grants. Might you consider the same strategy in getting more seriously about funding projects with some potential for improving patient care? Get with it, insist on rigor and reproducibility in what you fund.

*We should make too much of Oxford lending its branding to this workshop. Look at the workshops to which Harvard Medical School lends its labels.