Lessons we need to learn from a Lancet Psychiatry study of the association between exercise and mental health

The closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

giphyThe closer we look at a heavily promoted study of exercise and mental health, the more its flaws become obvious. There is little support for the most basic claims being made – despite the authors marshaling enormous attention to the study.

Apparently, the editor of Lancet Psychiatry and reviewers did not give the study a close look before it was accepted.

The article was used to raise funds for a startup company in which one of the authors was heavily invested. This was disclosed, but doesn’t let the authors off the hook for promoting a seriously flawed study. Nor should the editor of Lancet Psychiatry or reviewers escape criticism, nor the large number of people on Twitter who thoughtlessly retweeted and “liked” a series of tweets from the last author of the study.

This blog post is intended to raise consciousness about bad science appearing in prestigious journals and to allow citizen scientists to evaluate their own critical thinking skills in terms of their ability to detect misleading and exaggerated claims.

1.Sometimes a disclosure of extensive conflicts of interest alerts us not to pay serious attention to a study. Instead, we should question why the study got published in a prestigious peer-reviewed journal when it had such an obvious risk of bias.

2.We need citizen scientists with critical thinking skills to identify such promotional efforts and alert others in their social network that hype and hokum are being delivered.

3.We need to stand up to authors who use scientific papers for commercial purposes, especially when they troll critics.

Read on and you will see what a skeptical look at the paper and its promotion revealed.

  • The study failed to capitalize on the potential of multiple years of data for developing and evaluating statistical models. Bigger is not necessarily better. Combining multiple years of data was wasteful and served only the purpose of providing the authors bragging rights and the impressive, but meaningless p-values that come from overly large samples.
  • The study relied on an unvalidated and inadequate measure of mental health that confounded recurring stressful environmental conditions in the work or home with mental health problems, even where validated measures of mental health would reveal no effects.
  • The study used an odd measure of history of mental health problems that undoubtedly exaggerated past history.
  • The study confused physical activity with (planned) exercise. Authors amplified their confusion by relying on an exceedingly odd strategy for getting estimate of how much participants exercised: Estimates of time spent in a single activity was used in analyses of total time spent exercising. All other physical activity was ignored.
  • The study made a passing acknowledgment of the problems interpreting simple associations as causal, but then went on to selectively sample the existing literature to make the case that interventions to increase exercise improve mental health.
  • Taken together, a skeptical of assessment of this article provides another demonstration that disclosure of substantial financial conflicts of interests should alert readers to a high likelihood of a hyped, inaccurately reported study.
  • The article was pay walled so that anyone interested in evaluating the authors claims for themselves had to write to the author or have access to the article through a university library site. I am waiting for the authors to reply to my requests for the supplementary tables that are needed to make full sense of their claims. In the meantime, I’ll just complain about authors with significant conflicts of interest heavily promoting studies that they hide behind paid walls.

I welcome you to  examine the author’s thread of tweets. Request the actual article from the author if you want to evaluate independently my claims. This can be great material for a masters or honors class on critical appraisal, whether in psychology or journalism.

title of article

Let me know if you think that I’ve been too hard on this study.

A thread of tweets  from the last author celebrated the success of well orchestrated publicity campaign for a new article concerning exercise and mental health in Lancet Psychiatry.

The thread started:

Our new @TheLancetPsych paper was the biggest ever study of exercise and mental health. it caused quite a stir! here’s my guided tour of the paper, highlighting some of our excitements and apprehensions along the way [thread] 1/n

And ended with pitch for the author’s do-good startup company:

Where do we go from here? Over @spring_health – our mental health startup in New York City – we’re using these findings to develop personalized exercise plans. We want to help every individual feel better—faster, and understand exactly what each patient needs the most.

I wasn’t long into the thread before my skepticism was stimulated. The fourth tweet in the thread had a figure that didn’t get any comments about how bizarre it was.

The tweet

It looks like those differences mattered. for example, people who exercised for about 45 minutes seemed to have better mental health than people who exercised for less than 30, or more than 60 minutes. — a sweet spot for mental health, perhaps?

graphs from paper

Apparently the author does not comment on an anomaly either. Housework appears to be better for mental health than a summary score of all exercise and looks equal to or better than cycling or jogging. But how did housework slip into the category “exercise”?

I begin wondering what the authors meant by “exercise” or if they’d given the definition serious consideration when constructing their key variable from the survey data.

But then that tweet was followed by another one that generated more confusion with a  graph the seemingly contradicted the figures in the last one

the type of exercise people did seems important too! People doing team sports or cycling had much better mental health than other sports. But even just walking or doing household chores was better than nothing!

Then a self-congratulatory tweet for a promotional job well done.

for sure — these findings are exciting, and it has been overwhelming to see the whole world talking openly and optimistically about mental health, and how we can help people feel better. It isn’t all plain sailing though…

The author’s next tweet revealed a serious limitation to the measure of mental health used in the study in a screenshot.

screenshot up tweet with mental health variable

The author acknowledged the potential problem, sort of:

(1b- this might not be the end of the world. In general, most peple have a reasonable understanding of their feelings, and in depressed or anxious patients self-report evaluations are highly correlated with clinician-rated evaluations. But we could be more precise in the future)

“Not the end of the world?” Since when does the author of the paper in the Lancet family of journals so casually brush off a serious methodological issue? A lot of us who have examined the validity of mental health measures would be skeptical of this dismissal  of a potentially fatal limitation.

No validation is provided for this measure. On the face of it, respondents could endorse it on basis of facing  recurring stressful situations that had no consequences for their mental health. This reflects ambiguity of the term stress for both laypersons and scientists. “Stress” could variously refer to an environmental situation, a subjective experience of stress, or an adaptational outcome. Waitstaff could consider Thursday when the chef is off, a recurrent, weekly stress. Persons with diagnosable persistent depressive disorder would presumably endorse more days than not as being a mental health challenge. But they would mean something entirely different.

The author acknowledged that the association between exercise and mental health might be bidirectional in terms of causality

adam on lots of reasons to believe relationship goes both ways.PNG

But then made a strong claim for increased exercise leading to better mental health.

exercise increases mental health.PNG

[Actually, as we will see, the evidence from randomized trials of exercise to improve mental health is modest, and entirely disappears one limits oneself to the quality studies.]

The author then runs off the rail with the claim that the benefits of exercise exceed benefits of having greater than poverty-level income.

why are we so excited.PNG

I could not resist responding.

Stop comparing adjusted correlations obtained under different circumstances as if they demonstrated what would be obtained in RCT. Don’t claim exercising would have more effect than poor people getting more money.

But I didn’t get a reply from the author.

Eventually, the author got around to plugging his startup company.

I didn’t get it. Just how did this heavy promoted study advance the science fo such  “personalized recommendation?

Important things I learned from others’ tweets about the study

I follow @BrendonStubbs on Twitter and you should too. Brendon often makes wise critical observations of studies that most everyone else is uncritically praising. But he also identifies some studies that I otherwise would miss and says very positive things about them.

He started his own thread of tweets about the study on a positive note, but then he identified a couple of critical issues.

First, he took issue with the author’s week claiming to have identified a tipping point, below which exercise is beneficial, and above which exercise could prove detrimental the mental health.

4/some interpretations are troublesome. Most confusing, are the assumptions that higher PA is associated/worsens your MH. Would we say based on cross sect data that those taking most medication/using CBT most were making their MH worse?

A postdoctoral fellow @joefirth7  seconded that concern:

I agree @BrendonStubbs: idea of high PA worsening mental health limited to observation studies. Except in rare cases of athletes overtraining, there’s no exp evidence of ‘tipping point’ effect. Cross-sect assocs of poor MH <–> higher PA likely due to multiple other factors…

Ouch! But then Brendan follows up with concerns that the measure of physical activity has not been adequately validated, noting that such self-report measures prove to be invalid.

5/ one consideration not well discussed, is self report measures of PA are hopeless (particularly in ppl w mental illness). Even those designed for population level monitoring of PA https://journals.humankinetics.com/doi/abs/10.1123/jpah.6.s1.s5 … it is also not clear if this self report PA measure has been validated?

As we will soon see, the measure used in this study is quite flawed in its conceptualization and its odd methodology of requiring participants to estimate the time spent exercising for only one activity, with 70 choices.

Next, Brandon points to a particular problem using self-reported physical activity in persons with mental disorder and gives an apt reference:

6/ related to this, self report measures of PA shown to massively overestimate PA in people with mental ill health/illness – so findings of greater PA linked with mental illness likely bi-product of over-reporting of PA in people with mental illness e.g Validity and Value of Self-reported Physical Activity and Accelerometry in People With Schizophrenia: A Population-Scale Study of the UK Biobank [ https://academic.oup.com/schizophreniabulletin/advance-article/doi/10.1093/schbul/sbx149/4563831 ]

7/ An additional point he makes: anyone working in field of PA will immediately realise there is confusion & misinterpretation about the concepts of exercise & PA in the paper, which is distracting. People have been trying to prevent this happening over 30 years

Again, Brandon provides a spot-on citation clarifying the distinction between physical activity and exercise:, Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research 

The mysterious pseudonymous Zad Chow @dailyzad called attention to a blog post they had just uploaded and let’s take a look at some of the key points.

Lessons from a blog post: Exercise, Mental Health, and Big Data

Zad Chow is quite balanced in dispensing praise and criticism of the Lancet Psychiatry paper. They noted the ambiguity of any causality in cross-sectional correlation and that investigated the literature on their own.

So what does that evidence say? Meta-analyses of randomized trials seem to find that exercise has large and positive treatment effects on mental health outcomes such as depression.

Study Name     # of Randomized Trials             Effects (SMD) + Confidence Intervals

Schuch et al. 2016       25         1.11 (95% CI, 0.79-1.43)

Gordon et al. 2018      33         0.66 (95% CI, 0.48-0.83)

Krogh et al. 2017          35         −0.66 (95% CI, -0.86, -0.46)

But, when you only pool high-quality studies, the effects become tiny.

“Restricting this analysis to the four trials that seemed less affected of bias, the effect vanished into −0.11 SMD (−0.41 to 0.18; p=0.45; GRADE: low quality).” – Krogh et al. 2017

Hmm, would you have guessed this from the Lancet Psychiatry author’s thread of tweets?

Zad Chow showed the hype and untrustworthiness of the press coverage in prestigious media with a sampling of screenshots.

zad chou screenshots of press coverage

I personally checked and don’t see that Zad Chow’s selection of press coverage was skewed. Coverage in the media all seemed to be saying the same thing. I found the distortion to continue with uncritical parroting – a.k.a. churnaling – of the claims of the Lancet Psychiatry authors in the Wall Street Journal. 

The WSJ repeated a number of the author’s claims that I’ve already thrown into question and added a curiosity:

In a secondary analysis, the researchers found that yoga and tai chi—grouped into a category called recreational sports in the original analysis—had a 22.9% reduction in poor mental-health days. (Recreational sports included everything from yoga to golf to horseback riding.)

And the NHS England totally got it wrong:

NHS getting it wrong.PNG

So, we learned that the broad category “recreational sports” covers yoga and tai chi , as well as golf and  horseback riding. This raises serious questions about the lumping and splitting of categories of physical activity in the analyses that are being reported.

I needed to access the article in order to uncover some important things 

I’m grateful for the clues that I got from Twitter, and especially Zad Chow that I used in examining the article itself.

I got hung up on the title proclaiming that the study involved 1·2 million individuals. When I checked the article, I saw that the authors use three waves of publicly available data to get that number. Having that many participants gave them no real advantage except for bragging rights and the likelihood that modest associations could be expressed in expressed in spectacular p-values, like p<2・2 × 10–16. I don’t understand why the authors didn’t conduct analyses with one-way and Qwest validate results in another.

The obligatory Research in Context box made it sound like a systematic search of the literature had been undertaken. Maybe, but the authors were highly selective in what they chose to comment upon, as seen in its contradiction by the brief review of Zad Chow. The authors would have us believe that the existing literature is quite limited and inconclusive, supporting the need for like their study.

research in context

Caveat Lector, a strong confirmation bias is likely ahead in this article.

Questions accumulated quickly as to the appropriateness of the items available from a national survey undoubtedly constructed with other purposes. Certainly these items would not have been selected if the original investigators were interested in the research question at the center of this article.

Participants self-reported a previous diagnosis of depression or depressive episode on the basis of the following question: “Has a doctor, nurse, or other health professional EVER told you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?”

Our own work has cast serious doubt on the correspondence of reports of a history of depression in response to a brief question embedded in a larger survey with results of a structured interview in which respondents’ answers can be probed. We found that answers to such questions were more related to current distress, then to actual past diagnoses and treatment of depression. However, the survey question used in the Lancet Psychiatry study added the further ambiguity and invalidity with the added  “or minor depression.” I am not sure under what circumstances a health care professional would disclose a diagnosis of “minor depression” to a patient, but I doubt it would be in context in which the professional felt treatment was needed.

Despite the skepticism that I was developing about the usefulness of the survey data, I was unprepared for the assessment of “exercise.”

Other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise?” Participants who answered yes to this question were then asked: “What type of physical activity or exercise did you spend the most time doing during the past month?” A total of 75 types of exercise were represented in the sample, which were grouped manually into eight exercise categories to balance a diverse representation of exercises with the need for meaningful cell sizes (appendix).

Participants indicated the number of times per week or month that they did this exercise and the number of minutes or hours that they usually spend exercising in this way each time.

I had already been tipped off by the discussion on twitter that there would be a thorough confusion of planned exercise and mere physical activity. But now that was compounded. Why was physical activity during employment excluded? What if participants were engaged in a number of different physical activities,  like both jogging and bicycling? If so, the survey obtained data for only one of these activities, with the other excluded, and the choice could’ve been quite arbitrary as to which one the participant identified as the one to be counted.

Anyone who has ever constructed surveys would be alert to the problems posed by participants’ awareness that saying “yes” to exercising would require contemplating  75 different options, arbitrarily choosing one of them for a further question how much time the participant engaged in this activity. Unless participants were strongly motivated, then there was an incentive to simply say no, they didn’t exercise.

I suppose I could go on, but it was my judgment that any validity what the authors were claiming  had been ruled out. Like someone once said on NIH grant review panel, there are no vital signs left, let’s move on to the next item.

But let’s refocus just a bit on the overall intention of these authors. They want to use a large data set to make statements about the association between physical activity and a measure of mental health. They have used matching and statistical controls to equate participants. But that strategy effectively eliminates consideration of crucial contextual variables. Persons’ preferences and opportunities to exercise are powerfully shaped by their personal and social circumstances, including finances and competing demands on their time. Said differently, people are embedded in contexts in which a lot of statistical maneuvering has sought to eliminate.

To suggest a small number of the many complexities: how much physical activity participants get  in their  employment may be an important determinant of choices for additional activity, as well as how much time is left outside of work. If work typically involves a lot of physical exertion, people may simply be left too tired for additional planned physical activity, a.k.a. exercise, and the physical health may require it less. Environments differ greatly in terms of the opportunities and the safety of engaging in various kinds of physical activities. Team sports require other people being available. Etc., etc.

What I learned from the editorial accompanying the Lancet Psychiatry article

The brief editorial accompanying the article aroused my curiosity as to whether someone assigned to reading and commenting on this article would catch things that apparently the editor and reviewer missed.

Editorial commentators are chosen to praise, not to bury articles. There are strong social pressures to say nice things. However, this editorial leaked a number of serious concerns.

First

In presenting mental health as a workable, unified concept, there is a presupposition that it is possible and appropriate to combine all the various mental disorders as a single entity in pursuing this research. It is difficult to see the justification for this approach when these conditions differ greatly in their underlying causes, clinical presentation, and treatment. Dementia, substance misuse, and personality disorder, for example, are considered as distinct entities for research and clinical purposes; capturing them for study under the combined banner of mental health might not add a great deal to our understanding.

The problem here of categorisation is somewhat compounded by the repeated uncomfortable interchangeability between mental health and depression, as if these concepts were functionally equivalent, or as if other mental disorders were somewhat peripheral.

Then:

A final caution pertains to how studies approach a definition of exercise. In the current study, we see the inclusion of activities such as childcare, housework, lawn-mowing, carpentry, fishing, and yoga as forms of exercise. In other studies, these activities would be excluded for not fulfilling the definition of exercise as offered by the American College of Sports Medicine: “planned, structured and repetitive bodily movement done to improve or maintain one or more components of physical fitness.” 11 The study by Chekroud and colleagues, in its all-encompassing approach, might more accurately be considered a study in physical activity rather than exercise.

The authors were listening for a theme song with which they could promote their startup company in a very noisy data set. They thought they had a hit. I think they had noise.

The authors’ extraordinary disclosure of interests (see below this blog post) should have precluded publication of this serious flawed piece of work, either simply for reason of high likelihood of bias or because it promoted the editor and reviewers to look more carefully at the serious flaws hiding in plain sight.

Postscript: Send in the trolls.

On Twitter, Adam Chekroud announced he felt no need to respond to critics. Instead, he retweeted and “liked” trolling comments directed at critics from the twitter accounts of his brother, his mother, and even the official Twitter account of a local fried chicken joint @chickenlodge, that offered free food for retweets and suggested including Adam Chekroud’s twitter handle if you wanted to be noticed.

chicken lodge

Really, Adam, if you can’t stand the heat, don’t go near  where they are frying chicken.

The Declaration of Interests from the article.

declaration of interest 1

declaration of interest 2

 

Hazards of pointing out bad meta-analyses of psychological interventions

 

A cautionary tale

Psychology has a meta-analysis problem. And that’s contributing to its reproducibility problem. Meta-analyses are wallpapering over many research weaknesses, instead of being used to systematically pinpoint them. – Hilda Bastian

  • Meta-analyses of psychological interventions are often unreliable because they depend on a small number of poor quality, underpowered studies.
  • It is surprisingly easy to screen the studies being assembled for a meta-analysis and quickly determine that the literature is not suitable because it does not have enough quality studies. Apparently, the authors of many published meta-analyses did not undertake such a brief assessment or were undeterred by it from proceeding anyway.
  • We can’t tell how many efforts at meta-analyses were abandoned because of the insufficiencies of the available literature. But we can readily see that many published meta-analyses offer summary effect sizes for interventions that can’t be expected to be valid or generalizable.
  • We are left with a glut of meta-analyses of psychological interventions that convey inflated estimates of the efficacy of interventions and on this basis, make unwarranted recommendations that broad classes of interventions are ready for dissemination.
  • Professional organizations and promoters of particular treatments have strong vested interests in portraying their psychological interventions as effective. They will use their resources to resist efforts to publish critiques of their published meta-analyses and even fight the teaching of basic critical skills for appraising meta-analysis.
  • Publication of thorough critiques has little or no impact on the subsequent citation or influence of meta-analyses. Furthermore, such critiques are largely ignored.
  • Debunking bad meta-analyses of psychological interventions can be frustrating at best, and, at worst, hazardous to careers.
  • You should engage in such activities if you feel it is right to do so. It will be a valuable learning experience. And you can only hope that someone at some point will take notice.

3 Simple screening questions to decide whether a meta analysis is worth delving into.

I’m sick and tired of spending time trying to make sense of meta-analyses of psychological interventions that should have been dismissed out of hand. The likelihood of any contribution to the literature was ruled out by repeated, gross misapplication of meta-analysis by some authors  or, more often, the pathetic quality and quantity of literature available for meta-analysis.

Just recently, Retraction Watch reported the careful scrutiny of a pair of meta-analyses by two psychology graduate students, Paul-Christian Bürkner and Donald Williams. Coverage in Retraction Watch focused on their inability to get credit for the retraction of one of the papers that had occurred because of their critique.

But I was more saddened by their having spent so much time on the second meta-analysis, “A meta-analysis and theoretical critique of oxytocin and psychosis: Prospects for attachment and compassion in promoting recovery, The authors of this meta-analysis  had themselves acknowledged the literature was quite deficient, but proceeded anyway and published a paper that has already been cited 13 times.

The graduate students, as well as the original authors could simply have taken a quick look at the study’s Table 1: the seven included studies had from 9 to 35 patients exposed to oxytocin.  The study  with 35 patients was an outlier. This study also provided only a within-subject effect size, which should not have been entered into the meta-analysis with the results of the other studies.

The six remaining studies had an average sample size of 14 in the intervention group. I doubt that anyone would have undertaken a study of psychotic patients inhaling oxytocin to generate a robust estimate of effect size with only 9, 10, or 11 patients. It’s unclear why the original investigators stopped accruing patients when they did.

Without having specified their sample size ahead of time (there is no evidence that the investigators did), original investigators could simply have stopped when a peek at the data revealed statistically significant findings or they could have kept accruing patients when a peek revealed only nonsignificant findings. Or they could have dropped some patients. Regardless, the reported samples are so small that adding only one or two more patients could substantially change the results.

Furthermore, if the investigators were struggling to get enough patients, the study was probably under-resourced and compromised in other ways. Small sample sizes compound the problems posed by poor methodology and reporting. The authors conducting this particular meta-analysis could only confirm for one of the studies that data from all patients who were randomized were analyzed, i.e., that there was intention to treat analyses. Reporting was that bad, and worse. Again, think of the effects of the loss of data from the analysis of one or a few patients- it could be decisive for results –  particularly when the loss was not random.

Overall, the authors of the original meta-analysis conceded that the seven studies they were entering into the meta-analyses had a high risk of bias.

It should be apparent that authors cannot take a set of similarly flawed studies and integrate their effect sizes with a meta-analysis and expect to get around the limitations. Bottom line – readers should just dismiss the meta-analysis and get on to other things…

These well-meaning graduate students were wasting their time and talent carefully scrutinizing a pair of meta-analyses that were unworthy of their sustained attention. Think of what they could be doing more usefully. There is so much other bad science out there to uncover.

Everybody – I recommend not putting a lot of effort into analyzing obviously flawed meta-analysis, other than maybe posting a warning notice on PubMed Commons  or ranting in a blog post or both.

Detecting Bad Meta Analyses

Over a decade ago, I developed some quick assessment tools by which I can reliably determine that some meta-analyses are not worth our attention. You can see more about the quickly answered questions here.

To start such an assessment, directly to the table describing studies that were included in a published meta-analysis.

  1. Ask: “To what extent are the studies dominated by cell sample sizes less than 35?” Studies of this size have only a power of .50 to detect a moderate size effect. So, even if an effect were present, it would only be detected 50% of the time of all studies were being reported.
  2. Next, check to see whether whoever did the meta-analysis rated the included studies for risk of bias and how, if at all, risk of bias was taken into account in the meta-analyses.
  3. Finally, does the meta analysis adequately deal with clinical heterogeneity of included studies? Is there a basis for giving a meaningful interpretation to a single summary effect size?

Combining studies may be inappropriate for a variety of the following reasons: differences in patient eligibility criteria in the included trials, different interventions and outcomes, and other methodological differences or missing information.  Moher et al., 1998

I have found this quick exercise often reveals that meta-analyses of psychological interventions are dominated by underpowered studies of low methodological quality that produce positive effects for interventions at a greater rate than would be expected. There is little reason to proceed to calculate a summary effect size.

Pothole-FinalThe potholed road from a presentation to a publication.

My colleagues and I applied these criteria in a 2008 presentation to a packed audience at the European Health Psychology Conference in Bath. My focus was Undertook a similar exercise with four meta-analyses of behavioral interventions for adults (Dixon, Keefe, Scipio, Perri, & Abernethy, 2007; Hoffman, Papas, Chatkoff, & Kerns, 2007 ; Irwin, Cole, & Nicassio, 2006; and Jacobsen, Donovan, Vadaparampil, & Small, 2007) that appeared in a new section of Health Psychology, Evidence Based Treatment Reviews.

A sampling of what we found::

Irwin et al. The Irwin et al meta analysis had the stated objective of

comparing responses in studies that exclusively enrolled persons who were 55 years of age or older versus outcomes in randomized controlled trials that enrolled adults who were, on average, younger than 55 years of age(p. 4).

A quick assessment revealed exclusion of small trials (n < 35) would have eliminated all studies of older adults; five studies included 15 or fewer participants per condition. For the studies including younger adults, only one of the 15 studies would have remained.

Hoffman et al. We found that 17 of the 22 included fell below n = 35 per group. Response to our request, the authors graciously shared a table of the methodological quality of the included studies.

Intervention and control groups were not comparable In 60% of the studies on key variables at baseline.

Less than half provided adequate information concerning number of patients enrolled, treatment drop-out and reasons for drop-outs.

Only 15% of trials provided intent-to-treat analyses.

In a number of studies, the psychological intervention was part of the multicomponent package so that its unique contribution could not be determined. Often the psychological intervention was minimal. For instance, one study noted: “a lecture to give the patient an understanding that ordinary physical activity would not harm the disk and a recommendation to use the back and bend it.”

The only studies comparing a psychological intervention to an active control condition consisted of three underpowered studies into in which effects of the psychological component cannot be separated from the rest of the package in which it was embedded. In one of the studies, massage was the psychological intervention, but in another, it was the control group.

Nonetheless,  Hoffman et al. concluded ““The robust nature of these findings should encourage confidence among clinicians and researchers alike.”

As I readily demolished the meta-analyses  to the delight of the audience, I remarked something to the effect that I’m glad the editor of Health Psychology is not here to hear what I am saying about articles published in the journal he edits.

But Robert Kaplan was there. He invited me for a beer as I left the symposium. He said that such critical probing was sorely lacking in the journal. He invited that my colleagues and I submit an invited article. Eventually it would be published as:

Coyne JC, Thombs BD, Hagedoorn M. Ain’t necessarily so: Review and critique of recent meta-analyses of behavioral medicine interventions in health psychology. Health Psychology. 2010 Mar;29(2):107.

However, Kaplan first had an Associate Editor send out the manuscript for review. The manuscript was rejected  based on a pair of reviews that were not particularly informative . One reviewer stated:

The authors level very serious accusations against fellow scientists and claim to have identified significant shortcomings in their published work. When this is done in public, the authors must have done their homework, dotted all the i’s, and crossed all the t’s. Instead, they reveal “we do not redo these meta-analyses or offer a comprehensive critique, but provide a preliminary evaluation of the adequacy of the conduct, reporting and clinical recommendations of these meta-analyses”. To be frank, this is just not enough when one accuses colleagues of mistakes, poor judgment, false inferences, incompetence, and perhaps worse.

In what he would later describe as the only time he did this in his term as editor of Health Psychology, Bob Kaplan overruled the unanimous recommendations of his associate editor and the two reviewers. He accepted a revision of our manuscript in which we try to be clearer about the bases of our judgments.

According to Google Scholar, our “Ain’t necessarily so…” has been cited 53 times. Apparently it had little effect on the reception of the four meta-analyses. Hoffman et al. has been cited 599 times.

From a well-received workshop to a workshop canceled in order to celebrate a bad meta-analysis.

Mariet Hagedorn and I gave a well-received workshop at the annual meeting of The Society for Behavioral Medicine the next year. A member of SBM’s Evidence-based Behavioral Medicine Committee invited us to their committee meeting held immediately after the workshop. We were invited to give the workshop again in two years. I also became a member of the committee. I offered to be involved in future meta-analyses, learning that a number were planned.

I actually thought that I was involved in a meta-analysis of interventions for depressive symptoms among cancer patients. I immediately identified a study of problem-solving therapy for cancer patients that had such improbably large effect sizes that should be excluded from any meta-analysis as an extreme outlier. The suggestion was appreciated.

But I heard nothing further about the meta-analyses and to I was contacted by one of the authors who said that my permission was needed to be acknowledged in the accepted manuscript. I refused. When I finally saw the published version of the manuscript in the prestigious Journal of the National Cancer Institute, I published a scathing critique, which you can read here. My critique has so far been cited once, the meta-analysis in eighty times.

Only a couple of months before our workshop had been scheduled to occur I was told it was canceled in order to clear the schedule for full press coverage of a new meta-analysis. I only learned of this when I emailed the committee concerning the specific timing of the workshop.  The reply came from the first author of the new meta-analysis.

I have subsequently made the case that that meta-analysis was horribly done and horribly misleading of consumers in two blog posts:

Faux Evidence-Based Behavioral Medicine at Its Worst (Part I)

Faux Evidence-Based Behavioral Medicine Part 2

Some highlights:

The authors boasted of “robust findings” of “substantial rigor” in a meta-analysis that provided “strong evidence for psychosocial pain management approaches.” They claimed their findings supported the “systematic implementation” of these techniques.

The meta-analysis depended heavily on small trials. Of the 38 trials, 19 studies had less than 35 patients in the intervention or control group and so would be excluded with application of this criterion.

Some of the smaller trials were quite small. One had 7 patients receiving an education intervention;  another had 10 patients getting hypnosis; another, 15 patients getting education; another, 15 patients getting self hypnosis; and still another, 8 patients getting relaxation and eight patients getting CBT plus relaxation.

Two of what were by far the largest trials should have been excluded because they involved complex intervention. Patients received telephone-based collaborative care, which had a number of components, including support for adherence to medication.

It appears that listening to music, being hypnotized during a medical procedure, and being taught self hypnosis over 52 sessions, are all under the rubric of skills training. Similarly, interactive educational sessions are considered equivalent to passing out informational materials and simply pamphleteering.

But here’s what most annoyed me about clinical and policy decisions being made on the basis of this meta-analysis:

Perhaps most importantly from a cancer pain control perspective, there was no distinguishing of whether the cancer pain was procedural, acute, or chronic. These types of pain take very different management strategies. In preparation for surgery or radiation treatment, it might be appropriate to relax or hypnotize the patient or provide soothing music. The efficacy could be examined in a randomized trial. But the management of acute pain is quite different and best achieved with medication. Here is where the key gap exists between the known efficacy of medication and the poor control in the community, due to professional and particularly patient attitudes. Control of chronic pain, months after any painful procedures, is a whole different matter, and based on studies of noncancer pain, I would guess that here is another place for psychosocial intervention, but that should be established in randomized trials.

shushedGetting shushed about the sad state of couples interventions for cancer patients research

One of the psychologists present at the SBM meeting published a meta-analysis of couples interventions   in which I was thanked for my input in an acknowledgment. I did not give permission and this notice was subsequently retracted.

Ioana Cristea and Nilufer Kafescioglu and I subsequently submitted a critique to Psycho-Oncology. We were initially told it would be accepted as a letter to the editor, but then it was subject to an extraordinary six uninformative reviews and rejected. The article that we critiqued was given special status as a featured article and distributed free by the otherwise pay walled journal.

A version of our critique was relegated to a blog post.

The complicated politics of meta-analyses supported by professional organizations.

Starting with our “Ain’t necessarily so..” effort, we were taking aim at meta-analyses making broad, enthusiastic claims about the efficacy and readiness for dissemination of psychological interventions. Society for Behavioral Medicine was enjoying a substantial increase in membership, but like other associations dominated by psychologists, the new members were clinicians, not primarily academic researchers. SBM wanted to offer a branding of “evidence-based” to the psychological interventions for which the clinicians were seeking reimbursement. At the time, insurance companies were challenging that licensed psychologists would get reimbursed for psychological interventions that would not administered to patients with psychiatric diagnoses.

People involved with the governance of SBM at the time cannot help but be aware of an ugly side to the politics back then. A small amount of money had been given by NCI to support meta-analyses and it was quite a struggle to control its distribution. That the SBM-sponsored meta-analyses were oddly published in the APA journal, Health Psychology, rather than SBM’s Annals of Behavioral Medicine reflected the bid for presidency of APA’s Division of Health Psychology by someone who had been told that she could not run for president of SBM. But worse, there was a lot of money and undeclared conflicts of interest in play.

Someone originally involved in the meta-analysis of interventions for depressive symptoms among cancer patients had received a $10 million grant from Pfizer to develop a means of monitoring cancer surgeons’ inquiring about psychological distress and their offering of interventions. The idea (which was actually later mandated) was that cancer surgeons could not close their electronic records until they had indicated that they had asked the patient about psychological distress. If patient reported distress, the surgeons had to indicate what intervention was offered to the patient. Only then could they close the medical record. Of course, these requirements could be met simply by asking if a breast cancer patient was distressed and offering her antidepressant without any formal diagnosis or follow-up. These procedures were mandated as part of accreditation of facilities providing cancer care.

Psycho-Oncology, the journal with which we skirmished about the meta-analysis of couples interventions was the official publication of the International Psycho-Oncology Society, another organization dominated by commission seeking reimbursement for services to cancer patients.

You can’t always get what you want.

I nonetheless encourage others, particularly early career investigators, to take the tools that I offer. Please scrutinize meta-analyses that otherwise would have clinical and public policy recommendations attached to their findings. You may have trouble getting published, and you will be slowly disappointed if you expect to influence the reception of already published meta-analysis. You can always post your critiques at PubMed Commons.

You will learn important skills and the politics of trying to publish critiques of papers that are protected as having been “peer reviewed.” If enough of you do this and visibly complain about how ineffectual your efforts have been, we may finally overcome the incumbent advantage and protection from further criticism that goes with getting published.

And bloggers like myself and Hilda Bastian will recognize you and express appreciation.

 

 

Should have seen it coming: Once high-flying Psychological Science article lies in pieces on the ground

Life is too short for wasting time probing every instance of professional organizations promoting bad science when they have an established record of doing just that.

There were lots of indicators that’s what we were dealing with in the Association for Psychological Science (APS) recent campaign for the now discredited and retracted ‘sadness prevents us from seeing blue’ article.

sad blueA quick assessment of the press release should have led us to dismiss the claims being presented and convinced us to move on.

Readers can skip my introductory material by jumping down this blog post to [*} to see my analysis of the APS press release.

Readers can also still access the original press release, which has now disappeared from the web, here. Some may want to read the press release and form their own opinions before proceeding into this blog post.

What, I’ve stopped talking about the PACE trial? Yup, at least at Mind the Brain, for now. But you can go here for the latest in my continued discussion of the PACE trial of CBT for chronic fatigue syndrome, in which I moved from critical observer to activist a while ago.

Before we were so rudely interrupted  by the bad science and bad media coverage of the PACE trial, I was focusing on how readers can learn to make quick assessments of hyped media coverage of dubious scientific studies.

In “Sex and the single amygdala”  I asked:

Can skeptics who are not specialists, but who are science-minded and have some basic skills, learn to quickly screen and detect questionable science in the journals and its media coverage?

The counter argument of course is Chris Mooney telling us “You Have No Business Challenging Scientific Experts”. He cites

“Jenny McCarthy, who once remarked that she began her autism research at the “University of Google.”

But while we are on the topic of autism, how about the counter example of The Lancet’s coverage of the link between vaccines and autism? This nonsense continues to take its toll on American children whose parents – often higher income and more educated than the rest – refused to vaccinate them on the basis of a story that started in The Lancet. Editor Richard Horton had to concede

horton on lancet autism failure

 

 

 

If we accept Chris Mooney‘s position, we are left at the mercy of press releases cranked out by the likes of professional organizations like Association for Psychological Science (APS) that repeatedly demand that we revise our thinking about human nature and behavior, as well as change our behavior if we want to extend our lives and live happier, all on the basis of a single “breakthrough” study. Rarely do APS press releases have any follow-up as to the fate of a study they promoted. One has to hope that PubPeer  or PubMed Commons pick up on the article touted in the press release and see what a jury of post-publication peers decides.

As we have seen in my past Mind the Brain posts, there are constant demands on our attention from press releases generated from professional organizations, university press officers, and even NIH alerting us to supposed breakthroughs in psychological and brain science. Few such breakthroughs hold up over time.

Are there no alternatives?

Are there no alternatives to our simply deferring to the expertise being offered or taking the time to investigate for ourselves claims that are likely to prove exaggerated or simply false?

We should approach press releases from the APS – or from its rival American Psychological Association – using prior probabilities to set our expectations. The Open Science Collaboration: Psychology (OSC) article  in Science presented results of a systematic attempt to replicate 100 findings from prestigious psychological journals, including APS’ s Psychological Science and APA’s Journal of Personality and Social Psychology. Less than half of the findings were replicated. Findings from the APS and APA journals fared worse than the others.

So, our prior probabilities are that declarations of newsworthy, breakthrough findings trumpeted in press releases from psychological organizations are likely to be false or exaggerated – unless we assume that the publicity machines prefer the trustworthy over the exciting and newsworthy in the article they selected to promote.

I will guide readers through a quick assessment of APS press release which I started on this post before getting swept up into the PACE controversy. However, in the intervening time, there have been some extraordinary developments, which I will then briefly discuss. We can use these developments to validate my and your evaluation of the press release available earlier. Surprisingly, there is little overlap between the issues I note in the press release and what concerned post-publication commentators.

*A running commentary based on screening the press release

What once was a link to the“feeling blue and seeing blue”  article now takes one only to

retraction press releasee

Fortunately, the original press release can still be reached here. The original article is preserved here.

My skepticism was already high after I read the opening two paragraphs of the press release

The world might seem a little grayer than usual when we’re down in the dumps and we often talk about “feeling blue” — new research suggests that the associations we make between emotion and color go beyond mere metaphor. The results of two studies indicate that feeling sadness may actually change how we perceive color. Specifically, researchers found that participants who were induced to feel sad were less accurate in identifying colors on the blue-yellow axis than those who were led to feel amused or emotionally neutral.

Our results show that mood and emotion can affect how we see the world around us,” says psychology researcher Christopher Thorstenson of the University of Rochester, first author on the research. “Our work advances the study of perception by showing that sadness specifically impairs basic visual processes that are involved in perceiving color.”

What Anglocentric nonsense. First, blue as a metaphor for sad does not occur across most languages other than English and Serbian. In German, to call someone blue is suggesting the person is drunk. In Russian, you are suggesting that the person is gay. In Arabic, if you say you are having a blue day, it is a bad one. But if you say in Portuguese that “everything is blue”, it suggests everything is fine.

In Indian culture, blue is more associated with happiness than sadness, probably traceable to the blue-blooded Krishna being associated with divine and human love in Hinduism. In Catholicism, the Virgin Mary is often wearing blue and so the color has come to be associated with calmness and truth.

We are off to a bad start. Going to the authors’ description of their first of two studies, we learn:

In one study, the researchers had 127 undergraduate participants watch an emotional film clip and then complete a visual judgment task. The participants were randomly assigned to watch an animated film clip intended to induce sadness or a standup comedy clip intended to induce amusement. The emotional effects of the two clips had been validated in previous studies and the researchers confirmed that they produced the intended emotions for participants in this study.

Oh no! This is not a study of clinical depression, but another study of normal college students “made sad” with a mood induction.

So-called mood induction tasks don’t necessarily change actual mood state, but they do convey to research participants what is expected of them and how they are supposed to act. In one of the earliest studies I ever did, we described a mood induction procedure to subjects without actually having them experience it. We then asked them to respond as if they had received it. Their responses were indistinguishable. We concluded that we could not rule out that what were considered effects of a mood induction task were simply demand characteristics, what research participants perceive as instructions as to how they should behave.

It was fashionable way back then for psychology researchers who were isolated in departments that did not have access to clinically depressed patients to claim that they were nonetheless conducting analog studies of depression. Subjecting students to unsolvable anagram task or uncontrollable loud noises was seen as inducing learned helplessness in them, thereby allowing investigators an analog study of depression. We demonstrated a problem with that idea. If students believed that the next task that they were administered was part of the same experiment, they performed poorly, as if they were in a state of learned helplessness or depression. However, if they believed that the second task was unrelated to the first, they would show no such deficits. Their negative state of helplessness or depression was confined to their performance in what they thought was the same setting in which the induction had occurred. Shortly after our experiments. Marty Seligman wisely stopped doing studies “inducing” learned helplessness in humans, but he continued to make the same claims about the studies he had done.

Analog studies of depression disappeared for awhile, but I guess they have come back into fashion.

But the sad/blue experiment could also be seen as a priming  experiment. The research participants were primed by the film clip and their response to a color naming task was then examined.

It is fascinating that neither the press release nor the article itself ever mentioned the word priming. It was only a few years ago that APS press releases were crowing about priming studies. For instance, a 2011 press release entitled “Life is one big priming experiment…” declared:

One of the most robust ideas to come out of cognitive psychology in recent years is priming. Scientists have shown again and again that they can very subtly cue people’s unconscious minds to think and act certain ways. These cues might be concepts—like cold or fast or elderly—or they might be goals like professional success; either way, these signals shape our behavior, often without any awareness that we are being manipulated.

Whoever wrote that press release should be embarrassed today. In the interim, priming effects have not proven robust. Priming studies that cannot be replicated have figured heavily in the assessment that the psychological literature is untrustworthy. Priming studies also figure heavily in the 56 retracted studies of fraudster psychologist Diederik Stapel. He claims that he turned to inventing data when his experiments failed to demonstrate priming effects that he knew were there. Yet, once he resorted to publishing studies with fabricated data, others claimed to replicate his work.

I made up research, and wrote papers about it. My peers and the journal editors cast a critical eye over it, and it was published. I would often discover, a few months or years later, that another team of researchers, in another city or another country, had done more or less the same experiment, and found the same effects.  My fantasy research had been replicated. What seemed logical was true, once I’d faked it.

So, we have an APS press release reporting a study that assumes that the association between sadness and the color blue is so hardwired and culturally universal that is reflected in basic visual processes. Yet the study does not involve clinical depression, only an analog mood induction and a closer look reveals that once again APS is pushing a priming study. I think it’s time to move on. But let’s read on:

The results cannot be explained by differences in participants’ level of effort, attention, or engagement with the task, as color perception was only impaired on the blue-yellow axis.

“We were surprised by how specific the effect was, that color was only impaired along the blue-yellow axis,” says Thorstenson. “We did not predict this specific finding, although it might give us a clue to the reason for the effect in neurotransmitter functioning.”

The researchers note that previous work has specifically linked color perception on the blue-yellow axis with the neurotransmitter dopamine.

The press release tells us that the finding is very specific, occurring only on the blue-yellow axis, not the red-green axes and thatdifferences between are not found in level of effort, attention, or engagement of the task. The researchers did not expect such a specific finding, they were surprised.

The press release wants to convince us of an exciting story of novelty and breakthrough.  A skeptic sees it differently: This is an isolated finding that is unanticipated by the researchers getting all dressed up. See, we should’ve moved on.

The evidence with which the press release wants to convince us is exciting because it is specific and novel. iThe researchers are celebrating the specificity of their finding, but the blue-yellow axis finding may be the only one statistically significant because it is due to chance or an artifact.

And bringing up unmeasured “neurotransmitter functioning” is pretentious and unwise. I challenge the researchers to show that effects of watching a brief movie clip registers in measurable changes in neurotransmitters. I’m skeptical even whether persons drawn from the community or outpatient samples reliably differ from non-depressed persons in measures of the neurotransmitter dopamine.

This is new work and we need to take time to determine the robustness and generalizability of this phenomenon before making links to application,” he concludes.

Claims in APS press releases are not known for their “robustness and generalizability.” I don’t think this particular claim should prompt an effort at independent replication when scientists have so many more useful things to keep them busy.

Maybe, these investigators should have checked robustness and generalizability before rushing into print. Maybe APS should stop pestering us with findings that surprise researchers and that have not yet been replicated.

A flying machine in pieces on the ground

Sadness impairs color perception was sent soaring high, lifted by an APS press release now removed from the web, but that is still available here. The press release was initially uncritically echoed, usually cut-and-paste or outright churnaled  in over two dozen media mentions.

But, alas, Sadness impairs color perception is now a flying machine in pieces on the ground 

Noticing of the article’s problems seem to have started with some chatter of skeptically-minded individuals on Twitter,  which led to comments at PubPeer where the article was torn to pieces. What unfolded was a wonderful demonstration of crowdsourced post-publication peer review in action. Lesson: PubPeer rocks and can overcome the failures of pre-publication peer review to keep bad stuff out of the literature.

You can follow the thread of comments at PubPeer.

  • An anonymous skeptic started off by pointing out an apparent lack of a significant statistical effect where one was claimed.
  • There was an immediate call for a retraction, but it seemed premature.
  • Soon re-analyses of the data from the paper were being reported, confirming the lack of a significant statistical effect when analyses were done appropriately and reported transparently.
  • The data set for the article was mysteriously changed after it had been uploaded.
  • Doubts were expressed about the integrity of the data – had they been tinkered with?
  • The data disappeared.
  • There was an announcement of a retraction.

The retraction notice  indicated that the researchers were still convinced of the validity of their hypothesis, despite deciding to retract their paper.

We remain confident in the proposition that sadness impairs color perception, but would like to acquire clearer evidence before making this conclusion in a journal the caliber of Psychological Science.

so deflatedThe retraction note also carries a curious Editors note:

Although I believe it is already clear, I would like to add an explicit statement that this retraction is entirely due to honest mistakes on the part of the authors.

Since then, doubts about express whether retraction was a sufficient response or whether something more is needed. Some of the participants in the PubPeer discussion drafted a letter to the editor incorporating their reanalyses and prepared to submit it to Psychological Science. Unfortunately, having succeeded in getting the bad science retracted, these authors reduced the likelihood of theirr reanalysis being accepted by Psychological Science. As of this date, their fascinating account remains unpublished but available on the web.

Postscript

Next time you see an APS or APA press release, what will be your starting probabilities about the trustworthiness of the article being promoted? Do you agree with Chris Mooney that you should simply defer to the expertise of the professional organization?

Why would professional organizations risk embarrassment with these kinds of press releases? Apparently they are worth the risk. Such press releases can echo through the conventional and social media and attract early attention to an article. The game is increasing the impact factor of the journal (JIFs).

Although it is unclear precisely how journal impact factors are calculated, the number reflects the average number of citations an article obtains within two years of publication. However, if press releases  promote “early releases” of articles,  the journal can acquire citations before the clock starts ticking for the two years. APS and APA are in intense competition for prestige of their journals and membership. It matters greatly to them which organization can claim the most prestigious journals, as demonstrated by their JIFs.

So, press releases are important from garnering early attention. Apparently breakthroughs, innovations, and “first ever” mattered more than trustworthiness. In the professional organizations hope we won’t remember the fate of past claims.

 

Sex and the single amygdala: A tale almost saved by a peek at the data

So sexy! Was bringing up ‘risky sex’ merely a strategy to publish questionable and uninformative science?

wikipedia 1206_FMRIMy continuing question: Can skeptics who are not specialists, but who are science-minded and have some basic skills, learn to quickly screen and detect questionable science in the journals and media coverage?

You don’t need a weatherman to know which way the wind blows.” – Bob Dylandylan wind blows

I hope so. One goal of my blogging is to arouse readers’ skepticism and provide them some tools so that they can decide for themselves what to believe, what to reject, and what needs a closer look or a check against trusted sources.

Skepticism is always warranted in science, but it is particularly handy when confronting the superficial application of neuroscience to every aspect of human behavior. Neuroscience is increasingly being brought into conversations to sell ideas and products when it is neither necessary nor relevant. Many claims about how the brain is involved are false or exaggerated not only in the media, but in the peer-reviewed journals themselves.

A while ago I showed how a neuroscientist and a workshop guru teamed up to try to persuade clinicians with functional magnetic resonance imaging (fMRI) data  that a couples therapy was more sciencey than the rest. Although I took a look at some complicated neuroscience, a lot of my reasoning [1, 2, 3] merely involved applying basic knowledge of statistics and experimental design. I raised sufficient skepticism to dismiss the neuroscientist and psychotherapy guru’s claims, Even putting aside the excellent specialist insights provided by Neurocritic and his friend Magneto.

In this issue of Mind the Brain, I’m pursuing another tip from Neurocritic about some faulty neuroscience in need of debunking.

The paper

Victor, E. C., Sansosti, A. A., Bowman, H. C., & Hariri, A. R. (2015). Differential Patterns of Amygdala and Ventral Striatum Activation Predict Gender-Specific Changes in Sexual Risk Behavior. The Journal of Neuroscience, 35(23), 8896-8900.

Unfortunately, the paper is behind a pay wall. If you can’t get it through a university library portal, you can send a request for a PDF to the corresponding author, elizabeth.victor@duke.edu.

The abstract

Although the initiation of sexual behavior is common among adolescents and young adults, some individuals express this behavior in a manner that significantly increases their risk for negative outcomes including sexually transmitted infections. Based on accumulating evidence, we have hypothesized that increased sexual risk behavior reflects, in part, an imbalance between neural circuits mediating approach and avoidance in particular as manifest by relatively increased ventral striatum (VS) activity and relatively decreased amygdala activity. Here, we test our hypothesis using data from seventy 18- to 22-year-old university students participating in the Duke Neurogenetics Study. We found a significant three-way interaction between amygdala activation, VS activation, and gender predicting changes in the number of sexual partners over time. Although relatively increased VS activation predicted greater increases in sexual partners for both men and women, the effect in men was contingent on the presence of relatively decreased amygdala activation and the effect in women was contingent on the presence of relatively increased amygdala activation. These findings suggest unique gender differences in how complex interactions between neural circuit function contributing to approach and avoidance may be expressed as sexual risk behavior in young adults. As such, our findings have the potential to inform the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.

My thought processes

Hmm, sexual risk behavior -meaning number of partners? How many new partners during a follow-up period constitutes “risky” and does it matter whether safe sex was practiced? Well, ignoring these issues and calling it “sexual risk behavior “allows the authors to claim relevance to hot topics like HIV prevention….

But let’s cut to the chase: I’m always skeptical about a storyline depending on a three-way statistical interaction. These effects are highly unreliable, particularly in a sample size of only N = 70. I’m suspicious why investigators ahead of time staking their claims on a three-way interaction, not something simpler. I will be looking for evidence that they started with this hypothesis in mind, rather than cooking it up after peeking at the data.

fixed-designs-for-psychological-research-35-638Three-way interactions involve dividing a sample up into at eight boxes, in this case, 2 x (2) x (2). Such interactions can be mind-boggling to interpret, and this one is no exception

Although relatively increased VS activation predicted greater increases in sexual partners for both men and women, the effect in men was contingent on the presence of relatively decreased amygdala activation and the effect in women was contingent on the presence of relatively increased amygdala activation.

And then the “simple” interpretation?

These findings suggest unique gender differences in how complex interactions between neural circuit function contributing to approach and avoidance may be expressed as sexual risk behavior in young adults.

And the public health implications?

As such, our findings have the potential to inform the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.

hs-amygdalaJust how should these data inform public health strategies beyond what we knew before we stumbled upon this article? Really, should we stick people’s heads in a machine and gather fMRI data  before offering them condoms? Should we encourage computer dating services to post along with a recent headshot, recent fMRI images showing that prospective dates do not have their risky behavior center in the amygdala activated? Or encourage young people to get their heads examined with an fMRI before deciding whether it’s wise to sleep with somebody new?

So it’s difficult to see the practical relevance of these findings, but let’s stick around and consider the paragraph that Neurocritic singled out.

The paragraph

outlierThe majority of the sample reported engaging in vaginal sex at least once in their lifetime (n = 42, 60%). The mean number of vaginal sexual partners at baseline was 1.28 (SD =0.68). The mean increase in vaginal sexual partners at the last follow-up was 0.71 (SD = 1.51). There were no significant differences between men and women in self-reported baseline or change in self-reported number of sexual partners (t=0.05, p=0.96; t=1.02, p= 0.31, respectively). Although there was not a significant association between age and self-reported number of partners at baseline (r = 0.17, p= 0.16), younger participants were more likely to report a greater increase in partners over time (r =0.24, p =0.04). Notably, distribution analyses revealed two individuals with outlying values (3 SD from M; both subjects reported an increase in 8 partners between baseline and follow up). Given the low rate of sexual risk behavior reported in the sample, these outliers were not excluded, as they likely best represent young adults engaging in sexual risk behavior.

What triggers skepticism?

This paragraph is quite revealing if we just ponder it a bit.

First, notice there is only a single significant correlation (p=.04) in a subgroup analysis. Differences between men and women were examined finding no significant findings in either baseline or changes in number of sexual partners over the length of the observation. However, disregarding that finding, the authors went on to explore changes in number of partners over time among the younger participants and, bingo, there was their p =0.04.

Whoa! Age was never mentioned in the abstract. We are now beyond the 2 x 2 x 2 interaction mentioned in the abstract and rooting through another dimension, younger versus older.

But, worse, getting that significance required retaining two participants with eight new sexual partners each during the follow-up period. The decision to retain these participants was made after the pattern of results was examined with and without inclusion of these outliers. The authors say so and essentially say they decided because it made a better story.

The only group means and standard deviation included these two participants. Even including the participants, the average number of new sexual partners was less than one during some follow-up. We have no idea whether that one was risky or not. It’s a safer assumption that having eight new partners is risky, but even that we don’t know for sure.

Keep in mind for future reference: Investigators are supposed to make decisions about outliers without reference to the fate of the hypothesis being studied. And knowing nothing about this particular study, most authorities would say if two people out of 70 are way out there on a particular variable that otherwise has little variance, you should exclude them.

It is considered a Questionable Research Practice to make decisions about inclusion/exclusion based on what story the outcome of this decision allows the authors to tell. It is p-hacking, and significance chasing.

And note the distribution of numbers of vaginal sex partners. Twenty eight participants had none at the end of the study. Most accumulated less than one during the follow up, and even that mean number was distorted by two having eight partners. Hmm, it is going to be hard to get multivariate statistics to work appropriately when we get to the fancy neuroscience data. We could go off on discussions of multivariate normal or Poisson distributions or just think a bit..

We can do a little detective work and determine that one outlier was a male, another a female. (*1) Let’s go back to our eight little boxes of participants that are involved in the interpretation of the three-way interaction. It’s going to make a great difference exactly where the deviant male and female are dropped into one of the boxes or whether they are left out.

And think about sampling issues. What if, for reasons having nothing to with the study, neither of these outliers had shown up? Or if only one of them had showed up, it would skew the results in a particular direction, depending on whether the participant was the male or female.

Okay, if we were wasting our time continuing to read the article after finding what we did in the abstract, we are certainly wasting more of our time by continuing after reading this paragraph. But let’s keep poking around as an educational exercise.

The rest of the methods and results sections

We learn from the methods section that there was an ethnically diverse sample with a highly variable follow-up, from zero days to 3.9 years (M = 188.72 d, SD = 257.15; range = 0 d–3.19 years). And there were only 24 men in the original sample for the paper of 70 participants.

We don’t know whether these two outliers had eight sexual partners within a week of the first assessment or they were the ones captured in extending the study to almost 4 years. That matters somewhat, but we also have to worry whether this was an appropriate sample – with so few participants in it in the first place and even fewer who had sex by the end of the study – and length of follow-up to do such a study. The mean follow-up of about six months and huge standard deviation suggest there is not a lot of evidence of risky behavior, at least in terms of casual vaginal sex.

This is all getting very funky.

So I wondered about the larger context of the study, with increasing doubts that the authors had gone to all this trouble just to test an a priori hypothesis about risky sex.

We are told that the larger context is the ongoing “Duke Neurogenetics Study (DNS), which assesses a wide range of behavioral and biological traits.” The extensive list of inclusions and exclusions suggests a much more ambitious study. If we had more time, we could go look up the Duke Neurogenetics Study and see if that’s the case. But I have a strong suspicion that the study was not organized around the specific research questions of this paper (*2). I really can’t tell without any preregistration of this particular paper but I certainly have questions about how much Hypothesizing after the Results Are Known (HARKing) is going on here in the refining of hypotheses and measures, and decisions about which data to report.

Further explorations of the results section

I remind readers that I know little about fMRI data. Put it aside and we can discover some interesting things reading through the brief results section.

Main effects of task

As expected, our fMRI paradigms elicited robust affect-related amygdala and reward-related VS activity across the entire parent sample of 917 participants (Fig. 1). In our substudy sample of 70 participants, there were no significant effects of gender (t(70) values < 0.88, p values >0.17) or age (r values < 0.22; p values > 0.07) on VS or amygdala activity in either hemisphere.

figure1Hmm, let’s focus on the second sentence first. The authors tell us absolutely nothing is going on in terms of differences in amygdala and reward-related VS activity in relation to age and gender in the sample of 70 participants in the current study. In fact, we don’t even need to know what “amygdala and reward-related VS activity” is to wonder why the first sentence of this paragraph directs us to a graph not of the 70 participants, but a larger sample of 917 participants. And when we go to figure 1, we see some wild wowie zowie, hit-the-reader-between-the-eyes differences (in technical terms, intraocular trauma) for women. And claims of p < 0.000001 twice. But wait! One might think significance of that magnitude would have to come from the 917 participants, except the labeling of the X-axis must come from the substudy of the 70 participants for whom data concerning number of sex partners was collected. Maybe the significance comes from the anchoring of one of the graph lines by the one wayout outlier.

Note that the outlier woman with eight partners anchors the blue line for High Left Amygdala. Without inclusion of that single woman, the nonsignificant trends between women with High Left Amygdala versus women with Low Left Amygdala would be reversed.

figure2The authors make much of the differences between Figure 1 showing Results for Women and Figure 2 showing Results for Men. The comparison seems dramatic except that, once again, the one outlier sends the red line for Low Left Amygdala off from the blue line for High Left Amygdala. Otherwise, there is no story to tell. Mind-boggling, but I think we can safely conclude that something is amiss in these Frankenstein graphs.

Okay, we should stop beating a corpse of an article. There are no vital signs left.

Alternatively, we could probe the section on Poisson regressions and minimally note some details. There is the flash of some strings of zeros in the P values, but it seems complicated and then we are warned off with “no factors survive Bonferroni correction.” And then in the next paragraph, we get to exploring dubious interactions. And there is the final insult of the authors bringing in a two-way interaction trending toward significance among men, p =.051.

But we were never told how all this would lead as we were promised in the end of the abstract, “to the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.”

Rushing through the discussion section, we note the disclosure that

The nature of these unexpected gender differences on clear and warrants further consideration.

So, the authors confess that they did not start with expectations of finding a gender difference. They had nothing to report from a subset of data from an ambitious project put together for other purposes with an ill-suited follow-up for the research question (and even an ill-suited experimental task. They made a decision to include two outliers, salvaged some otherwise weak and inconsistent differences, and then constructed a story that depended on their inclusion. Bingo, they can survive confirmation bias and get published.

Readers might have been left with just their skepticism about the three-way interaction described in the abstract. However, the authors implicated themselves by disclosing in the article their examination of a distribution and reasons for including outlier. Then they further disclosed they did not start with a hypothesis about gender differences.

Why didn’t the editor and reviewers at Journal of Neuroscience (impact factor 6.344) do their job and cry foul? Questionable research practices (QRPs) are brought to us courtesy of questionable publication practices (QPPs).

And then we end with the confident

These limitations notwithstanding, our current results suggest the importance of considering gender-specific patterns of interactions between functional neural circuits supporting approach and avoidance in the expression of sexual risk behavior in young adults.

Yet despite this vague claim, the authors still haven’t explained how this research could be translated to practice.

Takeaway points for the future.

Without a tip from NeuroCritic, I might not have otherwise zeroed in on the dubious complex statistical interaction on which the storyline in the abstract depended. I also benefited from the authors for whatever reason telling us that they had peeked at the data and telling us further in the discussion that they had not anticipated the gender difference. With current standards for transparency and no preregistration of such studies, it would’ve been easy for us to miss what was done because the authors did not need to alert us. Until there are more and better standards enforced, we just need to be extra skeptical of claims of the application of neuroscience to everyday life.

Trust your skepticism.

Apply whatever you know about statistics and experimental methods. You probably know more than you think you do

Beware of modest sized neuroscience studies for which authors develop storylines from the patterning authors can discover in their data, not from a priori hypotheses suggested by a theory. If you keep looking around in the scientific literature and media coverage of it, I think you will find a lot of this QRP and QPP.

Don’t go into a default believe-it mode just because an article is peer-reviewed.

Notes

  1. If both the outliers were of the same gender, it would have been enough for that gender to have had significantly more sex partners than the other.
  1. Later we had told in the Discussion section that particular stimuli for which fMRI data were available were not chosen for relevance to the research question claimed for this this paper.

We did not measure VS and amygdala activity in response to sexually provocative stimuli but rather to more general representations of reward and affective arousal. It is possible that variability in VS and amygdala activity to such explicit stimuli may have different or nonexistent gender-specific patterns that may or may not map onto sexual risk behaviors.

Special thanks to Neurocritic for suggesting this blog post and for feedback, as well as to Neuroskeptic, Jessie Sun, and Hayley Jach for helpful feedback. However, @CoyneoftheRealm bears sole responsibility for any excesses or errors in this post.