“It’s certainly not bareknuckle:” Comments to a journalist about a critique of mindfulness research

We can’t assume authors of mindfulness studies are striving to do the best possible science, including being prepared for the possibility of being proven incorrect by their results.

mind the brain logo

I recently had a Skype interview with science journalist Peter Hess concerning an article in Psychological Science.

Peter was exceptionally prepared, had a definite point of view, but was open to what I said. In the end seem to be persuaded by me on a number of points.  The resulting article in Inverse  faithfully conveyed my perspective and juxtaposed quotes from me with those from an author of the Psych Science piece in a kind of debate.

My point of view

larger dogWhen evaluating an article about mindfulness in a peer-reviewed journal, we need to take into account that authors may not necessarily be striving to do the best science, but to maximally benefit their particular brand of mindfulness, their products, or the settings in which they operate. Many studies of mindfulness are a little more than infomercials, weak research intended only to get mindfulness promoters’ advertisement of themselves into print or to allow the labeling of claims as “peer-reviewed”. Caveat Lector.

We cannot assume authors of mindfulness studies are striving to do the best possible science, including being prepared for the possibility of being proven incorrect by their results. Rather they may be simply try to get the strongest possible claims through peer review, ignoring best research practices and best publication practices.

Psychologists Express Growing Concern With Mindfulness Meditation

“It’s not bare-knuckle, that’s for sure.”

There was much from the author of the Psych Science article with which  I would agree:

“In my opinion, there are far too many organizations, companies, and therapists moving forward with the implementation of ‘mindfulness-based’ treatments, apps, et cetera before the research can actually tell us whether it actually works, and what the risk-reward ratio is,” corresponding author and University of Melbourne research fellow Nicholas Van Dam, Ph.D. tells Inverse.

Bravo! And

“People are spending a lot of money and time learning to meditate, listening to guest speakers about corporate integration of mindfulness, and watching TED talks about how mindfulness is going to supercharge their brain and help them live longer. Best case scenario, some of the advertising is true. Worst case scenario: very little to none of the advertising is true and people may actually get hurt (e.g., experience serious adverse effects).”

But there were some statements that renewed the discomfort and disappointment I experienced when I read the original article in Psychological Science:

 “I think the biggest concern among my co-authors and I is that people will give up on mindfulness and/or meditation because they try it and it doesn’t work as promised,” says Van Dam.

“There may really be something to mindfulness, but it will be hard for us to find out if everyone gives up before we’ve even started to explore its best potential uses.”

So, how long before we “give up” on thousands of studies pouring out of an industry? In the meantime, should consumers act on what seem to be extravagant claims?

The Inverse article segued into some quotes from me after delivering another statement from the author which I could agree:

The authors of the study make their attitudes clear when it comes to the current state of the mindfulness industry: “Misinformation and poor methodology associated with past studies of mindfulness may lead public consumers to be harmed, misled, and disappointed,” they write. And while this comes off as unequivocal, some think they don’t go far enough in calling out specific instances of quackery.

“It’s not bare-knuckle, that’s for sure. I’m sure it got watered down in the review process,” James Coyne, Ph.D., an outspoken psychologist who’s extensively criticized the mindfulness industry, tells Inverse.

Coyne agrees with the conceptual issues outlined in the paper, specifically the fact that many mindfulness therapies are based on science that doesn’t really prove their efficacy, as well as the fact that researchers with copyrights on mindfulness therapies have financial conflicts of interest that could influence their research. But he thinks the authors are too concerned with tone policing.

“I do appreciate that they acknowledged other views, but they kept out anybody who would have challenged their perspective,” he says.

Regarding Coyne’s criticism about calling out individuals, Van Dam says the authors avoided doing that so as not to alienate people and stifle dialogue.

“I honestly don’t think that my providing a list of ‘quacks’ would stop people from listening to them,” says Van Dam. “Moreover, I suspect my doing so would damage the possibility of having a real conversation with them and the people that have been charmed by them.” If you need any evidence of this, look at David “Avocado” Wolfe, whose notoriety as a quack seems to make him even more popular as a victim of “the establishment.” So yes, this paper may not go so far as some would like, but it is a first step toward drawing attention to the often flawed science underlying mindfulness therapies.

To whom is the dialogue directed about unwarranted claims from the mindfulness industry?

As one of the authors of an article claiming to be an authoritative review from a group of psychologists with diverse expertise, Van Dam says he is speaking to consumers. Why won’t he and his co-authors provide citations and name names so that readers can evaluate for themselves what they are being told? Is the risk of reputational damage and embarrassment to the psychologists so great as to cause Van Dam to protect them versus protecting consumers from the exaggerated and even fraudulent claims of psychologists hawking their products branded as ‘peer-reviewed psychological and brain science’.

I use the term ‘quack’ sparingly outside of discussing unproven and unlikely-to-be-proven products supposed to promote physical health and well-being or to prevent or cure disease and distress.

I think Harvard psychologist Ellen Langer deserves the term “quack” for her selling of expensive trips to spas in Mexico to women with advanced cancer so that they can change their mind set to reverse the course of their disease. Strong evidence, please! Given that this self-proclaimed mother of mindfulness gets her claims promoted through the Association for Psychological Science website, I think it particularly appropriate for Van Dam and his coauthors to name her in their publication in an APS journal. Were they censored or only censoring themselves?

Let’s put aside psychologists who can be readily named as quacks. How about Van Dam and co-authors naming names of psychologists claiming to alter the brains and immune systems of cancer patients with mindfulness practices so that they improve their physical health and fight cancer, not just cope better with a life-altering disease?

I simply don’t buy Van Dam’s suggestion that to name names promotes quackery any more than I believe exposing anti-vaxxers promotes the anti-vaccine cause.

Is Van Dam only engaged in a polite discussion with fellow psychologists that needs to be strictly tone-policed to avoid offense or is he trying to reach, educate, and protect consumers as citizen scientists looking after their health and well-being? Maybe that is where we parted ways.

Calling out pseudoscience, radically changing the conversation about Amy Cuddy’s power posing paper

Part 1: Reviewed as the clinical trial that it is, the power posing paper should never have been published.

Has too much already been written about Amy Cuddy’s power pose paper? The conversation should not be stopped until its focus shifts and we change our ways of talking about psychological science.

The dominant narrative is now that a junior scientist published an influential paper on power posing and was subject to harassment and shaming by critics, pointing to the need for greater civility in scientific discourse.

Attention has shifted away from the scientific quality of the paper and the dubious products the paper has been used to promote and on the behavior of its critics.

Amy Cuddy and powerful allies are given forums to attack and vilify critics, accusing them of damaging the environment in which science is done and discouraging prospective early career investigators from entering the field.

Meanwhile, Amy Cuddy commands large speaking fees and has a top-selling book claiming the original paper provides strong science for simple behavioral manipulations altering mind-body relations and producing socially significant behavior.

This misrepresentation of psychological science does potential harm to consumers and the reputation of psychology among lay persons.

This blog post is intended to restart the conversation with a reconsideration of the original paper as a clinical and health psychology randomized trial (RCT) and, on that basis, identifying the kinds of inferences that are warranted from it.

In the first of a two post series, I argue that:

The original power pose article in Psychological Science should never been published.

-Basically, we have a therapeutic analog intervention delivered in 2 1-minute manipulations by unblinded experimenters who had flexibility in what they did,  what they communicated to participants, and which data they chose to analyze and how.

-It’s unrealistic to expect that 2 1-minute behavioral manipulations would have robust and reliable effects on salivary cortisol or testosterone 17 minutes later.

-It’s absurd to assume that the hormones mediated changes in behavior in this context.

-If Amy Cuddy retreats to the idea that she is simply manipulating “felt power,” we are solidly in the realm of trivial nonspecific and placebo effects.

The original power posing paper

Carney DR, Cuddy AJ, Yap AJ. Power posing brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science. 2010 Oct 1;21(10):1363-8.

The Psychological Science article can be construed as a brief mind-body intervention consisting of 2 1-minute behavioral manipulations. Central to the attention that the paper attracted is that argument that this manipulation  affected psychological state and social performance via the effects of the manipulation on the neuroendocrine system.

The original study is in effect, a disguised randomized clinical trial (RCT) of a biobehavioral intervention. Once this is recognized, a host of standards can come into play for reporting this study and interpreting the results.

CONSORT

All major journals and publishers including Association for Psychological Science have adopted the Consolidated Standards of Reporting Trials (CONSORT). Any submission of a manuscript reporting a clinical trial is required to be accompanied by a checklist  indicating that the article reports that particular details of how the trial was conducted. Item 1 on the checklist specifies that both the title and abstract indicate the study was a randomized trial. This is important and intended to aid readers in evaluating the study, but also for the study to be picked up in systematic searches for reviews that depend on screening of titles and abstracts.

I can find no evidence that Psychological Science adheres to CONSORT. For instance, my colleagues and I provided a detailed critique of a widely promoted study of loving-kindness meditation that was published in Psychological Science the same year as Cuddy’s power pose study. We noted that it was actually a poorly reported null trial with switched outcomes. With that recognition, we went on to identify serious conceptual, methodological and statistical problems. After overcoming considerable resistance, we were able  to publish a muted version of our critique. Apparently reviewers of the original paper had failed to evaluate it in terms of it being an RCT.

The submission of the completed CONSORT checklist has become routine in most journals considering manuscripts for studies of clinical and health psychology interventions. Yet, additional CONSORT requirements that developed later about what should be included in abstracts are largely being ignored.

It would be unfair to single out Psychological Science and the Cuddy article for noncompliance to CONSORT for abstracts. However, the checklist can be a useful frame of reference for noting just how woefully inadequate the abstract was as a report of a scientific study.

CONSORT for abstracts

Hopewell S, Clarke M, Moher D, Wager E, Middleton P, Altman DG, Schulz KF, CONSORT Group. CONSORT for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLOS Medicine. 2008 Jan 22;5(1):e20.

Journal and conference abstracts should contain sufficient information about the trial to serve as an accurate record of its conduct and findings, providing optimal information about the trial within the space constraints of the abstract format. A properly constructed and well-written abstract should also help individuals to assess quickly the validity and applicability of the findings and, in the case of abstracts of journal articles, aid the retrieval of reports from electronic databases.

Even if CONSORT for abstracts did not exist, we could argue that readers, starting with the editor and reviewers were faced with an abstract with extraordinary claims that required better substantiation. They were disarmed by a lack of basic details from evaluating these claims.

In effect, the abstract reduces the study to an experimercial for products about to be marketed in corporate talks and workshops, but let’s persist in evaluating it as an abstract as a scientific study.

Humans and other animals express power through open, expansive postures, and they express powerlessness through closed, contractive postures. But can these postures actually cause power? The results of this study confirmed our prediction that posing in high-power nonverbal displays (as opposed to low-power nonverbal displays) would cause neuroendocrine and behavioral changes for both male and female participants: High-power posers experienced elevations in testosterone, decreases in cortisol, and increased feelings of power and tolerance for risk; low-power posers exhibited the opposite pattern. In short, posing in displays of power caused advantaged and adaptive psychological, physiological, and behavioral changes, and these findings suggest that embodiment extends beyond mere thinking and feeling, to physiology and subsequent behavioral choices. That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

I don’t believe I have ever encountered in an abstract the extravagant claims with which this abstract concludes. But readers are not provided any basis for evaluating the claim until the Methods section. Undoubtedly, many holding opinions about the paper did not read that far.

Namely:

Forty-two participants (26 females and 16 males) were randomly assigned to the high-power-pose or low-power-pose condition.

Testosterone levels were in the normal range at both Time 1 (M = 60.30 pg/ml, SD = 49.58) and Time 2 (M = 57.40 pg/ml, SD = 43.25). As would be suggested by appropriately taken and assayed samples (Schultheiss & Stanton, 2009), men were higher than women on testosterone at both Time 1, F(1, 41) = 17.40, p < .001, r = .55, and Time 2, F(1, 41) = 22.55, p < .001, r = .60. To control for sex differences in testosterone, we used participant’s sex as a covariate in all analyses. All hormone analyses examined changes in hormones observed at Time 2, controlling for Time 1. Analyses with cortisol controlled for testosterone, and vice versa.2

Too small a study to provide an effect size

Hold on! First. Only 42 participants  (26 females and 16 males) would readily be recognized as insufficient for an RCT, particularly in an area of research without past RCTs.

After decades of witnessing the accumulation of strong effect sizes from underpowered studies, many of us have reacted by requiring 35 participants per group as the minimum acceptable level for a generalizable effect size. Actually, that could be an overly liberal criterion. Why?

Many RCTs are underpowered, yet a lack of enforcement of preregistration allows positive results by redefining the primary outcomes after results are known. A psychotherapy trial with 30 or less patients in the smallest cell has less than a 50% probability of detecting a moderate sized significant effect, even if it is present (Coyne,Thombs, & Hagedoorn, 2010). Yet an examination of the studies mustered for treatments being evidence supported by APA Division 12 ( http://www.div12.org/empirically-supported-treatments/ ) indicates that many studies were too underpowered to be reliably counted as evidence of efficacy, but were included without comment about this problem. Taking an overview, it is striking the extent to which the literature continues depend on small, methodologically flawed RCTs conducted by investigators with strong allegiances to one of the treatments being evaluated. Yet, which treatment is preferred by investigators is a better predictor of the outcome of the trial than the specific treatment being evaluated (Luborsky et al., 2006).

Earlier my colleagues and I had argued for the non-accumulative  nature of evidence from small RCTs:

Kraemer, Gardner, Brooks, and Yesavage (1998) propose excluding small, underpowered studies from meta-analyses. The risk of including studies with inadequate sample size is not limited to clinical and pragmatic decisions being made on the basis of trials that cannot demonstrate effectiveness when it is indeed present. Rather, Kraemer et al. demonstrate that inclusion of small, underpowered trials in meta-analyses produces gross overestimates of effect size due to substantial, but unquantifiable confirmatory publication bias from non-representative small trials. Without being able to estimate the size or extent of such biases, it is impossible to control for them. Other authorities voice support for including small trials, but generally limit their argument to trials that are otherwise methodologically adequate (Sackett & Cook, 1993; Schulz & Grimes, 2005). Small trials are particularly susceptible to common methodological problems…such as lack of baseline equivalence of groups; undue influence of outliers on results; selective attrition and lack of intent-to-treat analyses; investigators being unblinded to patient allotment; and not having a pre-determined stopping point so investigators are able to stop a trial when a significant effect is present.

In the power posing paper, there was the control for sex in all analyses because a peek at the data revealed baseline sex differences in testosterone dwarfing any other differences. What do we make of investigators conducting a study depending on testosterone mediating a behavioral manipulation who did not anticipate large baseline sex differences in testosterone?

In a Pubpeer comment leading up to this post , I noted:

We are then told “men were higher than women on testosterone at both Time 1, F(1, 41) = 17.40, p < .001, r = .55, and Time 2, F(1, 41) = 22.55, p < .001, r = .60. To control for sex differences in testosterone, we used participant’s sex as a covariate in all analyses. All hormone analyses examined changes in hormones observed at Time 2, controlling for Time 1. Analyses with cortisol controlled for testosterone, and vice versa.”

The findings alluded to in the abstract should be recognizable as weird and uninterpretable. Most basically, how could the 16 males be distributed across the two groups so that the authors could confidently say that differences held for both males and females? Especially when all analyses control for sex? Sex is highly correlated with testosterone and so an analysis that controlled for both the variables, sex and testosterone would probably not generalize to testosterone without such controls.

We are never given the basic statistics in the paper to independently assess what the authors are doing, not the correlation between cortisol and testosterone, only differences in time 2 cortisol controlling for time 1 cortisol, time 1 testosterone and gender. These multivariate statistics are not  very generalizable in a sample with 42 participants distributed across 2 groups. Certainly not for the 26 females and 16  males taken separately.

The behavioral manipulation

The original paper reports:

Participants’ bodies were posed by an experimenter into high-power or low-power poses. Each participant held two poses for 1 min each. Participants’ risk taking was measured with a gambling task; feelings of power were measured with self-reports. Saliva samples, which were used to test cortisol and testosterone levels, were taken before and approximately 17 min after the power-pose manipulation.

And then elaborates:

To configure the test participants into the poses, the experimenter placed an electrocardiography lead on the back of each participant’s calf and underbelly of the left arm and explained, “To test accuracy of physiological responses as a function of sensor placement relative to your heart, you are being put into a certain physical position.” The experimenter then manually configured participants’ bodies by lightly touching their arms and legs. As needed, the experimenter provided verbal instructions (e.g., “Keep your feet above heart level by putting them on the desk in front of you”). After manually configuring participants’ bodies into the two poses, the experimenter left the room. Participants were videotaped; all participants correctly made and held either two high-power or two low-power poses for 1 min each. While making and holding the poses, participants completed a filler task that consisted of viewing and forming impressions of nine faces.

The behavioral task and subjective self-report assessment

Measure of risk taking and powerful feelings. After they finished posing, participants were presented with the gambling task. They were endowed with $2 and told they could keep the money—the safe bet—or roll a die and risk losing the $2 for a payoff of $4 (a risky but rational bet; odds of winning were 50/50). Participants indicated how “powerful” and “in charge” they felt on a scale from 1 (not at all) to 4 (a lot).

An imagined bewildered review from someone accustomed to evaluating clinical trials

Although the authors don’t seem to know what they’re doing, we have an underpowered therapy analogue study with extraordinary claims. It’s unconvincing  that the 2 1-minute behavioral manipulations would change subsequent psychological states and behavior with any extralaboratory implications.

The manipulation poses a puzzle to research participants, challenging them to figure out what is being asked of them. The $2 gambling task presumably is meant to simulate effects on real-world behavior. But the low stakes could mean that participants believed the task evaluated whether they “got” the purpose of the intervention and behaved accordingly. Within that perspective, the unvalidated subjective self-report rating scale would serve as a clue to the intentions of the experimenter and an opportunity to show the participants were smart. The  manipulation of putting participants  into a low power pose is even more unconvincing as a contrasting active intervention or a control condition.  Claims that this manipulation did anything but communicate experimenter expectancies are even less credible.

This is a very weak form of evidence: A therapy analogue study with such a brief, low intensity behavioral manipulation followed by assessments of outcomes that might just inform participants of what they needed to do to look smart (i.e., demand characteristics). Add in that the experimenters were unblinded and undoubted had flexibility in how they delivered the intervention and what they said to participants. As a grossly underpowered trial, the study cannot make a contribution to the literature and certainly not an effect size.

Furthermore, if the authors had even a basic understanding of gender differences in social status or sex differences in testosterone, they would have stratified the study with respect to participate gender, not attempted to obtain control by post hoc statistical manipulation.

I could comment on signs of p-hacking and widespread signs of inappropriate naming, use, and interpretation of statistics, but why bother? There are no vital signs of a publishable paper here.

Is power posing salvaged by fashionable hormonal measures?

 Perhaps the skepticism of the editor and reviewers was overcome by the introduction of mind-body explanations  of what some salivary measures supposedly showed. Otherwise, we would be left with a single subjective self-report measure and a behavioral task susceptible to demand characteristics and nonspecific effects.

We recognize that the free availability of powerful statistical packages risks people using them without any idea of the appropriateness of their use or interpretation. The same observation should be made of the ready availability of means of collecting spit samples from research participants to be sent off to outside laboratories for biochemical analysis.

The clinical health psychology literature is increasingly filled with studies incorporating easily collected saliva samples intended to establish that psychological interventions influence mind-body relations. These have become particularly applied in attempts to demonstrate that mindfulness meditation and even tai chi can have beneficial effects on physical health and even cancer outcomes.

Often inaccurately described as as “biomarkers,” rather than merely as biological measurements, there is seldom little learned by inclusion of such measures that is generalizable within participants or across studies.

Let’s start with salivary-based cortisol measures.

A comprehensive review  suggests that:

  • A single measurement on a participant  or a pre-post pair of assessments would not be informative.
  • Single measurements are unreliable and large intra-and inter-individual differences not attributable to intervention can be in play.
  • Minor variations in experimental procedures can have large, unwanted effects.
  • The current standard is cortisol awakening response in the diurnal slope over more than one day, which would not make sense for the effects of 2 1-minute behavioral manipulations.
  • Even with sophisticated measurement strategies there is low agreement across and even within studies and low agreement with behavioral and self-report data.
  • The idea of collecting saliva samples would serve the function the investigators intended is an unscientific, but attractive illusion.

Another relevant comprehensive theoretical review and synthesis of cortisol reactivity was available at the time the power pose study was planned. The article identifies no basis for anticipating that experimenters putting participants into a 1-minute expansive poses would lower cortisol. And certainly no basis for assuming that putting participants into a 1-minute slumped position would raise cortisol. Or what such findings could possibly mean.

But we are clutching at straws. The authors’ interpretations of their hormonal data depend on bizarre post hoc decisions about how to analyze their data in a small sample in which participant sex is treated in incomprehensible  fashion. The process of trying to explain spurious results risks giving the results a credibility that authors have not earned for them. And don’t even try to claim we are getting signals of hormonal mediation from this study.

Another system failure: The incumbent advantage given to a paper that should not have been published.

Even when publication is based on inadequate editorial oversight and review, any likelihood or correction is diminished by published results having been blessed as “peer reviewed” and accorded an incumbent advantage over whatever follows.

A succession of editors have protected the power pose paper from post-publication peer review. Postpublication review has been relegated to other journals and social media, including PubPeer and blogs.

Soon after publication of  the power pose paper, a critique was submitted to Psychological Science, but it was desk rejected. The editor informally communicated to the author that the critique read like a review and teh original article had already been peer reviewed.

The critique by Steven J. Stanton nonetheless eventually appeared in Frontiers in Behavioral Neuroscience and is worth a read.

Stanton took seriously the science being invoked in the claims of the power pose paper.

A sampling:

Carney et al. (2010) collapsed over gender in all testosterone analyses. Testosterone conforms to a bimodal distribution when including both genders (see Figure 13; Sapienza et al., 2009). Raw testosterone cannot be considered a normally distributed dependent or independent variable when including both genders. Thus, Carney et al. (2010) violated a basic assumption of the statistical analyses that they reported, because they used raw testosterone from pre- and post-power posing as independent and dependent variables, respectively, with all subjects (male and female) included.

And

^Mean cortisol levels for all participants were reported as 0.16 ng/mL pre-posing and 0.12 ng/mL post-posing, thus showing that for all participants there was an average decrease of 0.04 ng/mL from pre- to post-posing, regardless of condition. Yet, Figure 4 of Carney et al. (2010) shows that low-power posers had mean cortisol increases of roughly 0.025 ng/mL and high-power posers had mean cortisol decreases of roughly 0.03 ng/mL. It is unclear given the data in Figure 4 how the overall cortisol change for all participants could have been a decrease of 0.04 ng/mL.

Another editor of Psychological Science received a critical comment from Marcus Crede and Leigh A. Phillips. After the first round of reviews, the Crede and Philips removed references to changes in the published power pose paper from earlier drafts that they had received from the first author, Dana Carney. However, Crede and Phillips withdrew their critique when asked to respond to a review by Amy Cuddy in a second resubmission.

The critique is now forthcoming in Social Psychological and Personality Science

Revisiting the Power Pose Effect: How Robust Are the Results Reported by Carney, Cuddy and Yap (2010) to Data Analytic Decisions

The article investigates effects of choices made in p-hacking in the original paper. An excerpt from the abstract

In this paper we use multiverse analysis to examine whether the findings reported in the original paper by Carney, Cuddy, and Yap (2010) are robust to plausible alternative data analytic specifications: outlier identification strategy; the specification of the dependent variable; and the use of control variables. Our findings indicate that the inferences regarding the presence and size of an effect on testosterone and cortisol are  highly sensitive to data analytic specifications. We encourage researchers to routinely explore the influence of data analytic choices on statistical inferences and also encourage editors and  reviewers to require explicit examinations of the influence of alternative data analytic  specifications on the inferences that are drawn from data.

Dana Carney, the first author of the has now posted an explanation why she no longer believes the originally reported findings are genuine and why “the evidence against the existence of power poses is undeniable.” She discloses a number of important confounds and important “researcher degrees of freedom in the analyses reported in the published paper.

Coming Up Next

A different view of the Amy Cuddy’s Ted talk in terms of its selling of pseudoscience to consumers and its acknowledgment of a strong debt to Cuddy’s adviser Susan Fiske.

A disclosure of some of the financial interests that distort discussion of the scientific flaws of the power pose.

How the reflexive response of the replicationados inadvertently reinforced the illusion that the original pose study provided meaningful effect sizes.

How Amy Cuddy and her allies marshalled the resources of the Association for Psychological Science to vilify and intimidate critics of bad science and of the exploitation of consumers by psychological pseudoscience.

How journalists played into this vilification.

What needs to be done to avoid a future fiasco for psychology like the power pose phenomenon and protect reformers of the dissemination of science.

Note: Time to reiterate that all opinions expressed here are solely those of Coyne of the Realm and not necessarily of PLOS blogs, PLOS One or his other affiliations.

Should have seen it coming: Once high-flying Psychological Science article lies in pieces on the ground

Life is too short for wasting time probing every instance of professional organizations promoting bad science when they have an established record of doing just that.

There were lots of indicators that’s what we were dealing with in the Association for Psychological Science (APS) recent campaign for the now discredited and retracted ‘sadness prevents us from seeing blue’ article.

sad blueA quick assessment of the press release should have led us to dismiss the claims being presented and convinced us to move on.

Readers can skip my introductory material by jumping down this blog post to [*} to see my analysis of the APS press release.

Readers can also still access the original press release, which has now disappeared from the web, here. Some may want to read the press release and form their own opinions before proceeding into this blog post.

What, I’ve stopped talking about the PACE trial? Yup, at least at Mind the Brain, for now. But you can go here for the latest in my continued discussion of the PACE trial of CBT for chronic fatigue syndrome, in which I moved from critical observer to activist a while ago.

Before we were so rudely interrupted  by the bad science and bad media coverage of the PACE trial, I was focusing on how readers can learn to make quick assessments of hyped media coverage of dubious scientific studies.

In “Sex and the single amygdala”  I asked:

Can skeptics who are not specialists, but who are science-minded and have some basic skills, learn to quickly screen and detect questionable science in the journals and its media coverage?

The counter argument of course is Chris Mooney telling us “You Have No Business Challenging Scientific Experts”. He cites

“Jenny McCarthy, who once remarked that she began her autism research at the “University of Google.”

But while we are on the topic of autism, how about the counter example of The Lancet’s coverage of the link between vaccines and autism? This nonsense continues to take its toll on American children whose parents – often higher income and more educated than the rest – refused to vaccinate them on the basis of a story that started in The Lancet. Editor Richard Horton had to concede

horton on lancet autism failure

 

 

 

If we accept Chris Mooney‘s position, we are left at the mercy of press releases cranked out by the likes of professional organizations like Association for Psychological Science (APS) that repeatedly demand that we revise our thinking about human nature and behavior, as well as change our behavior if we want to extend our lives and live happier, all on the basis of a single “breakthrough” study. Rarely do APS press releases have any follow-up as to the fate of a study they promoted. One has to hope that PubPeer  or PubMed Commons pick up on the article touted in the press release and see what a jury of post-publication peers decides.

As we have seen in my past Mind the Brain posts, there are constant demands on our attention from press releases generated from professional organizations, university press officers, and even NIH alerting us to supposed breakthroughs in psychological and brain science. Few such breakthroughs hold up over time.

Are there no alternatives?

Are there no alternatives to our simply deferring to the expertise being offered or taking the time to investigate for ourselves claims that are likely to prove exaggerated or simply false?

We should approach press releases from the APS – or from its rival American Psychological Association – using prior probabilities to set our expectations. The Open Science Collaboration: Psychology (OSC) article  in Science presented results of a systematic attempt to replicate 100 findings from prestigious psychological journals, including APS’ s Psychological Science and APA’s Journal of Personality and Social Psychology. Less than half of the findings were replicated. Findings from the APS and APA journals fared worse than the others.

So, our prior probabilities are that declarations of newsworthy, breakthrough findings trumpeted in press releases from psychological organizations are likely to be false or exaggerated – unless we assume that the publicity machines prefer the trustworthy over the exciting and newsworthy in the article they selected to promote.

I will guide readers through a quick assessment of APS press release which I started on this post before getting swept up into the PACE controversy. However, in the intervening time, there have been some extraordinary developments, which I will then briefly discuss. We can use these developments to validate my and your evaluation of the press release available earlier. Surprisingly, there is little overlap between the issues I note in the press release and what concerned post-publication commentators.

*A running commentary based on screening the press release

What once was a link to the“feeling blue and seeing blue”  article now takes one only to

retraction press releasee

Fortunately, the original press release can still be reached here. The original article is preserved here.

My skepticism was already high after I read the opening two paragraphs of the press release

The world might seem a little grayer than usual when we’re down in the dumps and we often talk about “feeling blue” — new research suggests that the associations we make between emotion and color go beyond mere metaphor. The results of two studies indicate that feeling sadness may actually change how we perceive color. Specifically, researchers found that participants who were induced to feel sad were less accurate in identifying colors on the blue-yellow axis than those who were led to feel amused or emotionally neutral.

Our results show that mood and emotion can affect how we see the world around us,” says psychology researcher Christopher Thorstenson of the University of Rochester, first author on the research. “Our work advances the study of perception by showing that sadness specifically impairs basic visual processes that are involved in perceiving color.”

What Anglocentric nonsense. First, blue as a metaphor for sad does not occur across most languages other than English and Serbian. In German, to call someone blue is suggesting the person is drunk. In Russian, you are suggesting that the person is gay. In Arabic, if you say you are having a blue day, it is a bad one. But if you say in Portuguese that “everything is blue”, it suggests everything is fine.

In Indian culture, blue is more associated with happiness than sadness, probably traceable to the blue-blooded Krishna being associated with divine and human love in Hinduism. In Catholicism, the Virgin Mary is often wearing blue and so the color has come to be associated with calmness and truth.

We are off to a bad start. Going to the authors’ description of their first of two studies, we learn:

In one study, the researchers had 127 undergraduate participants watch an emotional film clip and then complete a visual judgment task. The participants were randomly assigned to watch an animated film clip intended to induce sadness or a standup comedy clip intended to induce amusement. The emotional effects of the two clips had been validated in previous studies and the researchers confirmed that they produced the intended emotions for participants in this study.

Oh no! This is not a study of clinical depression, but another study of normal college students “made sad” with a mood induction.

So-called mood induction tasks don’t necessarily change actual mood state, but they do convey to research participants what is expected of them and how they are supposed to act. In one of the earliest studies I ever did, we described a mood induction procedure to subjects without actually having them experience it. We then asked them to respond as if they had received it. Their responses were indistinguishable. We concluded that we could not rule out that what were considered effects of a mood induction task were simply demand characteristics, what research participants perceive as instructions as to how they should behave.

It was fashionable way back then for psychology researchers who were isolated in departments that did not have access to clinically depressed patients to claim that they were nonetheless conducting analog studies of depression. Subjecting students to unsolvable anagram task or uncontrollable loud noises was seen as inducing learned helplessness in them, thereby allowing investigators an analog study of depression. We demonstrated a problem with that idea. If students believed that the next task that they were administered was part of the same experiment, they performed poorly, as if they were in a state of learned helplessness or depression. However, if they believed that the second task was unrelated to the first, they would show no such deficits. Their negative state of helplessness or depression was confined to their performance in what they thought was the same setting in which the induction had occurred. Shortly after our experiments. Marty Seligman wisely stopped doing studies “inducing” learned helplessness in humans, but he continued to make the same claims about the studies he had done.

Analog studies of depression disappeared for awhile, but I guess they have come back into fashion.

But the sad/blue experiment could also be seen as a priming  experiment. The research participants were primed by the film clip and their response to a color naming task was then examined.

It is fascinating that neither the press release nor the article itself ever mentioned the word priming. It was only a few years ago that APS press releases were crowing about priming studies. For instance, a 2011 press release entitled “Life is one big priming experiment…” declared:

One of the most robust ideas to come out of cognitive psychology in recent years is priming. Scientists have shown again and again that they can very subtly cue people’s unconscious minds to think and act certain ways. These cues might be concepts—like cold or fast or elderly—or they might be goals like professional success; either way, these signals shape our behavior, often without any awareness that we are being manipulated.

Whoever wrote that press release should be embarrassed today. In the interim, priming effects have not proven robust. Priming studies that cannot be replicated have figured heavily in the assessment that the psychological literature is untrustworthy. Priming studies also figure heavily in the 56 retracted studies of fraudster psychologist Diederik Stapel. He claims that he turned to inventing data when his experiments failed to demonstrate priming effects that he knew were there. Yet, once he resorted to publishing studies with fabricated data, others claimed to replicate his work.

I made up research, and wrote papers about it. My peers and the journal editors cast a critical eye over it, and it was published. I would often discover, a few months or years later, that another team of researchers, in another city or another country, had done more or less the same experiment, and found the same effects.  My fantasy research had been replicated. What seemed logical was true, once I’d faked it.

So, we have an APS press release reporting a study that assumes that the association between sadness and the color blue is so hardwired and culturally universal that is reflected in basic visual processes. Yet the study does not involve clinical depression, only an analog mood induction and a closer look reveals that once again APS is pushing a priming study. I think it’s time to move on. But let’s read on:

The results cannot be explained by differences in participants’ level of effort, attention, or engagement with the task, as color perception was only impaired on the blue-yellow axis.

“We were surprised by how specific the effect was, that color was only impaired along the blue-yellow axis,” says Thorstenson. “We did not predict this specific finding, although it might give us a clue to the reason for the effect in neurotransmitter functioning.”

The researchers note that previous work has specifically linked color perception on the blue-yellow axis with the neurotransmitter dopamine.

The press release tells us that the finding is very specific, occurring only on the blue-yellow axis, not the red-green axes and thatdifferences between are not found in level of effort, attention, or engagement of the task. The researchers did not expect such a specific finding, they were surprised.

The press release wants to convince us of an exciting story of novelty and breakthrough.  A skeptic sees it differently: This is an isolated finding that is unanticipated by the researchers getting all dressed up. See, we should’ve moved on.

The evidence with which the press release wants to convince us is exciting because it is specific and novel. iThe researchers are celebrating the specificity of their finding, but the blue-yellow axis finding may be the only one statistically significant because it is due to chance or an artifact.

And bringing up unmeasured “neurotransmitter functioning” is pretentious and unwise. I challenge the researchers to show that effects of watching a brief movie clip registers in measurable changes in neurotransmitters. I’m skeptical even whether persons drawn from the community or outpatient samples reliably differ from non-depressed persons in measures of the neurotransmitter dopamine.

This is new work and we need to take time to determine the robustness and generalizability of this phenomenon before making links to application,” he concludes.

Claims in APS press releases are not known for their “robustness and generalizability.” I don’t think this particular claim should prompt an effort at independent replication when scientists have so many more useful things to keep them busy.

Maybe, these investigators should have checked robustness and generalizability before rushing into print. Maybe APS should stop pestering us with findings that surprise researchers and that have not yet been replicated.

A flying machine in pieces on the ground

Sadness impairs color perception was sent soaring high, lifted by an APS press release now removed from the web, but that is still available here. The press release was initially uncritically echoed, usually cut-and-paste or outright churnaled  in over two dozen media mentions.

But, alas, Sadness impairs color perception is now a flying machine in pieces on the ground 

Noticing of the article’s problems seem to have started with some chatter of skeptically-minded individuals on Twitter,  which led to comments at PubPeer where the article was torn to pieces. What unfolded was a wonderful demonstration of crowdsourced post-publication peer review in action. Lesson: PubPeer rocks and can overcome the failures of pre-publication peer review to keep bad stuff out of the literature.

You can follow the thread of comments at PubPeer.

  • An anonymous skeptic started off by pointing out an apparent lack of a significant statistical effect where one was claimed.
  • There was an immediate call for a retraction, but it seemed premature.
  • Soon re-analyses of the data from the paper were being reported, confirming the lack of a significant statistical effect when analyses were done appropriately and reported transparently.
  • The data set for the article was mysteriously changed after it had been uploaded.
  • Doubts were expressed about the integrity of the data – had they been tinkered with?
  • The data disappeared.
  • There was an announcement of a retraction.

The retraction notice  indicated that the researchers were still convinced of the validity of their hypothesis, despite deciding to retract their paper.

We remain confident in the proposition that sadness impairs color perception, but would like to acquire clearer evidence before making this conclusion in a journal the caliber of Psychological Science.

so deflatedThe retraction note also carries a curious Editors note:

Although I believe it is already clear, I would like to add an explicit statement that this retraction is entirely due to honest mistakes on the part of the authors.

Since then, doubts about express whether retraction was a sufficient response or whether something more is needed. Some of the participants in the PubPeer discussion drafted a letter to the editor incorporating their reanalyses and prepared to submit it to Psychological Science. Unfortunately, having succeeded in getting the bad science retracted, these authors reduced the likelihood of theirr reanalysis being accepted by Psychological Science. As of this date, their fascinating account remains unpublished but available on the web.

Postscript

Next time you see an APS or APA press release, what will be your starting probabilities about the trustworthiness of the article being promoted? Do you agree with Chris Mooney that you should simply defer to the expertise of the professional organization?

Why would professional organizations risk embarrassment with these kinds of press releases? Apparently they are worth the risk. Such press releases can echo through the conventional and social media and attract early attention to an article. The game is increasing the impact factor of the journal (JIFs).

Although it is unclear precisely how journal impact factors are calculated, the number reflects the average number of citations an article obtains within two years of publication. However, if press releases  promote “early releases” of articles,  the journal can acquire citations before the clock starts ticking for the two years. APS and APA are in intense competition for prestige of their journals and membership. It matters greatly to them which organization can claim the most prestigious journals, as demonstrated by their JIFs.

So, press releases are important from garnering early attention. Apparently breakthroughs, innovations, and “first ever” mattered more than trustworthiness. In the professional organizations hope we won’t remember the fate of past claims.