Did a placebo affect allergic reactions to a pin prick or only in the authors’ minds?

Can placebo effects be harnessed to improve treatment outcomes? Stories of a placebo changing bodily function are important in promoting mind-body medicine, but mostly turn out to be false positives. Was this one an exception?

mind the brain logoCan placebo effects be harnessed to improve treatment outcomes? Stories of a placebo changing bodily function are important in promoting mind-body medicine, but mostly turn out to be false positives. Was this one an exception?

A lesson in critical appraisal: How to screen complicated studies in order to decide whether to put the time and energy into a closer look.

The study:

Howe LC, Goyer JP, Crum AJ. Harnessing the Placebo Effect: Exploring the Influence of Physician Characteristics on Placebo Response. Health Psychology Vol 36(11), Nov 2017, 1074-1082 http://dx.doi.org/10.1037/hea0000499

From the Abstract:

After inducing an allergic reaction in participants through a histamine skin prick test, a health care provider administered a cream with no active ingredients and set either positive expectations (cream will reduce reaction) or negative expectations (cream will increase reaction).

The provider demonstrated either high or low warmth, or either high or low competence.

Results: The impact of expectations on allergic response was enhanced when the provider acted both warmer and more competent and negated when the provider acted colder and less competent.

Conclusion: This study suggests that placebo effects should be construed not as a nuisance variable with mysterious impact but instead as a psychological phenomenon that can be understood and harnessed to improve treatment outcomes.

Why I dismissed this study

bigger skin prickThe small sample size was set in a power analysis based on the authors hopes of finding a moderate effect size, not any existing results. With only 20 participants per cell, most significant findings are likely to be false positives.

The authors had a complicated design with multiple manipulations and  time points, They examined 2 physiological measures, but only reported results for one of them in the paper, the one with stronger results.

The authors did not report a key overall test of whether there was a significant main or interaction effect. Without such a finding, jumping down to significant comparisons between groups is likely to a false positive.

The authors did not adjust for multiple comparisons, despite doing a huge number.

The authors did not report raw mean differences for comparisons, only differences at two time points controlling for gender, race, and the first two time points. No rationale is given.

The authors used language like ‘marginally significant, and ‘different, but not significantly so,’ which might suggest they were chasing and selectively reporting significant findings.

The phenomena under study was mild allergic reaction in the short term:  three time points,  9-15 minutes, with data for 2 earlier time points not reported as outcomes. It is unclear the mechanism by which an experimental manipulation could have an observable effect on such a mild reaction in such a short period of time.

Overview

Claims of placebo effects figures heavily in discussions of the power of the mind over the body. Yet, this power is greatly exaggerated by lay persons and in the lay press and social media. Effects of a placebo manipulation on objective physiological measures, as opposed to subjective self-report measures are uncommon and usually turn out to be false positives.

A New England Journal of Medicine review  of 130 clinical trials found

Little evidence in general that placebos had powerful clinical effects. Although placebos had no significant effects on objective or binary outcomes, they had possible small benefits in studies with continuous subjective outcomes and for the treatment of pain. Outside the setting of clinical trials, there is no justification for the use of placebos.

I often cite another great NEJM study  showing the sharp contrast in positive results obtained subjective self-report versus negative results with objective physical functioning measures.

That is probably the case with a recent report of effects of expectancies and interpersonal relationship on a mild allergic reaction induced by a histamine skin prick test (SPT). The study involved manipulation of the perceived warmth and competence of a provider, as well as whether research participants were told that an inert cream being applied would have a positive or negative effect.

The authors invoke in claiming support that psychological variables do indeed influence a mild allergic reaction. Examining all of the numerous pairwise comparisons,  would be a long and tedious task. However, I decided from some details of the design and analysis of the study, I would not proceed.

Some notable features of the study.

The key manipulations of high versus low warmth and high versus low competence were in the behavior of a single unblinded experimenter.

The design is described as 2x2x2 with a cell size of n= 20 (19 in one cell).

It is more properly described as 2x2x2x(5) because of the 5 time points after the provider administeried the skin prick:

(T1 = 3 min post-SPT, T2 = 6 mi  post-SPT and cream administered directly afterward, T3 = 9 min post-SPTand 3 min post-cream,T4 = 12 min post-SPT and 6 min post-cream, T5 =15 min post-SPT and 9 min. post-cream).

The small number of participants per cell was set in a power analysis based on hope a moderate effect size could be shown, not on past results.

The physiological reaction was measured in terms of size of a wheal (raised bump) and size of the flare (redness surrounding the bump).

Numerous other physiological measures were obtained, including blood pressure and pre-post session saliva samples. It is not stated what was done with these data, but they could have been used to evaluate further the manipulation of experimenter behavior.

No simple correlation between participants’ perceptions of warm and competence are reported, which would have been helpful in interpreting the 2×2 crossing of warmth and competence.

In the supplementary materials, readers are told ratings of itchiness and mood were obtained after the skin prick. No effects of the experimental manipulation were observed, which would seem not to support the effectiveness of the intervention.

No overall ANOVA or test for significance of interactions is presented.

Instead, numerous paired comparisons are presented without correction for post hoc multiplicity.

Further comparisons were conducted with a sample that was constructed post hoc:

To better understand the mechanism by which expectations differed, within a setting of high warmth and high competence, we compared the wheal and flare size for the positive and negative expectations conditions to a follow-up sample who received neutral expectations. This resulted in a total sample of N=62.

Differences arising using this sample were discussed, despite significance levels being p = .095 and p =  .155.

Raw mean scores are not presented nor discussed. Instead, all comparisons controlled for gender and race and size of the wheal at Times 1 and 2,

Only the size of the wheal is reported in the body of the paper, but it was reported

The results on the flare of the reaction were mostly similar (see the supplemental material available online).

Actually, the results reported in the supplemental material were considerably weaker, with claims of differences being marginally significant and favoring results that were only significant at particular time points.

So, what do you think? If you are interested, take a look at the study and let me know if I was premature to dismiss it.

Preorders are being accepted for e-books providing skeptical lookseBook_PositivePsychology_345x550 at mindfulness and positive psychology, and arming citizen scientists with critical thinking skills. Right now there is a special offer for free access to a Mindfulness Master Class. But hurry, it won’t last.

I will also be offering scientific writing courses on the web as I have been doing face-to-face for almost a decade. I want to give researchers the tools to get into the journals where their work will get the attention it deserves.

Sign up at my website to get advance notice of the forthcoming e-books and web courses, as well as upcoming blog posts at this and other blog sites. Get advance notice of forthcoming e-books and web courses. Lots to see at CoyneoftheRealm.com.

Power pose: II. Could early career investigators participating in replication initiatives hurt their advancement?

Participation in attempts to replicate seriously flawed studies might be seen as bad judgment, when there many more opportunities to demonstrate independent, critical thinking.

mind the brain logo

This is the second  blog post concerning the special issue of Comprehensive Results in Social Psychology  devoted to replicating Amy Cuddy’s original power pose study in Psychological Science.

Some things for early career investigators to think about.

Participating in attempts to replicate seriously flawed studies might be seen as bad judgment, when there many more opportunities to demonstrate independent, critical thinking.

I have long argued that there should be better incentives for early career (as well as more senior) investigators (ECRs) participating in efforts to improve the trustworthiness of science.

ECRs should be encouraged -and expected- to engage in post publication peer review and PubPeer and PubMed Commons in which to develop way in which such activity can be listed on the CV.

The Pottery Barn rule should be extended so that ECRs can publish critical commentaries in the journals that publish the original flawed papers. Retraction notices should indicate whose complaints led to the retraction.

Rather than being pressured to publish more underpowered, under-resourced studies, ECRs should be rewarded for research parasite activity. They should be assisted in obtaining data sets from already published studies. With that data, they should conduct exploratory, secondary analyses aimed at understanding what went wrong in larger-scale studies that left them methodologically compromised and with shortfalls in recruitment.

But I wonder if we should counsel ECRs that participating in a multisite replication initiatives like the one directed at the power pose effect might not contribute to the career advancement and may even hurt it.

MturkI’ve been critical of the value of replication initiatives  as the primary means of addressing the own trustworthiness of psychology, particularly in areas with claims of clinical and public health relevance. To add to the other reservations I have, I can point that the necessary economy and efficiency of reliance on MTurk and other massive administrations of experimental manipulations can force the efforts to improve the trustworthiness of psychology into less socially significant and may be less representative areas.

I certainly wouldn’t penalize an early career investigator for involvement in a multisite replication. I appreciate there is room for disagreement with my skepticism about the value of such initiatives. I would recognize the expression of valuation a better research practices that involvement would represent.

But I think early career investigator’s need to consider that some senior investigators and members of hiring and promotion committees (HPCs) might give a low rating of publications coming from such initiatives in judging the candidates potential for original, creative, risk-taking research. That might be even if these committee members appreciate the need to improve the trustworthiness of psychology.

Here are some conceivable comments that could be made in such a committee’s deliberations.

“Why did this candidate get involved in a modest scale study so focused on two saliva assessments of cortisol? Even if it is not their area of expertise, shouldn’t they have consulted the literature and saw how uninformative a pair of assessments of cortisol are, given the well-known problems with cortisol of intra-individual and inter-individual variation in sensitivity to uncontrolled contextual variables?…They should have powered their study to find cortisol differences amidst all the noise.”

“Were they unaware that testosterone levels differ between men and women by a factor of five or six? How do they expect that discontinuity in distributions to be overcome in any statistical analyses combining men and women? What basis was there in the literature suggests that a brief, seemingly trivial manipulation of posture with have such enduring effects on hormones? Why they specifically anticipate differences would be registered in women? Overall, their involvement in this initiative demonstrates a willingness to commit considerable time and resources to ideas that could have been ruled out by a search of the relevant literature.”

lemmengiphy.gif

“There seems to be a lemming quality to this large group of researchers pursuing some bad hypotheses with inappropriate methods. Why didn’t this investigator have the independence of mind to object? Can we expect a similar going with the herd after fashionable topics in research over the next few years?”

“While I appreciate the motivation of this investigator, I believe there was a violation of the basic principle of ‘stop and think before you undertake a study’ that does not bode well for how they will spend their time when faced with the demands of teaching and administration as well as doing research.”

Readers may think that these comments represent horrible, cruel sentiments and would be a great injustice if they influence hiring and decisions. But anyone who is ever been on a hiring and promotion committee knows that they are full of such horrible comments and that such processes are not fair or just or even rational.

 

 

 

Creating illusions of wondrous effects of yoga and meditation on health: A skeptic exposes tricks

The tour of the sausage factory is starting, here’s your brochure telling you’ll see.

 

A recent review has received a lot of attention with it being used for claims that mind-body interventions have distinct molecular signatures that point to potentially dramatic health benefits for those who take up these practices.

What Is the Molecular Signature of Mind–Body Interventions? A Systematic Review of Gene Expression Changes Induced by Meditation and Related Practices.  Frontiers in Immunology. 2017;8.

Few who are tweeting about this review or its press coverage are likely to have read it or to understand it, if they read it. Most of the new agey coverage in social media does nothing more than echo or amplify the message of the review’s press release.  Lazy journalists and bloggers can simply pass on direct quotes from the lead author or even just the press release’s title, ‘Meditation and yoga can ‘reverse’ DNA reactions which cause stress, new study suggests’:

“These activities are leaving what we call a molecular signature in our cells, which reverses the effect that stress or anxiety would have on the body by changing how our genes are expressed.”

And

“Millions of people around the world already enjoy the health benefits of mind-body interventions like yoga or meditation, but what they perhaps don’t realise is that these benefits begin at a molecular level and can change the way our genetic code goes about its business.”

[The authors of this review actually identified some serious shortcomings to the studies they reviewed. I’ll be getting to some excellent points at the end of this post that run quite counter to the hype. But the lead author’s press release emphasized unwarranted positive conclusions about the health benefits of these practices. That is what is most popular in media coverage, especially from those who have stuff to sell.]

Interpretation of the press release and review authors’ claims requires going back to the original studies, which most enthusiasts are unlikely to do. If readers do go back, they will have trouble interpreting some of the deceptive claims that are made.

Yet, a lot is at stake. This review is being used to recommend mind-body interventions for people having or who are at risk of serious health problems. In particular, unfounded claims that yoga and mindfulness can increase the survival of cancer patients are sometimes hinted at, but occasionally made outright.

This blog post is written with the intent of protecting consumers from such false claims and providing tools so they can spot pseudoscience for themselves.

Discussion in the media of the review speaks broadly of alternative and complementary interventions. The coverage is aimed at inspiring  confidence in this broad range of treatments and to encourage people who are facing health crises investing time and money in outright quackery. Seemingly benign recommendations for yoga, tai chi, and mindfulness (after all, what’s the harm?) often become the entry point to more dubious and expensive treatments that substitute for established treatments.  Once they are drawn to centers for integrative health care for classes, cancer patients are likely to spend hundreds or even thousands on other products and services that are unlikely to benefit them. One study reported:

More than 72 oral or topical, nutritional, botanical, fungal and bacterial-based medicines were prescribed to the cohort during their first year of IO care…Costs ranged from $1594/year for early-stage breast cancer to $6200/year for stage 4 breast cancer patients. Of the total amount billed for IO care for 1 year for breast cancer patients, 21% was out-of-pocket.

Coming up, I will take a skeptical look at the six randomized trials that were highlighted by this review.  But in this post, I will provide you with some tools and insights so that you do not have to make such an effort in order to make an informed decision.

Like many of the other studies cited in the review, these randomized trials were quite small and underpowered. But I will focus on the six because they are as good as it gets. Randomized trials are considered a higher form of evidence than simple observational studies or case reports [It is too bad the authors of the review don’t even highlight what studies are randomized trials. They are lumped with others as “longitudinal studies.]

As a group, the six studies do not actually add any credibility to the claims that mind-body interventions – specifically yoga, tai chi, and mindfulness training or retreats improve health by altering DNA.  We can be no more confident with what the trials provide than we would be without them ever having been done.

I found the task of probing and interpreting the studies quite labor-intensive and ultimately unrewarding.

I had to get past poor reporting of what was actually done in the trials, to which patients, and with what results. My task often involved seeing through cover ups with authors exercising considerable flexibility in reporting what measures were they actually collected and what analyses were attempted, before arriving at the best possible tale of the wondrous effects of these interventions.

Interpreting clinical trials should not be so hard, because they should be honestly and transparently reported and have a registered protocol and stick to it. These reports of trials were sorely lacking, The full extent of the problems took some digging to uncover, but some things emerged before I got to the methods and results.

The introductions of these studies consistently exaggerated the strength of existing evidence for the effects of these interventions on health, even while somehow coming to the conclusion that this particular study was urgently needed and it might even be the “first ever”. The introductions to the six papers typically cross-referenced each other, without giving any indication of how poor quality the evidence was from the other papers. What a mutual admiration society these authors are.

One giveaway is how the introductions  referred to the biggest, most badass, comprehensive and well-done review, that of Goyal and colleagues.

That review clearly states that the evidence for the effects of mindfulness is poor quality because of the lack of comparisons with credible active treatments. The typical randomized trial of mindfulness involves a comparison with no-treatment, a waiting list, or patients remaining in routine care where the target problem is likely to be ignored.  If we depend on the bulk of the existing literature, we cannot rule out the likelihood that any apparent benefits of mindfulness are due to having more positive expectations, attention, and support over simply getting nothing.  Only a handful  of hundreds of trials of mindfulness include appropriate, active treatment comparison/control groups. The results of those studies are not encouraging.

One of the first things I do in probing the introduction of a study claiming health benefits for mindfulness is see how they deal with the Goyal et al review. Did the study cite it, and if so, how accurately? How did the authors deal with its message, which undermines claims of the uniqueness or specificity of any benefits to practicing mindfulness?

For yoga, we cannot yet rule out that it is better than regular exercising – in groups or alone – having relaxing routines. The literature concerning tai chi is even smaller and poorer quality, but there is the same need to show that practicing tai chi has any benefits over exercising in groups with comparable positive expectations and support.

Even more than mindfulness, yoga and tai chi attract a lot of pseudoscientific mumbo jumbo about integrating Eastern wisdom and Western science. We need to look past that and insist on evidence.

Like their introductions, the discussion sections of these articles are quite prone to exaggerating how strong and consistent the evidence is from existing studies. The discussion sections cherry pick positive findings in the existing literature, sometimes recklessly distorting them. The authors then discuss how their own positively spun findings fit with what is already known, while minimizing or outright neglecting discussion of any of their negative findings. I was not surprised to see one trial of mindfulness for cancer patients obtain no effects on depressive symptoms or perceived stress, but then go on to explain mindfulness might powerfully affect the expression of DNA.

If you want to dig into the details of these studies, the going can get rough and the yield for doing a lot of mental labor is low. For instance, these studies involved drawing blood and analyzing gene expression. Readers will inevitably encounter passages like:

In response to KKM treatment, 68 genes were found to be differentially expressed (19 up-regulated, 49 down-regulated) after adjusting for potentially confounded differences in sex, illness burden, and BMI. Up-regulated genes included immunoglobulin-related transcripts. Down-regulated transcripts included pro-inflammatory cytokines and activation-related immediate-early genes. Transcript origin analyses identified plasmacytoid dendritic cells and B lymphocytes as the primary cellular context of these transcriptional alterations (both p < .001). Promoter-based bioinformatic analysis implicated reduced NF-κB signaling and increased activity of IRF1 in structuring those effects (both p < .05).

Intimidated? Before you defer to the “experts” doing these studies, I will show you some things I noticed in the six studies and how you can debunk the relevance of these studies for promoting health and dealing with illness. Actually, I will show that even if these 6 studies got the results that the authors claimed- and they did not- at best, the effects would trivial and lost among the other things going on in patients’ lives.

Fortunately, there are lots of signs that you can dismiss such studies and go on to something more useful, if you know what to look for.

Some general rules:

  1. Don’t accept claims of efficacy/effectiveness based on underpowered randomized trials. Dismiss them. The rule of thumb is reliable to dismiss trials that have less than 35 patients in the smallest group. Over half the time, true moderate sized effects will be missed in such studies, even if they are actually there.

Due to publication bias, most of the positive effects that are published from such sized trials will be false positives and won’t hold up in well-designed, larger trials.

When significant positive effects from such trials are reported in published papers, they have to be large to have reached significance. If not outright false, these effect sizes won’t be matched in larger trials. So, significant, positive effect sizes from small trials are likely to be false positives and exaggerated and probably won’t replicate. For that reason, we can consider small studies to be pilot or feasibility studies, but not as providing estimates of how large an effect size we should expect from a larger study. Investigators do it all the time, but they should not: They do power calculations estimating how many patients they need for a larger trial from results of such small studies. No, no, no!

Having spent decades examining clinical trials, I am generally comfortable dismissing effect sizes that come from trials with less than 35 patients in the smaller group. I agree with a suggestion that if there are two larger trials are available in a given literature, go with those and ignore the smaller studies. If there are not at least two larger studies, keep the jury out on whether there is a significant effect.

Applying the Rule of 35, 5 of the 6 trials can be dismissed and the sixth is ambiguous because of loss of patients to follow up.  If promoters of mind-body interventions want to convince us that they have beneficial effects on physical health by conducting trials like these, they have to do better. None of the individual trials should increase our confidence in their claims. Collectively, the trials collapse in a mess without providing a single credible estimate of effect size. This attests to the poor quality of evidence and disrespect for methodology that characterizes this literature.

  1. Don’t be taken in by titles to peer-reviewed articles that are themselves an announcement that these interventions work. Titles may not be telling the truth.

What I found extraordinary is that five of the six randomized trials had a title that indicating a positive effect was found. I suspect that most people encountering the title will not actually go on to read the study. So, they will be left with the false impression that positive results were indeed obtained. It’s quite a clever trick to make the title of an article, by which most people will remember it, into a false advertisement for what was actually found.

For a start, we can simply remind ourselves that with these underpowered studies, investigators should not even be making claims about efficacy/effectiveness. So, one trick of the developing skeptic is to confirm that the claims being made in the title don’t fit with the size of the study. However, actually going to the results section one can find other evidence of discrepancies between what was found in what is being claimed.

I think it’s a general rule of thumb that we should be careful of titles for reports of randomized that declare results. Even when what is claimed in the title fits with the actual results, it often creates the illusion of a greater consistency with what already exists in the literature. Furthermore, even when future studies inevitably fail to replicate what is claimed in the title, the false claim lives on, because failing to replicate key findings is almost never a condition for retracting a paper.

  1. Check the institutional affiliations of the authors. These 6 trials serve as a depressing reminder that we can’t go on researchers’ institutional affiliation or having federal grants to reassure us of the validity of their claims. These authors are not from Quack-Quack University and they get funding for their research.

In all cases, the investigators had excellent university affiliations, mostly in California. Most studies were conducted with some form of funding, often federal grants.  A quick check of Google would reveal from at least one of the authors on a study, usually more, had federal funding.

  1. Check the conflicts of interest, but don’t expect the declarations to be informative. But be skeptical of what you find. It is also disappointing that a check of conflict of interest statements for these articles would be unlikely to arouse the suspicion that the results that were claimed might have been influenced by financial interests. One cannot readily see that the studies were generally done settings promoting alternative, unproven treatments that would benefit from the publicity generated from the studies. One cannot see that some of the authors have lucrative book contracts and speaking tours that require making claims for dramatic effects of mind-body treatments could not possibly be supported by: transparent reporting of the results of these studies. As we will see, one of the studies was actually conducted in collaboration with Deepak Chopra and with money from his institution. That would definitely raise flags in the skeptic community. But the dubious tie might be missed by patients in their families vulnerable to unwarranted claims and unrealistic expectations of what can be obtained outside of conventional medicine, like chemotherapy, surgery, and pharmaceuticals.

Based on what I found probing these six trials, I can suggest some further rules of thumb. (1) Don’t assume for articles about health effects of alternative treatments that all relevant conflicts of interest are disclosed. Check the setting in which the study was conducted and whether it was in an integrative [complementary and alternative, meaning mostly unproven.] care setting was used for recruiting or running the trial. Not only would this represent potential bias on the part of the authors, it would represent selection bias in recruitment of patients and their responsiveness to placebo effects consistent with the marketing themes of these settings.(2) Google authors and see if they have lucrative pop psychology book contracts, Ted talks, or speaking gigs at positive psychology or complementary and alternative medicine gatherings. None of these lucrative activities are typically expected to be disclosed as conflicts of interest, but all require making strong claims that are not supported by available data. Such rewards are perverse incentives for authors to distort and exaggerate positive findings and to suppress negative findings in peer-reviewed reports of clinical trials. (3) Check and see if known quacks have prepared recruitment videos for the study, informing patients what will be found (Serious, I was tipped off to look and I found that).

  1. Look for the usual suspects. A surprisingly small, tight, interconnected group is generating this research. You could look the authors up on Google or Google Scholar or  browse through my previous blog posts and see what I have said about them. As I will point out in my next blog, one got withering criticism for her claim that drinking carbonated sodas but not sweetened fruit drinks shortened your telomeres so that drinking soda was worse than smoking. My colleagues and I re-analyzed the data of another of the authors. We found contrary to what he claimed, that pursuing meaning, rather than pleasure in your life, affected gene expression related to immune function. We also showed that substituting randomly generated data worked as well as what he got from blood samples in replicating his original results. I don’t think it is ad hominem to point out a history for both of the authors of making implausible claims. It speaks to source credibility.
  1. Check and see if there is a trial registration for a study, but don’t stop there. You can quickly check with PubMed if a report of a randomized trial is registered. Trial registration is intended to ensure that investigators commit themselves to a primary outcome or maybe two and whether that is what they emphasized in their paper. You can then check to see if what is said in the report of the trial fits with what was promised in the protocol. Unfortunately, I could find only one of these was registered. The trial registration was vague on what outcome variables would be assessed and did not mention the outcome emphasized in the published paper (!). The registration also said the sample would be larger than what was reported in the published study. When researchers have difficulty in recruitment, their study is often compromised in other ways. I’ll show how this study was compromised.

Well, it looks like applying these generally useful rules of thumb is not always so easy with these studies. I think the small sample size across all of the studies would be enough to decide this research has yet to yield meaningful results and certainly does not support the claims that are being made.

But readers who are motivated to put in the time of probing deeper come up with strong signs of p-hacking and questionable research practices.

  1. Check the report of the randomized trial and see if you can find any declaration of one or two primary outcomes and a limited number of secondary outcomes. What you will find instead is that the studies always have more outcome variables than patients receiving these interventions. The opportunities for cherry picking positive findings and discarding the rest are huge, especially because it is so hard to assess what data were collected but not reported.
  1. Check and see if you can find tables of unadjusted primary and secondary outcomes. Honest and transparent reporting involves giving readers a look at simple statistics so they can decide if results are meaningful. For instance, if effects on stress and depressive symptoms are claimed, are the results impressive and clinically relevant? Almost in all cases, there is no peeking allowed. Instead, authors provide analyses and statistics with lots of adjustments made. They break lots of rules in doing so, especially with such a small sample. These authors are virtually assured to get results to crow about.

Famously, Joe Simmons and Leif Nelson hilariously published claims that briefly listening to the Beatles’ “When I’m 64” left students a year and a half older younger than if they were assigned to listening to “Kalimba.”  Simmons and Leif Nelson knew this was nonsense, but their intent was to show what researchers can do if they have free reign with how they analyze their data and what they report and  . They revealed the tricks they used, but they were so minor league and amateurish compared to what the authors of these trials consistently did in claiming that yoga, tai chi, and mindfulness modified expression of DNA.

Stay tuned for my next blog post where I go through the six studies. But consider this, if you or a loved one have to make an immediate decision about whether to plunge into the world of woo woo unproven medicine in hopes of  altering DNA expression. I will show the authors of these studies did not get the results they claimed. But who should care if they did? Effects were laughably trivial. As the authors of this review about which I have been complaining noted:

One other problem to consider are the various environmental and lifestyle factors that may change gene expression in similar ways to MBIs [Mind-Body Interventions]. For example, similar differences can be observed when analyzing gene expression from peripheral blood mononuclear cells (PBMCs) after exercise. Although at first there is an increase in the expression of pro-inflammatory genes due to regeneration of muscles after exercise, the long-term effects show a decrease in the expression of pro-inflammatory genes (55). In fact, 44% of interventions in this systematic review included a physical component, thus making it very difficult, if not impossible, to discern between the effects of MBIs from the effects of exercise. Similarly, food can contribute to inflammation. Diets rich in saturated fats are associated with pro-inflammatory gene expression profile, which is commonly observed in obese people (56). On the other hand, consuming some foods might reduce inflammatory gene expression, e.g., drinking 1 l of blueberry and grape juice daily for 4 weeks changes the expression of the genes related to apoptosis, immune response, cell adhesion, and lipid metabolism (57). Similarly, a diet rich in vegetables, fruits, fish, and unsaturated fats is associated with anti-inflammatory gene profile, while the opposite has been found for Western diet consisting of saturated fats, sugars, and refined food products (58). Similar changes have been observed in older adults after just one Mediterranean diet meal (59) or in healthy adults after consuming 250 ml of red wine (60) or 50 ml of olive oil (61). However, in spite of this literature, only two of the studies we reviewed tested if the MBIs had any influence on lifestyle (e.g., sleep, diet, and exercise) that may have explained gene expression changes.

How about taking tango lessons instead? You would at least learn dance steps, get exercise, and decrease any social isolation. And so what if there were more benefits than taking up these other activities?

 

 

Calling out pseudoscience, radically changing the conversation about Amy Cuddy’s power posing paper

Part 1: Reviewed as the clinical trial that it is, the power posing paper should never have been published.

Has too much already been written about Amy Cuddy’s power pose paper? The conversation should not be stopped until its focus shifts and we change our ways of talking about psychological science.

The dominant narrative is now that a junior scientist published an influential paper on power posing and was subject to harassment and shaming by critics, pointing to the need for greater civility in scientific discourse.

Attention has shifted away from the scientific quality of the paper and the dubious products the paper has been used to promote and on the behavior of its critics.

Amy Cuddy and powerful allies are given forums to attack and vilify critics, accusing them of damaging the environment in which science is done and discouraging prospective early career investigators from entering the field.

Meanwhile, Amy Cuddy commands large speaking fees and has a top-selling book claiming the original paper provides strong science for simple behavioral manipulations altering mind-body relations and producing socially significant behavior.

This misrepresentation of psychological science does potential harm to consumers and the reputation of psychology among lay persons.

This blog post is intended to restart the conversation with a reconsideration of the original paper as a clinical and health psychology randomized trial (RCT) and, on that basis, identifying the kinds of inferences that are warranted from it.

In the first of a two post series, I argue that:

The original power pose article in Psychological Science should never been published.

-Basically, we have a therapeutic analog intervention delivered in 2 1-minute manipulations by unblinded experimenters who had flexibility in what they did,  what they communicated to participants, and which data they chose to analyze and how.

-It’s unrealistic to expect that 2 1-minute behavioral manipulations would have robust and reliable effects on salivary cortisol or testosterone 17 minutes later.

-It’s absurd to assume that the hormones mediated changes in behavior in this context.

-If Amy Cuddy retreats to the idea that she is simply manipulating “felt power,” we are solidly in the realm of trivial nonspecific and placebo effects.

The original power posing paper

Carney DR, Cuddy AJ, Yap AJ. Power posing brief nonverbal displays affect neuroendocrine levels and risk tolerance. Psychological Science. 2010 Oct 1;21(10):1363-8.

The Psychological Science article can be construed as a brief mind-body intervention consisting of 2 1-minute behavioral manipulations. Central to the attention that the paper attracted is that argument that this manipulation  affected psychological state and social performance via the effects of the manipulation on the neuroendocrine system.

The original study is in effect, a disguised randomized clinical trial (RCT) of a biobehavioral intervention. Once this is recognized, a host of standards can come into play for reporting this study and interpreting the results.

CONSORT

All major journals and publishers including Association for Psychological Science have adopted the Consolidated Standards of Reporting Trials (CONSORT). Any submission of a manuscript reporting a clinical trial is required to be accompanied by a checklist  indicating that the article reports that particular details of how the trial was conducted. Item 1 on the checklist specifies that both the title and abstract indicate the study was a randomized trial. This is important and intended to aid readers in evaluating the study, but also for the study to be picked up in systematic searches for reviews that depend on screening of titles and abstracts.

I can find no evidence that Psychological Science adheres to CONSORT. For instance, my colleagues and I provided a detailed critique of a widely promoted study of loving-kindness meditation that was published in Psychological Science the same year as Cuddy’s power pose study. We noted that it was actually a poorly reported null trial with switched outcomes. With that recognition, we went on to identify serious conceptual, methodological and statistical problems. After overcoming considerable resistance, we were able  to publish a muted version of our critique. Apparently reviewers of the original paper had failed to evaluate it in terms of it being an RCT.

The submission of the completed CONSORT checklist has become routine in most journals considering manuscripts for studies of clinical and health psychology interventions. Yet, additional CONSORT requirements that developed later about what should be included in abstracts are largely being ignored.

It would be unfair to single out Psychological Science and the Cuddy article for noncompliance to CONSORT for abstracts. However, the checklist can be a useful frame of reference for noting just how woefully inadequate the abstract was as a report of a scientific study.

CONSORT for abstracts

Hopewell S, Clarke M, Moher D, Wager E, Middleton P, Altman DG, Schulz KF, CONSORT Group. CONSORT for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLOS Medicine. 2008 Jan 22;5(1):e20.

Journal and conference abstracts should contain sufficient information about the trial to serve as an accurate record of its conduct and findings, providing optimal information about the trial within the space constraints of the abstract format. A properly constructed and well-written abstract should also help individuals to assess quickly the validity and applicability of the findings and, in the case of abstracts of journal articles, aid the retrieval of reports from electronic databases.

Even if CONSORT for abstracts did not exist, we could argue that readers, starting with the editor and reviewers were faced with an abstract with extraordinary claims that required better substantiation. They were disarmed by a lack of basic details from evaluating these claims.

In effect, the abstract reduces the study to an experimercial for products about to be marketed in corporate talks and workshops, but let’s persist in evaluating it as an abstract as a scientific study.

Humans and other animals express power through open, expansive postures, and they express powerlessness through closed, contractive postures. But can these postures actually cause power? The results of this study confirmed our prediction that posing in high-power nonverbal displays (as opposed to low-power nonverbal displays) would cause neuroendocrine and behavioral changes for both male and female participants: High-power posers experienced elevations in testosterone, decreases in cortisol, and increased feelings of power and tolerance for risk; low-power posers exhibited the opposite pattern. In short, posing in displays of power caused advantaged and adaptive psychological, physiological, and behavioral changes, and these findings suggest that embodiment extends beyond mere thinking and feeling, to physiology and subsequent behavioral choices. That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

I don’t believe I have ever encountered in an abstract the extravagant claims with which this abstract concludes. But readers are not provided any basis for evaluating the claim until the Methods section. Undoubtedly, many holding opinions about the paper did not read that far.

Namely:

Forty-two participants (26 females and 16 males) were randomly assigned to the high-power-pose or low-power-pose condition.

Testosterone levels were in the normal range at both Time 1 (M = 60.30 pg/ml, SD = 49.58) and Time 2 (M = 57.40 pg/ml, SD = 43.25). As would be suggested by appropriately taken and assayed samples (Schultheiss & Stanton, 2009), men were higher than women on testosterone at both Time 1, F(1, 41) = 17.40, p < .001, r = .55, and Time 2, F(1, 41) = 22.55, p < .001, r = .60. To control for sex differences in testosterone, we used participant’s sex as a covariate in all analyses. All hormone analyses examined changes in hormones observed at Time 2, controlling for Time 1. Analyses with cortisol controlled for testosterone, and vice versa.2

Too small a study to provide an effect size

Hold on! First. Only 42 participants  (26 females and 16 males) would readily be recognized as insufficient for an RCT, particularly in an area of research without past RCTs.

After decades of witnessing the accumulation of strong effect sizes from underpowered studies, many of us have reacted by requiring 35 participants per group as the minimum acceptable level for a generalizable effect size. Actually, that could be an overly liberal criterion. Why?

Many RCTs are underpowered, yet a lack of enforcement of preregistration allows positive results by redefining the primary outcomes after results are known. A psychotherapy trial with 30 or less patients in the smallest cell has less than a 50% probability of detecting a moderate sized significant effect, even if it is present (Coyne,Thombs, & Hagedoorn, 2010). Yet an examination of the studies mustered for treatments being evidence supported by APA Division 12 ( http://www.div12.org/empirically-supported-treatments/ ) indicates that many studies were too underpowered to be reliably counted as evidence of efficacy, but were included without comment about this problem. Taking an overview, it is striking the extent to which the literature continues depend on small, methodologically flawed RCTs conducted by investigators with strong allegiances to one of the treatments being evaluated. Yet, which treatment is preferred by investigators is a better predictor of the outcome of the trial than the specific treatment being evaluated (Luborsky et al., 2006).

Earlier my colleagues and I had argued for the non-accumulative  nature of evidence from small RCTs:

Kraemer, Gardner, Brooks, and Yesavage (1998) propose excluding small, underpowered studies from meta-analyses. The risk of including studies with inadequate sample size is not limited to clinical and pragmatic decisions being made on the basis of trials that cannot demonstrate effectiveness when it is indeed present. Rather, Kraemer et al. demonstrate that inclusion of small, underpowered trials in meta-analyses produces gross overestimates of effect size due to substantial, but unquantifiable confirmatory publication bias from non-representative small trials. Without being able to estimate the size or extent of such biases, it is impossible to control for them. Other authorities voice support for including small trials, but generally limit their argument to trials that are otherwise methodologically adequate (Sackett & Cook, 1993; Schulz & Grimes, 2005). Small trials are particularly susceptible to common methodological problems…such as lack of baseline equivalence of groups; undue influence of outliers on results; selective attrition and lack of intent-to-treat analyses; investigators being unblinded to patient allotment; and not having a pre-determined stopping point so investigators are able to stop a trial when a significant effect is present.

In the power posing paper, there was the control for sex in all analyses because a peek at the data revealed baseline sex differences in testosterone dwarfing any other differences. What do we make of investigators conducting a study depending on testosterone mediating a behavioral manipulation who did not anticipate large baseline sex differences in testosterone?

In a Pubpeer comment leading up to this post , I noted:

We are then told “men were higher than women on testosterone at both Time 1, F(1, 41) = 17.40, p < .001, r = .55, and Time 2, F(1, 41) = 22.55, p < .001, r = .60. To control for sex differences in testosterone, we used participant’s sex as a covariate in all analyses. All hormone analyses examined changes in hormones observed at Time 2, controlling for Time 1. Analyses with cortisol controlled for testosterone, and vice versa.”

The findings alluded to in the abstract should be recognizable as weird and uninterpretable. Most basically, how could the 16 males be distributed across the two groups so that the authors could confidently say that differences held for both males and females? Especially when all analyses control for sex? Sex is highly correlated with testosterone and so an analysis that controlled for both the variables, sex and testosterone would probably not generalize to testosterone without such controls.

We are never given the basic statistics in the paper to independently assess what the authors are doing, not the correlation between cortisol and testosterone, only differences in time 2 cortisol controlling for time 1 cortisol, time 1 testosterone and gender. These multivariate statistics are not  very generalizable in a sample with 42 participants distributed across 2 groups. Certainly not for the 26 females and 16  males taken separately.

The behavioral manipulation

The original paper reports:

Participants’ bodies were posed by an experimenter into high-power or low-power poses. Each participant held two poses for 1 min each. Participants’ risk taking was measured with a gambling task; feelings of power were measured with self-reports. Saliva samples, which were used to test cortisol and testosterone levels, were taken before and approximately 17 min after the power-pose manipulation.

And then elaborates:

To configure the test participants into the poses, the experimenter placed an electrocardiography lead on the back of each participant’s calf and underbelly of the left arm and explained, “To test accuracy of physiological responses as a function of sensor placement relative to your heart, you are being put into a certain physical position.” The experimenter then manually configured participants’ bodies by lightly touching their arms and legs. As needed, the experimenter provided verbal instructions (e.g., “Keep your feet above heart level by putting them on the desk in front of you”). After manually configuring participants’ bodies into the two poses, the experimenter left the room. Participants were videotaped; all participants correctly made and held either two high-power or two low-power poses for 1 min each. While making and holding the poses, participants completed a filler task that consisted of viewing and forming impressions of nine faces.

The behavioral task and subjective self-report assessment

Measure of risk taking and powerful feelings. After they finished posing, participants were presented with the gambling task. They were endowed with $2 and told they could keep the money—the safe bet—or roll a die and risk losing the $2 for a payoff of $4 (a risky but rational bet; odds of winning were 50/50). Participants indicated how “powerful” and “in charge” they felt on a scale from 1 (not at all) to 4 (a lot).

An imagined bewildered review from someone accustomed to evaluating clinical trials

Although the authors don’t seem to know what they’re doing, we have an underpowered therapy analogue study with extraordinary claims. It’s unconvincing  that the 2 1-minute behavioral manipulations would change subsequent psychological states and behavior with any extralaboratory implications.

The manipulation poses a puzzle to research participants, challenging them to figure out what is being asked of them. The $2 gambling task presumably is meant to simulate effects on real-world behavior. But the low stakes could mean that participants believed the task evaluated whether they “got” the purpose of the intervention and behaved accordingly. Within that perspective, the unvalidated subjective self-report rating scale would serve as a clue to the intentions of the experimenter and an opportunity to show the participants were smart. The  manipulation of putting participants  into a low power pose is even more unconvincing as a contrasting active intervention or a control condition.  Claims that this manipulation did anything but communicate experimenter expectancies are even less credible.

This is a very weak form of evidence: A therapy analogue study with such a brief, low intensity behavioral manipulation followed by assessments of outcomes that might just inform participants of what they needed to do to look smart (i.e., demand characteristics). Add in that the experimenters were unblinded and undoubted had flexibility in how they delivered the intervention and what they said to participants. As a grossly underpowered trial, the study cannot make a contribution to the literature and certainly not an effect size.

Furthermore, if the authors had even a basic understanding of gender differences in social status or sex differences in testosterone, they would have stratified the study with respect to participate gender, not attempted to obtain control by post hoc statistical manipulation.

I could comment on signs of p-hacking and widespread signs of inappropriate naming, use, and interpretation of statistics, but why bother? There are no vital signs of a publishable paper here.

Is power posing salvaged by fashionable hormonal measures?

 Perhaps the skepticism of the editor and reviewers was overcome by the introduction of mind-body explanations  of what some salivary measures supposedly showed. Otherwise, we would be left with a single subjective self-report measure and a behavioral task susceptible to demand characteristics and nonspecific effects.

We recognize that the free availability of powerful statistical packages risks people using them without any idea of the appropriateness of their use or interpretation. The same observation should be made of the ready availability of means of collecting spit samples from research participants to be sent off to outside laboratories for biochemical analysis.

The clinical health psychology literature is increasingly filled with studies incorporating easily collected saliva samples intended to establish that psychological interventions influence mind-body relations. These have become particularly applied in attempts to demonstrate that mindfulness meditation and even tai chi can have beneficial effects on physical health and even cancer outcomes.

Often inaccurately described as as “biomarkers,” rather than merely as biological measurements, there is seldom little learned by inclusion of such measures that is generalizable within participants or across studies.

Let’s start with salivary-based cortisol measures.

A comprehensive review  suggests that:

  • A single measurement on a participant  or a pre-post pair of assessments would not be informative.
  • Single measurements are unreliable and large intra-and inter-individual differences not attributable to intervention can be in play.
  • Minor variations in experimental procedures can have large, unwanted effects.
  • The current standard is cortisol awakening response in the diurnal slope over more than one day, which would not make sense for the effects of 2 1-minute behavioral manipulations.
  • Even with sophisticated measurement strategies there is low agreement across and even within studies and low agreement with behavioral and self-report data.
  • The idea of collecting saliva samples would serve the function the investigators intended is an unscientific, but attractive illusion.

Another relevant comprehensive theoretical review and synthesis of cortisol reactivity was available at the time the power pose study was planned. The article identifies no basis for anticipating that experimenters putting participants into a 1-minute expansive poses would lower cortisol. And certainly no basis for assuming that putting participants into a 1-minute slumped position would raise cortisol. Or what such findings could possibly mean.

But we are clutching at straws. The authors’ interpretations of their hormonal data depend on bizarre post hoc decisions about how to analyze their data in a small sample in which participant sex is treated in incomprehensible  fashion. The process of trying to explain spurious results risks giving the results a credibility that authors have not earned for them. And don’t even try to claim we are getting signals of hormonal mediation from this study.

Another system failure: The incumbent advantage given to a paper that should not have been published.

Even when publication is based on inadequate editorial oversight and review, any likelihood or correction is diminished by published results having been blessed as “peer reviewed” and accorded an incumbent advantage over whatever follows.

A succession of editors have protected the power pose paper from post-publication peer review. Postpublication review has been relegated to other journals and social media, including PubPeer and blogs.

Soon after publication of  the power pose paper, a critique was submitted to Psychological Science, but it was desk rejected. The editor informally communicated to the author that the critique read like a review and teh original article had already been peer reviewed.

The critique by Steven J. Stanton nonetheless eventually appeared in Frontiers in Behavioral Neuroscience and is worth a read.

Stanton took seriously the science being invoked in the claims of the power pose paper.

A sampling:

Carney et al. (2010) collapsed over gender in all testosterone analyses. Testosterone conforms to a bimodal distribution when including both genders (see Figure 13; Sapienza et al., 2009). Raw testosterone cannot be considered a normally distributed dependent or independent variable when including both genders. Thus, Carney et al. (2010) violated a basic assumption of the statistical analyses that they reported, because they used raw testosterone from pre- and post-power posing as independent and dependent variables, respectively, with all subjects (male and female) included.

And

^Mean cortisol levels for all participants were reported as 0.16 ng/mL pre-posing and 0.12 ng/mL post-posing, thus showing that for all participants there was an average decrease of 0.04 ng/mL from pre- to post-posing, regardless of condition. Yet, Figure 4 of Carney et al. (2010) shows that low-power posers had mean cortisol increases of roughly 0.025 ng/mL and high-power posers had mean cortisol decreases of roughly 0.03 ng/mL. It is unclear given the data in Figure 4 how the overall cortisol change for all participants could have been a decrease of 0.04 ng/mL.

Another editor of Psychological Science received a critical comment from Marcus Crede and Leigh A. Phillips. After the first round of reviews, the Crede and Philips removed references to changes in the published power pose paper from earlier drafts that they had received from the first author, Dana Carney. However, Crede and Phillips withdrew their critique when asked to respond to a review by Amy Cuddy in a second resubmission.

The critique is now forthcoming in Social Psychological and Personality Science

Revisiting the Power Pose Effect: How Robust Are the Results Reported by Carney, Cuddy and Yap (2010) to Data Analytic Decisions

The article investigates effects of choices made in p-hacking in the original paper. An excerpt from the abstract

In this paper we use multiverse analysis to examine whether the findings reported in the original paper by Carney, Cuddy, and Yap (2010) are robust to plausible alternative data analytic specifications: outlier identification strategy; the specification of the dependent variable; and the use of control variables. Our findings indicate that the inferences regarding the presence and size of an effect on testosterone and cortisol are  highly sensitive to data analytic specifications. We encourage researchers to routinely explore the influence of data analytic choices on statistical inferences and also encourage editors and  reviewers to require explicit examinations of the influence of alternative data analytic  specifications on the inferences that are drawn from data.

Dana Carney, the first author of the has now posted an explanation why she no longer believes the originally reported findings are genuine and why “the evidence against the existence of power poses is undeniable.” She discloses a number of important confounds and important “researcher degrees of freedom in the analyses reported in the published paper.

Coming Up Next

A different view of the Amy Cuddy’s Ted talk in terms of its selling of pseudoscience to consumers and its acknowledgment of a strong debt to Cuddy’s adviser Susan Fiske.

A disclosure of some of the financial interests that distort discussion of the scientific flaws of the power pose.

How the reflexive response of the replicationados inadvertently reinforced the illusion that the original pose study provided meaningful effect sizes.

How Amy Cuddy and her allies marshalled the resources of the Association for Psychological Science to vilify and intimidate critics of bad science and of the exploitation of consumers by psychological pseudoscience.

How journalists played into this vilification.

What needs to be done to avoid a future fiasco for psychology like the power pose phenomenon and protect reformers of the dissemination of science.

Note: Time to reiterate that all opinions expressed here are solely those of Coyne of the Realm and not necessarily of PLOS blogs, PLOS One or his other affiliations.

Why PhD students should not evaluate a psychotherapy for their dissertation project

  • Things some clinical and health psychology students wish they had known before they committed themselves to evaluating a psychotherapy for their dissertation study.
  • A well designed pilot study addressing feasibility and acceptability issues in conducting and evaluating psychotherapies is preferable to an underpowered study which won’t provide a valid estimate of the efficacy of the intervention.
  • PhD students would often be better off as research parasites – making use of existing published data – rather than attempting to organize their own original psychotherapy study, if their goal is to contribute meaningfully to the literature and patient care.
  • Reading this blog, you will encounter a link to free, downloadable software that allows you to make quick determinations of the number of patients needed for an adequately powered psychotherapy trial.

I so relish the extra boost of enthusiasm that many clinical and health psychology students bring to their PhD projects. They not only want to complete a thesis of which they can be proud, they want their results to be directly applicable to improving the lives of their patients.

Many students are particularly excited about a new psychotherapy about which extravagant claims are being made that it’s better than its rivals.

I have seen lots of fad and fashions come and go, third wave, new wave, and no wave therapies. When I was a PhD student, progressive relaxation was in. Then it died, mainly because it was so boring for therapists who had to mechanically provide it. Client centered therapy was fading with doubts that anyone else could achieve the results of Carl Rogers or that his three facilitative conditions of unconditional positive regard, genuineness,  and congruence were actually distinguishable enough to study.  Gestalt therapy was supercool because of the charisma of Fritz Perls, who distracted us with his showmanship from the utter lack of evidence for its efficacy.

I hate to see PhD students demoralized when their grand plans prove unrealistic.  Inevitably, circumstances force them to compromise in ways that limit any usefulness to their project, and maybe even threaten their getting done within a reasonable time period. Overly ambitious plans are the formidable enemy of the completed dissertation.

The numbers are stacked against a PhD student conducting an adequately powered evaluation of a new psychotherapy.

This blog post argues against PhD students taking on the evaluation of a new therapy in comparison to an existing one, if they expect to complete their projects and make meaningful contribution to the literature and to patient care.

I’ll be drawing on some straightforward analysis done by Pim Cuijpers to identify what PhD students are up against when trying to demonstrate that any therapy is better than treatments that are already available.

Pim has literally done dozens of meta-analyses, mostly of treatments for depression and anxiety. He commands a particular credibility, given the quality of this work. The way Pim and his colleagues present a meta-analysis is so straightforward and transparent that you can readily examine the basis of what he says.

Disclosure: I collaborated with Pim and a group of other authors in conducting a meta-analysis as to whether psychotherapy was better than a pill placebo. We drew on all the trials allowing a head-to-head comparison, even though nobody ever really set out to pit the two conditions against each other as their first agenda.

Pim tells me that the brief and relatively obscure letter, New Psychotherapies for Mood and Anxiety Disorders: Necessary Innovation or Waste of Resources? on which I will draw is among his most unpopular pieces of work. Lots of people don’t like its inescapable message. But I think that if PhD students should pay attention, they might avoid a lot of pain and disappointment.

But first…

Note how many psychotherapies have been claimed to be effective for depression and anxiety. Anyone trying to make sense of this literature has to contend with claims being based on a lot of underpowered trials– too small in sample size to be expected reasonably to detect the effects that investigators claim – and that are otherwise compromised by methodological limitations.

Some investigators were simply naïve about clinical trial methodology and the difficulties doing research with clinical populations. They may have not understand statistical power.

But many psychotherapy studies end up in bad shape because the investigators were unrealistic about the feasibility of what they were undertaken and the low likelihood that they could recruit the patients in the numbers that they had planned in the time that they had allotted. After launching the trial, they had to change strategies for recruitment, maybe relax their selection criteria, or even change the treatment so it was less demanding of patients’ time. And they had to make difficult judgments about what features of the trial to drop when resources ran out.

Declaring a psychotherapy trial to be a “preliminary” or a “pilot study” after things go awry

The titles of more than a few articles reporting psychotherapy trials contain the apologetic qualifier after a colon: “a preliminary study” or “a pilot study”. But the studies weren’t intended at the outset to be preliminary or pilot studies. The investigators are making excuses post-hoc – after the fact – for not having been able to recruit sufficient numbers of patients and for having had to compromise their design from what they had originally planned. The best they can hope is that the paper will somehow be useful in promoting further research.

Too many studies from which effect sizes are entered into meta-analyses should have been left as pilot studies and not considered tests of the efficacy of treatments. The rampant problem in the psychotherapy literature is that almost no one treats small scale trials as mere pilot studies. In a recent blog post, I provided readers with some simple screening rules to identify meta-analyses of psychotherapy studies that they could dismiss from further consideration. One was whether there were sufficient numbers of adequately powered studies,  Often there are not.

Readers take their inflated claims of results of small studies seriously, when these estimates should be seen as unrealistic and unlikely to be replicated, given a study’s sample size. The large effect sizes that are claimed are likely the product of p-hacking and the confirmation bias required to get published. With enough alternative outcome variables to choose from and enough flexibility in analyzing and interpreting data, almost any intervention can be made to look good.

The problem is is readily seen in the extravagant claims about acceptance and commitment therapy (ACT), which are so heavily dependent on small, under-resourced studies supervised by promoters of ACT that should not have been used to generate effect sizes.

Back to Pim Cuijpers’ brief letter. He argues, based on his numerous meta-analyses, that it is unlikely that a new treatment will be substantially more effective than an existing credible, active treatment.  There are some exceptions like relaxation training versus cognitive behavior therapy for some anxiety disorders, but mostly only small differences of no more than d= .20 are found between two active, credible treatments. If you search the broader literature, you can find occasional exceptions like CBT versus psychoanalysis for bulimia, but most you find prove to be false positives, usually based on investigator bias in conducting and interpreting a small, underpowered study.

You can see this yourself using the freely downloadable G*power program and plug in d= 0.20 for calculating the number of patients needed for a study. To be safe, add more patients to allow for the expectable 25% dropout rate that has occurred across trials. The number you get would require a larger study than has ever been done in the past, including the well-financed NIMH Collaborative trial.

G power analyses

Even more patients would be needed for the ideal situation in which a third comparison group allowed  the investigator to show the active comparison treatment had actually performed better than a nonspecific treatment that was delivered with the same effectiveness that the other had shown in earlier trials. Otherwise, a defender of the established therapy might argue that the older treatment had not been properly implemented.

So, unless warned off, the PhD student plans a study to show not only that now hypothesis can be rejected that the new treatment is no better than the existing one, but that in the same study the existing treatment had been shown to be better than wait list. Oh my, just try to find an adequately powered, properly analyzed example of a comparison of two active treatments plus a control comparison group in the existing published literature. The few examples of three group designs in which a new psychotherapy had come out better than an effectively implemented existing treatment are grossly underpowered.

These calculations so far have all been based on what would be needed to reject the null hypothesis of no difference between the active treatment and a more established one. But if the claim is that the new treatment is superior to the existing treatment, our PhD student now needs to conduct a superiority trial in which some criteria is pre-set (such as greater than a moderate difference, d= .30) and the null hypothesis is that the advantage of the new treatment is less. We are now way out into the fantasyland of breakthrough, but uncompleted dissertation studies.

Two take away messages

 The first take away message is that we should be skeptical of claims of the new treatment is better than past ones except when the claim occurs in a well-designed study with some assurance that it is free of investigator bias. But the claim also has to arise in a trial that is larger than almost any psychotherapy study is ever been done. Yup, most comparative psychotherapy studies are underpowered and we cannot expect robust claims are robust that one treatment is superior to another.

But for PhD students been doing a dissertation project, the second take away message is that they should not attempt to show that one treatment is superior to another in the absence of resources they probably don’t have.

The psychotherapy literature does not need another study with too few patients to support its likely exaggerated claims.

An argument can be made that it is unfair and even unethical to enroll patients in a psychotherapy RCT with insufficient sample size. Some of the patients will be randomized to the control condition that is not what attracted them to the trial. All of the patients will be denied having been in a trial makes a meaningful contribution to the literature and to better care for patients like themselves.

What should the clinical or health psychology PhD student do, besides maybe curb their enthusiasm? One opportunity to make meaningful contributions to literature by is by conducting small studies testing hypotheses that can lead to improvement in the feasibility or acceptability of treatments to be tested in studies with more resources.

Think of what would’ve been accomplished if PhD students had determined in modest studies that it is tough to recruit and retain patients in an Internet therapy study without some communication to the patients that they are involved in a human relationship – without them having what Pim Cuijpers calls supportive accountability. Patients may stay involved with the Internet treatment when it proves frustrating only because they have the support and accountability to someone beyond their encounter with an impersonal computer. Somewhere out there, there is a human being who supports them and sticking it out with the Internet psychotherapy and will be disappointed if they don’t.

A lot of resources have been wasted in Internet therapy studies in which patients have not been convinced that what they’re doing is meaningful and if they have the support of a human being. They drop out or fail to do diligently any homework expected of them.

Similarly, mindfulness studies are routinely being conducted without anyone establishing that patients actually practice mindfulness in everyday life or what they would need to do so more consistently. The assumption is that patients assigned to the mindfulness diligently practice mindfulness daily. A PhD student could make a valuable contribution to the literature by examining the rates of patients actually practicing mindfulness when the been assigned to it in a psychotherapy study, along with barriers and facilitators of them doing so. A discovery that the patients are not consistently practicing mindfulness might explain weaker findings than anticipated. One could even suggest that any apparent effects of practicing mindfulness were actually nonspecific, getting all caught up in the enthusiasm of being offered a treatment that has been sought, but not actually practicing mindfulness.

An unintended example: How not to recruit cancer patients for a psychological intervention trial

Randomized-controlled-trials-designsSometimes PhD students just can’t be dissuaded from undertaking an evaluation of a psychotherapy. I was a member of a PhD committee of a student who at least produced a valuable paper concerning how not to recruit cancer patients for a trial evaluating problem-solving therapy, even though the project fell far short of conducting an adequately powered study.

The PhD student was aware that  claims of effectiveness of problem-solving therapy reported in in the prestigious Journal of Consulting and Clinical Psychology were exaggerated. The developer of problem-solving therapy for cancer patients (and current JCCP Editor) claimed  a huge effect size – 3.8 if only the patient were involved in treatment and an even better 4.4 if the patient had an opportunity to involve a relative or friend as well. Effect sizes for this trial has subsequently had to be excluded from at least meta-analyses as an extreme outlier (1,2,3,4).

The student adopted the much more conservative assumption that a moderate effect size of .6 would be obtained in comparison with a waitlist control. You can use G*Power to see that 50 patients would be needed per group, 60 if allowance is made for dropouts.

Such a basically inert control group, of course, has a greater likelihood of seeming to demonstrate a treatment is effective than when the comparison is another active treatment. Of course, such a control group also has the problem of not allowing a determination if it was the active ingredient of the treatment that made the difference, or just the attention, positive expectations, and support that were not available in the waitlist control group.

But PhD students should have the same option as their advisors to contribute another comparison between an active treatment and a waitlist control to the literature, even if it does not advance our knowledge of psychotherapy. They can take the same low road to a successful career that so many others have traveled.

This particular student was determined to make a different contribution to the literature. Notoriously, studies of psychotherapy with cancer patients often fail to recruit samples that are distressed enough to register any effect. The typical breast cancer patient, for instance, who seeks to enroll in a psychotherapy or support group trial does not have clinically significant distress. The prevalence of positive effects claimed in the literature for interventions with cancer patients in published studies likely represents a confirmation bias.

The student wanted to address this issue by limiting patients whom she enrolled in the study to those with clinically significant distress. Enlisting colleagues, she set up screening of consecutive cancer patients in oncology units of local hospitals. Patients were first screened for self-reported distress, and, if they were distressed, whether they were interested in services. Those who met both criteria were then re-contacted to see if that be willing to participate in a psychological intervention study, without the intervention being identified. As I reported in the previous blog post:

  • Combining results of  the two screenings, 423 of 970 patients reported distress, of whom 215 patients indicated need for services.
  • Only 36 (4% of 970) patients consented to trial participation.
  • We calculated that 27 patients needed to be screened to recruit a single patient, with 17 hours of time required for each patient recruited.
  • 41% (n= 87) of 215 distressed patients with a need for services indicated that they had no need for psychosocial services, mainly because they felt better or thought that their problems would disappear naturally.
  • Finally, 36 patients were eligible and willing to be randomized, representing 17% of 215 distressed patients with a need for services.
  • This represents 8% of all 423 distressed patients, and 4% of 970 screened patients.

So, the PhD student’s heroic effort did not yield the sample size that she anticipated. But she ended up making a valuable contribution to the literature that challenges some of the basic assumptions that were being made about how cancer patients in psychotherapy research- that all or most were distressed. She also ended up producing some valuable evidence that the minority of cancer patients who report psychological distress are not necessarily interested in psychological interventions.

Fortunately, she had been prepared to collect systematic data about these research questions, not just scramble within a collapsing effort at a clinical trial.

Becoming a research parasite as an alternative to PhD students attempting an under-resourced study of their own

research parasite awardPsychotherapy trials represent an enormous investment of resources, not only the public funding that is often provided for them,be a research parasite but in the time, inconvenience, and exposure to ineffective treatments experienced by patients who participate in the trials. Increasingly, funding agencies require that investigators who get money to do a psychotherapy study some point make their data available for others to use.  The 14 prestigious medical journals whose editors make up the International Committee of Medical Journal Editors (ICMJE) each published in earlier in 2016 a declaration that:

there is an ethical obligation to responsibly share data generated by interventional clinical trials because participants have put themselves at risk.

These statements proposed that as a condition for publishing a clinical trial, investigators would be required to share with others appropriately de-identified data not later than six months after publication. Further, the statements proposed that investigators describe their plans for sharing data in the registration of trials.

Of course, a proposal is only exactly that, a proposal, and these requirements were intended to take effect only after the document is circulated and ratified. The incomplete and inconsistent adoption of previous proposals for registering of  trials in advance and investigators making declarations of conflicts of interest do not encourage a lot of enthusiasm that we will see uniform implementation of this bold proposal anytime soon.

Some editors of medical journals are already expressing alarmover the prospect of data sharing becoming required. The editors of New England Journal of Medicine were lambasted in social media for their raising worries about “research parasites”  exploiting the availability of data:

a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

 Richard Lehman’s  Journal Review at the BMJ ‘s blog delivered a brilliant sarcastic response to these concerns that concludes:

I think we need all the data parasites we can get, as well as symbionts and all sorts of other creatures which this ill-chosen metaphor can’t encompass. What this piece really shows, in my opinion, is how far the authors are from understanding and supporting the true opportunities of clinical data sharing.

However, lost in all the outrage that The New England Journal of Medicine editorial generated was a more conciliatory proposal at the end:

How would data sharing work best? We think it should happen symbiotically, not parasitically. Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested. What is learned may be beautiful even when seen from close up.

The PLOS family of journals has gone on record as requiring that all data for papers published in their journals be publicly available without restriction.A February 24, 2014 PLOS’ New Data Policy: Public Access to Data  declared:

In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings. This Data Availability Statement will be published on the first page of each article.

Many of us are aware of the difficulties in achieving this lofty goal. I am holding my breath and turning blue, waiting for some specific data.

The BMJ has expanded their previous requirements for data being available:

Loder E, Groves T. The BMJ requires data sharing on request for all trials. BMJ. 2015 May 7;350:h2373.

The movement to make data from clinical trials widely accessible has achieved enormous success, and it is now time for medical journals to play their part. From 1 July The BMJ will extend its requirements for data sharing to apply to all submitted clinical trials, not just those that test drugs or devices. The data transparency revolution is gathering pace.

I am no longer heading dissertation committees after one that I am currently supervising is completed. But if any PhD students asked my advice about a dissertation project concerning psychotherapy, I would strongly encourage them to enlist their advisor to identify and help them negotiate access to a data set appropriate to the research questions they want to investigate.

Most well-resourced psychotherapy trials have unpublished data concerning how they were implemented, with what bias and with which patient groups ending up underrepresented or inadequately exposed to the intensity of treatment presumed to be needed for benefit. A story awaits to be told. The data available from a published trial are usually much more adequate than then any graduate student could collect with the limited resources available for a dissertation project.

I look forward to the day when such data is put into a repository where anyone can access it.

until youre done In this blog post I have argued that PhD students should not take on responsibility for developing and testing a new psychotherapy for their dissertation project. I think that using data from existing published trials is a much better alternative. However, PhD students may currently find it difficult, but certainly not impossible to get appropriate data sets. I certainly am not recruiting them to be front-line infantry in advancing the cause of routine data sharing. But they can make an effort to obtain such data and they deserve all support they can get from their dissertation committees in obtaining data sets and in recognizing when realistically that data are not being made available, even when the data have been promised to be available as a condition for publishing. Advisors, please request the data from published trials for your PhD students and protect them from the heartache of trying to collect such data themselves.