Creating illusions of wondrous effects of yoga and meditation on health: A skeptic exposes tricks

The tour of the sausage factory is starting, here’s your brochure telling you’ll see.


A recent review has received a lot of attention with it being used for claims that mind-body interventions have distinct molecular signatures that point to potentially dramatic health benefits for those who take up these practices.

What Is the Molecular Signature of Mind–Body Interventions? A Systematic Review of Gene Expression Changes Induced by Meditation and Related Practices.  Frontiers in Immunology. 2017;8.

Few who are tweeting about this review or its press coverage are likely to have read it or to understand it, if they read it. Most of the new agey coverage in social media does nothing more than echo or amplify the message of the review’s press release.  Lazy journalists and bloggers can simply pass on direct quotes from the lead author or even just the press release’s title, ‘Meditation and yoga can ‘reverse’ DNA reactions which cause stress, new study suggests’:

“These activities are leaving what we call a molecular signature in our cells, which reverses the effect that stress or anxiety would have on the body by changing how our genes are expressed.”


“Millions of people around the world already enjoy the health benefits of mind-body interventions like yoga or meditation, but what they perhaps don’t realise is that these benefits begin at a molecular level and can change the way our genetic code goes about its business.”

[The authors of this review actually identified some serious shortcomings to the studies they reviewed. I’ll be getting to some excellent points at the end of this post that run quite counter to the hype. But the lead author’s press release emphasized unwarranted positive conclusions about the health benefits of these practices. That is what is most popular in media coverage, especially from those who have stuff to sell.]

Interpretation of the press release and review authors’ claims requires going back to the original studies, which most enthusiasts are unlikely to do. If readers do go back, they will have trouble interpreting some of the deceptive claims that are made.

Yet, a lot is at stake. This review is being used to recommend mind-body interventions for people having or who are at risk of serious health problems. In particular, unfounded claims that yoga and mindfulness can increase the survival of cancer patients are sometimes hinted at, but occasionally made outright.

This blog post is written with the intent of protecting consumers from such false claims and providing tools so they can spot pseudoscience for themselves.

Discussion in the media of the review speaks broadly of alternative and complementary interventions. The coverage is aimed at inspiring  confidence in this broad range of treatments and to encourage people who are facing health crises investing time and money in outright quackery. Seemingly benign recommendations for yoga, tai chi, and mindfulness (after all, what’s the harm?) often become the entry point to more dubious and expensive treatments that substitute for established treatments.  Once they are drawn to centers for integrative health care for classes, cancer patients are likely to spend hundreds or even thousands on other products and services that are unlikely to benefit them. One study reported:

More than 72 oral or topical, nutritional, botanical, fungal and bacterial-based medicines were prescribed to the cohort during their first year of IO care…Costs ranged from $1594/year for early-stage breast cancer to $6200/year for stage 4 breast cancer patients. Of the total amount billed for IO care for 1 year for breast cancer patients, 21% was out-of-pocket.

Coming up, I will take a skeptical look at the six randomized trials that were highlighted by this review.  But in this post, I will provide you with some tools and insights so that you do not have to make such an effort in order to make an informed decision.

Like many of the other studies cited in the review, these randomized trials were quite small and underpowered. But I will focus on the six because they are as good as it gets. Randomized trials are considered a higher form of evidence than simple observational studies or case reports [It is too bad the authors of the review don’t even highlight what studies are randomized trials. They are lumped with others as “longitudinal studies.]

As a group, the six studies do not actually add any credibility to the claims that mind-body interventions – specifically yoga, tai chi, and mindfulness training or retreats improve health by altering DNA.  We can be no more confident with what the trials provide than we would be without them ever having been done.

I found the task of probing and interpreting the studies quite labor-intensive and ultimately unrewarding.

I had to get past poor reporting of what was actually done in the trials, to which patients, and with what results. My task often involved seeing through cover ups with authors exercising considerable flexibility in reporting what measures were they actually collected and what analyses were attempted, before arriving at the best possible tale of the wondrous effects of these interventions.

Interpreting clinical trials should not be so hard, because they should be honestly and transparently reported and have a registered protocol and stick to it. These reports of trials were sorely lacking, The full extent of the problems took some digging to uncover, but some things emerged before I got to the methods and results.

The introductions of these studies consistently exaggerated the strength of existing evidence for the effects of these interventions on health, even while somehow coming to the conclusion that this particular study was urgently needed and it might even be the “first ever”. The introductions to the six papers typically cross-referenced each other, without giving any indication of how poor quality the evidence was from the other papers. What a mutual admiration society these authors are.

One giveaway is how the introductions  referred to the biggest, most badass, comprehensive and well-done review, that of Goyal and colleagues.

That review clearly states that the evidence for the effects of mindfulness is poor quality because of the lack of comparisons with credible active treatments. The typical randomized trial of mindfulness involves a comparison with no-treatment, a waiting list, or patients remaining in routine care where the target problem is likely to be ignored.  If we depend on the bulk of the existing literature, we cannot rule out the likelihood that any apparent benefits of mindfulness are due to having more positive expectations, attention, and support over simply getting nothing.  Only a handful  of hundreds of trials of mindfulness include appropriate, active treatment comparison/control groups. The results of those studies are not encouraging.

One of the first things I do in probing the introduction of a study claiming health benefits for mindfulness is see how they deal with the Goyal et al review. Did the study cite it, and if so, how accurately? How did the authors deal with its message, which undermines claims of the uniqueness or specificity of any benefits to practicing mindfulness?

For yoga, we cannot yet rule out that it is better than regular exercising – in groups or alone – having relaxing routines. The literature concerning tai chi is even smaller and poorer quality, but there is the same need to show that practicing tai chi has any benefits over exercising in groups with comparable positive expectations and support.

Even more than mindfulness, yoga and tai chi attract a lot of pseudoscientific mumbo jumbo about integrating Eastern wisdom and Western science. We need to look past that and insist on evidence.

Like their introductions, the discussion sections of these articles are quite prone to exaggerating how strong and consistent the evidence is from existing studies. The discussion sections cherry pick positive findings in the existing literature, sometimes recklessly distorting them. The authors then discuss how their own positively spun findings fit with what is already known, while minimizing or outright neglecting discussion of any of their negative findings. I was not surprised to see one trial of mindfulness for cancer patients obtain no effects on depressive symptoms or perceived stress, but then go on to explain mindfulness might powerfully affect the expression of DNA.

If you want to dig into the details of these studies, the going can get rough and the yield for doing a lot of mental labor is low. For instance, these studies involved drawing blood and analyzing gene expression. Readers will inevitably encounter passages like:

In response to KKM treatment, 68 genes were found to be differentially expressed (19 up-regulated, 49 down-regulated) after adjusting for potentially confounded differences in sex, illness burden, and BMI. Up-regulated genes included immunoglobulin-related transcripts. Down-regulated transcripts included pro-inflammatory cytokines and activation-related immediate-early genes. Transcript origin analyses identified plasmacytoid dendritic cells and B lymphocytes as the primary cellular context of these transcriptional alterations (both p < .001). Promoter-based bioinformatic analysis implicated reduced NF-κB signaling and increased activity of IRF1 in structuring those effects (both p < .05).

Intimidated? Before you defer to the “experts” doing these studies, I will show you some things I noticed in the six studies and how you can debunk the relevance of these studies for promoting health and dealing with illness. Actually, I will show that even if these 6 studies got the results that the authors claimed- and they did not- at best, the effects would trivial and lost among the other things going on in patients’ lives.

Fortunately, there are lots of signs that you can dismiss such studies and go on to something more useful, if you know what to look for.

Some general rules:

  1. Don’t accept claims of efficacy/effectiveness based on underpowered randomized trials. Dismiss them. The rule of thumb is reliable to dismiss trials that have less than 35 patients in the smallest group. Over half the time, true moderate sized effects will be missed in such studies, even if they are actually there.

Due to publication bias, most of the positive effects that are published from such sized trials will be false positives and won’t hold up in well-designed, larger trials.

When significant positive effects from such trials are reported in published papers, they have to be large to have reached significance. If not outright false, these effect sizes won’t be matched in larger trials. So, significant, positive effect sizes from small trials are likely to be false positives and exaggerated and probably won’t replicate. For that reason, we can consider small studies to be pilot or feasibility studies, but not as providing estimates of how large an effect size we should expect from a larger study. Investigators do it all the time, but they should not: They do power calculations estimating how many patients they need for a larger trial from results of such small studies. No, no, no!

Having spent decades examining clinical trials, I am generally comfortable dismissing effect sizes that come from trials with less than 35 patients in the smaller group. I agree with a suggestion that if there are two larger trials are available in a given literature, go with those and ignore the smaller studies. If there are not at least two larger studies, keep the jury out on whether there is a significant effect.

Applying the Rule of 35, 5 of the 6 trials can be dismissed and the sixth is ambiguous because of loss of patients to follow up.  If promoters of mind-body interventions want to convince us that they have beneficial effects on physical health by conducting trials like these, they have to do better. None of the individual trials should increase our confidence in their claims. Collectively, the trials collapse in a mess without providing a single credible estimate of effect size. This attests to the poor quality of evidence and disrespect for methodology that characterizes this literature.

  1. Don’t be taken in by titles to peer-reviewed articles that are themselves an announcement that these interventions work. Titles may not be telling the truth.

What I found extraordinary is that five of the six randomized trials had a title that indicating a positive effect was found. I suspect that most people encountering the title will not actually go on to read the study. So, they will be left with the false impression that positive results were indeed obtained. It’s quite a clever trick to make the title of an article, by which most people will remember it, into a false advertisement for what was actually found.

For a start, we can simply remind ourselves that with these underpowered studies, investigators should not even be making claims about efficacy/effectiveness. So, one trick of the developing skeptic is to confirm that the claims being made in the title don’t fit with the size of the study. However, actually going to the results section one can find other evidence of discrepancies between what was found in what is being claimed.

I think it’s a general rule of thumb that we should be careful of titles for reports of randomized that declare results. Even when what is claimed in the title fits with the actual results, it often creates the illusion of a greater consistency with what already exists in the literature. Furthermore, even when future studies inevitably fail to replicate what is claimed in the title, the false claim lives on, because failing to replicate key findings is almost never a condition for retracting a paper.

  1. Check the institutional affiliations of the authors. These 6 trials serve as a depressing reminder that we can’t go on researchers’ institutional affiliation or having federal grants to reassure us of the validity of their claims. These authors are not from Quack-Quack University and they get funding for their research.

In all cases, the investigators had excellent university affiliations, mostly in California. Most studies were conducted with some form of funding, often federal grants.  A quick check of Google would reveal from at least one of the authors on a study, usually more, had federal funding.

  1. Check the conflicts of interest, but don’t expect the declarations to be informative. But be skeptical of what you find. It is also disappointing that a check of conflict of interest statements for these articles would be unlikely to arouse the suspicion that the results that were claimed might have been influenced by financial interests. One cannot readily see that the studies were generally done settings promoting alternative, unproven treatments that would benefit from the publicity generated from the studies. One cannot see that some of the authors have lucrative book contracts and speaking tours that require making claims for dramatic effects of mind-body treatments could not possibly be supported by: transparent reporting of the results of these studies. As we will see, one of the studies was actually conducted in collaboration with Deepak Chopra and with money from his institution. That would definitely raise flags in the skeptic community. But the dubious tie might be missed by patients in their families vulnerable to unwarranted claims and unrealistic expectations of what can be obtained outside of conventional medicine, like chemotherapy, surgery, and pharmaceuticals.

Based on what I found probing these six trials, I can suggest some further rules of thumb. (1) Don’t assume for articles about health effects of alternative treatments that all relevant conflicts of interest are disclosed. Check the setting in which the study was conducted and whether it was in an integrative [complementary and alternative, meaning mostly unproven.] care setting was used for recruiting or running the trial. Not only would this represent potential bias on the part of the authors, it would represent selection bias in recruitment of patients and their responsiveness to placebo effects consistent with the marketing themes of these settings.(2) Google authors and see if they have lucrative pop psychology book contracts, Ted talks, or speaking gigs at positive psychology or complementary and alternative medicine gatherings. None of these lucrative activities are typically expected to be disclosed as conflicts of interest, but all require making strong claims that are not supported by available data. Such rewards are perverse incentives for authors to distort and exaggerate positive findings and to suppress negative findings in peer-reviewed reports of clinical trials. (3) Check and see if known quacks have prepared recruitment videos for the study, informing patients what will be found (Serious, I was tipped off to look and I found that).

  1. Look for the usual suspects. A surprisingly small, tight, interconnected group is generating this research. You could look the authors up on Google or Google Scholar or  browse through my previous blog posts and see what I have said about them. As I will point out in my next blog, one got withering criticism for her claim that drinking carbonated sodas but not sweetened fruit drinks shortened your telomeres so that drinking soda was worse than smoking. My colleagues and I re-analyzed the data of another of the authors. We found contrary to what he claimed, that pursuing meaning, rather than pleasure in your life, affected gene expression related to immune function. We also showed that substituting randomly generated data worked as well as what he got from blood samples in replicating his original results. I don’t think it is ad hominem to point out a history for both of the authors of making implausible claims. It speaks to source credibility.
  1. Check and see if there is a trial registration for a study, but don’t stop there. You can quickly check with PubMed if a report of a randomized trial is registered. Trial registration is intended to ensure that investigators commit themselves to a primary outcome or maybe two and whether that is what they emphasized in their paper. You can then check to see if what is said in the report of the trial fits with what was promised in the protocol. Unfortunately, I could find only one of these was registered. The trial registration was vague on what outcome variables would be assessed and did not mention the outcome emphasized in the published paper (!). The registration also said the sample would be larger than what was reported in the published study. When researchers have difficulty in recruitment, their study is often compromised in other ways. I’ll show how this study was compromised.

Well, it looks like applying these generally useful rules of thumb is not always so easy with these studies. I think the small sample size across all of the studies would be enough to decide this research has yet to yield meaningful results and certainly does not support the claims that are being made.

But readers who are motivated to put in the time of probing deeper come up with strong signs of p-hacking and questionable research practices.

  1. Check the report of the randomized trial and see if you can find any declaration of one or two primary outcomes and a limited number of secondary outcomes. What you will find instead is that the studies always have more outcome variables than patients receiving these interventions. The opportunities for cherry picking positive findings and discarding the rest are huge, especially because it is so hard to assess what data were collected but not reported.
  1. Check and see if you can find tables of unadjusted primary and secondary outcomes. Honest and transparent reporting involves giving readers a look at simple statistics so they can decide if results are meaningful. For instance, if effects on stress and depressive symptoms are claimed, are the results impressive and clinically relevant? Almost in all cases, there is no peeking allowed. Instead, authors provide analyses and statistics with lots of adjustments made. They break lots of rules in doing so, especially with such a small sample. These authors are virtually assured to get results to crow about.

Famously, Joe Simmons and Leif Nelson hilariously published claims that briefly listening to the Beatles’ “When I’m 64” left students a year and a half older younger than if they were assigned to listening to “Kalimba.”  Simmons and Leif Nelson knew this was nonsense, but their intent was to show what researchers can do if they have free reign with how they analyze their data and what they report and  . They revealed the tricks they used, but they were so minor league and amateurish compared to what the authors of these trials consistently did in claiming that yoga, tai chi, and mindfulness modified expression of DNA.

Stay tuned for my next blog post where I go through the six studies. But consider this, if you or a loved one have to make an immediate decision about whether to plunge into the world of woo woo unproven medicine in hopes of  altering DNA expression. I will show the authors of these studies did not get the results they claimed. But who should care if they did? Effects were laughably trivial. As the authors of this review about which I have been complaining noted:

One other problem to consider are the various environmental and lifestyle factors that may change gene expression in similar ways to MBIs [Mind-Body Interventions]. For example, similar differences can be observed when analyzing gene expression from peripheral blood mononuclear cells (PBMCs) after exercise. Although at first there is an increase in the expression of pro-inflammatory genes due to regeneration of muscles after exercise, the long-term effects show a decrease in the expression of pro-inflammatory genes (55). In fact, 44% of interventions in this systematic review included a physical component, thus making it very difficult, if not impossible, to discern between the effects of MBIs from the effects of exercise. Similarly, food can contribute to inflammation. Diets rich in saturated fats are associated with pro-inflammatory gene expression profile, which is commonly observed in obese people (56). On the other hand, consuming some foods might reduce inflammatory gene expression, e.g., drinking 1 l of blueberry and grape juice daily for 4 weeks changes the expression of the genes related to apoptosis, immune response, cell adhesion, and lipid metabolism (57). Similarly, a diet rich in vegetables, fruits, fish, and unsaturated fats is associated with anti-inflammatory gene profile, while the opposite has been found for Western diet consisting of saturated fats, sugars, and refined food products (58). Similar changes have been observed in older adults after just one Mediterranean diet meal (59) or in healthy adults after consuming 250 ml of red wine (60) or 50 ml of olive oil (61). However, in spite of this literature, only two of the studies we reviewed tested if the MBIs had any influence on lifestyle (e.g., sleep, diet, and exercise) that may have explained gene expression changes.

How about taking tango lessons instead? You would at least learn dance steps, get exercise, and decrease any social isolation. And so what if there were more benefits than taking up these other activities?



Pay $1000 to criticize a bad ‘blood test for depression’ article?

pay to play-1No way, call for retraction.

Would you pay $1,000 for the right to criticize bad science in the journal in which it originally appeared? That is what it costs to participate in postpublication peer review at the online Nature Publishing Group (NPG) journal, Translational Psychiatry.

Damn, NPG is a high-fashion brand, but peer review is quite fallible, even at an NPG npgxJournal. Should we have to pay to point out the flawed science that even NPG inevitably delivers? You’d think we were doing them a favor in terms of quality control.

Put differently, should the self-correction on which scientific progress so thoroughly depends require critics be willing to pay, presumably out of their own personal funds? Sure, granting agencies now reimburse publication costs for the research they fund, but a critique is unlikely to qualify.

Take another perspective: Suppose you have asmall data set of patients for whom you have blood samples.  The limited value of the data set was further comporimsed by substantial, nonrandom loss to follow-up. But you nonetheless want to use it to solicit industry funding for a “blood test for depression.” Would you be willing to pay a premium of $3,600-$3,900 to publish your results in a prestigious NPG journal, with the added knowledge that it would be insulated from critics?

I was curious just who would get so worked up about an article that they would pay $1,000 to complain.

So, I put Translational Psychiatry in PUBLICATION NAME at Web of Science. It yielded 379 entries. I then applied the restriction CORRESPONDENCE and that left only two entries.

Both were presenting original data and did not even cite another article in Translational Psychiatry.  Maybe the authors were trying to get a publication into an NPG journal on the cheap, at a discount of $2,600.

P2PinvestIt appears that nobody has ever published a letter to the editor in Translational Psychiatry. Does that mean that there has never ever been anything about which to complain? Is everything we find in Translational Psychiatry perfectly trustworthy?

I recently posted at Mind the Brain and elsewhere about a carefully-orchestrated media campaign promoting some bad science published in Translational Psychiatry. An extraordinary publicity effort disseminated a Northwestern University press release and video to numerous media outlets. There was an explicit appeal for industry funding for the development of what was supposedly a nearly clinic-ready inexpensive blood test for depression.

The Translational Psychiatry website where I learned of these publication costs displays the standard NPG message that becomes mocking by a paywall that effectively blocks critics:

“A key strength of NPG is its close relationship with the scientific community. Working closely with scientists, listening to what they say, and always placing emphasis on quality rather than quantity, has made NPG the leading scientific publisher at finding innovative solutions to scientists’ information needs.”

The website also contains the standard NPG assurances about authors’ disclosures of conflicts of interest:

“The statement must contain an explicit and unambiguous statement describing any potential conflict of interest, or lack thereof, for any of the authors as it relates to the subject of the report”

The authors of this particular paper declared:

“EER is named as an inventor on two pending patent applications, filed and owned by Northwestern University. The remaining authors declare no conflict of interest.”

Does this disclosure give readers much clarity concerning the authors’ potential financial conflict of interest? Check out this marketing effort exploiting the Translational Psychiatry article.

Northwestern Researchers Develop RT-qPCR Assay for Depression Biomarkers, Seek Industry Partners

I have also raised questions about a lack of disclosures of conflicts of interest from promoters of Triple P Parenting. The developers claimed earlier that their program was owned by the University of Queensland, so there was no conflict of interest to declare. Further investigation  of the university website revealed that the promoters got a lucrative third of proceeds. Once that was revealed, a flood of erratum notices disclosing the financial conflicts of interest of Triple P promoters followed – at least 10 so far. For instance

triple P erratum PNG
Please Click to Enlarge

How bad is the bad science?

You can find the full Translational Psychiatry article here. The abstract provides a technical but misleading summary of results:

“Abundance of the DGKA, KIAA1539 and RAPH1 transcripts remained significantly different between subjects with MDD and ND controls even after post-CBT remission (defined as PHQ-9 <5). The ROC area under the curve for these transcripts demonstrated high discriminative ability between MDD and ND participants, regardless of their current clinical status. Before CBT, significant co-expression network of specific transcripts existed in MDD subjects who subsequently remitted in response to CBT, but not in those who remained depressed. Thus, blood levels of different transcript panels may identify the depressed from the nondepressed among primary care patients, during a depressive episode or in remission, or follow and predict response to CBT in depressed individuals.”

This was simplified in a press release that echoed in shamelessly churnalized media coverage. For instance:

“If the levels of five specific RNA markers line up together, that suggests that the patient will probably respond well to cognitive behavioral therapy, Redei said. “This is the first time that we can predict a response to psychotherapy,” she added.”

The unacknowledged problems of the article began with the authors only having 32 depressed primary-care patients at baseline and their diagnostic status not having been  confirmed by gold standard semi-structured interviews by professionals.

But the problems get worse. For the critical comparison of patients who recovered in cognitive behavioral therapy versus those that did not occurred in the subsample of nine recovered versus 13 unrecovered patients remaining after a loss-to-follow-up of 10 patients. Baseline results for the 9 +13= 22 patients in the follow-up sample did not even generalize back to the original full sample. How, then, could the authors argue that the results apply to the 23 million or so depressed patients in the United States? Well, they apparently felt they could better-generalize back to the original sample, if not the United States, by introducing an analysis of covariance that controlled for age, race and sex.  (For those of you who are tracking the more technical aspects of this discussion, contemplate the implications of controlling for three variables in a between-groups comparison of nine versus 13 patients. Apparently the authors believe that readers would accept the adjusted analyses in place of the unadjusted analyses which had obvious problems of generalizability. The reviewers apparently accepted this.).

Finally, treatment with cognitive behavior therapy was confounded with uncontrolled treatment with antidepressants.

I won’t discuss here the other problems of the study noted in my earlier blog posts. But I think you can see that these are analyses of a small data set truly unsuitable for publication in Translational Psychiatry and serving as a basis for seeking industry funding for a blood test for depression.

As I sometimes do, I tried to move from blog posts about what I considered problematic to a formal letter to the editor to which the authors would have an opportunity to reply. It was then that I discovered the publication costs.

So what are the alternatives to a letter to the editor?

Letters to the editor are a particularly weak form of post-publication peer review. There is little evidence that they serve as an effective self-correction mechanism for science. Letters to the editor seldom alter the patterns of citations of the articles about which they complain.

pay to paly2Even if I paid the $1,000 fee, I would only have been entitled to 700 words to make my case that this article is scientifically flawed and misleading. I’m not sure that a similar fee would be required from the authors to reply. Maybe responding to critics is part of the original package that they purchased from NPG. We cannot tell from what appears in the journal because the necessity of responding to a critic has not yet occurred.

It is quite typical across journals, even those not charging for a discussion of published papers, to limit the exchanges to a single letter per correspondent and a single response from the authors. And the window for acceptance of letters is typically limited to a few weeks or months after an article has appeared. While letters to the editor are often peer-reviewed, replies from authors typically do not receive peer review.

A different outcome, maybe

I recently followed up blogging about the serious flaws of a paper published in PNAS by Fredrickson and colleagues with a letter to the editor. They in turn responded. Compare our letters to see why the uninformed reader might infer that only confusion had been generated by either of them. But stay tuned…

The two letters would have normally ended any exchange.

However, this time my co-authors and I thoroughly re-analyzed the Fredrickson et al data and PNAS allowed us to publish our results. This time, we did not mince words:

“Not only is Fredrickson et al.’s article conceptually deficient, but more crucially statistical analyses are fatally flawed, to the point that their claimed results are in fact essentially meaningless.”

In the supplementary materials, we provided in excruciating detail our analytic strategy and results. The authors’ response was again dismissive and confusing.

The authors next refused our offer for an adversarial collaboration in which both parties would lay responses to each other with a mediator in order to allow readers to reach some resolution. However, the strengths of our arguments and reanalysis – which included thousands of regression equations, some with randomly generated data – are such others have now calling for a retraction of the original Fredrickson and Cole paper. If that occurs, it would be an extraordinarily rare event.

Limits on journals impose on post-peer-review commentary severely constrain the ability of science to self-correct.

The Reproducibility Project: Psychology is widely being hailed as a needed corrective for the crisis of credibility in science. But replications of studies such as this one involving pre-post sampling of genomic expression from an intervention trial are costly and unlikely to be undertaken. And why attempt a “replication” of findings that have no merit in the first place? After all, the authors’ results for baseline assessments did not replicate in the baseline results of patients still available at follow-up. That suggests a table problem, and that attempts at replication would be futile.

plosThe PLOS journals have introduced the innovation of allowing comments to be placed directly at the journal article’s webpage, with their existence acknowledged on the article itself. Anyone can respond and participate in a post-publication peer review process that can go on for the life of the interest in a particular article. The next stage in furthering post-publication peer review is that such comments be indexed and citable and counted in traditional metrics, as well as altmetrics. This would recognize citizen scientists’ contributions to cleaning up what appears to be a high rate of false positives and outright nonsense in the current literature.

pubmedcommonsPubMed Commons offers the opportunity to post comments on any of the over 23 million entries in PubMed, expanding the PLOS initiative to all journals, even those of the Nature Publishing Group. Currently, the only restriction is that someone attempting to place a comment have authored any of the 23,000,000+ entries in PubMed, even a letter to the editor. This represents progress.

But similar to the PLOS initiative, PubMed Commons will get more traction when it can provide conventional academic credit –countable citations– to contributors identifying and critiquing bad science. Currently authors can get credit for putting bad science into the literature that no one can get for  helping getting it recognized as such.

So, the authors of this particular article have made indefensibly bad claims about having made substantial progress toward developing an inexpensive blood test for depression. It’s not unreasonable to assume their motive is to cultivate financial support from industry for further development. What’s a critic to do?

In this case, the science is bad enough and the damage to the public and professionals’ perception of the state of the science of ‘blood test for depression’ sufficient for  a retraction is warranted. Stay tuned – unless Nature Publishing Group requires a $1,000 payment for investigating whether an article warrants retraction.

Postscript: As I was finishing this post, I discovered that the journals published by the  Modern Language Society requires  payment of a $3,000 membership fee to publish a letter to the editor in one of their journals. I guess they need to keep the discussion within the club. 

Views expressed in this blog post are entirely those of the author and not necessarily those of PLOS or its staff.

Special thanks to Skeptical Cat.skeptical sleuth cat 8 30 14-1

How to critique claims of a “blood test for depression”

Special thanks to Ghassan El-baalbaki and John Stewart for their timely assistance. Much appreciated.

“I hope it is going to result in licensing, investing, or any other way that moves it forward…If it only exists as a paper in my drawer, what good does it do?” – Eva Redei, PhD, first author.

video screenshotMedia coverage of an article in Translational Psychiatry uniformly passed on the authors’ extravagant claims in a press release from Northwestern University that declared that a simple blood test for depression had been found. That is, until I posted a critique of these claims at my secondary blog. As seen on Twitter, the tide of opinion suddenly shifted and considerable skepticism was expressed.

I am now going to be presenting a thorough critique of the article itself. More importantly,translational psychiatry I will be pointing to how, with some existing knowledge and basic tools, many of you can learn to critically examine the credibility of such claims that will inevitably arise in the future. Biomarkers for depression are a hot topic, and John Ioannidis has suggested that means a lot of exaggerated claims about flawed studies are more likely to be the result than real progress.

The article can be downloaded here and the Northwestern University press release here. When I last blogged about this article, I had not seen the 1:58 minute video that is embedded in the press release. I encourage you to view it before my critique and then view it again if you believe that it has any remaining credibility. I do not know where the dividing line is between unsubstantiated claims about scientific research and sheer quackery, but this video tests the boundaries, when evaluated in light of the evidence actually presented in the article.

I am sure that many journalists, medical and mental health professionals, laypersons were intimidated by the mention of “blood transcriptomic biomarkers” in the title of this peer-reviewed article. Surely, the published article had survived evaluation by an editor and reviewers with better, relevant expertise. What is there for an unarmed person to argue about?

Start with the numbers and basic statistics

Skepticism about the study is encouraged by a look at the small numbers of patients involved in the study, which was limited to

  • 64 total participants, 32 depressed patients from a clinical trial and 32 controls.
  • 5 patients were lost from baseline  to follow up.
  • 5 more were lost  from 18 week blood draws, leaving
  • 22 remaining patients –
  • 9 classified as in remission, 13 not in remission.

The authors were interested in differences in 20 blood transcriptomic biomarkers in 2 comparisons: the 32 depressed patients versus 32 controls and the 9 patients who remitted at the end of the trial versus 13 who did not. The authors committed themselves to looking for a clinically significant difference or effect size, which, they tell readers, is defined as .45. We can use a program readily available on the web for a power analysis, which indicates the likelihood of obtaining a statistically significant result (p <.05) for any one of these biomarkers, if differences existed between depressed patients and controls or between the patients who improved in the study versus those who did not. Before even putting these numbers into the calculator, we would expect the likelihood is low because of the size of the sample.

We find that there is only a power of 0.426 for finding one of these individual biomarkers significant, even if it really distinguishes between depressed patients and controls and a power of 0.167 for finding a significant difference in the comparison of the patients who improved versus those who did not.

Bottom line is that this is much too small a sample to address the questions in which the authors are interested – less than 50-50 for identifying a biomarker that actually distinguished between depressed patients and controls and less than 1 in 6 in finding a biomarker actually distinguishing those patients who improved versus those who did not. So, even if the authors really have stumbled upon a valid biomarker, they are unlikely to detect it in these samples.

But there are more problems. For instance, it takes a large difference between groups to achieve statistical significance with such small numbers, so any significant result will be quite large. Yet, with such small numbers, statistical significance is unstable: dropping or adding a few or even a single patient or control or reclassifying a patient as improved or not improved will change the results. And notice that there was some loss of patients to follow-up and to determining whether they improved or not. Selective loss to follow-up is a possible explanation of any differences between the patients considered improved and those who are not considered improved. Indeed, near the end of the discussion, the authors note that patients who were retained for a second blood draw differed in gene transcription from those who did not. This should have tempered claims of finding differences in improved versus unimproved patients, but it did not.

So what I am getting at is that this small sample is likely to produce strong results that will not be replicated in other samples. But it gets still worse –

Samples of 32 depressed patients and 32 controls chosen because they match on age, gender, and race – as they were selected in the current study – can still differ on lots of variables.  The depressed patients are probably more likely to be smokers and to be neurotic. So the authors made only be isolating blood transcriptomic biomarkers associated with innumerable such variables, not depression.

There can be single, unmeasured variables that are the source of any differences or some combination of multiple variables that do not make much difference by themselves, but do so when they are together present in a sample. So,  in such a small sample a few differences affecting a few people can matter greatly. And it does no good to simply do a statistical test between the two groups, because any such test is likely to be underpowered and miss influential differences that are not by themselves so extremely strong that they meet conditions for statistical significance in a small sample.

The authors might be tempted to apply some statistical controls – they actually did in a comparison of the nine versus 13 patients – but that would only compound the problem. Use of statistical controls requires much larger samples, and would likely produce spurious – erroneous – results in such a small sample. Bottom line is that the authors cannot rule out lots of alternative explanations for any differences that they find.

The authors nonetheless claim that 9 of the 20 biomarkers they examined distinguish depressed patients and 3 of these distinguish patients who improve. This is statistically improbable and unlikely to be replicated in subsequent studies.

And then there is the sampling issue. We are going to come back to that later in the blog, but just consider how random or systematic differences can arise between this sample of 32 patients versus 32 controls and what might be obtained with another sampling of the same or a different population. The problem is even more serious when we get down to the 9 versus 13 comparison of patients who completed the trial. A different intervention or a different sample or better follow-up could produce very different results.

So, just looking at the number of available patients and controls, we are not expecting much good science to come out of this study that is pursuing significance levels to define results. I think that many persons familiar with these issues would simply dismissed this paper out of hand after looking at these small numbers.

The authors were aware of the problems in examining 20 biomarkers in such small comparisons. They announced that they would commit themselves to adjusting significance levels for multiple comparisons. With such low ratios of participants in the comparison groups to variables examined, this remains a dubious procedure.  However, when this correction eliminated any differences between the improved and unimproved patients, they simply ignored having done this procedure and went on to discuss results as significant. If you return to the press release and the video, you can see no indication that the authors had applied a procedure that eliminated their ability to claim results as significant. By their own standards, they are crowing about being able to distinguish ahead of time patients who will improve versus those who will not when they did not actually find any biomarkers that did so.

What does the existing literature tell us we should expect?

Our skepticism aroused, we might next want to go to Google Scholar and search for topicspull down menu such as genetics depression, biomarkers depression, blood test depression, etc. [Hint: when you put a set of terms into the search box and click, then pull down the menu on the far right to get an advanced search.]

I could say this takes 25 minutes because that is how much time I spent, but that would be misleading. I recall a jazz composer who claim to write a song in 25 minutes. When the interviewer expressed skepticism, the composer said “Yeah, 25 minutes and 25 years of experience.” I had the advantage of knowing what I was looking for.”

The low heritability of liability for MDD implies an important role for environmental risk factors. Although genotype X environment interaction cannot explain the so-called ‘missing heritability’,52 it can contribute to small effect sizes. Although genotype X environment studies are conceptually attractive, the lessons learned from the most studied genotype X environment hypothesis for MDD (5HTTLPR and stressful life event) are sobering.


Whichever way we look at it, and whether risk variants are common or rare, it seems that the challenge for MDD will be much harder than for the less prevalent more heritable psychiatric disorders. Larger samples are required whether we attempt to identify associated variants with small effect across average backgrounds or attempt to enhance detectable effects sizes by selection of homogeneity of genetic or environmental background. In the long-term, a greater understanding of the etiology of MDD will require large prospective, longitudinal, uniformly and broadly phenotyped and genotyped cohorts that allow the joint dissection of the genetic and environmental factors underlying MDD.

[Update suggested on Twitter by Nese Direk, MD] A subsequent even bigger search for the elusive depression gene reported

We analyzed more than 1.2 million autosomal and X chromosome single-nucleotide polymorphisms (SNPs) in 18 759 independent and unrelated subjects of recent European ancestry (9240 MDD cases and 9519 controls). In the MDD replication phase, we evaluated 554 SNPs in independent samples (6783 MDD cases and 50 695 controls)…Although this is the largest genome-wide analysis of MDD yet conducted, its high prevalence means that the sample is still underpowered to detect genetic effects typical for complex traits. Therefore, we were unable to identify robust and replicable findings. We discuss what this means for genetic research for MDD.

So, there is not much encouragement for the present tiny study.

baseline gene expression may contain too much individual variation to identify biomarkers with a given disease, as was suggested by the studies’ authors.

Furthermore it noted that other recent studies had identified markers that either performed poorly in replication studies or were simply not replicated.

Again, not much encouragement for the tiny present study.

[According to Wiktionary, Omics refers to  related measurements or data from such interrelated fields as genomics, proteomics. transcriptomic or other fields.]

The report came about because of numerous concerns expressed by statisticians and bioinformatics scientists concerning the marketing of gene expression-based tests by Duke University. The complaints concerned the lack of an orderly process for validating such tests and the likelihood that these test would not perform as advertised. In response, the IOM convened an expert panel, which noted that many of the studies that became the basis for promoting commercial tests were small, methodological flawed, and relied on statistics that were inappropriate for the size of the samples and the particular research questions.

The committee came up with some strong recommendations for discovering, validating, and evaluating such tests in clinical practice. By these evidence-based standards, the efforts of the authors of the Translational Psychiatry are woefully inadequate and irresponsible in jumping from their modest size study to the claims they are making to the media and possible financial backers, particularly from such a preliminary small study without further replication in an independent sample.

Given that the editor and reviewers of Translational Psychiatry nonetheless accepted this paper for publication, they should be required to read the IOM report. And all of the journalists who passed on ridiculous claims about this article should also read the IOM book.

If we google the same search terms, we come up with lots of press coverage of what work previously claimed as breakthroughs. Almost none of them pan out in replication, despite the initial fanfare. Failures to replicate are much less newsworthy than false discoveries, but once in a while a statement of resignation makes it into the media. For instance,

Depression gene search disappoints



Click to expand

Looking for love biomarkers in all the wrong places

The existing literature suggests that the investigators have a difficult task looking for what is probably a weak signal with a lot of false positives in the context of a lot of noise. Their task would be simpler if they had a well-defined, relatively homogeneous sample of depressed patients. That is so these patients would be relatively consistent in whatever signal they each gave.

With those criteria, the investigators chose was probably the worst possible sample. They obtained their small sample of 32 depressed patients from a clinical trial comparing face-to-face to Internet cognitive behavioral therapy in a sample recruited from primary medical care.

Patients identified as depressed in primary care are a very mixed group. Keep in mind that the diagnostic criteria require that five of nine symptoms be present for at least two weeks. Many depressed patients in primary care have only five or six symptoms, which are mild and ambiguous. For instance, most women experience sleep disturbance weeks after given birth to an infant. But probing them readily reveals that their sleep is being disturbed by the infant. Similarly, one cardinal symptom of depression is the loss of the ability to experience pleasure, but that is confusing item for primary care patients who do not understand that the loss of the ability is supposed to be due to not being able to experience pleasure, rather than not been able to do things that are previously given them pleasure.

And two weeks is not a long time. It is conceivable that symptoms can be maintained that long in a hostile, unsupportive environment but immediately dissipate when the patient is removed from that environment.

Primary care physicians, if they even adhere to diagnostic criteria, are stuck with the challenge of making a diagnosis based on patients having the minimal number of symptoms, with the required  symptoms often being very mild and ambiguous in themselves.

So, depression in primary care is inherently noisy in terms of its inability to give a clear signal of a single biomarker or a few. It is likely that if a biomarker ever became available, many patients considered depressed now, would not have the biomarker. And what would we make of patients who had the biomarker but did not report symptoms of depression. Would we overrule them and insist that they were really depressed? Or what about patients who exhibited classic symptoms of depression, but did not have the biomarker. When we tell them they are merely miserable and not depressed?

The bottom line is that depression in primary care can be difficult to diagnose and to do so requires a careful interview or maybe the passage of time. In Europe, many guidelines discourage aggressive treatment of mild to moderate depression, particularly with medication. Rather, the suggestion is to wait a few weeks with vigilant monitoring of symptoms and  encouraging the patient to try less intensive interventions, like increased social involvement or behavioral activation. Only with the failure of those interventions to make a difference and the failure of symptoms to resolve the passive time, should a diagnosis and initiation of treatment be considered.

Most researchers agree that rather than looking to primary care, we should look to more severe depression in tertiary care settings, like inpatient or outpatient psychiatry. Then maybe go back and see the extent to which these biomarkers are found in a primary care population.

And then there is the problem by which the investigators defined depression. They did not make a diagnosis with a gold standard, semi structured interview, like the Structured Clinical Interview for DSM Disorders (SCID) administered by trained clinicians. Instead, they relied on a rigid simple interview, the Mini International Neuropsychiatry Interview, more like a questionnaire, that was administered by bachelor-level research assistants. This would hardly pass muster with the Food and Drug Administration (FDA). The investigators had available scores on the interview-administered Hamilton Depression Scale (HAM-D), to measure improvement, but instead relied on the self-report Personal Health Questionnaire (PHQ-9). The reason why they chose this instrument is not clear, but it would again not pass muster with the FDA.

Oh, and finally, the investigators talk about a possible biomarker predicting improvement in psychotherapy. But most of the patients in this study were also receiving antidepressant medication. This means we do not know if the improvement was due to the psychotherapy or the medication, but the general hope for a biomarker is that it can distinguish which patients will respond to one versus the other treatment. The bottom line is that this sample is hopelessly confounded when it comes to predicting response to the psychotherapy.

Why get upset about this study?

I could go on about other difficulties in the study, but I think you can get the picture that this is not a credible study and one that can serve as the basis in search for a blood base, biomarker for depression. It simply absurd to present it as such. But why get upset?

  1. Publication of such low quality research and high profile attempts to pass it off as strong evidence of damage the credibility of all evidence-based efforts to establish the efficacy of diagnostic tools and treatments. This study adds to the sense that much of what we read in the scientific journals and is echoed in the media is simply exaggerated or outright false.
  2. Efforts to promote this article are particularly pernicious in suggesting that primary care physicians can make diagnoses of depression without careful interviewing of patients. The physicians do not need to talk to the patients, they can simply draw blood or give out questionnaires.
  3. Implicit in the promotion of their results has evidence for a blood test of depression is the assumption that depression is a biological phenomenon, strongly influenced by genetic expression, not the environment. Aside from being patently wrong and inconsistent with available evidence, it leads to a reliance on biomedical treatments.
  4. Wide dissemination of the article and press release’s claims serve to reinforce laypersons and clinicians’ belief in the validity of commercially available blood tests of dubious value. These tests can cost as much as $475 per administration and there is no credible evidence, by IOM standards, that they perform superior to simply talking to patients.

At the present time, there is no strong evidence that antidepressants are on average superior in their effects on typical primary care patients, relative to, say, interpersonal psychotherapy (IPT). IPT assumes that regardless of how depression comes about, patient improvement can come about by understanding and renegotiating significant interpersonal relationships. All of the trash talk of these authors contradicts this evidence-based assumption. Namely, they are suggesting that we may soon be approaching an era where even the mild and moderate depression of primary care can be diagnosed and treated without talking to the patient. I say bollocks and shame on the authors who should know better.