How to critique claims of a “blood test for depression”

Special thanks to Ghassan El-baalbaki and John Stewart for their timely assistance. Much appreciated.

“I hope it is going to result in licensing, investing, or any other way that moves it forward…If it only exists as a paper in my drawer, what good does it do?” – Eva Redei, PhD, first author.

video screenshotMedia coverage of an article in Translational Psychiatry uniformly passed on the authors’ extravagant claims in a press release from Northwestern University that declared that a simple blood test for depression had been found. That is, until I posted a critique of these claims at my secondary blog. As seen on Twitter, the tide of opinion suddenly shifted and considerable skepticism was expressed.

I am now going to be presenting a thorough critique of the article itself. More importantly,translational psychiatry I will be pointing to how, with some existing knowledge and basic tools, many of you can learn to critically examine the credibility of such claims that will inevitably arise in the future. Biomarkers for depression are a hot topic, and John Ioannidis has suggested that means a lot of exaggerated claims about flawed studies are more likely to be the result than real progress.

The article can be downloaded here and the Northwestern University press release here. When I last blogged about this article, I had not seen the 1:58 minute video that is embedded in the press release. I encourage you to view it before my critique and then view it again if you believe that it has any remaining credibility. I do not know where the dividing line is between unsubstantiated claims about scientific research and sheer quackery, but this video tests the boundaries, when evaluated in light of the evidence actually presented in the article.

I am sure that many journalists, medical and mental health professionals, laypersons were intimidated by the mention of “blood transcriptomic biomarkers” in the title of this peer-reviewed article. Surely, the published article had survived evaluation by an editor and reviewers with better, relevant expertise. What is there for an unarmed person to argue about?

Start with the numbers and basic statistics

Skepticism about the study is encouraged by a look at the small numbers of patients involved in the study, which was limited to

  • 64 total participants, 32 depressed patients from a clinical trial and 32 controls.
  • 5 patients were lost from baseline  to follow up.
  • 5 more were lost  from 18 week blood draws, leaving
  • 22 remaining patients –
  • 9 classified as in remission, 13 not in remission.

The authors were interested in differences in 20 blood transcriptomic biomarkers in 2 comparisons: the 32 depressed patients versus 32 controls and the 9 patients who remitted at the end of the trial versus 13 who did not. The authors committed themselves to looking for a clinically significant difference or effect size, which, they tell readers, is defined as .45. We can use a program readily available on the web for a power analysis, which indicates the likelihood of obtaining a statistically significant result (p <.05) for any one of these biomarkers, if differences existed between depressed patients and controls or between the patients who improved in the study versus those who did not. Before even putting these numbers into the calculator, we would expect the likelihood is low because of the size of the sample.

We find that there is only a power of 0.426 for finding one of these individual biomarkers significant, even if it really distinguishes between depressed patients and controls and a power of 0.167 for finding a significant difference in the comparison of the patients who improved versus those who did not.

Bottom line is that this is much too small a sample to address the questions in which the authors are interested – less than 50-50 for identifying a biomarker that actually distinguished between depressed patients and controls and less than 1 in 6 in finding a biomarker actually distinguishing those patients who improved versus those who did not. So, even if the authors really have stumbled upon a valid biomarker, they are unlikely to detect it in these samples.

But there are more problems. For instance, it takes a large difference between groups to achieve statistical significance with such small numbers, so any significant result will be quite large. Yet, with such small numbers, statistical significance is unstable: dropping or adding a few or even a single patient or control or reclassifying a patient as improved or not improved will change the results. And notice that there was some loss of patients to follow-up and to determining whether they improved or not. Selective loss to follow-up is a possible explanation of any differences between the patients considered improved and those who are not considered improved. Indeed, near the end of the discussion, the authors note that patients who were retained for a second blood draw differed in gene transcription from those who did not. This should have tempered claims of finding differences in improved versus unimproved patients, but it did not.

So what I am getting at is that this small sample is likely to produce strong results that will not be replicated in other samples. But it gets still worse –

Samples of 32 depressed patients and 32 controls chosen because they match on age, gender, and race – as they were selected in the current study – can still differ on lots of variables.  The depressed patients are probably more likely to be smokers and to be neurotic. So the authors made only be isolating blood transcriptomic biomarkers associated with innumerable such variables, not depression.

There can be single, unmeasured variables that are the source of any differences or some combination of multiple variables that do not make much difference by themselves, but do so when they are together present in a sample. So,  in such a small sample a few differences affecting a few people can matter greatly. And it does no good to simply do a statistical test between the two groups, because any such test is likely to be underpowered and miss influential differences that are not by themselves so extremely strong that they meet conditions for statistical significance in a small sample.

The authors might be tempted to apply some statistical controls – they actually did in a comparison of the nine versus 13 patients – but that would only compound the problem. Use of statistical controls requires much larger samples, and would likely produce spurious – erroneous – results in such a small sample. Bottom line is that the authors cannot rule out lots of alternative explanations for any differences that they find.

The authors nonetheless claim that 9 of the 20 biomarkers they examined distinguish depressed patients and 3 of these distinguish patients who improve. This is statistically improbable and unlikely to be replicated in subsequent studies.

And then there is the sampling issue. We are going to come back to that later in the blog, but just consider how random or systematic differences can arise between this sample of 32 patients versus 32 controls and what might be obtained with another sampling of the same or a different population. The problem is even more serious when we get down to the 9 versus 13 comparison of patients who completed the trial. A different intervention or a different sample or better follow-up could produce very different results.

So, just looking at the number of available patients and controls, we are not expecting much good science to come out of this study that is pursuing significance levels to define results. I think that many persons familiar with these issues would simply dismissed this paper out of hand after looking at these small numbers.

The authors were aware of the problems in examining 20 biomarkers in such small comparisons. They announced that they would commit themselves to adjusting significance levels for multiple comparisons. With such low ratios of participants in the comparison groups to variables examined, this remains a dubious procedure.  However, when this correction eliminated any differences between the improved and unimproved patients, they simply ignored having done this procedure and went on to discuss results as significant. If you return to the press release and the video, you can see no indication that the authors had applied a procedure that eliminated their ability to claim results as significant. By their own standards, they are crowing about being able to distinguish ahead of time patients who will improve versus those who will not when they did not actually find any biomarkers that did so.

What does the existing literature tell us we should expect?

Our skepticism aroused, we might next want to go to Google Scholar and search for topicspull down menu such as genetics depression, biomarkers depression, blood test depression, etc. [Hint: when you put a set of terms into the search box and click, then pull down the menu on the far right to get an advanced search.]

I could say this takes 25 minutes because that is how much time I spent, but that would be misleading. I recall a jazz composer who claim to write a song in 25 minutes. When the interviewer expressed skepticism, the composer said “Yeah, 25 minutes and 25 years of experience.” I had the advantage of knowing what I was looking for.”

The low heritability of liability for MDD implies an important role for environmental risk factors. Although genotype X environment interaction cannot explain the so-called ‘missing heritability’,52 it can contribute to small effect sizes. Although genotype X environment studies are conceptually attractive, the lessons learned from the most studied genotype X environment hypothesis for MDD (5HTTLPR and stressful life event) are sobering.


Whichever way we look at it, and whether risk variants are common or rare, it seems that the challenge for MDD will be much harder than for the less prevalent more heritable psychiatric disorders. Larger samples are required whether we attempt to identify associated variants with small effect across average backgrounds or attempt to enhance detectable effects sizes by selection of homogeneity of genetic or environmental background. In the long-term, a greater understanding of the etiology of MDD will require large prospective, longitudinal, uniformly and broadly phenotyped and genotyped cohorts that allow the joint dissection of the genetic and environmental factors underlying MDD.

[Update suggested on Twitter by Nese Direk, MD] A subsequent even bigger search for the elusive depression gene reported

We analyzed more than 1.2 million autosomal and X chromosome single-nucleotide polymorphisms (SNPs) in 18 759 independent and unrelated subjects of recent European ancestry (9240 MDD cases and 9519 controls). In the MDD replication phase, we evaluated 554 SNPs in independent samples (6783 MDD cases and 50 695 controls)…Although this is the largest genome-wide analysis of MDD yet conducted, its high prevalence means that the sample is still underpowered to detect genetic effects typical for complex traits. Therefore, we were unable to identify robust and replicable findings. We discuss what this means for genetic research for MDD.

So, there is not much encouragement for the present tiny study.

baseline gene expression may contain too much individual variation to identify biomarkers with a given disease, as was suggested by the studies’ authors.

Furthermore it noted that other recent studies had identified markers that either performed poorly in replication studies or were simply not replicated.

Again, not much encouragement for the tiny present study.

[According to Wiktionary, Omics refers to  related measurements or data from such interrelated fields as genomics, proteomics. transcriptomic or other fields.]

The report came about because of numerous concerns expressed by statisticians and bioinformatics scientists concerning the marketing of gene expression-based tests by Duke University. The complaints concerned the lack of an orderly process for validating such tests and the likelihood that these test would not perform as advertised. In response, the IOM convened an expert panel, which noted that many of the studies that became the basis for promoting commercial tests were small, methodological flawed, and relied on statistics that were inappropriate for the size of the samples and the particular research questions.

The committee came up with some strong recommendations for discovering, validating, and evaluating such tests in clinical practice. By these evidence-based standards, the efforts of the authors of the Translational Psychiatry are woefully inadequate and irresponsible in jumping from their modest size study to the claims they are making to the media and possible financial backers, particularly from such a preliminary small study without further replication in an independent sample.

Given that the editor and reviewers of Translational Psychiatry nonetheless accepted this paper for publication, they should be required to read the IOM report. And all of the journalists who passed on ridiculous claims about this article should also read the IOM book.

If we google the same search terms, we come up with lots of press coverage of what work previously claimed as breakthroughs. Almost none of them pan out in replication, despite the initial fanfare. Failures to replicate are much less newsworthy than false discoveries, but once in a while a statement of resignation makes it into the media. For instance,

Depression gene search disappoints



Click to expand

Looking for love biomarkers in all the wrong places

The existing literature suggests that the investigators have a difficult task looking for what is probably a weak signal with a lot of false positives in the context of a lot of noise. Their task would be simpler if they had a well-defined, relatively homogeneous sample of depressed patients. That is so these patients would be relatively consistent in whatever signal they each gave.

With those criteria, the investigators chose was probably the worst possible sample. They obtained their small sample of 32 depressed patients from a clinical trial comparing face-to-face to Internet cognitive behavioral therapy in a sample recruited from primary medical care.

Patients identified as depressed in primary care are a very mixed group. Keep in mind that the diagnostic criteria require that five of nine symptoms be present for at least two weeks. Many depressed patients in primary care have only five or six symptoms, which are mild and ambiguous. For instance, most women experience sleep disturbance weeks after given birth to an infant. But probing them readily reveals that their sleep is being disturbed by the infant. Similarly, one cardinal symptom of depression is the loss of the ability to experience pleasure, but that is confusing item for primary care patients who do not understand that the loss of the ability is supposed to be due to not being able to experience pleasure, rather than not been able to do things that are previously given them pleasure.

And two weeks is not a long time. It is conceivable that symptoms can be maintained that long in a hostile, unsupportive environment but immediately dissipate when the patient is removed from that environment.

Primary care physicians, if they even adhere to diagnostic criteria, are stuck with the challenge of making a diagnosis based on patients having the minimal number of symptoms, with the required  symptoms often being very mild and ambiguous in themselves.

So, depression in primary care is inherently noisy in terms of its inability to give a clear signal of a single biomarker or a few. It is likely that if a biomarker ever became available, many patients considered depressed now, would not have the biomarker. And what would we make of patients who had the biomarker but did not report symptoms of depression. Would we overrule them and insist that they were really depressed? Or what about patients who exhibited classic symptoms of depression, but did not have the biomarker. When we tell them they are merely miserable and not depressed?

The bottom line is that depression in primary care can be difficult to diagnose and to do so requires a careful interview or maybe the passage of time. In Europe, many guidelines discourage aggressive treatment of mild to moderate depression, particularly with medication. Rather, the suggestion is to wait a few weeks with vigilant monitoring of symptoms and  encouraging the patient to try less intensive interventions, like increased social involvement or behavioral activation. Only with the failure of those interventions to make a difference and the failure of symptoms to resolve the passive time, should a diagnosis and initiation of treatment be considered.

Most researchers agree that rather than looking to primary care, we should look to more severe depression in tertiary care settings, like inpatient or outpatient psychiatry. Then maybe go back and see the extent to which these biomarkers are found in a primary care population.

And then there is the problem by which the investigators defined depression. They did not make a diagnosis with a gold standard, semi structured interview, like the Structured Clinical Interview for DSM Disorders (SCID) administered by trained clinicians. Instead, they relied on a rigid simple interview, the Mini International Neuropsychiatry Interview, more like a questionnaire, that was administered by bachelor-level research assistants. This would hardly pass muster with the Food and Drug Administration (FDA). The investigators had available scores on the interview-administered Hamilton Depression Scale (HAM-D), to measure improvement, but instead relied on the self-report Personal Health Questionnaire (PHQ-9). The reason why they chose this instrument is not clear, but it would again not pass muster with the FDA.

Oh, and finally, the investigators talk about a possible biomarker predicting improvement in psychotherapy. But most of the patients in this study were also receiving antidepressant medication. This means we do not know if the improvement was due to the psychotherapy or the medication, but the general hope for a biomarker is that it can distinguish which patients will respond to one versus the other treatment. The bottom line is that this sample is hopelessly confounded when it comes to predicting response to the psychotherapy.

Why get upset about this study?

I could go on about other difficulties in the study, but I think you can get the picture that this is not a credible study and one that can serve as the basis in search for a blood base, biomarker for depression. It simply absurd to present it as such. But why get upset?

  1. Publication of such low quality research and high profile attempts to pass it off as strong evidence of damage the credibility of all evidence-based efforts to establish the efficacy of diagnostic tools and treatments. This study adds to the sense that much of what we read in the scientific journals and is echoed in the media is simply exaggerated or outright false.
  2. Efforts to promote this article are particularly pernicious in suggesting that primary care physicians can make diagnoses of depression without careful interviewing of patients. The physicians do not need to talk to the patients, they can simply draw blood or give out questionnaires.
  3. Implicit in the promotion of their results has evidence for a blood test of depression is the assumption that depression is a biological phenomenon, strongly influenced by genetic expression, not the environment. Aside from being patently wrong and inconsistent with available evidence, it leads to a reliance on biomedical treatments.
  4. Wide dissemination of the article and press release’s claims serve to reinforce laypersons and clinicians’ belief in the validity of commercially available blood tests of dubious value. These tests can cost as much as $475 per administration and there is no credible evidence, by IOM standards, that they perform superior to simply talking to patients.

At the present time, there is no strong evidence that antidepressants are on average superior in their effects on typical primary care patients, relative to, say, interpersonal psychotherapy (IPT). IPT assumes that regardless of how depression comes about, patient improvement can come about by understanding and renegotiating significant interpersonal relationships. All of the trash talk of these authors contradicts this evidence-based assumption. Namely, they are suggesting that we may soon be approaching an era where even the mild and moderate depression of primary care can be diagnosed and treated without talking to the patient. I say bollocks and shame on the authors who should know better.

34 thoughts on “How to critique claims of a “blood test for depression””

  1. James
    your usual nice forensic examination of bold claims…
    I was struck by your statement
    ” The investigators had available scores on the interview-administered Hamilton Depression Scale, to measure improvement, but instead relied on the self-report Personal Health Questionnaire (PHQ-9). The reason why they chose this instrument is not clear, but it would again not pass muster with the FDA.”

    I noticed this when I read the paper and thought it a little odd
    One possibility of course, relates to the PHQ-9 being self-rated and the HAM-D being clinician rated. The use of self-rating PHQ-9 at outcome means the study is not blind (especially viz the use of CBT).
    It would have been nice to also see results for clinician ratings (on HAM-D) at outcome to see if they show the same relationship proposed for self-ratings on the PHQ-9
    Leaving aside the many valid issues you raise about the study…’if’ the effect were valid, it raises the possibility that the blood test is predictive of self-rated depression only ….and not clinically defined depression…with manifold implications if it were true.


  2. As always, right on target, I think your blog entry deserves a press release. Here we have a case in which the host institution, and not the media per se, are responsible of disseminating research that lacks credibility. It is easy to explain, within universities, the search for quick prominence and media coverage leads communication departments and administrators to this sort of game. As always, your blog entries inspire in other realms to continue speaking up.


    1. Thanks, Gonzalo. As I alluded to in my blog post, the Institute of Medicine convened a committee at Duke University was charged with prematurely promoting a biomarker test for response to chemotherapy. The final report clearly implicated institutions, not just individual investigators, in the promulgation of premature and unsubstantiated claims for such test. Too bad no one at Northwestern University seems to read that report.


  3. Very nice blog post. There are so many things wrong with this paper that your post could have gone on and on. But you’ve covered the most glaring things. You mention the lack of independent replication briefly, but I think it should be further emphasised. It’s impossible to make claims about how well a test does at discriminating between condition A and condition B without applying it to an independent sample. If you get a significant result in such a small sample, of course it will be able to discriminate well in that set. If you build up a polygenic predictor from a GWAS and use it to see how it discriminates cases and controls in the same sample, it will do a tremendous job. Not so much in an independent sample. It is reprehensible that the authors are promoting this as something that it is of any immediate utility.


  4. Thank you for a brilliant critique. Given the potential for harm arising from this research, it would be nice if Translational Psychiatry published an editorial response from you connected to the original paper. Might this be possible?


  5. As Dr. Coyne indicates, there are many fundamental flaws in this study, not the least of which is the statistical analysis. Dr. Redei is not a clinician; rather, she is a laboratory researcher who has done most of her work on rats, as a review of her recent publications will indicate. The chronic problem with this kind of study is that clinicians rely on the laboratory members of the team to perform the biological measurements, and the laboratory members of the team rely on the clinicians for psychopathology measurements; e.g., the HAM-D or the PHQ-9, neither really understanding the other’s contribution. As well, no one was minding the statistical design–sample size, etc.–so that the result is a small, meaningless study that generated the hype that universities today are eager to invoke about whatever might lead to some kind of commercialization. Nothing new under the sun…


    1. Thanks for checking out Dr. Redei’s background, which perhaps explains her lame clinical pronouncements. And you are correct also in flagging Northwestern University’s commercial interests in her patents.


  6. I agree with 99% of this article but there is a subtle point. A power analysis is a prediction of the probability of a successful study. If the well done study ,and well done analysis ,hits a significant p it’s a success. Postdicting the power analysis and coming up with 0.4 should not raise skepticism since skepticism should already be at a maximum.. It calls for a replication —-but one should always call for a replication —so power analysis contributes little.
    The reason replication is so important is that our samples are convenience samples rather than representative samples.Generalization is premature.
    Unfortunately our granting agencies and journals want hot news rather than judicious science.


  7. Good point. The question is where did the projected .4 effect size come from. From the authors who claim it was set a priori. But on the basis of what wishful thinking? Certainly it was not derived from past research. I think this is a particularly bad example of authors setting an effect size based on what would make their study look good, not on the basis of prior probabilities.

    I agree, though, that the level of suspicion about this study is already so high that one needs to dismiss it. Sure, the authors deserve the bulk of the blame, but so dues Northwestern for releasing the press release and Tom Insel for setting the absurd requirement that applications for grants to do psychotherapy have to specify biomarkers. This sets the stage for authors making all sorts of unjustified claims in the interest of obtaining funding.


  8. Another huge problem with the study is the lack of a psychiatric control group. How do we know the findings (as problematic as they are) tell us anything specific about depression, vs psychopathology in general? Especially with depressed mood being so nonspecific.


  9. my question is what is the motive? is it to get federal funding because psychotherapy studies must now involve biomarkers? is it to get funding from say private equity to commercialize the product? i note there is one company that has a proprietary test for depression. my take from the researcher’s video is that this test would be in the public domain. am i wrong?


    1. We can only speculate as to the authors’ motives. However, the clinical trial from which data are were obtained was an R21 Exploratory Grant. Presumably the authors want to go on to a full R01. But now NIMH has made the requirement that R01 applications specify treatment by individual characteristic interactions, and that these exploration of interaction effects get at underlying biological mechanisms, i.e., biomarkers. This may be an absurd expectation, but if the results of the authors were accepted, they would be on their way to making a suitable application.

      However, as I indicate in my secondary blog, the authors have clearly made statements to the press that indicate that they are seeking the support of industry for further developing their “biomarkers” a rival group has already marketed a $475 blood test, but in the absence of appropriate validation. See


  10. Many of your critiques are legitimate, and I share your skepticism about the claims of this study. However, some of your points seem dubious to me:

    1. The low power of the study could lead to false negatives, but not false positives. Also, a power of 0.426 is not tiny, and in any case that is based on an assumption about effect sizes; if the actual effect sizes are larger, it becomes much more likely that the effects will be detected.

    2. You make much of the fact that large-scale searches for genetic association with depression have not yielded much. However, this study was looking for something quite different, so the comparison is inapt. Even for something with zero heritable component, we might well find strong associations with expression levels of various genes.


    1. Thanks for your comments.

      Low power does not lead to more positive findings, but to greater likelihood that any positive findings are false. That distinction has recently been noted in the literature. A power of .426 is generally seen as unacceptably in terms of missing most associations if they are indeed present. Are you suggesting that because the study is grossly underpowered, we should be more accepting of positive findings?

      Gene expression is different than identifying genes associated with depression, but given the lack of replicable findings in huge searches for genes, a “strong association” of depression with gene expression in such a pitiful small sample is highly unlikely. Furthermore, recent findings of differences in gene expression have not been replicated, as would be expected. The Institute of Medicine clearly wants us to be on guard against accepting what are likely to be false positives in the absence of replication.

      Note further that the tiny sample of recovered versus unrecovered patients was a subset of patients that significantly differed in gene expression from the larger original sample. Loss to follow-up was not random. This adds to the weight of evidence that some sort of nonsense is going on here that should not be given substantive interpretation.

      Transcriptomic biomarkers are different than genes, but the failure to identify genes in exhaustive, huge searches is certainly discouraging of the expectation that gene expressions will be found in such a tiny sample. Finding a “strong association” with gene expression in such a small sample and should be dismissed in the absence of replications. Recent “strong associations” have been found to be false positives, as woul


  11. “Are you suggesting that because the study is grossly underpowered, we should be more accepting of positive findings?”

    Huh? Where did that come from? I am suggesting that it is not legitimate to dismiss statistically signficant results because the sample size is small. Statitstical tests take sample size into account. I understand that their statistical analysis is suspect, but that is an orthogonal point.

    It is not unthinkable to obtain a result that there is a 42.6% chance of obtaining. They might even argue that all of their 20 markers contain information, and they only found an effect for about 42.6% of them (9/20 = 45%).

    Furthermore, the power estimate is for a more or less arbitrary effect size. Who says the effect size is not actually much larger than this?

    “Transcriptomic biomarkers are different than genes, but the failure to identify genes in exhaustive, huge searches is certainly discouraging of the expectation that gene expressions will be found in such a tiny sample.”

    Why? There might be zero heritable component, but large expression differences due to purely environmental effects.

    I’m not defending the claims of the paper. I am pointing out weaknesses in some of your arguments.


    1. I think we best let readers decide if you have pointed out “weaknesses in some of my arguments.”

      Most of us would feel quite comfortable in dismissing findings that were “statistically significant” under these circumstances.

      If authors who had posited an a priori power of .466 went on to get excited about finding effects under these circumstances for 45%, most of us would find it inappropriate post hoc exuberance.

      Similarly, if authors with the prior probabilities of the present study would defend their results with the argument they were dealing with transcriptomic biomarkers with zero heritable component, they would similarly be dismissed.


  12. This power issue is likely to confuse everybody–but is really unimportant. If a grossly low powered study comes up with a significant finding it may be fluke or maybe they fell upon a fact. In either case replication is central. For instance the early panic studies re imipramine effects were feeble,power wise,but they withstood independent replication–although it was 17 years later.
    This example indicates the central role of replication–not just one trial–as well as the unfortunate fact of non-replication.- I ascribed this to the desire of granting agencies and journals to go for hot news–but were reminded that the urge for commercial advantage is surely important.


  13. I’m puzzled. Statisticians don’t disregard statistically significant results just because the sample size is small. In fact they have developed many tests to deal specifically with small sample size. The point about power makes no sense. Among other things, the authors stated that they would consider only effects >0.45 to be statistically significant, but you are assuming that the effect cannot be greater than 0.45.

    You are also assuming a connection between expression differences and genetic differences that there is no reason to expect.

    ‘I think we best let readers decide if you have pointed out “weaknesses in some of my arguments.” ‘

    Come now. If you can declare something to be “low quality research” (you didn’t wait for readers to decide), you can’t take umbrage at somebody saying that your arguments have weaknesses.


    1. Josh, I don’t understand your lack of understanding of this article and guess I have to get back to the basics. Of necessity, scientists disregard statistically significant results all the time. This context is an observational study comparing a small number of depressed patients whose depression is not ascertained by gold standard methods. The comparison group was matched only on three variables out of thousands of potentially relevant variables. Essentially it is fugitive epidemiology and when recognized as such, most people would see the folly of assuming that any differences that were found could serve as a trait marker for depression beyond the study. Think spurious association. So they would not accept statistical significance ofa t-test as deciding the question.

      The authors chose their 20 dependent variables based on expected ties to genes, although, yes, the association between genes and genes is complex. They apparently are at a loss to indicate the plausible biological mechanism by which any significant results might be related to depression.

      The larger context generating prior probabilities is that the search for both genes and genetic transcription associated with depression has been quite disappointing, with no consistent replicable findings. And the situation in omics in general is so bad that the IOM convened an expert panel to make recommendations for any new declarations of gene transcription factors. This article violates those recommendations.

      Ask any experts and I am sure that they will tell you that this effort was grossly underpowered, given prior probabilities. Any effect sizes it generates are likely to prove exaggerated or not replicated. Without replication, this study does not alter prior probabilities and the literature already has enough unreplicated junk to be vigilant about adding more.

      Bottom line is that these authors have no business making substantive claims about having discovered biomarkers.


  14. There are so many things wrong with this study it is easy to confuse the issues.
    The bottom line is whether there is enough here to incite replication or we should just ignore it. The claim of factual, definitive, demonstration because of a significant p value is plainly wrong.
    Is the post hoc estimate of low power, derived from their findings,sufficient to throw out their significant finding. I don’t think so for the same reason a significant finding may be a Type 1 error. A low power finding, may be a Type 2 error .
    But I think the power issue is a red herring after the fact. Any significant finding is viewed within the context of design and other knowledge.
    Naturalistic contrasts make one suspicious because there can be so many unrecognized biasing variables. Nonetheless if it appears to cast light on a heuristic or practical question,by a significant finding, the incentive to test by replications is increased.
    Experimental randomized protocols can generate a power statement, prior to the experiment, if one assumes a sample size , an effect size and an alpha value.
    It’s value ,when low,is to suggest a larger N, or more optimism re effect or a less stringent alpha or to forget the whole thing.
    Once the experiment is done , a significant finding is sufficient to incite interest ,despite a post hoc low power,if the requirements for a randomized trial are met that allow generalizability.
    My general point is that this is hardly ever the case since we rely on samples of convenience rather than a randomized selection from defined populations.
    Therefore,neither naturalistic or experimental findings should incite confidence
    ,but should incite replication if meaningful heuristically or practically.
    This last applies to such mind blowing work as the detection of the Higgs particle.
    I do not understand their statistics. It is quite possible that they have confirmed their finding by any number of techniques such as resampling,split samples,etc that amount to replications -I am sure they have detailed their methods somewhere -I plead ignorance and hope somebody will cite the appropriate reference. It is even possible that their samples are random subsets of defined populations.
    But my take home point is that any “finding” is not a definitive fact without adequate replication . Further post hoc low power is insufficient to invalidate a significant finding or for that matter to increase skepticism ,that primarily derives from other issues— that usually set skepticism at a maximum.


  15. That fails to respond to my points, throws in some irrelevant stuff, and attempts to argue by condescension (especially that ludicrous first sentence). My points stand. Pointing out the other things wrong with the paper doesn’t change that.


    1. Guess we differ on what is relevant and that you accept the P-value fallacy. Most epidemiologists are reluctant to accept statistical significance as determining validity of conclusion. Most people working in biomarkers are dismissive of statistical significance in small case-control studies, etc. Most accept relevance of prior probabilities, etc, etc. I guess your point stands for you.


  16. If you accept null hypothesis significance testing in the first place (which is no longer fashionable, I guess, but that doesn’t seem to be the basis for your critique), then we should not be more confident of positive results with low sample sizes and large observed effects, nor should we be more confident of positive results with large sample sizes and small observed effects. If the p-values are the same, we should be *exactly* as accepting of positive results with low sample sizes as with high sample sizes. With this sample, you need an observed effect of .43 in order to claim that the actual effect is greater than zero (i.e. positive) at p.10. This would not be a significant result, but the observed effect would still be moderate and positive. A larger sample with the same effect would have a smaller and more stable p-value, and the confidence interval around the observed effect would be smaller. A similar study with a larger sample, a smaller effect, and the same p-value would also have a smaller confidence interval, so instead of being confident that the true effect is between .05 and .8, we’d be 95% confident that the true effect was between .02 and .4. We would be exactly, precisely, as confident that the true effect was not zero.


    1. This is a strange analysis and leads to an incorrect recommendation that we accept improbably large effect sizes from small studies. From a risk of bias perspective, a large effect in a small trial is less likely to be reproducible because studies of this magnitude are more vulnerable to deliberate loss or retention of a few or even one data points and other decisions about final “design” or analyses after a look at the data – and to publication bias. Furthermore, assumptions in any attempt to correct for incomplete data are less testable in already small data sets.

      Larger samples are simply always better because they provide more accurate effect size estimates of a population effect size. It’s not just about the p-value, it’s a combination of the sample size, p-value and effect size that should inform our judgment how reliable something is. Tiny samples are known to hugely overestimate ES due to publication bias. Note the authors’ avoidance of presenting simple associations in favor of overcontrolled analyses with arbitrary covariates. Smells fishy. Unlike me, the attach a lot of resolving power to simply having statistically significant to report. But by their own criteria, the portion of the baseline data available for follow up does not replicate the results obtained from the full set of baseline data.

      We need to particularly vigilant about secondary analyses of small clinical data sets with missing data for which analyses were not pre-registered.


  17. Jim
    I think you are incorrectly focussed on the projected 0.4 effect size . No doubt the effect size in power analyses is often manipulated,out of the blue, so that the project looks feasible and fundable. But that has nothing to do with the found significance of the study .
    Also,I did not dismiss the study ,rather I called for an independent replication. If I had dismissed it ,say because of incoherence,my suggestion would be not to bother with replication.


    1. Don, thanks for your comments. I think even in new areas of research, we are rarely without prior expectations that can guide power analyses. Large-scale studies attempting to relate diagnoses of clinical depression to biomarkers at best find associations in the .2 range. We additionally have guidance from both the genetic and genomic expression literature to suggest that any associations are likely to be quite modest and require much larger studies to establish. The present convenience sample, by the authors own criteria, did not replicate from the original baseline sample to the portion of the baseline sample retained for their small follow-up study. If the smaller sample became the basis for identifying predictors of treatment response, we are even further from anything likely to be reproducible.

      Note also that the authors seem to be looking for blood transcriptomic biomarkers in the wrong places. One would think that is necessary to start with a cleaner, more well-defined phenotype. However, they have a convenience sample of patients recruited from the community, for whom gold standard diagnoses based on semi structured interview are not available. It is quite a noisy sample in which to look for a weak signal. Note also, that there was no control for receipt of antidepressant medication or adherence to this medication.

      Exploration of blood transcriptomic biomarkers might still be seen as promising, but I think that subsequent independent investigators should not take too seriously results of this study as the primary focus of their work, but looked beyond this narrow range of biomarkers that were identified in this study. Bets are that whatever biomarkers a larger study with a better defined phenotype would yield would not overlap with the results of the present study. At best the study, shows the limited feasibility of collecting blood samples from participants in a trial of Internet psychotherapy and obtaining second samples after completion of therapy. It doesn’t yield, in my opinion, substantive, specific reproducible results. We know have the same knowledge that we did before the authors conducted their study.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s