Special thanks to Ioana Cristea and Nilufer Kafescioglu who co-authored the commentary with me discussed in this blog post.
I recall a Monty Python skit in which clever government officials solved a housing crisis by hiring the famed hypnotist El Mystico to put up 50 hypnosis-induced 25 story blocks. Of course, the illusion was always threatened by someone falling out of trance and noticing that there was actually no housing. If residents stopped believing, the entire housing complex would fall down. El Mystico was kept busy, watching for people coming out of trance and hypnotizing them again.
Tenant: we received a note from the Council saying that if we ceased to believe in this building it would fall down.
Voice Over: You don’t mind living in a figment of another man’s imagination?
Tenant: No, it’s much better than where we used to live.
This skit can serve as a metaphor for whole areas of psychotherapy and psychosocial intervention research that are dominated by small, similarly flawed studies, but in which the illusion of a solid body of work is nervously protected against anyone noticing differently.
My recent blog post examining the Triple P Parenting Program literature found that expensive implementations of that program were being justified by data that did not actually support its effectiveness. In this particular case, the illusion was preserved by undeclared financial conflicts of interest of those generating these little studies, but also dominating the peer review process. Null trials were kept from being published or spun to looking like positive trials, and any criticism was suppressed by negative peer reviews recommending rejection.
Most often, in psychotherapy research at least, there are no such obvious financial interests in play. Peer review typically draws upon persons who are identified as experts in an area of research. That sounds reasonable, except that in areas of research dominated by similarly flawed studies, we cannot reasonably expect peer reviewers to be overly critical of studies that share the same flaws as their own.
And then there is the problem of peer reviewers who should be fairer, but whose objectivity is overridden by worry that the credibility of the field would be damaged by any tough tell-it-like-it-is critique. Such well-meaning reviewers may recommend rejection of a manuscript solely on the basis of the authors not playing nice by offering constructive suggestions, rather than commenting on the flaws in the literature that no one else is willing to acknowledge. Conspiracies of silence can develop so that no one comments on the obvious, and anyone inclined to do so is kept out of the published literature.
Systematic reviews and meta-analyses provide opportunities for recognizing larger patterns in a literature and acknowledging the difficulty or impossibility of drawing firm conclusions as to whether interventions actually work from available studies. Yet, too often reviewers simply put lipstick on a pig of a literature, and comment how beautiful it is. Once such summaries are published, the likelihood decreases that anyone will go back to the primary studies and find the flaws, rather than relying on the secondary source that is now available.
In this blog post, I am going to focus on studies of couples interventions for cancer patients, a literature that is dominated by small studies that share similar flaws. Recently, a meta-analyses appeared in Psycho-Oncology that discreetly avoided commenting on the important limitations of available studies. My colleagues and I attempted to publish a brief commentary on it, but we got sandbagged by defensive reviewers. I’ll be discussing both the meta-analysis and the sandbagging. Here is the abstract of the meta-analysis, which is unfortunately not open access. But I if you write to the lead author email@example.com, she will surely send you a PDF.
The authors’ systematic search succeeded in identifying 23 randomized trials, 20 of which could provide effect sizes for inclusion in a meta-analysis. The authors provided a table in which every single study reported at least one positive finding. Does that suggest that all is well in this literature, with positive results consistently occurring?
It so happens that one of the couple studies in the table was already the focus of a critical analysis in another of my blog posts. I identified inflated claims in its abstract and I described how, more generally, abstracts often distort findings reported in the results sections of articles. I gave some tips as to what to look for. Because many readers only look at an abstract without actually going to the article, distorted abstracts can perpetuate hype that does not get corrected. I’m quite sure that many of the other studies reported in the table are summarized on the basis of what appears in their abstracts, not their actual results.
As an example, one group of investigators apparently radically altered their design after results were known. The two-group design became a three-group design, with a new group created from patients assigned to the intervention who did not receive any exposure to it. The abstract reported positive effects for couples who received the intervention, leaving out those who were assigned to the intervention, but for whatever reason did not receive it.
Reproduced below is Figure 1 from the article [you can doubleclick on it to expand it], which presents forest plots of findings which are different than the glowing assessments offered in the table in the article. For instance, the overall effect size obtained from summarizing all the studies of patients’ distress was small (.25 ). Yet, curiously only two studies—rather than all of them– individually obtained a significant effect. Not all studies that are listed in the table report effects for distress, but all that are reported are positive.
One of these studies was small (30 couples per group) and obtained only what the original study authors characterized as a “small” effect for patient distress (d= .22), not the 5 times bigger (d = 1.1) one displayed in the forest plot. The other had an adequately sized sample and a significant effect on patient outcomes. Unfortunately, when we and others have attempted to enter this study into meta-analyses, we consistently find that it is such an outlier that it cannot be considered a member of the same class as the rest of the interventions. This raises questions about its validity, unless the intervention happens to be spectacularly effective. Arguing against this possibility, the authors did not succeed in getting a significant effect for this intervention with another sample of cancer patients. This trial should have been excluded, as it has been in other meta analyses, leaving the one small positive trial.
As also seen in the figure above, a forest plot of findings for effects on spouse levels of distress similarly finds a small, significant effect (.21), with only one underpowered individual study being able to reject the null hypothesis of no effect.
Our critique that did not get published
The critique that we attempted to publish pointed out that 9 of the 20 studies entered into the meta-analysis had 35 or less patients per group/cell. The ability of an intervention trial of that size to detect a moderately sized significant effect, even if it is present, is less than 50% per trial. So, even if these interventions were amazingly and consistently effective, only half of these trials should report so. Think of it– what we see reported for these studies is the equivalent of flipping a coin 9 times and getting all heads. That all of these trials were described as positive in the authors’ table 2 suggests that there is confirmatory publication bias.
When we examined these 20 trials for methodological quality, we found that pervasive deficiencies that are typically associated with bias. Almost all of the trials suffered loss of couples from follow up, and results were based on whichever patients could be contacted. Any of a number of statistical strategies that could have reduced this particular bias were not applied. There was thus a lack of intention to treat analyses which would have provided more conservative and accurate estimates of patient outcomes.
There was almost no evidence that any of these trials had specified a primary outcome ahead of time. Rather, investigators typically administered a number of measures and were free to pick the one that made the trial look best. That is termed selective outcome reporting. Because it had been happening so much in the medical literature, high impact medical journals now require investigators to register their designs and their primary outcomes in a publicly accessible place before they even run the first patient. No pre-registration means no publication in the journal. No such reforms have taken hold in the psychotherapy literature.
Finally, none of these trials presented evidence that the investigators had decided upon a particular sample size ahead of time and achieved that sample. The issue is that the investigators could monitor incoming data and either cease data collection when a significant result appeared or keep collecting data with the hope that one did.
We did not invent these criteria for risk of bias. They are routinely used for assessing bias in the studies being entered in the highly respected Cochrane collaboration meta-analyses. We were simply invoking the standards that are accepted elsewhere for randomized trials.
What we thus found was that this entire literature was at high risk of bias because investigators were in a position to manipulate what they claimed were the results by to actually stop the trial when results were at their best, selective analyses of patients who stuck around to provide data, by selecting what outcome measures should be designated for emphasis after the running of the trial, and. All of these problems are compounded by having less than minimal sample sizes to obtain consistently a moderate size effect, if it is there. These biases would be eliminated if investigators were required to register their designs ahead of time, including primary outcomes and sample size and if they were also not allowed to peek at their data as it came in and to HARK—hypothesize after the results are known.
The evidence is that a published collection of small trials paradoxically often report bigger effects than a collection of larger trials. That can be a number of reasons for this, but an obvious one is a publication bias. If a small trial does not get in effect, reviewers and editors will reject the trial for publication because one should none should have been expected. However, if by chance or bias, such a trial can made to appear to have gotten a positive effect, it is published and the intervention is celebrated as so powerful that it can demonstrate effectiveness even when one would not expect it.
There are various statistical strategies for compensating for the presence of small studies in a meta-analysis that also incorporates larger studies. But these require a larger number of studies to work with, and, most importantly, there is no means of compensating for such a preponderance of studies that are not only small, but similarly flawed. You cannot take studies of pervasively low methodological quality and combine them in a meta-analysis in a way that overcomes their flaws.
There is good evidence that we got sandbagged by reviewers, and I think the editors suspected that in continuing to solicit reviews until five had been received.
One reviewer simply recommended against publication because we did not make a significant contribution to the literature but did not explain that judgment.
Another reviewer got personal:
Coyne was cited in the acknowledgements of Badr and Krebs’ paper, meaning that he reviewed it prior to it going for peer review, which raises serious questions about what Coyne’s real ‘agenda’ is and the credibility of his argument.
This is a cheap shot. I was not even an author on the commentary, I did not see the final version of the manuscript, and I did not grant approval of being listed in the acknowledgment. Most journals require anyone being listed in the acknowledgments to provide approval. This requirement came about because of authors listing people in the acknowledgments without their permission. This strategy was used to exclude these people as possible reviewers and also to suggest to other reviewers that someone was endorsing the manuscript. After I complained, the journal removed my name from the acknowledgments in the PDF available at the website.
The reviewer continued with a technical defense of meta-analysis as being able to overcome the bias of small studies
….the underlying premise of meta-analysis is that even though any or all individual studies might be flawed in one respect or another, when pooled together, they all provide an estimate of the true effect size. Third, Badr and Krebs used Hedges’ g, which corrects for small sample bias. Fourth, Badr and Krebs assessed fail-safe N values, which indicated that a substantial number of ‘hidden’ null studies would have been required to reduce the summary effect sizes to non-significance.
I commend this reviewer for knowing the lingo, but he misapplies it. Hedges’ g offer some correction for small trials, but cannot correct for their sharing methodological flaws. And the meta analysis authors did not assess failsafe N, which involves estimating the number of null studies that had to remain unpublished to counter impressions based on only published studies. Calculating failsafe N was once quite fashionable in psychology, but it has long been abandoned elsewhere because it provides inflated estimates of the strength and quality of a literature.
And then there was the criticism from one reviewer that we should have reconsidered whether our standard criteria should even be applied to couples studies. Apparently, there should be a special dispensation granted. We
….point out, rightly so, that the failure to explicitly assign a primary outcome a priori gives researchers license to select positive findings and under-report null or adverse findings. While it is true that many intervention trials are prone to these temptations (in and outside of psycho-oncology), it is also true that the notion of a “primary outcome” may need to be reconsidered when testing relational, systemic interventions that target individual and dyad-level variables.
Another reviewer objected that we should not be criticizing studies in journals where they did not appear. Okay, but we also disallowed from criticizing a meta-analysis that depended on them?
It not clear that this is the best forum for critiquing studies in other journals, which were published up to 10 years ago and before the guidelines for reporting randomized trials were often required by journals.
And then there was the damning of us as being negative without offering positive suggestions.
The major weakness of this paper is that it reads somewhat like an unfavorable manuscript review, and falls short of offering creative and forward-thinking ways of addressing the unique challenges of couple and family intervention research.
Whither our attempted criticism of couples research?
Although quite harsh, only one of the five reviewers recommended outright rejection of our commentary, but a number suggested that we be limited to a 400 word letter to the editor with one reference. Based on their near unanimity, the editor rejected our appeal and so we will have the thankless exercise of condensing all our concerns into 400 words. Such strict limits on post publication commentary arose in an era when paper journals were worried about using their scarce page restrictions with letters to the editor. However, this particular journal no longer publishes a paper edition, and so the editor really should reconsider the tokenism of a 400 word letter.
But we got a symposium accepted for the July 2013 European Health Psychology conference in Bordeaux, France where we will discuss this whole affair. Be there! And we are negotiating an extended publication about this at another journal, based on this blog post.
Whither research concerning couples interventions for cancer patients?
The sad state of research concerning couples interventions for cancer patients is that results of underpowered, similarly flawed studies are being spun to create the illusion of a strong body of evidence that these interventions are effective. Meta-analyses such as the one we critiqued falsely reassure patients, clinicians, other researchers, and policy makers that couples interventions are effective when there is not yet quality evidence for this claim.
Those who are responsible for the illusion should consider an unanticipated consequence of it. It is an inevitable that sooner or later critics with less of an investment in the status quo will expose the limitations of this research. In the interim, time will have been lost accumulating methodologically sophisticated research because of the appearance that no further research is needed. I can point to the loss of NIMH interest in funding research concerning couples intervention for depression a few decades ago because of illusion that these interventions had already proved effective.
It is challenging to do meaningful studies of couples interventions for cancer patients. A while ago I corresponded with the author of one of the studies included in the meta analysis and inquired what completion of the study entailed, which happened to be her PhD thesis. She said
Difficulty recruiting couples is one major issue, and taking steps that enhance this process can be like a second project in itself. Over the course of the project I drove 140,000 kilometers, and a further two therapists drove around 20,000 kilometers each.
She noted the many features of the study that contributed to the relatively high rate of recruitment in her study: social marketing strategies to publicize the study, use of chart reviews to identify women potentially eligible for the study who could be personally approached and contacted by study staff, enlistment of oncologists in expressing enthusiasm for patients’ participation in the study, and the tailoring of the home-based intervention to the schedules and preferences of the women and their husbands. These are heroic efforts, perhaps crucial to the completion of the project, but are unlikely to occur in routine community cancer care. And most couples researchers do not adopt them.
Yet there is a larger issue in this couples research that presents greater challenges. A consensus is emerging in the literature concerning psychosocial interventions that studies that do not select patients for heightened distress are unlikely to be able to show that interventions are effective. There were signals in results of previous null trials, but they were obscured by spinning and confirmatory bias. Negative trials were recast to be positive and got published.
The null trials occur because most patients who are drawn to participate in trials are not sufficiently distressed to register clinically significant improvement—what is called a floor effect. Intervention studies are now struggling to recruit sufficient numbers of distressed cancer patients and discovering that only a minority of cancer patients found to be distressed in routine screening are interested in participating in intervention trials, even when treatment is free. So far, few studies that restrict samples to distressed patients, and it is likely that multisite trials will be necessary in the future to recruit adequate samples.
Investigators of couples interventions are still hiding from the likelihood that these trials are recruiting patients who are insufficiently distressed to demonstrate that interventions are effective. The implications are more challenging for couples than for individual interventions. The prevalent expectation that couples interventions can improve both patient and spouse distress levels may be unrealistic. There is a reliable phenomenon that gender predicts distress levels in couples with cancer better than knowing whether someone in a couple with cancer is the patient or the spouse. Husbands of breast cancer patients on average are not more distressed than married men whose partner is not a cancer patient. Similarly, wives of prostate patients tend to be more distressed than the men themselves, who have low levels of psychological distress. Couples in which both partners have clinically significant levels of distress are in a small minority, in part because of men not reaching clinically significant levels of distress, regardless whether they are the patients or the spouses.
Reform is possible. While it would take a while to implement preregistration of trials, steps could be taken immediately to bring claims in abstracts in greater alignment with the results are actually reported. Furthermore, journals can cease to publish underpowered trials or at least insist upon acknowledgment of their limitations and that authors avoid of hype. They could make a greater effort to publish transparent reporting of null trials and stop punishing honesty. And they could insist that reviewers who evaluate meta analysis have some minimal level of competency.
In future blogs I will give more attention to sandbagging, which is jsut one of a number of significant risk to relying exclusively peer review and expecting a fair and accurate evaluation of the evidence in the published literature. Sandbagging helps explain John Ioannidis’ observation that consistency of published findings is often only an index of the popularity of particular point of view, not the strength of evidence.
Sandbagged manuscripts were once buried and forgotten. Fortunately, there are now blogs like this one in which we can present our criticisms of the literature, expose sandbagging reviewers and alert readers to the editorial decisions that keep criticism suppressed. Blogging will have to suffice until journals provide the support for the post-publication commentary needed to correct for the inevitable failings of peer review.