Power pose: I. Demonstrating that replication initiatives won’t salvage the trustworthiness of psychology

An ambitious multisite initiative showcases how inefficient and ineffective replication is in correcting bad science.

 

mind the brain logo

Bad publication practices keep good scientists unnecessarily busy, as in replicability projects.- Bjoern Brembs

Power-PoseAn ambitious multisite initiative showcases how inefficient and ineffective replication is in correcting bad science. Psychologists need to reconsider pitfalls of an exclusive reliance on this strategy to improve lay persons’ trust in their field.

Despite the consistency of null findings across seven attempted replications of the original power pose study, editorial commentaries in Comprehensive Results in Social Psychology left some claims intact and called for further research.

Editorial commentaries on the seven null studies set the stage for continued marketing of self-help products, mainly to women, grounded in junk psychological pseudoscience.

Watch for repackaging and rebranding in next year’s new and improved model. Marketing campaigns will undoubtedly include direct quotes from the commentaries as endorsements.

We need to re-examine basic assumptions behind replication initiatives. Currently, these efforts  suffer from prioritizing of the reputations and egos of those misusing psychological science to market junk and quack claims versus protecting the consumers whom these gurus target.

In the absence of a critical response from within the profession to these persons prominently identifying themselves as psychologists, it is inevitable that the void be filled from those outside the field who have no investment in preserving the image of psychology research.

In the case of power posing, watchdog critics might be recruited from:

Consumer advocates concerned about just another effort to defraud consumers.

Science-based skeptics who see in the marketing of the power posing familiar quackery in the same category as hawkers using pseudoscience to promote homeopathy, acupuncture, and detox supplements.

Feminists who decry the message that women need to get some balls (testosterone) if they want to compete with men and overcome gender disparities in pay. Feminists should be further outraged by the marketing of junk science to vulnerable women with an ugly message of self-blame: It is so easy to meet and overcome social inequalities that they have only themselves to blame if they do not do so by power posing.

As reported in Comprehensive Results in Social Psychology,  a coordinated effort to examine the replicability of results reported in Psychological Science concerning power posing left the phenomenon a candidate for future research.

I will be blogging more about that later, but for now let’s look at a commentary from three of the over 20 authors get reveals an inherent limitation to such ambitious initiatives in tackling the untrustworthiness of psychology.

Cesario J, Jonas KJ, Carney DR. CRSP special issue on power poses: what was the point and what did we learn?.  Comprehensive Results in Social Psychology. 2017

 

Let’s start with the wrap up:

The very costly expense (in terms of time, money, and effort) required to chip away at published effects, needed to attain a “critical mass” of evidence given current publishing and statistical standards, is a highly inefficient use of resources in psychological science. Of course, science is to advance incrementally, but it should do so efficiently if possible. One cannot help but wonder whether the field would look different today had peer-reviewed preregistration been widely implemented a decade ago.

 We should consider the first sentence with some recognition of just how much untrustworthy psychological science is out there. Must we mobilize similar resources in every instance or can we develop some criteria to decide what is on worthy of replication? As I have argued previously, there are excellent reasons for deciding that the original power pose study could not contribute a credible effect size to the literature. There is no there to replicate.

The authors assume preregistration of the power pose study would have solved problems. In clinical and health psychology, long-standing recommendations to preregister trials are acquiring new urgency. But the record is that motivated researchers routinely ignore requirements to preregister and ignore the primary outcomes and analytic plans to which they have committed themselves. Editors and journals let them get away with it.

What measures do the replicationados have to ensure the same things are not being said about bad psychological science a decade from now? Rather than urging uniform adoption and enforcement of preregistration, replicationados urged the gentle nudge of badges for studies which are preregistered.

Just prior to the last passage:

Moreover, it is obvious that the researchers contributing to this special issue framed their research as a productive and generative enterprise, not one designed to destroy or undermine past research. We are compelled to make this point given the tendency for researchers to react to failed replications by maligning the intentions or integrity of those researchers who fail to support past research, as though the desires of the researchers are fully responsible for the outcome of the research.

There are multiple reasons not to give the authors of the power pose paper such a break. There is abundant evidence of undeclared conflicts of interest in the huge financial rewards for publishing false and outrageous claims. Psychological Science about the abstract of the original paper to leave out any embarrassing details of the study design and results and end with a marketing slogan:

That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

 Then the Association for Psychological Science gave a boost to the marketing of this junk science with a Rising Star Award to two of the authors of this paper for having “already made great advancements in science.”

As seen in this special issue of Comprehensive Results in Social Psychology, the replicationados share responsibility with Psychological Science and APS for keeping keep this system of perverse incentives intact. At least they are guaranteeing plenty of junk science in the pipeline to replicate.

But in the next installment on power posing I will raise the question of whether early career researchers are hurting their prospects for advancement by getting involved in such efforts.

How many replicationados does it take to change a lightbulb? Who knows, but a multisite initiative can be combined with a Bayesian meta-analysis to give a tentative and unsatisfying answer.

Coyne JC. Replication initiatives will not salvage the trustworthiness of psychology. BMC Psychology. 2016 May 31;4(1):28.

The following can be interpreted as a declaration of financial interests or a sales pitch:

eBook_PositivePsychology_345x550I will soon be offering e-books providing skeptical looks at positive psychology and mindfulness, as well as scientific writing courses on the web as I have been doing face-to-face for almost a decade.

 Sign up at my website to get advance notice of the forthcoming e-books and web courses, as well as upcoming blog posts at this and other blog sites. Get advance notice of forthcoming e-books and web courses. Lots to see at CoyneoftheRealm.com.

 

‘Replace male doctors with female ones and save at least 32,000 lives each year’?

The authors of a recent article in JAMA Internal Medicine

Physician Gender and Outcomes of Hospitalized Medicare Beneficiaries in the U.S.,” Yusuke Tsugawa, Anupam B. Jena, Jose F. Figueroa, E. John Orav, Daniel M. Blumenthal, Ashish K. Jha, MD, MPH1,2,8, JAMA Internal Medicine, online December 19, 2016, doi: 10.1001/jamainternmed.2016.7875

Stirred lots of attention in the media with direct quotes like these:

“If we had a treatment that lowered mortality by 0.4 percentage points or half a percentage point, that is a treatment we would use widely. We would think of that as a clinically important treatment we want to use for our patients,” said Ashish Jha, professor of health policy at the Harvard School of Public Health. The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

Washington Post: Women really are better doctors, study suggests.

LA  Times: How to save at least 32,000 lives each year: Replace male doctors with female ones.

NPR: Patients cared for by female doctors fare better than those treated by men.

My immediate reactions after looking at the abstract were only confirmed when I delved deeper.

Basically, we have a large, but limited and very noisy data set. It is unlikely that these data allow us to be confident about the strength of any signal concerning the relationship between physician gender and patient outcome that is so important to the authors. The small apparent differences could be just more noise on which the authors have zeroed in so that they can make a statement about the injustice of gender differences in physician pay.

 I am unwilling to relax methodological and statistical standards to manufacture support for such a change. There could be unwanted consequences of accepting that arguments can be made with such weak evidence, even for a good cause.

What if the authors had found the same small differences in noisy data in the reverse direction? Would they argue that we should preserve gender differences in physician pay? What if the authors focus on a different variable in all this noise and concluded that lower pay which women receive was associated with reduced mortality? Would we then advocate that will reduce the pay of both male and female physicians in order to improve patient outcomes?

Despite all the excitement that claim about an effect of physician gender on patient mortality is generating, it is most likely that we are dealing with noise arising from overinterpretation of complex analyses that assume more completeness and precision than can be found in the data being analyzed.

These claims are not just a matter of causal relationships being spun from correlation. Rather, they are causal claims being made on the basis of partial correlations emerging in complex multivariate relationships found in an administrative data set.

  • Administrative data sets, particularly Medicaid data sets like this one, are not constructed with such research questions in mind. There are severe constraints on what variables can be isolated and which potential confounds can be identified and tested.
  • Administrative data sets consist of records, not actual behaviors. It’s reasonable to infer a patient death associated with a record of a death. Association of a physician gender associated with a particular record is more problematic, as we will see. Even if we accept the association found in these records, it does not necessarily mean that physicians engaged in any particular behaviors or that the physician behavior is associated with the pattern of deaths emerging in these multivariate analyses.
  • The authors start out with a statement about differences in how female and male physicians practice. In the actual article and the media, they have referred to variables like communication skills, providing evidence-based treatments, and encouraging health-related behaviors. None of these variables are remotely accessible in a Medicaid data set.
  • Analyses of such administrative data sets do not allow isolation of the effects of physician gender from the effects of the contexts in which their practice occurs and relevant associated variables. We are not talking about a male or female physician encountering a particular patient being associated with a death or not, but an administrative record of physician gender arising in a particular context being interpreted as associated with a death. Male and female physicians may differ in being found in particular contexts in nonrandom fashion. It’s likely that these differences will dwarf any differences in outcomes. There will be a real challenge in even confidently attributing those outcomes to whether patients had an attending male or female physician.

The validity of complex multivariate analyses are strongly threatened by specification bias and residual confounding. The analyses must assume that all of the relevant confounds have been identified and measured without error. Departures from these ideal conditions can lead to spurious results, and generally do. Examination of the limitations in the variables available in a Medicaid data set and how they were coded can quickly undermine any claim to validity.

Acceptance of claims about effects of particular variables like female physician gender arising in complex multivariate analyses involve assumptions of “all-other-things-being-equal.” If we attempt to move from statistical manipulation to inference about a real world encounter, we no longer talking about a particular female physician, but a construction that may be very different from particular physicians interacting with particular patients in particular contexts.

The potential for counterfactual statements can be seen if we move from the study to one of science nerds and basketball players and hypothesize if John and Jason were of equivalent height, John would not study so hard.

Particularly in complex social situations, it is usually a fantasy that we can change one variable, and only one variable, not others. Just how did John and Jason get of equal height? And how are they now otherwise different?

Associations discovered in administrative data sets most often do not translate into effects observed in randomized trials. I’m not sure how we could get a representative sample of patients to disregard their preferences and accept random assignment to a male or female physician. It would have to be a very large study to detect the effect sizes reported in this observational study, and I’m skeptical this sufficiently strong signal would emerge from all of the noise.

We might relax our standards and accept a quasi-experimental design that would be smaller but encompass a wider range of relevant variables. For instance, it is conceivable that we could construct a large sample in which physicians varied in terms of whether they had formal communication skills training. We might examine whether communications training influenced subsequent patient mortality, independent of physician gender, and vice versa. This would be a reasonable translation of the authors’ hypothesis that communication skills differences between male and female physicians account for what the authors believe is the observed association between physician gender and mortality. I know of no such study having been done. I know of no study demonstrating that physician communication training affects patient mortality. I’m skeptical that the typical communication training is so powerful in its effects. If such a study required substantial resources, rather than relied on data on hand, I would not be encouraged to invest in it by the strength of the results of the present study to marshal those resources.

What I saw when I looked at the article

 We dealing with very small adjusted differences in percentage arising in a large sample.

Patients treated by female physicians had lower 30-day mortality (adjusted mortality, 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233).

Assignment of a particular patient to a particular physician is done with a lot of noise.

We assigned each hospitalization to a physician based on the National Provider Identifier in the Carrier File that accounted for the largest amount of Medicare Part B spending during that hospitalization.25 Part B spending comprises professional and other fees determined by the physician. On average, these physicians were responsible for 51.1% of total Part B spending for a given hospitalization.

One commentator quoted in a news article noted:

William Weeks, a professor of psychiatry at Dartmouth’s Geisel School of Medicine, said that the researchers had done a good job of trying to control for other factors that might influence the outcome. He noted that one caveat is that hospital care is usually done by a team. That fact was underscored by the method the researchers used to identify the doctor who led the care for patients in the study. To identify the gender of the physician, they looked for the doctor responsible for the biggest chunk of billing for hospital services — which was, on average, about half. That means that almost half of the care was provided by others.

Actually, much of the care is not provided by the attending physician, but other staff, including nurses and residents.

The authors undertook the study to call attention to gender disparities in physician pay. But could disparities show up in males being able to claim more billable procedures – greater credit administratively for what is done with patients during hospitalization, including by other physicians? This might explain at least some of the gender differences, but could undermine the validity of this key variable in relating physician gender to differences in patient outcome.

The statistical control of differences in patient and physician characteristics afforded by variables in this data set is inadequate.

Presumably, a full range of patient variables is related to whether patients die within 30 days of a hospitalization. Recall the key assumption that all of the relevant confounds have been identified and assessed without error in considering the variables used to characterize patient characteristics:

Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year.

Note that the comorbidity index is based on collapsing 27 other variables into one number. Simplifies the statistics, yes, but with a tremendous loss of information.

Recall the assumption that this set of variables represent not just what is available in administrative data set, but all the patient characteristics relevant to their dying within 30 days after discharge from the hospital. Are we really willing to accept this assumption?

For the physician variables displayed at the top of Table 1, there are huge differences between male and female physicians, relative to the modest difference in patient mortality, adjusted mortality, 11.07% vs 11.49%.

smaller table of patient characiteristics

These authors encourage us to think about the results as simulating a randomized trial, except that statistical controls are serving the function that randomization of patients to physician gender would serve. We are being asked to accept that these difference in baseline characteristics of the practices of female versus physicians can be eliminated through statistics. We would never accept that argument in a randomized trial.

Addressing criticisms of the authors interpretation of their results.

 The senior author provided a pair of blog posts in which he acknowledges criticism of his study, but attempts to defuse key objections. It’s unfortunate that the sources of these objections are not identified, and so we dependent on the author’s summary out of context. I think the key responses are to straw man objections.

Correlation, Causation, and Gender Differences in Patient Outcomes

Do women make better doctors than men?

Correlation is not causation.

We often make causal inferences based on observational data – and here’s the kicker: sometimes, we should.  Think smoking and lung cancer.  Remember the RCT that assigned people to smoking (versus not) to see if it really caused lung cancer?  Me neither…because it never happened.  So, if you are a strict “correlation is not causation” person who thinks observational data only create hypotheses that need to be tested using RCTs, you should only feel comfortable stating that smoking is associated with lung cancer but it’s only a hypothesis for which we await an RCT.  That’s silly.  Smoking causes lung cancer.

No, it is this argument that is silly. We can now look back on the data concerning smoking and lung cancer and benefit from the hindsight provided by years of sorting smoking as a risk factor from potential confounds.  I recall at some point, drinking coffee being related to lung cancer in the United States, whereas drinking tea was correlated in the UK. Of course, if we don’t know that smoking is the culprit, we might miss that in the US, smoking was done while drinking coffee, whereas the UK, while drinking tea.

And isolating smoking as a risk factor, rather than just a marker for risk, is so much simpler than isolating whatever risk factors for death are hidden behind physician gender as a marker for risk of mortality.

Coming up with alternative explanations for the apparent link between physician gender and patient mortality.

The final issue – alternative explanations – has been brought up by nearly every critic. There must be an alternative explanation! There must be confounding!  But the critics have mostly failed to come up with what a plausible confounder could be.  Remember, a variable, in order to be a confounder, must be correlated both with the predictor (gender) and outcome (mortality).

This is similarly a fallacious argument. I am not arguing for alternative substantive explanations, I’m proposing that spurious results were produced by pervasive specification bias, including measurement error. There is no potential confounder I have to identify. I am simply arguing that that the small differences in mortality are dwarfed by specification and measurement error.

This tiny difference is actually huge in its implications.

Several critics have brought up the point that statistical significance and clinical significance are not the same thing.  This too is epidemiology 101.  Something can be statistically significant but clinically irrelevant.  Is a 0.43 percentage point difference in mortality rate clinically important? This is not a scientific or a statistical question.  This is a clinical question. A policy and public health question.  And people can reasonably disagree.  From a public health point of view, a 0.43 percentage point difference in mortality for Medicare beneficiaries admitted for medical conditions translates into potentially 32,000 additional deaths. You might decide that this is not clinically important. I think it is. It’s a judgment call and we can disagree.

The author taking a small difference and magnifying its importance by applying to a larger population. He is attributing the “additional deaths” to patients being treated by men. I feel he hasn’t made a case that physician gender is the culprit and so nothing is accomplished except introducing shock and awe by amplifying the small effect into its implications for the larger population.

In response to a journalist, the author makes a parallel argument:

The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.

In addition to what I have already argued, if we know the same number of deaths are attributable to automobile crashes, we at least know how to take steps to reduce these crashes and the mortality associated with them. We don’t know how to change the mortality the authors claim is associated with physician gender. We don’t even know that the author’s claims are valid.

Searching for meaning where meaning no meaning is to be found.

In framing the study and interpreting the results to the media, the authors undertake a search of the literature with a heavy confirmation bias, ignoring the many contradictions that are uncovered with a systematic search. For instance, one commentator on the senior author’s blog notes

It took me about 5 minutes of Google searching to find a Canadian report suggesting that female physicians in that country have workloads around 75% to 80% of male physicians:

https://secure.cihi.ca/free_products/PracticingPhysicianCommunityCanada.pdf

If US data is even vaguely similar, that factor would be a serious omission from your article.

But the authors were looking for what supported the results, not for studies that potentially challenged or contradicted their results. They are looking to strengthen a narrative, not expose it to refutation.

Is there a call to action here?

As consumers of health services, we could all switch to being cared for by female physicians. I suspect that some of the systems and structural issues associated with the appearance that care by male physicians inferior would be spread among females, including increased workloads. The bias in the ability of male physicians to claim credit for the work of others would be redistributed to women. Neither would improve patient mortality.

We should push for reduction in inequalities in pay related to gender. But we don’t need results of this study to encourage us.

I certainly know health care professionals and researchers who have more confidence in communication learning modules producing clinically significant changes in position behavior. I don’t know any of them who could produce evidence that these changes include measurable reductions in patient mortality. If someone produces such data, I’m capable of being persuaded. But the present study adds nothing to my confidence in that likelihood.

If we are uncomfortable with the communication skills or attention to evidence that our personal physicians display, we should replace them. But I don’t think this study provides additional evidence for us doing so, beyond the legitimacy of us acting on our preferences.

In the end, this article reminds us to stick to our standards and not be tempted to relax them to make socially acceptable points.

 

 

 

 

 

Sex and the single amygdala: A tale almost saved by a peek at the data

So sexy! Was bringing up ‘risky sex’ merely a strategy to publish questionable and uninformative science?

wikipedia 1206_FMRIMy continuing question: Can skeptics who are not specialists, but who are science-minded and have some basic skills, learn to quickly screen and detect questionable science in the journals and media coverage?

You don’t need a weatherman to know which way the wind blows.” – Bob Dylandylan wind blows

I hope so. One goal of my blogging is to arouse readers’ skepticism and provide them some tools so that they can decide for themselves what to believe, what to reject, and what needs a closer look or a check against trusted sources.

Skepticism is always warranted in science, but it is particularly handy when confronting the superficial application of neuroscience to every aspect of human behavior. Neuroscience is increasingly being brought into conversations to sell ideas and products when it is neither necessary nor relevant. Many claims about how the brain is involved are false or exaggerated not only in the media, but in the peer-reviewed journals themselves.

A while ago I showed how a neuroscientist and a workshop guru teamed up to try to persuade clinicians with functional magnetic resonance imaging (fMRI) data  that a couples therapy was more sciencey than the rest. Although I took a look at some complicated neuroscience, a lot of my reasoning [1, 2, 3] merely involved applying basic knowledge of statistics and experimental design. I raised sufficient skepticism to dismiss the neuroscientist and psychotherapy guru’s claims, Even putting aside the excellent specialist insights provided by Neurocritic and his friend Magneto.

In this issue of Mind the Brain, I’m pursuing another tip from Neurocritic about some faulty neuroscience in need of debunking.

The paper

Victor, E. C., Sansosti, A. A., Bowman, H. C., & Hariri, A. R. (2015). Differential Patterns of Amygdala and Ventral Striatum Activation Predict Gender-Specific Changes in Sexual Risk Behavior. The Journal of Neuroscience, 35(23), 8896-8900.

Unfortunately, the paper is behind a pay wall. If you can’t get it through a university library portal, you can send a request for a PDF to the corresponding author, elizabeth.victor@duke.edu.

The abstract

Although the initiation of sexual behavior is common among adolescents and young adults, some individuals express this behavior in a manner that significantly increases their risk for negative outcomes including sexually transmitted infections. Based on accumulating evidence, we have hypothesized that increased sexual risk behavior reflects, in part, an imbalance between neural circuits mediating approach and avoidance in particular as manifest by relatively increased ventral striatum (VS) activity and relatively decreased amygdala activity. Here, we test our hypothesis using data from seventy 18- to 22-year-old university students participating in the Duke Neurogenetics Study. We found a significant three-way interaction between amygdala activation, VS activation, and gender predicting changes in the number of sexual partners over time. Although relatively increased VS activation predicted greater increases in sexual partners for both men and women, the effect in men was contingent on the presence of relatively decreased amygdala activation and the effect in women was contingent on the presence of relatively increased amygdala activation. These findings suggest unique gender differences in how complex interactions between neural circuit function contributing to approach and avoidance may be expressed as sexual risk behavior in young adults. As such, our findings have the potential to inform the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.

My thought processes

Hmm, sexual risk behavior -meaning number of partners? How many new partners during a follow-up period constitutes “risky” and does it matter whether safe sex was practiced? Well, ignoring these issues and calling it “sexual risk behavior “allows the authors to claim relevance to hot topics like HIV prevention….

But let’s cut to the chase: I’m always skeptical about a storyline depending on a three-way statistical interaction. These effects are highly unreliable, particularly in a sample size of only N = 70. I’m suspicious why investigators ahead of time staking their claims on a three-way interaction, not something simpler. I will be looking for evidence that they started with this hypothesis in mind, rather than cooking it up after peeking at the data.

fixed-designs-for-psychological-research-35-638Three-way interactions involve dividing a sample up into at eight boxes, in this case, 2 x (2) x (2). Such interactions can be mind-boggling to interpret, and this one is no exception

Although relatively increased VS activation predicted greater increases in sexual partners for both men and women, the effect in men was contingent on the presence of relatively decreased amygdala activation and the effect in women was contingent on the presence of relatively increased amygdala activation.

And then the “simple” interpretation?

These findings suggest unique gender differences in how complex interactions between neural circuit function contributing to approach and avoidance may be expressed as sexual risk behavior in young adults.

And the public health implications?

As such, our findings have the potential to inform the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.

hs-amygdalaJust how should these data inform public health strategies beyond what we knew before we stumbled upon this article? Really, should we stick people’s heads in a machine and gather fMRI data  before offering them condoms? Should we encourage computer dating services to post along with a recent headshot, recent fMRI images showing that prospective dates do not have their risky behavior center in the amygdala activated? Or encourage young people to get their heads examined with an fMRI before deciding whether it’s wise to sleep with somebody new?

So it’s difficult to see the practical relevance of these findings, but let’s stick around and consider the paragraph that Neurocritic singled out.

The paragraph

outlierThe majority of the sample reported engaging in vaginal sex at least once in their lifetime (n = 42, 60%). The mean number of vaginal sexual partners at baseline was 1.28 (SD =0.68). The mean increase in vaginal sexual partners at the last follow-up was 0.71 (SD = 1.51). There were no significant differences between men and women in self-reported baseline or change in self-reported number of sexual partners (t=0.05, p=0.96; t=1.02, p= 0.31, respectively). Although there was not a significant association between age and self-reported number of partners at baseline (r = 0.17, p= 0.16), younger participants were more likely to report a greater increase in partners over time (r =0.24, p =0.04). Notably, distribution analyses revealed two individuals with outlying values (3 SD from M; both subjects reported an increase in 8 partners between baseline and follow up). Given the low rate of sexual risk behavior reported in the sample, these outliers were not excluded, as they likely best represent young adults engaging in sexual risk behavior.

What triggers skepticism?

This paragraph is quite revealing if we just ponder it a bit.

First, notice there is only a single significant correlation (p=.04) in a subgroup analysis. Differences between men and women were examined finding no significant findings in either baseline or changes in number of sexual partners over the length of the observation. However, disregarding that finding, the authors went on to explore changes in number of partners over time among the younger participants and, bingo, there was their p =0.04.

Whoa! Age was never mentioned in the abstract. We are now beyond the 2 x 2 x 2 interaction mentioned in the abstract and rooting through another dimension, younger versus older.

But, worse, getting that significance required retaining two participants with eight new sexual partners each during the follow-up period. The decision to retain these participants was made after the pattern of results was examined with and without inclusion of these outliers. The authors say so and essentially say they decided because it made a better story.

The only group means and standard deviation included these two participants. Even including the participants, the average number of new sexual partners was less than one during some follow-up. We have no idea whether that one was risky or not. It’s a safer assumption that having eight new partners is risky, but even that we don’t know for sure.

Keep in mind for future reference: Investigators are supposed to make decisions about outliers without reference to the fate of the hypothesis being studied. And knowing nothing about this particular study, most authorities would say if two people out of 70 are way out there on a particular variable that otherwise has little variance, you should exclude them.

It is considered a Questionable Research Practice to make decisions about inclusion/exclusion based on what story the outcome of this decision allows the authors to tell. It is p-hacking, and significance chasing.

And note the distribution of numbers of vaginal sex partners. Twenty eight participants had none at the end of the study. Most accumulated less than one during the follow up, and even that mean number was distorted by two having eight partners. Hmm, it is going to be hard to get multivariate statistics to work appropriately when we get to the fancy neuroscience data. We could go off on discussions of multivariate normal or Poisson distributions or just think a bit..

We can do a little detective work and determine that one outlier was a male, another a female. (*1) Let’s go back to our eight little boxes of participants that are involved in the interpretation of the three-way interaction. It’s going to make a great difference exactly where the deviant male and female are dropped into one of the boxes or whether they are left out.

And think about sampling issues. What if, for reasons having nothing to with the study, neither of these outliers had shown up? Or if only one of them had showed up, it would skew the results in a particular direction, depending on whether the participant was the male or female.

Okay, if we were wasting our time continuing to read the article after finding what we did in the abstract, we are certainly wasting more of our time by continuing after reading this paragraph. But let’s keep poking around as an educational exercise.

The rest of the methods and results sections

We learn from the methods section that there was an ethnically diverse sample with a highly variable follow-up, from zero days to 3.9 years (M = 188.72 d, SD = 257.15; range = 0 d–3.19 years). And there were only 24 men in the original sample for the paper of 70 participants.

We don’t know whether these two outliers had eight sexual partners within a week of the first assessment or they were the ones captured in extending the study to almost 4 years. That matters somewhat, but we also have to worry whether this was an appropriate sample – with so few participants in it in the first place and even fewer who had sex by the end of the study – and length of follow-up to do such a study. The mean follow-up of about six months and huge standard deviation suggest there is not a lot of evidence of risky behavior, at least in terms of casual vaginal sex.

This is all getting very funky.

So I wondered about the larger context of the study, with increasing doubts that the authors had gone to all this trouble just to test an a priori hypothesis about risky sex.

We are told that the larger context is the ongoing “Duke Neurogenetics Study (DNS), which assesses a wide range of behavioral and biological traits.” The extensive list of inclusions and exclusions suggests a much more ambitious study. If we had more time, we could go look up the Duke Neurogenetics Study and see if that’s the case. But I have a strong suspicion that the study was not organized around the specific research questions of this paper (*2). I really can’t tell without any preregistration of this particular paper but I certainly have questions about how much Hypothesizing after the Results Are Known (HARKing) is going on here in the refining of hypotheses and measures, and decisions about which data to report.

Further explorations of the results section

I remind readers that I know little about fMRI data. Put it aside and we can discover some interesting things reading through the brief results section.

Main effects of task

As expected, our fMRI paradigms elicited robust affect-related amygdala and reward-related VS activity across the entire parent sample of 917 participants (Fig. 1). In our substudy sample of 70 participants, there were no significant effects of gender (t(70) values < 0.88, p values >0.17) or age (r values < 0.22; p values > 0.07) on VS or amygdala activity in either hemisphere.

figure1Hmm, let’s focus on the second sentence first. The authors tell us absolutely nothing is going on in terms of differences in amygdala and reward-related VS activity in relation to age and gender in the sample of 70 participants in the current study. In fact, we don’t even need to know what “amygdala and reward-related VS activity” is to wonder why the first sentence of this paragraph directs us to a graph not of the 70 participants, but a larger sample of 917 participants. And when we go to figure 1, we see some wild wowie zowie, hit-the-reader-between-the-eyes differences (in technical terms, intraocular trauma) for women. And claims of p < 0.000001 twice. But wait! One might think significance of that magnitude would have to come from the 917 participants, except the labeling of the X-axis must come from the substudy of the 70 participants for whom data concerning number of sex partners was collected. Maybe the significance comes from the anchoring of one of the graph lines by the one wayout outlier.

Note that the outlier woman with eight partners anchors the blue line for High Left Amygdala. Without inclusion of that single woman, the nonsignificant trends between women with High Left Amygdala versus women with Low Left Amygdala would be reversed.

figure2The authors make much of the differences between Figure 1 showing Results for Women and Figure 2 showing Results for Men. The comparison seems dramatic except that, once again, the one outlier sends the red line for Low Left Amygdala off from the blue line for High Left Amygdala. Otherwise, there is no story to tell. Mind-boggling, but I think we can safely conclude that something is amiss in these Frankenstein graphs.

Okay, we should stop beating a corpse of an article. There are no vital signs left.

Alternatively, we could probe the section on Poisson regressions and minimally note some details. There is the flash of some strings of zeros in the P values, but it seems complicated and then we are warned off with “no factors survive Bonferroni correction.” And then in the next paragraph, we get to exploring dubious interactions. And there is the final insult of the authors bringing in a two-way interaction trending toward significance among men, p =.051.

But we were never told how all this would lead as we were promised in the end of the abstract, “to the development of novel, gender-specific strategies that may be more effective at curtailing sexual risk behavior.”

Rushing through the discussion section, we note the disclosure that

The nature of these unexpected gender differences on clear and warrants further consideration.

So, the authors confess that they did not start with expectations of finding a gender difference. They had nothing to report from a subset of data from an ambitious project put together for other purposes with an ill-suited follow-up for the research question (and even an ill-suited experimental task. They made a decision to include two outliers, salvaged some otherwise weak and inconsistent differences, and then constructed a story that depended on their inclusion. Bingo, they can survive confirmation bias and get published.

Readers might have been left with just their skepticism about the three-way interaction described in the abstract. However, the authors implicated themselves by disclosing in the article their examination of a distribution and reasons for including outlier. Then they further disclosed they did not start with a hypothesis about gender differences.

Why didn’t the editor and reviewers at Journal of Neuroscience (impact factor 6.344) do their job and cry foul? Questionable research practices (QRPs) are brought to us courtesy of questionable publication practices (QPPs).

And then we end with the confident

These limitations notwithstanding, our current results suggest the importance of considering gender-specific patterns of interactions between functional neural circuits supporting approach and avoidance in the expression of sexual risk behavior in young adults.

Yet despite this vague claim, the authors still haven’t explained how this research could be translated to practice.

Takeaway points for the future.

Without a tip from NeuroCritic, I might not have otherwise zeroed in on the dubious complex statistical interaction on which the storyline in the abstract depended. I also benefited from the authors for whatever reason telling us that they had peeked at the data and telling us further in the discussion that they had not anticipated the gender difference. With current standards for transparency and no preregistration of such studies, it would’ve been easy for us to miss what was done because the authors did not need to alert us. Until there are more and better standards enforced, we just need to be extra skeptical of claims of the application of neuroscience to everyday life.

Trust your skepticism.

Apply whatever you know about statistics and experimental methods. You probably know more than you think you do

Beware of modest sized neuroscience studies for which authors develop storylines from the patterning authors can discover in their data, not from a priori hypotheses suggested by a theory. If you keep looking around in the scientific literature and media coverage of it, I think you will find a lot of this QRP and QPP.

Don’t go into a default believe-it mode just because an article is peer-reviewed.

Notes

  1. If both the outliers were of the same gender, it would have been enough for that gender to have had significantly more sex partners than the other.
  1. Later we had told in the Discussion section that particular stimuli for which fMRI data were available were not chosen for relevance to the research question claimed for this this paper.

We did not measure VS and amygdala activity in response to sexually provocative stimuli but rather to more general representations of reward and affective arousal. It is possible that variability in VS and amygdala activity to such explicit stimuli may have different or nonexistent gender-specific patterns that may or may not map onto sexual risk behaviors.

Special thanks to Neurocritic for suggesting this blog post and for feedback, as well as to Neuroskeptic, Jessie Sun, and Hayley Jach for helpful feedback. However, @CoyneoftheRealm bears sole responsibility for any excesses or errors in this post.