NOTE: After I wrote this blog post, I received via PNAS the reply from Steve Cole and Barbara Fredrickson to our article. I did not have time to thoroughly digest it, but will address it in a future blog post. My preliminary impression is that their reply is, ah…a piece of work. For a start, they attack our mechanical bitmapping of their data as an unvalidated statistical procedure. But calling it a statistical procedure is like Sarah Palin calling Africa a country. And they again assert the validity of their scoring of a self-report questionnaire without documentation. As seen below, I had already offered to donate $100 to charity if they can produce the unpublished analyses that justified this idiosyncratic scoring. The offer stands. They claim that our factor analyses were in appropriate because the sample size was too small, but we used their data, which they claimed to have factor analyzed. Geesh. But more on their reply later.
Our new PNAS article questions the reliability of results and interpretations in a high profile previous PNAS article.
Fredrickson, Barbara L., Karen M. Grewen, Kimberly A. Coffey, Sara B. Algoe, Ann M. Firestine, Jesusa MG Arevalo, Jeffrey Ma, and Steven W. Cole. “A functional genomic perspective on human well-being.” Proceedings of the National Academy of Sciences 110, no. 33 (2013): 13684-13689.
Was the original article a matter of “science” made for press release? Our article poses issues concerning the gullibility of the scientific community and journalists regarding claims of breakthrough discoveries from small studies with provocative, but fuzzy theorizing and complicated methodologies and statistical analyses that apparently even the authors themselves do not understand.
- Multiple analyses of original data do not find separate factors indicating striving for pleasure versus purpose
- Random number generators yield best predictors of gene expression from the original data
[Warning, numbers ahead. This blog post contains some excerpts from the results section that contain lots of numbers and require some sophistication to interpret. I encourage readers to at least skim these sections, to allow independent evaluation of some of things that I will say in the rest of the blog.]
A well-orchestrated media blitz for the PNAS article had triggered my skepticism. The Economist, CNN, The Atlantic Monthly and countless newspapers seemingly sang praise in unison for the significance of the article.
Maybe the research reported in PNAS was, as one the authors, Barbara Fredrickson claimed, a major breakthrough in behavioral genomics, a science-based solution to an age-old philosophical problem of how to lead one’s life. Or, as she has later claimed in a July 2014 talk in Amsterdam, the PNAS article provided an objective basis for moral philosophy.
Maybe it showed
People who are happy but have little to no sense of meaning in their lives—proverbially, simply here for the party—have the same gene expression patterns as people who are responding to and enduring chronic adversity.
Skeptical? Maybe you are paying too much attention to your conscious mind. What does it know? According to author Steve Cole—
What this study tells us is that doing good and feeling good have very different effects on the human genome, even though they generate similar levels of positive emotion… “Apparently, the human genome is much more sensitive to different ways of achieving happiness than are conscious minds.”
Or maybe this PNAS article was an exceptional example of the kind of nonsense, pure bunk, you can find in a prestigious journal.
Assembling a Team.
I blogged about the PNAS article. People whom I have yet to meet expressed concerns similar to mine. We began collaborating, overcoming considerable differences in personal style but taking advantage of complementary skills and background.
It all started with a very tentative email exchange with Nick Brown. He brought on his co-author from his American Psychologist article demolishing any credibility to a precise positivity ratio, Harris Friedman. Harris in turn brought on Doug McDonald to examine Fredrickson and Cole’s claims that factor analysis supported their clean distinction between two forms of well-being with opposite effects on health.
Manoj Samanta found us by way of my blog post and then a Google search that took him to Nick and Harris’ article with Alan Sokal. Manoj cited my post in his own blog. When Nick saw it, he contacted him. Manoj was working in genomics, attempting to map the common genomic basis for the evolution of electric organs in fish from around the world, but was a physicist in recovery. He was delighted to work with a couple of guys who had a co-authored a paper with his hero from grad school, Alan Sokal. Manoj interpreted Fredrickson and Cole’s seeming unnecessarily complicated approach to genomic analysis. Nick set off to deconstruct and reproduce Cole’s regression analyses predicting genomic expression. He discovered that Cole’s procedure generated statistically significant (but meaningless) results from over two-thirds of the thousands of ways of splitting the psychometric data. Even using random numbers produced huge numbers of junk results.
The final group was Nick, Doug, Manoj, Harris, and myself. Others came and went from our email exchanges, some accepting our acknowledgment in the paper, while others asked us explicitly not to acknowledge them.
The team gave an extraordinarily careful look at the article, noting its fuzzy theorizing and conceptual deficiencies, but we did much more than that. We obtained the original data and asked the authors of the original paper about their complex analytic methods. We then reanalyzed the data, following their specific advice. We tried alternative analyses and even re-did the same analyses with randomly generated data. Overall, our hastily assembled group performed and interpreted 1000s of analyses, more than many productive labs do in a year.
The embargo on our paper in PNAS is now off.
I can report our conclusion that
Not only is Fredrickson et al.’s article conceptually deficient, but more crucially statistical analyses are fatally flawed, to the point that their claimed results are in fact essentially meaningless.
Fuzzy thinking creates theoretical and general methodological problems
Fredrickson et al. claimed that two types of strivings for well-being, eudaimonic and hedonic have distinct and opposite effects on physical health, by way of “molecular signaling pathways” or genomic expression, despite an unusually high correlation for two supposedly different variables. I had challenged the authors about the validity of their analyses in my earlier blog post and then in a letter to PNAS, but got blown off. Their reply dismissed my concerns, citing analyses that they have never shown, either in the original article or the reply.
In our article, we noted a subtlety in the distinction between eudamonia and hedonia.
Eudaimonic well-being, generally defined (including by Fredrickson et al.) in terms of tendencies to strive for meaning, appears to be trait-like, since such striving for meaning is typically an ongoing life strategy.
Hedonic well-being, in contrast is typically defined in terms of a person’s (recent) affective experiences, and is state-like; regardless of the level of meaning in one’s life, everyone experiences “good” and “bad” days.
The problem is
If well-being is a state, then a person’s level of well-being will change over time and perhaps at a very fast rate. If we only measure well-being at one time point, as Fredrickson et al. did, then unless we obtain a genetic sample at the same time, the likelihood that the well-being score will actually accurately reflect level of genomic expression will be diminished if not eliminated.
“Your experiences today will influence the molecular composition of your body for the next two to three months,” he tells his audience, “or, perhaps, for the rest of your life. Plan your day accordingly.”
Hmm. Really? Evidence?
Eudaimonic and hedonic well-being constructs may have a long history in philosophy, but empirically separating them is an unsolved problem. And taken together, the two constructs by no means capture the complexity of well-being.
Is a scientifically adequate taxonomy of well-being on which to do research even possible? Maybe, but doubts are raised when one considers the overcrowded field of well-being concepts available in the literature—
General well-being, subjective well-being, psychological well-being, ontological well-being, spiritual well-being, religious well-being, existential well-being, chaironic well-being, emotional well-being, and physical well-being—along with the various constructs which treated as essentially synonymous with well-being, such as self-esteem, life-satisfaction, and, lest we forget, happiness.
No one seems to be paying attention to this confusing proliferation of similar constructs and how they are supposed to relate to each other. But in the realm of negative emotion, the problem is well known and variously referred to as the “big mush” or “crud factor”. Actually, there is a good deal of difficulty separating out positive well-being concepts from their obverse concepts, negative well-being.
Fredrickson and colleagues found that eudaimonia and especially hedonic well-being were strongly, but negatively related to depression. Their measures of depression qualified as a covariate or confound for their analyses, but somehow disappeared from further consideration. If it had been retained, it would have further reduced the analyses to gobbledygook. Technically speaking, the residual of hedonia-controlling-for (highly correlated)-eudaimonia-and-depression does not even have a family resemblance to hedonia and is probably nonsense.
Fredrickson et al. measured well-being with which they called the Short Flourishing Scale, AKA and better known in the literature as the Mental Health Continuum-Short Form (MHC-SF).
We looked and we were not able to identify any published evidence of a two factor solution in which distinct eudaimonic and hedonic well-being factors adequately characterized MHC-SF data.
The closest thing we could find was
Keyes et al. (10) referred to these groupings of hedonic and eudaimonic items as “clusters,” an ostensibly neutral term that seems to deliberately avoid the word “factor.”
However, his split of the MHC-SF items into hedonic and eudaimonic categories appears to have been made mainly to to allow arbitrary classifying of persons as “languishing” versus “flourishing.” Yup, positive psychology is now replacing the stigma of conventional psychology’s deficiency model of depressed versus not depressed with a strength model of languishing versus flourishing.
In contrast to the rest of the MHC-SF literature, Fredrickson at el referred to a factor analysis of – implicitly in their original PNAS paper, and then explicitly in reply to my PNAS letter – yielding two distinct factors (“Hedonic” and “Eudaimonic”), corresponding to Keyes’ languishing versus flourishing diagnoses (i.e., items SF1–SF3 for Hedonic and SF4–SF14 for Eudaimonic).
The data from Fredrickson et al were mostly in the public domain. After getting further psychometric data from Fredrickson’s lab, we set off set off on a thorough reanalysis that should have revealed whatever basis for their claims there might be.
In exploratory factor analyses, which we ran using different extraction (e.g., principal axis, maximum likelihood) and rotation (orthogonal, oblique) methods, we found two factors with eigenvalues greater than 1 with all items producing a loading of .50 on at least one factor.
That’s lots of analyses, but results were consistent:
Examination of factor loading coefficients consistently showed that the first factor was comprised of elevated loadings from 11 items (SF1, SF2, SF3, SF4, SF5, SF9, SF10, SF11, SF12, SF13, and SF14), while the second factor housed high loadings from 3 items (SF6, SF7, and SF8).
If this is the factor structure Fredrickson and colleagues claim, eudaimonic well-being would have to be the last three items. But look at them in the figure on the left and particularly look at the qualification below. The items seem to reflect living in a particular kind of environment that is safe and supportive of people like the respondent. Actually, these results seem to lend support to my complaint that positive psychology is mainly for rich people: to flourish, one must live in a special environment. If you languish, it is your fault.
Okay, we did not find much support for the claims of Fredrickson and colleagues, but we gave them another chance with a confirmatory factor analysis (CFA). With this analysis, we would not be looking for the best solution, only learning if either one or two factor models are defensible.
For the one-factor model, goodness-of-fit statistics indicated grossly inadequate fit (χ2 = 227.64, df = 77, GFI = .73, CFI = .83, RMSEA = .154). Although the equivalent statistics for the correlated two-factor model were slightly better, they still came out as poor (χ2 = 189.40, df = 76, GFI = .78, CFI = .87, RMSEA = .135).
Thus, even though our findings tended to support the view that well-being is best represented as at least a two dimensional construct, we did not confirm Fredrickson et al.’s claim (6) that the MHC-SF produces two factors conforming to hedonic and eudaimonic well-being.
Hey Houston, we’ve got a problem.
As Ryff and Singer (15) put it, “Lacking evidence of scale validity and reliability, subsequent work is pointless” (p. 276).
Maybe we should have thrown in the towel. But if Fredrickson and colleagues could
nonetheless proceed to multivariate analyses relating the self-report data to genomic expression, we decided that we would follow in the same path.
Relating self-report data to genomic expression: Random can be better
Fredrickson et al. analytic approach to genomic expression seemed unnecessarily complicated. They repeated regression analyses 53 times (which we came to call RR53) in which they regressed each of 53 genes of interest on eudaimonic and hedonic well-being and a full range of confounding/control variables. Recall that they had only 80 participants. This approach leaves them lots of room for capitalizing on chance.
So, why not simply regress
the scores for hedonic and eudaimonic well-being on the average expression of the 53 genes of interest, after changing the sign of the values of those genes that were expected to be down-regulated. [?]
After all the authors had said
[T]he goal of this study is to test associations between eudaimonic and hedonic well-being and average levels of expression of specific sets of genes” (p. 1)
We started with our simpler approach.
We conducted a number of such regressions, using different methods of evaluating the “average level of expression” of the 53 CTRA genes of interest (e.g., taking the mean of their raw values, or the mean of their z-scores), but in all cases the model ANOVA was not statistically significant.
Undaunted, we next applied the RR53 regression procedure to see whether it could, in contrast to our simpler “naive” approach, yield such highly significant results with the factors we had derived.
You can read the more technical description of our procedures in our article and its supplementary materials, but our results were
The t-tests for the regression coefficients corresponding to the predictor variables of interest, namely hedonic and eudaimonic well-being, were almost all non-significant (p > .05 in 104 out of 106 cases; mean p = .567, SD = 0.251), and in the two remaining cases (gene FOSL1, for both “hedonic,” p = .047, and “eudaimonic,” p = .030), the overall model ANOVA was not statistically significant (p = .146).
We felt that drawing any substantive conclusions from these coefficients is inappropriate.
Nonetheless, we continued….
We…created two new variables, which we named PWB (corresponding to items SF1–SF5 and SF9–SF14) and EPSE (corresponding to items SF6–SF8). When we applied Fredrickson et al.’s regression procedure using these variables as the two principal predictor variables of interest (replacing the Hedonic and Eudaimonic factor variables), we discovered that the “effects” of this factor pair were about twice as high as those for the Hedonic and Eudaimonic pair (PWB: up-regulation by 13.6%, p < .001; EPSE: down-regulation by 18.0%, p < .001; see Figures 3 and 4 in the Supporting Information).
Wow, if we accept statistical significance over all other considerations, we actually did better than Fredrickson et al.
Taken seriously, it suggests that the participants’ genes are not only expressing “molecular well-being” but even more vigorously, some other response that we presume Fredrickson et al. might call “molecular social evaluation.”
Or we might conclude that living in a particular kind of environment, is good for your genomic expression.
But we were skeptical about whether we could give substantive interpretations of any kind and so we went wild, using the RR53 procedure with every possible way of splitting up the self-report data. Yup, that is a lot of analyses.
Excluding duplicates due to symmetry, there are 8,191 possible such combinations. Of these, we found that 5,670 (69.2%) gave statistically significant results using the method described on pp. 1–2 of Fredrickson et al.’s Supporting Information (7) (i.e., the t-tests of the fold differences corresponding to the two elements of the pair of pseudo-factors were both significant at the .05 level), with 3,680 of these combinations (44.9% of the total) having both components significant at the .001 level.
Furthermore, 5,566 combinations (68.0%) generated statistically significant pairs of fold difference values that were greater in magnitude than Fredrickson et al.’s (6, figure 2A) Hedonic and Eudaimonic factors.
While one possible explanation of these results is that differential gene expression is associated with almost any factor combination of the psychometric data, with the study participants’ genes giving simultaneous “molecular expression” to several thousand factors which psychologists have not yet identified, we suspected that there might be a more parsimonious explanation.
But we did not stop there. Bring on the random number generator.
As a further test of the validity of the RR53 procedure, we replaced Fredrickson et al.’s psychometric data (6) with random numbers (i.e., every item/respondent cell was replaced by a random integer in the range 0–5) and re-ran the R program. We did this in two different ways. First, we replaced the psychometric data with normally-distributed random numbers, such that the item-level means and standard deviations were close to the equivalent values for the original data. With these pseudo-data, 3,620 combinations of pseudo-factors (44.2%) gave a pair of fold difference values having t-tests significantly different from zero at the .05 level; of these, 1,478 (18.0% of the total) were both statistically significant at the .001 level. (We note that, assuming independence of up- and down-regulation of genes, the probability of the latter result occurring by chance with random psychometric data if the RR53 regression procedure does indeed identify differential gene expression as a function of psychometric factors, ought to be—literally—one in a million, i.e. 0.001², rather than somewhere between one in five and one in six.) Second, we used uniformly-distributed random numbers (i.e., all “responses” were equally likely to appear for any given item and respondent). With these “white noise” data, we found that 2,874 combinations of pseudo-factors (35.1%) gave a pair of fold difference values having t-tests statistically significantly different from zero at the .05 level, of which 893 (10.9% of the total) were both significant at the .001 level. Finally, we re-ran the program once more, using the same uniformly distributed random numbers, but this time excluding the demographic data and control genes; thus, the only non-random elements supplied to the RR53 procedure were the expression values of the 53 CTRA genes. Despite the total lack of any information with which to correlate these gene expression values, the procedure generated 2,540 combinations of pseudo-factors (31.0%) with a pair of fold difference values having t-tests statistically significantly different from zero at the .05 level, of which 235 (2.9% of the total) were both significant at the .001 level.
Thus, in all cases, we obtained far more statistically significant results using Fredrickson et al.’s methods (6) than would be predicted by chance alone for truly independent variables (i.e., .052 × 8191 ≈ 20), even when the psychometric data were replaced by meaningless random numbers. To try to identify the source of these puzzling results, we ran simple bivariate correlations on the gene expression variables, which revealed moderate to strong correlations between many of them, suggesting that our significant results were mainly the product of shared variance across criterion variables. We therefore went back to the original psychometric data, and “scrambled” the CTRA gene expression data, reassigning each cell value for a given gene to a participant selected at random, thus minimizing any within-participants correlation between these values. When we re-ran the regressions with these data, the number of statistically significant results dropped to just 44 (.54%).
To summarize: even when fed entirely random psychometric data, the RR53 regression procedure generates large numbers of results that appear, according to these authors’ interpretation, to establish a statistically significant relationship between self-reported well-being and gene expression. We believe that this regression procedure is, simply put, totally lacking in validity. It appears to be nothing more than a mechanism for producing apparently statistically significant effects from non-significant regression coefficients, driven by a high degree of correlation between many of the criterion variables.
Despite exhaustive efforts, we could not replicate the authors’ simple factor structure differentiating hedonic versus eudaimonic well-being, upon which their genomic analyses so crucially depended. Then we showed that the complicated RR53 procedure turned random nonsense into statistically significant results. Poof, there is no there there (as Gertrude Stein once said about Oakland, California) in their paper, no evidence of “molecular signaling pathways that transduce positive psychological states into somatic physiology,” just nonsense.
How in the taxonomy of bad science, do we classify first this slipup and the earlier one in American Psychologist? Poor methodological habits, run-of-the-mill scientific sloppiness, innocent probabilistic error, injudicious hype, or simply an unbridled enthusiasm with inadequate grasp of methods and statistics?
Play nice and avoid the trap of negative psychology?
Our PNAS article exposed the unreliability of the results and interpretation offered in a paper claimed to be a game changing breakthrough in our understanding of how positive psychology affects health by way of genomic expression. Science is slow and incomplete in self-correcting. But corrections, even of outright nonsense, seldom garner the attention of the original error. It is just not as newsworthy to find that claims of minor adjustments in everyday behavior modifying gene expression are nonsense as to make unsustainable claims in the first place.
Given the rewards offered by media coverage and even prestigious journals, authors can be expected to be incorrigible in terms of giving in to the urge to orchestrate media attention for ill understood results generated by dubious methods applied in small samples. But the rest of the scientific community and journalists need to keep in mind that most breakthrough discoveries are false, unreplicable, or at least wildly exaggerated.
The authors were offered a chance to respond to my muted and tightly constrained letter to PNAS. Cole and Fredrickson made references to analyses they have never presented and offered misinterpretations of the literature that I cited. I consider their response disingenuous and dismissive of any dialogue. I am willing to apologize for this assessment if they produce the factor analyses of the self-report data to which they pointed. I will even donate $100 to the American Cancer Society if they can produce it. I doubt they will.
Concerns about the unreliability of the scientific and biomedical literature have risen to the threshold of precipitating concern fromthe director of NIMH, Francis Collins. On the other hand,a backlash has called out critics for encouraging a negative psychology warned to temper our criticism. Evidence of the excesses of critics include “’voodoo correlation’ claims, ‘p-hacking’ investigations, websites like Retraction Watch, Neuroskeptic, [and] a handful of other blogs devoted to exposing bad science”, to caution us that “moral outrage has been conflated with scientific rigor.” We are told we are damaging the credibility of science with criticism and that we should engage authors in clarification rather than criticize them. But I think our experience with this PNAS article demonstrates just how much work it takes to deconstruct outrageous claims based on methods and results that authors poorly understand but nonetheless promote in social media campaigns.. Certainly, there are grounds for skepticism based on prior probabilities, and to be skeptical is not cynical. But is not cynical to construct the pseudoscience of a positivity ratio and then a faux objective basis for moral philosophy?