Last year our lab signed up to be part of a registered replication report (at Perspectives on Psychological Science) seeking to replicate a classic study from Fritz Strack and colleagues (Strack, Stepper & Martin, 1988). The original study found that participants who held a pen in their teeth (unconsciously facilitating a smile) gave higher funniness ratings to cartoons than participants who held a pen between their lips (unconsciously inhibiting a smile). Many variants of this study have been done, but never an exact replication. Since the original paper is often presented as good evidence for embodied views of cognition, and that the paper itself is cited over 1000 times, it seemed like a good candidate for replication. Since we are an embodied cognition lab, and I discuss Strack et al's original study in my teaching, I thought it would be a good project for myself and some students to be involved in.
Once we had signed up, we then agreed to the core protocol developed by Wagenmakers, Beek, Dijkhoff and Gronau, and also reviewed by Fritz Strack. As well as the protocol, the main body of the journal article is pretty much written at this point, as well as the key predictions, plans for treatment of data, and the analysis plan. There is also little room for deviation from the main protocol, apart from some labs translating the instructions and materials.
However, we also took the opportunity to include some supplementary tasks following the main experiment, so they could not interfere with the experimental manipulation or the key task, which was rating the funniness of four Far Side cartoons. Because we had a number of undergraduate students helping with the study, each developed a short, additional task or measure that could be easily related to the pen in teeth/pen in lips manipulation. I won't say more about those tasks here, as we still hope to write up those results.
Dan Simons, the handling editor for RRRs at Perspectives on Psychological Science, kept us completely up to date with any developments or minor tweaks to the protocol. For example, it turned out there was a huge shortage of the preferred pen type, and so a little extra flexibility had to be granted there. There was also a large amount of secrecy surrounding the project (see opening quote) - we knew other labs were involved, but from this point until very recently we didn't know who they were, or how many there were, and we were reminded repeatedly not to discuss our data or findings with anyone outside our research groups. Life inside registered replication club was simultaneously frustrating, since we couldn't share our findings for months, and kind of exciting to know we were part of a large "covert" research operation.
|The infamous Stabilo 68|
Once all the data was collected, we then had to input the data and code responses. Because of the nature of the tasks, the entire study was pen-and-paper based. I'm so used to automatically extracting data from Superlab or Qualtrics that this was a surprisingly painful task. Once the data were input, we then had to examine the video recordings to ensure each participant performed the task correctly. They had to rate 4 cartoons, and if they did more than one incorrectly they would be excluded from the analysis. We double-coded each participant's performance, and then a third rater would spot check for accuracy. We were now ready to conduct our analysis.
So what did we find?We had received a spreadsheet template for our data and analysis scripts that could be run in R Studio. When we ran our analysis, we found a good chunk of participants were excluded (32) for not performing the pen manipulation correctly, but this did not dramatically alter the outcome of the analysis (Final N = 126). Overall we found that people who held the pen in their teeth (M = 4.54, SD = 1.42) gave higher ratings for the cartoons than those who held the pen in their lips (M = 4.18, SD = 1.73), with a marginal p-value of .066. As well as traditional t-tests being conducted, the analysis script also output the Bayes Factor, to give an indication of whether the results were more likely to be in favour of the null (i.e., no difference between conditions) or the alternative hypothesis (i.e., that there is a difference in the predicted direction). The Bayes Factor was 0.994, meaning the data should be considered inconclusive.
So our own lab's results didn't bring any great clarity to the picture at all. Even with a decent sample, we were left unsure if this was a real effect or not.
We then submitted our data to Dan Simons (around February 2016), where the full omnibus analysis across all participating labs would be completed. And so the waiting began. Was our lab's result going to be typical of the other results? Or would we be an embarrassing outlier relative to our anonymous colleagues' efforts? We had to wait several months for the answer.
What did everyone else find?And so about 4 weeks ago (mid-July) we finally received the almost-complete manuscript with the full results. The meta-analytic effect size was estimated to be 0.03, almost zero, and considerably smaller than the effect of 0.82 observed by Strack and colleagues. In only 2 out of 17 of the replication effects did the 95% confidence intervals overlap with the effect size from the original study. The calculated Bayes Factors further support this pattern, with 13 of 17 replications providing positive support for the null, and 12 out of 17 supporting the null in a second analysis with different priors.
How do I feel about the results?My first reaction was slight surprise, but I didn't feel my world had been turned upside down. It was more that, well, this is the result, now let's get on with doing some more science. In advance of conducting the study, I had expected/predicted that the pattern from the original would replicate, but with smaller effect size, as is often the case with replications (see Schooler, 2011, on the Decline Effect). Although I work in embodied cognition, I didn't feel dismayed by the results, but rather felt that we had done a good job of fairly testing what many consider a classic finding of embodied cognition research. I do think there may be moderating factors (e.g., participant age, condition difficulty), and that there may be scope for further research on this effect, but I am content that the true effect is much closer to zero than was previously observed.
The full paper is published today (18th August, 2016), along with a commentary from Fritz Strack. I haven't seen the commentary yet, so it will be interesting to see what he thinks. Along with Wolfgang Stroebe, Fritz Strack (2014) noted that direct replications were difficult because "identical operationalizations of variables in studies conducted at different times and with different subject populations might test different theoretical constructs." From his perspective, it is more important to identify the correct theoretical constructs and underlying mechanisms, rather than making sure everything "looks" identical to the original. I don't know whether this argument will wash as an explanation for the difference between the original and replication findings*.
Having been through this multi-site replication attempt, what have I learned, and, as important, would I do it again? Although I have previously published a pre-registered replication (Lynott et al., 2014), this one was on an even larger scale, and I tip my hat to Dan Simons for coordinating the effort and to EJ Wagenmakers, Laura Dijhkoff, Titia Beek and Quentin Gronau for doing excellent work in developing the protocol, and providing such detailed instructions for the participating labs. I'm in no doubt, that as a lead lab, there is a huge amount of work involved in preparing for and implementing a Registered Replication Report. Even as a mere participating lab, this was quite a bit of work, but I'm very glad that we contributed, and I hope that in the near future we'll be involved in more RRRs and in pre-registration more generally. Lastly, it's great to be finally able to talk about it all!
(And if anyone wants to buy some leftover Stabilo 68s, I can do you a good deal.)
*Correction 18/08/2016: Fritz Strack did not review the protocol. Rather he nominated colleague to review.
Edit 19/08/2016 - Link added to in press paper
Edit 12/01/2017 - DOI added to Wagenmakers et al paper
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., Albohn, D. N., Allard, E. S., Benning, S. D., Blouin-Hudon, E.-M., Bulnes, L. C., Caldwell, T. L., Calin-Jageman, R., Capaldi, C. A., Carfagno, N., Chasten, K. T., Cleeremans, A., Connell, L., DeCicco, J. M., Dijkstra, K., Fischer, A. H., Foroni, F., Hess, U., Holmes, K. J., Jones, J. L. H., Klein, O., Koch, C., Korb, S., Lewinski, P., Liao, J. D., Lund, S., Lupiáñez, J., Lynott, D., Nance, C. N., Oosterwijk, S., Özdoğru, A. A., Pacheco-Unguetti, A. P., Pearson, B., Powis, C., Riding, S., Roberts, T.-A., Rumiati, R. I., Senden, M., Shea-Shumsky, N. B., Sobocko, K., Soto, J. A., Steiner, T. G., Talarico, J. M., van Allen, Z. M., Vandekerckhove, M., Wainwright, B., Wayand, J. F., Zeelenberg, R., Zetzer, E., Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science. DOI: https://doi.org/10.1177/1745691616674458
Lynott, D., Corker, K. S., Wortman, J., Connell, L., Donnellan, M. B., Lucas, R. E., & O’Brien, K. (2014). Replication of “Experiencing physical warmth promotes interpersonal warmth” by Williams and Bargh (2008). Social Psychology, 45, 216-222. DOI: 10.1027/1864-9335/a000187
Schooler, J. W. (2011) Unpublished results hide the decline effect. Nature, 470, 437.
Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: a nonobtrusive test of the facial feedback hypothesis. Journal of personality and social psychology, 54(5), 768.
Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59-71.