Thursday, August 18, 2016

The first rule of registered replication club...A Reflection on Replicating Strack, Stepper & Martin (1988)

"The first rule of registered replication club is: you do not talk about registered replication club."

Last year our lab signed up to be part of a registered replication report (at Perspectives on Psychological Science) seeking to replicate a classic study from Fritz Strack and colleagues (Strack, Stepper & Martin, 1988).  The original study found that participants who held a pen in their teeth (unconsciously facilitating a smile) gave higher funniness ratings to cartoons than participants who held a pen between their lips (unconsciously inhibiting a smile).  Many variants of this study have been done, but never an exact replication.  Since the original paper is often presented as good evidence for embodied views of cognition, and that the paper itself is cited over 1000 times, it seemed like a good candidate for replication. Since we are an embodied cognition lab, and I discuss Strack et al's original study in my teaching, I thought it would be a good project for myself and some students to be involved in.

Once we had signed up, we then agreed to the core protocol developed by Wagenmakers, Beek, Dijkhoff and Gronau, and also reviewed by Fritz Strack.  As well as the protocol, the main body of the journal article is pretty much written at this point, as well as the key predictions, plans for treatment of data, and the analysis plan. There is also little room for deviation from the main protocol, apart from some labs translating the instructions and materials.

However, we also took the opportunity to include some supplementary tasks following the main experiment, so they could not interfere with the experimental manipulation or the key task, which was rating the funniness of four Far Side cartoons.  Because we had a number of undergraduate students helping with the study, each developed a short, additional task or measure that could be easily related to the pen in teeth/pen in lips manipulation. I won't say more about those tasks here, as we still hope to write up those results.

Dan Simons, the handling editor for RRRs at Perspectives on Psychological Science, kept us completely up to date with any developments or minor tweaks to the protocol. For example, it turned out there was a huge shortage of the preferred pen type, and so a little extra flexibility had to be granted there.  There was also a large amount of secrecy surrounding the project (see opening quote) - we knew other labs were involved, but from this point until very recently we didn't know who they were, or how many there were, and we were reminded repeatedly not to discuss our data or findings with anyone outside our research groups.  Life inside registered replication club was simultaneously frustrating, since we couldn't share our findings for months, and kind of exciting to know we were part of a large "covert" research operation.

The infamous Stabilo 68
For us, running the study was only really possible with some additional funding provided by the APS, which covered our equipment costs (web cams to record participant performance, disinfectant wipes, hair clips, tissues - for drool wiping, and lots and lots of Stabilo 68 black ink pens) and about half of our participant expenses.  For the remaining participants we recruited from our undergraduate participant pool, where students receive course credit. We originally planned to recruit over 200 participants but due to unavoidable delays in starting testing and then the unexpected closure of Lancaster University in December last year thanks to Storm Desmond, we recruited 158 participants.  Recruiting even this smaller sample was still a mammoth task as each participant had to be tested individually. It often took several attempts for participants to correctly understand how to hold the pen correctly in their teeth/lips, with testing sessions lasting about 30-45 mins from participant arrival to departure.

Once all the data was collected, we then had to input the data and code responses. Because of the nature of the tasks, the entire study was pen-and-paper based.  I'm so used to automatically extracting data from Superlab or Qualtrics that this was a surprisingly painful task. Once the data were input, we then had to examine the video recordings to ensure each participant performed the task correctly. They had to rate 4 cartoons, and if they did more than one incorrectly they would be excluded from the analysis. We double-coded each participant's performance, and then a third rater would spot check for accuracy. We were now ready to conduct our analysis.

So what did we find? 

We had received a spreadsheet template for our data and analysis scripts that could be run in R Studio. When we ran our analysis, we found a good chunk of participants were excluded (32) for not performing the pen manipulation correctly, but this did not dramatically alter the outcome of the analysis (Final N = 126). Overall we found that people who held the pen in their teeth (M = 4.54, SD = 1.42) gave higher ratings for the cartoons than those who held the pen in their lips (M = 4.18, SD = 1.73), with a marginal p-value of .066.  As well as traditional t-tests being conducted, the analysis script also output the Bayes Factor, to give an indication of whether the results were more likely to be in favour of the null (i.e., no difference between conditions) or the alternative hypothesis (i.e., that there is a difference in the predicted direction). The Bayes Factor was 0.994, meaning the data should be considered inconclusive.

So our own lab's results didn't bring any great clarity to the picture at all. Even with a decent sample, we were left unsure if this was a real effect or not.

We then submitted our data to Dan Simons (around February 2016), where the full omnibus analysis across all participating labs would be completed. And so the waiting began.  Was our lab's result going to be typical of the other results? Or would we be an embarrassing outlier relative to our anonymous colleagues' efforts? We had to wait several months for the answer.

What did everyone else find?

And so about 4 weeks ago (mid-July) we finally received the almost-complete manuscript with the full results.  The meta-analytic effect size was estimated to be 0.03, almost zero, and considerably smaller than the effect of 0.82 observed by Strack and colleagues.  In only 2 out of 17 of the replication effects did the 95% confidence intervals overlap with the effect size from the original study. The calculated Bayes Factors further support this pattern, with 13 of 17 replications providing positive support for the null, and 12 out of 17 supporting the null in a second analysis with different priors.

How do I feel about the results?

My first reaction was slight surprise, but I didn't feel my world had been turned upside down. It was more that, well, this is the result, now let's get on with doing some more science. In advance of conducting the study, I had expected/predicted that the pattern from the original would replicate, but with smaller effect size, as is often the case with replications (see Schooler, 2011, on the Decline Effect).  Although I work in embodied cognition, I didn't feel dismayed by the results, but rather felt that we had done a good job of fairly testing what many consider a classic finding of embodied cognition research.  I do think there may be moderating factors (e.g., participant age, condition difficulty), and that there may be scope for further research on this effect, but I am content that the true effect is much closer to zero than was previously observed.

The full paper is published today (18th August, 2016), along with a commentary from Fritz Strack. I haven't seen the commentary yet, so it will be interesting to see what he thinks. Along with Wolfgang Stroebe, Fritz Strack (2014) noted that direct replications were difficult because "identical operationalizations of variables in studies conducted at different times and with different subject populations might test different theoretical constructs." From his perspective, it is more important to identify the correct theoretical constructs and underlying mechanisms, rather than making sure everything "looks" identical to the original. I don't know whether this argument will wash as an explanation for the difference between the original and replication findings*.

Having been through this multi-site replication attempt, what have I learned, and, as important, would I do it again? Although I have previously published a pre-registered replication (Lynott et al., 2014), this one was on an even larger scale, and I tip my hat to Dan Simons for coordinating the effort and to EJ Wagenmakers, Laura Dijhkoff, Titia Beek and Quentin Gronau for doing excellent work in developing the protocol, and providing such detailed instructions for the participating labs.  I'm in no doubt, that as a lead lab, there is a huge amount of work involved in preparing for and implementing a Registered Replication Report. Even as a mere participating lab, this was quite a bit of work, but I'm very glad that we contributed, and I hope that in the near future we'll be involved in more RRRs and in pre-registration more generally.  Lastly, it's great to be finally able to talk about it all!

(And if anyone wants to buy some leftover Stabilo 68s, I can do you a good deal.)

*Correction 18/08/2016: Fritz Strack did not review the protocol. Rather he nominated colleague to review.
Edit 19/08/2016 - Link added to in press paper
Edit 12/01/2017 - DOI added to Wagenmakers et al paper

Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., Albohn, D. N., Allard, E. S., Benning, S. D., Blouin-Hudon, E.-M., Bulnes, L. C., Caldwell, T. L., Calin-Jageman, R., Capaldi, C. A., Carfagno, N., Chasten, K. T., Cleeremans, A., Connell, L., DeCicco, J. M., Dijkstra, K., Fischer, A. H., Foroni, F., Hess, U., Holmes, K. J., Jones, J. L. H., Klein, O., Koch, C., Korb, S., Lewinski, P., Liao, J. D., Lund, S., Lupiáñez, J., Lynott, D., Nance, C. N., Oosterwijk, S., Özdoğru, A. A., Pacheco-Unguetti, A. P., Pearson, B., Powis, C., Riding, S., Roberts, T.-A., Rumiati, R. I., Senden, M., Shea-Shumsky, N. B., Sobocko, K., Soto, J. A., Steiner, T. G., Talarico, J. M., van Allen, Z. M., Vandekerckhove, M., Wainwright, B., Wayand, J. F., Zeelenberg, R., Zetzer, E., Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science. DOI:

Other references
Lynott, D., Corker, K. S., Wortman, J., Connell, L., Donnellan, M. B., Lucas, R. E., & O’Brien, K. (2014). Replication of “Experiencing physical warmth promotes interpersonal warmth” by Williams and Bargh (2008). Social Psychology, 45, 216-222. DOI: 10.1027/1864-9335/a000187

Schooler, J. W. (2011) Unpublished results hide the decline effect. Nature, 470, 437.

Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: a nonobtrusive test of the facial feedback hypothesis. Journal of personality and social psychology, 54(5), 768.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59-71.


  1. Minor correction: Fritz Strack declined to review the protocol, but he nominated an expert colleague (Ursula Hess) to do so, and she was tremendously helpful in making sure it was done right.

  2. Yes, thanks Dan. I've made that correction.

  3. Thanks to all involved in this effort!

    Given that most RRR's thusfar have "failed" to find any evidence for the original findings, I do wonder about a possible consequence of these RRR's.

    More specifically, I wonder how many new attempts at trying to find something related to the original finding will be done given these new findings. I can imagine that the RRR-findings could cause quite a few researchers to try and come up with new ways to test the general theory or to try and find the effect.

    Given the low publication standards (i.c. low-powered, p-hacked studies were, and probably still are the norm) and, most importantly perhaps, publication bias, I wonder if a whole new set of findings will be released in the coming years, supposedly conforming the effect and/or the general theory behind it.

    I really like RRR's, but as long as researchers/journals let publication bias continue to exist i think they might even enhance the chances of even more related non-replicable findings and bias to enter the literature.

  4. I think RRRs are still finding their feet, and the reason the majority have not replicated original findings is most likely due to publication bias (combined with other factors). Publication standards have been low, but already this is changing with more pre-registration and more adequately powered studies being conducted. So, if that pattern continues, one would expect that as we progress, future RRRs will be more likely to support original findings, since original estimates will then better reflect the true underlying effect size.

    1. Thanks for the reply! To me it's not about supporting or not supporting the original findings. It's about what is done with that information, and what the effect of it is.

      Even if publication standards were to increase (which I highly doubt), there is still the matter of publication bias which, to me, makes everything else pointless.

      It seems impossible to me to decide which findings in the literature are "true", or "replicable", or whatever term you want to use for it. I think that might be largely due to publication bias. Strack states in his reply to the failed RRR that "During the past five years, at least 20 studies have been published demonstrating the predicted effect of the pen procedure on evaluative judgments". Because of publication bias, and low standards, that tells us nothing at all I reason. There could have been hundreds of studies performed in the last 5 years which did not show the effect, nobody knows...

      I will pre-register the hypothesis that a lot of new findings will enter the literature in the coming years which will "show" the effect again. Because researchers/journals keep letting low publication standards, and publication bias continue to exist, it will get us no closer to the truth, and perhaps will bring us even further from it.

      Even though I may hold this pessimistic view, I very much appreciate all the effort people put into the RRR's. I think RRR's are very useful in pointing out all that is wrong. As long as there is publication bias, and low publication standards, I do not think they will "get us closer to the truth" concerning the effect/theory, or contribute to theoretical progress.

      Thank you again for all your efforts. I sincerely hope you will take part in other RRR's as well.

  5. It is interesting that Strack points to all these published replications, but doesn't mention that there are also published non-replications of the original effect (e.g., Andreasson & Dimberg, 2008), not to mention the probable file-drawer issue. I completely agree with you that we can't underestimate the problem of publication bias, but I'm choosing to take a more optimistic view that changes in publication practices (such as registered reports) will have a positive impact. The key thing is that people share data for studies conducted, even if they are not published. This will allow researchers to conduct good meta-analysis, which can (relatively) easily updated as new data becomes available. Any individual study should never be seen as the final word on the matter, but should only be considered with the entirety of evidence available.