Friday, June 2, 2017

Boiling rage and warm hearts: Or, how does temperature affect prosocial behaviour?


Temperature is an ever-present feature of the environment, and while we know that climate change is leading to all manner of physical changes to the planet (melting ice-caps, desertification, greater weather volatility), we are still unsure how temperature experience might affect people's behaviour. On the one hand, some studies have shown that experience of higher temperatures is linked
with more positive and prosocial behaviours (e.g., greater gift giving, altruism), while on the other hand, some have shown that higher temperatures are associated with more negative, antisocial behaviours, like violence and aggression.

As an example of more positive outcomes, a 2008 study by Lawrence Williams and John Bargh found that brief exposure to warm objects, like a cup of hot coffee or a warm therapeutic gel pack, lead people to view others in a more positive light, or be more likely to behave prosocially, such as choosing to give a gift to a friend.  In contrast to this pattern, a study by Richard Larrick and colleagues looking at player behaviour in over 11,000 baseball games found that as temperatures increased, so too did the likelihood of pitchers deliberately throwing the ball at the batter or making other retaliatory actions.  In other words, higher temperatures lead to more aggressive play. So, it has been unclear whether experiences of higher temperatures, be they brief exposures to objects or longer exposure to particular ambient temperatures, are consistently associated with more or less prosocial behaviour. 

In the current study1we investigated this question by looking at whether higher temperatures are associated with more or less prosocial responding, while also looking at whether brief interactions with hot/cold objects affected people's choices.  At different ambient temperatures, participants took part in a "product evaluation" study of hot or cold therapeutic gel packs.  At the end of the study, each participant could choose between taking a reward for themselves (the self-interested option) or giving the reward to someone else (the prosocial option). While the pack temperatures did not influence the choices people made, we found a weak relationship between the ambient temperature at the time of the study and whether the participant responded prosocially or not; as temperatures increased, participants were more likely to choose the prosocial option (see the graph).
Graph showing the relationship between temperature and prosocial choice. 

However, subsequent analysis suggested that this pattern existed for only one of the two groups of participants we tested (one in the UK, but not in the US), and so we feel this finding should be taken with a pinch of salt, rather than as clear evidence of a link between the environmental temperature and people's prosocial behaviour2.  However, if temperature change does have the capacity to influence human behaviour, it is certainly an issue that merits further research! 

Footnotes



1 This data was originally collected as part of a pre-registered replication of the Williams & Bargh (2008) study (Lynott et al., 2014). However, in the 2014 paper we only examined pack temperature, but not ambient temperature. 


2 The Bayes Factors for the combined data (both study sites together) indicated support for the effect of ambient temperature, whereas Bayes Factors for the two separate study sites were either inconclusive, or supported the null model (i.e., were not consistent with temperature affecting prosocial choice). 
 

References
Larrick, R. P., Timmerman, T. A., Carton, A. M., & Abrevaya, J. (2011). Temper, temperature, and temptation: Heat-related retaliation in baseball. Psychological Science, 22(4), 423-428.
Lynott, D., Corker, K. S., Connell, L., & O'Brien, K. S. (2017). The effect of haptic and ambient temperature experience on prosocial behavior. Archives of Scientific Psychology, 5(1), 10.
Lynott, D., Corker, K. S., Wortman, J., Connell, L., Donnellan, M. B., Lucas, R. E., & O’Brien, K. (2014). Replication of “Experiencing physical warmth promotes interpersonal warmth” by Williams and Bargh (2008). Social Psychology.
Williams, L. E., & Bargh, J. A. (2008). Experiencing physical warmth promotes interpersonal warmth. Science, 322(5901), 606-607.

Thursday, August 18, 2016

The first rule of registered replication club...A Reflection on Replicating Strack, Stepper & Martin (1988)

"The first rule of registered replication club is: you do not talk about registered replication club."

Last year our lab signed up to be part of a registered replication report (at Perspectives on Psychological Science) seeking to replicate a classic study from Fritz Strack and colleagues (Strack, Stepper & Martin, 1988).  The original study found that participants who held a pen in their teeth (unconsciously facilitating a smile) gave higher funniness ratings to cartoons than participants who held a pen between their lips (unconsciously inhibiting a smile).  Many variants of this study have been done, but never an exact replication.  Since the original paper is often presented as good evidence for embodied views of cognition, and that the paper itself is cited over 1000 times, it seemed like a good candidate for replication. Since we are an embodied cognition lab, and I discuss Strack et al's original study in my teaching, I thought it would be a good project for myself and some students to be involved in.

Once we had signed up, we then agreed to the core protocol developed by Wagenmakers, Beek, Dijkhoff and Gronau, and also reviewed by Fritz Strack.  As well as the protocol, the main body of the journal article is pretty much written at this point, as well as the key predictions, plans for treatment of data, and the analysis plan. There is also little room for deviation from the main protocol, apart from some labs translating the instructions and materials.

However, we also took the opportunity to include some supplementary tasks following the main experiment, so they could not interfere with the experimental manipulation or the key task, which was rating the funniness of four Far Side cartoons.  Because we had a number of undergraduate students helping with the study, each developed a short, additional task or measure that could be easily related to the pen in teeth/pen in lips manipulation. I won't say more about those tasks here, as we still hope to write up those results.

Dan Simons, the handling editor for RRRs at Perspectives on Psychological Science, kept us completely up to date with any developments or minor tweaks to the protocol. For example, it turned out there was a huge shortage of the preferred pen type, and so a little extra flexibility had to be granted there.  There was also a large amount of secrecy surrounding the project (see opening quote) - we knew other labs were involved, but from this point until very recently we didn't know who they were, or how many there were, and we were reminded repeatedly not to discuss our data or findings with anyone outside our research groups.  Life inside registered replication club was simultaneously frustrating, since we couldn't share our findings for months, and kind of exciting to know we were part of a large "covert" research operation.

The infamous Stabilo 68
For us, running the study was only really possible with some additional funding provided by the APS, which covered our equipment costs (web cams to record participant performance, disinfectant wipes, hair clips, tissues - for drool wiping, and lots and lots of Stabilo 68 black ink pens) and about half of our participant expenses.  For the remaining participants we recruited from our undergraduate participant pool, where students receive course credit. We originally planned to recruit over 200 participants but due to unavoidable delays in starting testing and then the unexpected closure of Lancaster University in December last year thanks to Storm Desmond, we recruited 158 participants.  Recruiting even this smaller sample was still a mammoth task as each participant had to be tested individually. It often took several attempts for participants to correctly understand how to hold the pen correctly in their teeth/lips, with testing sessions lasting about 30-45 mins from participant arrival to departure.

Once all the data was collected, we then had to input the data and code responses. Because of the nature of the tasks, the entire study was pen-and-paper based.  I'm so used to automatically extracting data from Superlab or Qualtrics that this was a surprisingly painful task. Once the data were input, we then had to examine the video recordings to ensure each participant performed the task correctly. They had to rate 4 cartoons, and if they did more than one incorrectly they would be excluded from the analysis. We double-coded each participant's performance, and then a third rater would spot check for accuracy. We were now ready to conduct our analysis.

So what did we find? 

We had received a spreadsheet template for our data and analysis scripts that could be run in R Studio. When we ran our analysis, we found a good chunk of participants were excluded (32) for not performing the pen manipulation correctly, but this did not dramatically alter the outcome of the analysis (Final N = 126). Overall we found that people who held the pen in their teeth (M = 4.54, SD = 1.42) gave higher ratings for the cartoons than those who held the pen in their lips (M = 4.18, SD = 1.73), with a marginal p-value of .066.  As well as traditional t-tests being conducted, the analysis script also output the Bayes Factor, to give an indication of whether the results were more likely to be in favour of the null (i.e., no difference between conditions) or the alternative hypothesis (i.e., that there is a difference in the predicted direction). The Bayes Factor was 0.994, meaning the data should be considered inconclusive.

So our own lab's results didn't bring any great clarity to the picture at all. Even with a decent sample, we were left unsure if this was a real effect or not.

We then submitted our data to Dan Simons (around February 2016), where the full omnibus analysis across all participating labs would be completed. And so the waiting began.  Was our lab's result going to be typical of the other results? Or would we be an embarrassing outlier relative to our anonymous colleagues' efforts? We had to wait several months for the answer.

What did everyone else find?

And so about 4 weeks ago (mid-July) we finally received the almost-complete manuscript with the full results.  The meta-analytic effect size was estimated to be 0.03, almost zero, and considerably smaller than the effect of 0.82 observed by Strack and colleagues.  In only 2 out of 17 of the replication effects did the 95% confidence intervals overlap with the effect size from the original study. The calculated Bayes Factors further support this pattern, with 13 of 17 replications providing positive support for the null, and 12 out of 17 supporting the null in a second analysis with different priors.

How do I feel about the results?

My first reaction was slight surprise, but I didn't feel my world had been turned upside down. It was more that, well, this is the result, now let's get on with doing some more science. In advance of conducting the study, I had expected/predicted that the pattern from the original would replicate, but with smaller effect size, as is often the case with replications (see Schooler, 2011, on the Decline Effect).  Although I work in embodied cognition, I didn't feel dismayed by the results, but rather felt that we had done a good job of fairly testing what many consider a classic finding of embodied cognition research.  I do think there may be moderating factors (e.g., participant age, condition difficulty), and that there may be scope for further research on this effect, but I am content that the true effect is much closer to zero than was previously observed.

The full paper is published today (18th August, 2016), along with a commentary from Fritz Strack. I haven't seen the commentary yet, so it will be interesting to see what he thinks. Along with Wolfgang Stroebe, Fritz Strack (2014) noted that direct replications were difficult because "identical operationalizations of variables in studies conducted at different times and with different subject populations might test different theoretical constructs." From his perspective, it is more important to identify the correct theoretical constructs and underlying mechanisms, rather than making sure everything "looks" identical to the original. I don't know whether this argument will wash as an explanation for the difference between the original and replication findings*.

Having been through this multi-site replication attempt, what have I learned, and, as important, would I do it again? Although I have previously published a pre-registered replication (Lynott et al., 2014), this one was on an even larger scale, and I tip my hat to Dan Simons for coordinating the effort and to EJ Wagenmakers, Laura Dijhkoff, Titia Beek and Quentin Gronau for doing excellent work in developing the protocol, and providing such detailed instructions for the participating labs.  I'm in no doubt, that as a lead lab, there is a huge amount of work involved in preparing for and implementing a Registered Replication Report. Even as a mere participating lab, this was quite a bit of work, but I'm very glad that we contributed, and I hope that in the near future we'll be involved in more RRRs and in pre-registration more generally.  Lastly, it's great to be finally able to talk about it all!

(And if anyone wants to buy some leftover Stabilo 68s, I can do you a good deal.)

*Correction 18/08/2016: Fritz Strack did not review the protocol. Rather he nominated colleague to review.
Edit 19/08/2016 - Link added to in press paper
Edit 12/01/2017 - DOI added to Wagenmakers et al paper

Reference
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., Albohn, D. N., Allard, E. S., Benning, S. D., Blouin-Hudon, E.-M., Bulnes, L. C., Caldwell, T. L., Calin-Jageman, R., Capaldi, C. A., Carfagno, N., Chasten, K. T., Cleeremans, A., Connell, L., DeCicco, J. M., Dijkstra, K., Fischer, A. H., Foroni, F., Hess, U., Holmes, K. J., Jones, J. L. H., Klein, O., Koch, C., Korb, S., Lewinski, P., Liao, J. D., Lund, S., Lupiáñez, J., Lynott, D., Nance, C. N., Oosterwijk, S., Özdoğru, A. A., Pacheco-Unguetti, A. P., Pearson, B., Powis, C., Riding, S., Roberts, T.-A., Rumiati, R. I., Senden, M., Shea-Shumsky, N. B., Sobocko, K., Soto, J. A., Steiner, T. G., Talarico, J. M., van Allen, Z. M., Vandekerckhove, M., Wainwright, B., Wayand, J. F., Zeelenberg, R., Zetzer, E., Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science. DOI: https://doi.org/10.1177/1745691616674458


Other references
Lynott, D., Corker, K. S., Wortman, J., Connell, L., Donnellan, M. B., Lucas, R. E., & O’Brien, K. (2014). Replication of “Experiencing physical warmth promotes interpersonal warmth” by Williams and Bargh (2008). Social Psychology, 45, 216-222. DOI: 10.1027/1864-9335/a000187

Schooler, J. W. (2011) Unpublished results hide the decline effect. Nature, 470, 437.

Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: a nonobtrusive test of the facial feedback hypothesis. Journal of personality and social psychology, 54(5), 768.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9, 59-71.

Sunday, June 5, 2016

Why staying in the EU might be good for UK research

There is obviously a lot of debate at the moment about the pros and cons of staying in the EU. I'm not going to get into the broader debate, but I thought I'd highlight some data on European Research Council funding successes that suggest why Britain is better off in the EU in this instance.

Here, I take a look just at grants awarded to individuals in the form of European Research Council (ERC) Starter, Consolidator and Advanced grants.  The amount of funding goes from €1.5-€2.5 million, with these grants generally viewed as being very prestigious. In 2015 for example, grants worth ~€398 million were awarded to the UK by the ERC.

There are two reasons why the UK is better off in the EU in relation to funding under these grant schemes.

First of all, the UK does very well in all three grant schemes, and in fact has the largest number of grant successes of any European country.

In the 2015 allocations, the UK received the highest number of starter grants (61, compared to second-placed Germany's 53), the highest number of consolidator grants (67, compared to second-placed Germany's 45), and the highest number of advanced grants, with a whopping 69 awards, compared to second-placed Germany on 43 awards. So, in absolute terms, the UK attracts an awful lot of this funding, which means excellent research and researchers are being funded to do their work in the UK.

Secondly, the majority of grants won by the UK are not actually won by UK citizens, but by non-UK citizens who have come here to work, or who are using the grants to come to the UK to do research.

For starter grants, only 28% (17/61) were awarded to UK nationals - the figure below shows the distribution of grantees by country of host institution. For consolidator grants, 36% (24/67) went to UK nationals, while a majority 65% (45/69) of advanced grants went to UK nationals.  Overall, more than half of all awards (56%) that came to the UK were awarded to non-UK nationals. In monetary terms, approximately €212 million of the €398 million awarded in 2015 was brought to the UK by non-UK citizens.  So, yes, there is excellent research being done in the UK, but a large chunk of it is being done by people who are not originally from the UK.   What's more, the largest proportion of non-UK "grantees" are from elsewhere in the European Union.

ERC Starter grant awards by country (2015)

There's no doubt that the UK does very well out of these funding competitions, and the success rate also speaks to the quality of research and researchers working in the UK.  However, the data also highlight the importance of freedom of movement within the EU for scientists and researchers. Switzerland is outside the European Union, but also does very well out of ERC research grants; the condition for them being able to access these funds (or any other Horizon 2020 funding) is freedom of movement for EU workers. If the UK were to leave the EU and place added immigration controls for EU workers, would the UK remain as attractive a place to work in the future?  I'm not so sure.

Sources
Statistics on awards for starter, consolidator and advanced grants in 2015
https://erc.europa.eu/sites/default/files/document/file/erc_2015_stg_statistics.pdf
https://erc.europa.eu/sites/default/files/document/file/erc_2015_cog_statistics.pdf
https://erc.europa.eu/sites/default/files/document/file/erc_2015_adg_statistics.pdf


Monday, July 6, 2015

Meta-analysis Help Needed: Effects of temperature on prosocial and antisocial behaviour

We are currently conducting a meta-analysis looking at the influence of temperature on prosocial and antisocial behaviour.  If you have anything that fits this description (details below) I would be grateful if you could email me (d.lynott@lancaster.ac.uk) or Katie Corker (corkerk@kenyon.edu).  We have already done an extensive literature trawl, so we are mainly looking for unpublished work in this area - this could be work from many years ago, work that is under review/in press or data that has been collected and analysed but not formally written up yet. 

The Specifics

We are specifically interested in experimental (or quasi-experimental studies) that manipulate either ambient temperature (e.g., hot vs cold conditions; across a range of temperatures) or temperature primes (e.g., hot/cold packs, hot/cold drinks, hot/cold seat pads), where the behaviour of interest could be classified as prosocial or antisocial.  An example of prosocial behaviour is Williams & Bargh (2008, Study 2), where participants chose between selecting a gift for themselves (self-interested) or a friend (prosocial). See also Ijzerman et al (2013), where children donate stickers to another child.  And example of anti-social behaviour might be where participants have the opportunity to punish another participant or target individual. For example, Fay and Maner (2014) had participants direct noise blasts at partners. 

If you believe your study meets these criteria, we are requesting details of your study, including the study design, sample sizes per condition, means and standard deviations per condition.  Alternatively, if you are happy to provide us with the raw data and information on the coding of variables, we can extract the required data. 

Note: We will only use the data for the purpose of the meta-analysis and we will delete the data afterwards. 

If you have any questions about this study, please do not hesitate to get in touch.  Thank you in advance for your help and any work you might contribute. 

Best wishes, 


Dermot Lynott, Katie Corker, Louise Connell, Kerry O’Brien. 

Monday, December 22, 2014

Bang for buck in REF 2014

The Research Excellence Framework 2014 evaluates research in terms of outputs (e.g., journal publications), impacts (e.g., how research translates to real-world impact) and research environment (e.g., facilities, funding, people).  Universities try to maximise the proportion of 4-star and 3-star ratings they get which indicates world and internationally leading research.  To the wider world, the only things that count are the end products of research: outputs and (research's indirect cousin) impacts.  So what is the point of assessing research environment?  One possibility is that it can be used to ascertain which departments are squandering resources and which departments are doing fantastic research on a shoestring, and using that information to guide funding decisions.  For example, if a department is managing to produce a high proportion of 3-4* outputs on a 2* environment, then quality-related (QR) research money should be thrown at them because one can only imagine what they might achieve with better facilities.  On the other hand, if a department has a 4* environment but its modal outputs tend to be in the 2-3* range, then serious thought should be given to holding back some of their QR money because they are clearly wasting their excellent research environment facilities on relatively mediocre research.  How can we examine - and reward - greater bang for buck?

We attempt to quantify "bang for buck" by looking at the quality of research produced by departments (outputs and impacts) relative to the quality of their research environment. Are people in 4* research environments consistently producing 4* research outputs? Are people in lower rated research environments producing poorer quality outputs?

To examine this, we calculate each department's grade point average (GPA) based on their combined outputs (weighted at 65% in REF2014) and impact (weighted at 20%) and compare it to the GPA based on their environment alone.  In short this is the score for outputs and impacts divided by the score for research environment.  This gives us a "Bang for Buck" ratio, where a score of 1 indicates a university/unit is producing research in line with the quality of its environment. A score of greater than 1 indicates higher quality outputs relative to the research environment, while a score lower than 1 indicates outputs of a lower quality relative to the environment.  As my interest is in psychology, I've done the analysis for Unit of Assessment 4 (Psychology, Psychiatry and Neuroscience) - if time permits I’ll add the analysis for universities overall later.  I’m focussing on top research departments, which I've defined as those that received above average scores for their 3* and 4* outputs/impacts combined (the average score in UoA 4 is ~65%).  Table 1 below shows the rankings according to the Bang for Buck ratio for the top research departments. 

Table 1. Bang-for-buck scores for the top 46 institutions in UoA 4. Overall GPA is included for context, but not used in the calculation. 
Rank Institution name Overall GPA Environment GPA Impact GPA Outputs GPA b4b
1 Nottingham Trent University 2.894 2.50 3.73 2.73 1.19
2 University of Stirling 3.269 2.88 4.00 3.14 1.16
3 University of East London 2.931 2.63 4.00 2.67 1.14
4 Goldsmiths' College 2.981 2.75 3.60 2.84 1.10
5 University of Aberdeen 3.221 3.00 3.00 3.34 1.09
6 University of Portsmouth 2.799 2.63 3.53 2.61 1.08
7 University of Dundee 3.182 3.00 3.80 3.03 1.07
8 University of Warwick 3.290 3.13 3.47 3.27 1.06
9 University of East Anglia 3.099 3.00 3.60 2.97 1.04
10 University of Hull 2.805 2.75 3.27 2.68 1.02
11 University of Essex 3.302 3.25 3.53 3.24 1.02
12 University of Plymouth 3.040 3.00 3.20 3.00 1.02
13 Roehampton University 2.785 2.75 3.40 2.60 1.02
14 Bangor University 3.278 3.25 3.73 3.15 1.01
15 Lancaster University 3.014 3.00 3.27 2.94 1.01
16 Queen's University Belfast 2.987 3.00 3.73 2.76 1.00
17 University of Sussex 3.355 3.38 3.84 3.20 0.99
18 City University London 2.728 2.75 2.80 2.70 0.99
19 University of Leicester 3.062 3.13 3.00 3.07 0.98
20 Swansea University 3.176 3.25 4.00 2.91 0.97
21 University of Surrey 2.858 3.00 2.73 2.86 0.94
22 Royal Holloway, University of London 3.439 3.63 3.73 3.31 0.94
23 University of Kent 2.963 3.13 3.40 2.79 0.94
24 University of Liverpool 3.138 3.38 3.80 2.88 0.92
25 University of Southampton 3.212 3.50 3.80 2.96 0.90
26 University of Glasgow 3.205 3.50 3.20 3.14 0.90
27 Imperial College London 3.405 3.75 3.84 3.19 0.89
28 University of Oxford 3.623 4.00 3.86 3.47 0.89
29 University of Reading 3.043 3.38 3.68 2.77 0.88
30 University of York 3.460 3.88 3.47 3.36 0.87
31 University of Exeter 3.223 3.63 3.20 3.14 0.87
32 Cardiff University 3.522 4.00 3.80 3.33 0.86
33 University of Cambridge 3.508 4.00 3.64 3.35 0.86
34 University of Durham 3.038 3.50 3.27 2.86 0.84
35 Birkbeck College 3.462 4.00 4.00 3.17 0.84
36 University of Bristol 3.191 3.75 3.80 2.87 0.82
37 University of Birmingham 3.400 4.00 3.68 3.18 0.82
38 University of Nottingham 3.188 3.75 3.33 3.01 0.82
39 University of Leeds 3.066 3.63 3.36 2.85 0.82
40 Newcastle University 3.366 4.00 3.87 3.07 0.81
41 University of Edinburgh 3.349 4.00 3.82 3.06 0.81
42 King's College London 3.329 4.00 3.78 3.04 0.80
43 University College London 3.307 4.00 3.71 3.02 0.80
44 University of Sheffield 3.196 3.88 3.60 2.92 0.79
45 University of St Andrews 3.281 4.00 3.40 3.08 0.79
46 University of Manchester 3.232 4.00 3.60 2.94 0.77


The first thing that strikes me is the relatively small number of UoA 4 departments that have a bang for buck (b4b) ratio above 1. Only 16 of the top 46 departments are producing better outputs/impacts than their research environments would suggest.  The second thing I notice is that Oxford (.89) and Cambridge (.86), almost always considered to be in the top 3 research departments, are well down the table in terms of b4b.  Now, Oxford and Cambridge departments could hardly be considered underachieving, given their impressive outputs and impacts, but the b4b scores suggest they could be doing still more given the resources at their disposal. 

Of the 35 institutions in Table 1 that don't receive top marks for their environments, 25% of outputs and 58% of impacts are rated as 4*, indicating that huge swathes of world class research are done outside of traditional research powerhouses and elite institutions.  For me, the real stars in this analysis are Aberdeen, Dundee and Warwick - all three have very impressive GPAs for outputs and impact, but are still scoring very highly on the bang-for-buck measure.  On average, 35% of the outputs and 49% of the impacts from these three institutions are at 4* level, despite not getting top scores for their research environments (averaging only 8% at 4*).  

Does size matter?

Scanning the table, it also seems that larger departments tend to do less well in terms of bang-for-buck.  Table 2 shows the 10 largest departments (by number of staff submitted to UoA 4) along with their b4b scores. 

Table 2.  Bang-for-buck scores of the top 10 departments by size
Rank Institution name FTE Staff Submitted b4b2
1 University College London 286.6 0.80
2 King's College London 238.9 0.80
3 University of Edinburgh 117.3 0.81
4 University of Oxford 98.3 0.89
5 University of Cambridge 76.0 0.86
6 Cardiff University 69.3 0.86
7 University of Bristol 68.8 0.82
8 University of Manchester 67.7 0.77
9 University of Nottingham 53.2 0.82
10 Newcastle University 51.6 0.81

The mean b4b score of the top 10 largest departments is .825, meaning that the quality of research outputs and impact are considerably below what one might expect given their research environments. There is a general negative relationship between the size of department and the b4b score (r ~= -0.3) Are the departments in Table 2 too large to function optimally in terms of research outputs?  Is there an ideal department size in terms of producing quality research consistent with the research environment?  Table 3 shows the b4b scores for groups of departments ordered by size (e.g., the median number of staff for the top 10 largest departments is 72.64, the next 10 largest is 37.3 and so on).  

Table 3. Groups of departments ranked by size with median number of active researchers and b4b scores
Rank by Overall GPA Median FTE b4b
1-10 72.64 0.825
11-20 37.3 0.908
21-30 29.275 0.908
31-40 21.75 1.033
41-50 14.85 1.038

According to this data, the sweet spot for department size is somewhere between 15 and 22 active researchers. Beyond that, people are not making optimal use of their research environments. It's not clear why this would be the case - it may be because resource management becomes extra problematic over a given size or because a culture of resource wastage becomes normalised in larger departments.  There may be many reasons for such a relationship.

What have we learned?

Overall, I think this analysis suggests a couple of very important points. It is clear that there is huge value in not focusing research funding in a small number of elite institutions. Whatever the flaws of the REF, the overall research profiles clearly indicate that there is high quality research taking place across the UK - there are 46 departments with >65% research considered in world-leading or internationally-leading.  Furthermore,  most of the world-leading research in the UK is taking place in departments that lack 4* research environments.  What the bang for buck calculations show is that many departments are punching well-above their weight in terms of research produced and are demonstrating extremely good value for money. In addition, as departments increase in size there is a noticeable decrease in bang-for-buck scores. 

In terms of allocating research funding (QR money), there are at least two ways to utilise this information. One way is to allocate extra money to those departments with high bang for buck ratios, because giving them the opportunity to improve their research environments will likely lead to matched improvements in research outputs and impacts.  Another way is to disregard environment completely when allocating QR money, and focus only on outputs and impacts, which rewards departments purely by the research they actually produce, rather than factoring in the environment they already have in place.