How to quantify research quality?

Introduction

You may have recently noticed a number appearing under our blog posts, in a little green square. That’s an attempt to better track the quality of our research, which is, as far we know, the first system of its kind.

This post explains why we added it, how it works, who does the ratings, and its benefits so far.

Image_0

Figure: Where can you see the score?

Why score our blog posts?

Our key organisational aim this year is to deepen our knowledge of how to maximise the social impact of career choices through research. But how can we know if our research is good or bad?

It’s easy to track the popularity of online content, and we already attempt to track the extent to which our research changes behaviour, but it’s possible for bad research to be both popular and and persuasive. How can we gain more confidence in the quality of our research?

One answer is having a research evaluation. In a research evaluation, an evaluator (ideally external and unbiased) qualitatively assesses a list of criteria that constitute good research. For instance, they could consider: “In coming to these conclusions, did they use methods that are likely to be reliable?” and “are these conclusions important, new information for users?” Finding out whether authoritative, credible people agree with the conclusions is also useful evidence. These are the kind of evaluations GiveWell has used (although they no-longer actively seek external evaluations because they were getting detailed feedback from users and it was difficult to find suitable evaluators). This is also the kind of process widely used within universities and in grant applications.

We plan to introduce an external research evaluation process. The problem with them is you don’t learn the results until months after you’ve done the work. External evaluations take time to organise, so its harder to have them more frequently than every six months. But rapid feedback is really important for improving quickly.

Ozzie Gooen suggested we could instead evaluate each research blog post, since that’s where we initially publish most of our research. We could give a group of external evaluators access to a system to score each blog post based on a couple of important criteria. The combined score could be displayed on the blog. This would create a system for research evaluation that’s rapid, external and transparent.

How does it work?

We’ve found a group of internal and external evaluators (though we would like to move towards a fully external group). Each evaluator gives scores. These are averaged into a combined score that appears next to each blog post. is given access to a scoring box that appears at the bottom of every blog post. When they input their scores, the scores are added to the combined score. The combined score appears next to the blog post. If you click on the score box, you can see the individual breakdowns, and any comments left by the raters. At the end of every week, the 80,000 Hours team reviews the scores we received.

Image_1

Figure: The scoring screen.

What criteria do we use for rating?

Full research posts are judged on the following criteria:

  1. Quality – Given the aims of this type of post in 80,000 Hours’ research program, overall how high quality is this work?

  2. Usefulness – If true, how useful is this information to our typical users in picking careers with social impact?

  3. Reliability – Are the conclusions likely to be true?

  4. Transparency – Could a motivated reader follow the research process?

  5. Clarity – Is the message of this research quick and easy to understand?

Usefulness and reliability measure the immediate value of the research to our users. Transparency and clarity are important instrumental goals. They make it easier to judge the quality of our research, and for others to build upon our work. The quality score is an all considered judgement of how useful this research is. You can find more detail in this document.

For posts that are not providing new arguments (interviews, case studies, discussion), the full rating is less applicable, so we just give an overall quality score (relative to the aims of that type of post).

How do we select the raters?

Selecting the raters is the most difficult challenge, and one of the main reasons why GiveWell stopped actively seeking external evaluation. The difficulty is finding a group of people who are both sufficiently engaged with 80,000 Hours that are willing to spend the time reading and scoring our posts, but also sufficiently external to be unbiased and credible. In addition, the group ideally has experience with relevant research.

Currently, all internal team members can also post research quality ratings. That means the combined score is a combination of internal and external evaluation.

We’re still working on building our group of external evaluators. In particular, we would like to involve more people who are less connected with 80,000 Hours. However, we think ratings from the existing group already provide useful additional information over purely internal evaluations. In the future, we’ll remove the internal scores from the overall average.

We would like to add more raters. If you know someone who might be suitable, please let us know!

Who are the external evaluators?

Peter Hurford
Credentials: Peter is exactly the sort of person we’d like to appeal to and help with their careers, so we value his opinion on how useful our research is. Peter has been involved with effective altruism from the start, showed good research skills during an internship at Giving What We Can and writes a high quality blog.
Disclosure: Peter has received a case study from us, and interned at our sister charity, Giving What We Can.

Mike Webb
Credentials: Michael is a research economist at the Institute for Fiscal Studies (a leading, UK-based, independent economic research institute). He has a Masters from MIT, and graduated top of his year in economics at Oxford. In September he will start the Economics PhD program at Stanford. Like Peter, he represents our target audience, but he’s not actively engaged in the effective altruism community.
Disclosure: Mike is a friend of Ben from university.

Sam Bankman-Fried
Credentials: Sam graduated from MIT with a major in maths, and has a job in quantitative trading earning to give.
Disclosure: We’ve given Sam advice on his career.

November 17, 2022 1:00 pm GMT: Until recently, we had highlighted Sam Bankman-Fried on our website as a positive example of someone ambitiously pursuing a high-impact career. To say the least, we no longer endorse that. See our statement for why.

Note: On May 12, 2023, we released a blog post on updates we made to our advice after the collapse of FTX.

Ben Kuhn
Credentials: Ben works on machine learning at a quantitative finance start-up, and has previously interned at GiveWell. He recently finished his undergraduate mathematics studies at Harvard, where he ran Harvard Effective Altruism. He writes about effective altruism and other topics on his excellent blog.
Disclosure: Ben has received a case study from us, is a trustee of CEA USA and was approached about working at 80,000 Hours.

Benefits of the system

The benefits include:

  • It provides quicker feedback for the research team on the quality of their work, and how it’s changing over time.

  • It provides additional data for our full research evaluation.

  • Developing the rubric encouraged us to think hard about what constitutes good research, improving our research processes.

  • Having these criteria written down serves as a checklist of criteria to satisfy in producing good research. We think this has increased research quality (e.g. ensuring each post has a proper summary, is well referenced, properly hedges its conclusions etc.).

  • Scoring our posts and discussing the scores each week as a team forces us to regularly self-reflect on the quality of our work (we’ve kept brief notes on the meetings here).

  • It has increased the amount of qualitative feedback we receive each week.

  • We think it makes it easier for our users to judge the quality of our work, and in particular, to find our best posts.

Alternatives

The main alternative is unstructured qualitative feedback, for instance, chatting to people about our work or receiving blog comments. We receive this naturally from our existing users, especially in coaching sessions and on comments on the blog. We view qualitative feedback as complementary. We use the blog rubric as a source of additional feedback, and also because it’s quick, transparent and easy to analyse.

How was it developed?

  1. We developed the rubric by reflecting on what criteria we want our research to satisfy, and by looking at the processes used in research grant applications. We took the initial rubric through several rounds of feedback and improvement.

  2. Ozzie implemented the scoring system on the blog. You can see a picture of what it looks like below.

  3. We tested it with staff, before adding an initial group of raters.

  4. We scored a selection of old posts and let the system run for several months of new posts.

  5. We also sought advice from Stephanie Wykstra, a former GiveWell employee who now works to promote meta-research.

  6. We implemented another round of changes to the rubric. In particular, we simplified the scoring system and altered the definitions of the scores to spread them out more widely. (More description here).

  7. Currently, we’re looking to build up the group of raters to 4-6 regulars.

  8. We’ll perform another review of the system in a year, focusing on who does the rating, whether to alter the rubric and whether to make the system more anonymous.

Image_2

Figure: The scoring box.