Skip to content

Evaluation Data

MuSR

Data Description:

MuSR (Multistep Soft Reasoning) is a multistep reasoning assessment set where each question contains longer (around a thousand words) natural text descriptions of three types: who is the culprit, object location, and team task assignment.

There are a total of 756 test data.

Sample (simplified) source dataset:

{
  "context": "In an adrenaline inducing bungee jumping site, Mack's thrill-seeking adventure came to a gruesome end by a nunchaku ...",
  "questions": [{"question": "Who is the most likely murderer?", "choices": ["Mackenzie", "Ana"]}]
}

paper citation:

MuSR: https://arxiv.org/abs/2310.16049

@inproceedings{
sprague2024musr,
title={Mu{SR}: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning},
author={Zayne Rea Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=jenyYQzue1}
}