Evaluation Data
MuSR
Data Description:
MuSR (Multistep Soft Reasoning) is a multistep reasoning assessment set where each question contains longer (around a thousand words) natural text descriptions of three types: who is the culprit, object location, and team task assignment.
There are a total of 756 test data.
Sample (simplified) source dataset:
{
"context": "In an adrenaline inducing bungee jumping site, Mack's thrill-seeking adventure came to a gruesome end by a nunchaku ...",
"questions": [{"question": "Who is the most likely murderer?", "choices": ["Mackenzie", "Ana"]}]
}
paper citation:
MuSR: https://arxiv.org/abs/2310.16049
@inproceedings{
sprague2024musr,
title={Mu{SR}: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning},
author={Zayne Rea Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=jenyYQzue1}
}