评测数据
MuSR
数据描述:
MuSR (Multistep Soft Reasoning)是一个多步推理评测集,每个问题包含篇幅较长(千字左右)的自然文字描述,有三种类型:谁是凶手、物体位置、团队任务分配。
共756条测试数据。
源数据集样例(简化):
{
"context": "In an adrenaline inducing bungee jumping site, Mack's thrill-seeking adventure came to a gruesome end by a nunchaku ...",
"questions": [{"question": "Who is the most likely murderer?", "choices": ["Mackenzie", "Ana"]}]
}
论文引用:
MuSR: https://arxiv.org/abs/2310.16049
@inproceedings{
sprague2024musr,
title={Mu{SR}: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning},
author={Zayne Rea Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=jenyYQzue1}
}