Skip to content

评测数据

MuSR

数据描述:

MuSR (Multistep Soft Reasoning)是一个多步推理评测集,每个问题包含篇幅较长(千字左右)的自然文字描述,有三种类型:谁是凶手、物体位置、团队任务分配。

共756条测试数据。

源数据集样例(简化):

{
  "context": "In an adrenaline inducing bungee jumping site, Mack's thrill-seeking adventure came to a gruesome end by a nunchaku ...",
  "questions": [{"question": "Who is the most likely murderer?", "choices": ["Mackenzie", "Ana"]}]
}

论文引用:

MuSR: https://arxiv.org/abs/2310.16049

@inproceedings{
sprague2024musr,
title={Mu{SR}: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning},
author={Zayne Rea Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=jenyYQzue1}
}