评测数据

以下数据集均转化为标准评测Prompt再进行评测

IMDB

#评测指标-Quasi-Exact Match

数据描述：

IMDB是一个二元情绪分类的大型电影评论数据集，包含比以前的基准数据集多得多的数据。其中25000条截然不同的电影评论作为训练集，25000条用于测试。还有其它的未标记数据可供使用。

数据集构成和规范：

下载的数据集文件大小:84.13 MB 生成的数据集大小:133.23 MB 总磁盘使用量:217.35 MB

源数据量：

训练集（25000），测试集（25000），未标记数据（50000）

评测数据量：

评测数据为源数据测试集中的25000个实例

数据字段：

KEY	EXPLAIN
label	分类ID
text	评论文本

源数据集样例：

{
    "label": 0,
    "text": "Goodbye world2"
}

论文引用：

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

RAFT

#评测指标-Quasi-Exact Match

数据描述：

Real-world Annotated Few-shot Tasks (RAFT)数据集是在现实世界中构建的英语数据集的集合。每个数据集都与一个二元或多类分类任务相关联，目的是帮我们更好地理解语言模型是如何在具体的、具有现实世界价值的任务上执行的。每个数据集中只提供50个标记示例。

数据集构成和规范：

子数据集	源数据量	采样数据量	源数据集样例
Ade Corpus V2	训练集（50）测试集（5000）	测试集中采样（40）	Sentence: No regional side effects were noted. ID: 0 Label: 2
Banking 77	训练集（50）测试集（5000）	测试集中采样（40）	Query: Is it possible for me to change my PIN number? ID: 0 Label: 23
NeurIPS Impact Statement Risks	训练集（50）测试集（150）	测试集中采样（40）	Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation... Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf... Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p... ID: 0 Label: 1
One Stop English	训练集（50）测试集（516）	测试集中采样（40）	Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ... ID: 0 Label: 3
Overruling	训练集（50）测试集（2350）	测试集中采样（40）	Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree.... ID: 0 Label: 2
Semiconductor Org Types	训练集（50）测试集（449）	测试集中采样（40）	Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver... Organization name: North Carolina State Univ.,Raleigh,NC,USA ID: 0 Label: 3
Systematic Review Inclusion	训练集（50）测试集（2243）	测试集中采样（40）	Title: Prototyping and transforming facial textures for perception research... Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro... Authors: Tiddeman, B.; Burt, M.; Perrett, D. Journal: IEEE Comput Graphics Appl ID: 0 Label: 2
TAI Safety Research	训练集（50）测试集（1639）	测试集中采样（40）	Title: Malign generalization without internal search Abstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex... Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search... Publication Year: 2020 Item Type: blogPost Author: Barnett, Matthew Publication Title: AI Alignment Forum ID: 0 Label: 1
Terms Of Service	训练集（50）测试集（5000）	测试集中采样（40）	Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen... ID: 0 Label: 2

数据集版权使用说明：

Dataset	License
Ade Corpus V2	Unlicensed
Banking 77	CC BY 4.0
NeurIPS Impact Statement Risks	MIT License/CC BY 4.0
One Stop English	CC BY-SA 4.0
Overruling	Unlicensed
Semiconductor Org Types	CC BY-NC 4.0
Systematic Review Inclusion	CC BY 4.0
TAI Safety Research	CC BY-SA 4.0
Terms Of Service	Unlicensed
Tweet Eval Hate	Unlicensed
Twitter Complaints	Unlicensed

评测数据 ​

IMDB ​

数据描述： ​

数据集构成和规范： ​

源数据量： ​

评测数据量： ​

数据字段： ​

源数据集样例： ​

论文引用： ​

RAFT ​

数据描述： ​

数据集构成和规范： ​

数据集版权使用说明： ​

评测数据

IMDB

数据描述：

数据集构成和规范：

源数据量：

评测数据量：

数据字段：

源数据集样例：

论文引用：

RAFT

数据描述：

数据集构成和规范：

数据集版权使用说明：