评测数据
以下数据集均转化为标准评测Prompt再进行评测
IMDB
数据描述:
IMDB是一个二元情绪分类的大型电影评论数据集,包含比以前的基准数据集多得多的数据。其中25000条截然不同的电影评论作为训练集,25000条用于测试。还有其它的未标记数据可供使用。
数据集构成和规范:
下载的数据集文件大小:84.13 MB 生成的数据集大小:133.23 MB 总磁盘使用量:217.35 MB
源数据量:
训练集(25000),测试集(25000),未标记数据(50000)
评测数据量:
评测数据为源数据测试集中的25000个实例
数据字段:
KEY | EXPLAIN |
---|---|
label | 分类ID |
text | 评论文本 |
源数据集样例:
{
"label": 0,
"text": "Goodbye world2"
}
论文引用:
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
RAFT
数据描述:
Real-world Annotated Few-shot Tasks (RAFT)数据集是在现实世界中构建的英语数据集的集合。每个数据集都与一个二元或多类分类任务相关联,目的是帮我们更好地理解语言模型是如何在具体的、具有现实世界价值的任务上执行的。每个数据集中只提供50个标记示例。
数据集构成和规范:
子数据集 | 源数据量 | 采样数据量 | 源数据集样例 |
---|---|---|---|
Ade Corpus V2 | 训练集(50) 测试集(5000) | 测试集中采样(40) | Sentence: No regional side effects were noted. ID: 0 Label: 2 |
Banking 77 | 训练集(50) 测试集(5000) | 测试集中采样(40) | Query: Is it possible for me to change my PIN number? ID: 0 Label: 23 |
NeurIPS Impact Statement Risks | 训练集(50) 测试集(150) | 测试集中采样(40) | Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation... Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf... Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p... ID: 0 Label: 1 |
One Stop English | 训练集(50) 测试集(516) | 测试集中采样(40) | Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ... ID: 0 Label: 3 |
Overruling | 训练集(50) 测试集(2350) | 测试集中采样(40) | Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree.... ID: 0 Label: 2 |
Semiconductor Org Types | 训练集(50) 测试集(449) | 测试集中采样(40) | Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver... Organization name: North Carolina State Univ.,Raleigh,NC,USA ID: 0 Label: 3 |
Systematic Review Inclusion | 训练集(50) 测试集(2243) | 测试集中采样(40) | Title: Prototyping and transforming facial textures for perception research... Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro... Authors: Tiddeman, B.; Burt, M.; Perrett, D. Journal: IEEE Comput Graphics Appl ID: 0 Label: 2 |
TAI Safety Research | 训练集(50) 测试集(1639) | 测试集中采样(40) | Title: Malign generalization without internal search Abstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex... Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search... Publication Year: 2020 Item Type: blogPost Author: Barnett, Matthew Publication Title: AI Alignment Forum ID: 0 Label: 1 |
Terms Of Service | 训练集(50) 测试集(5000) | 测试集中采样(40) | Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen... ID: 0 Label: 2 |
数据集版权使用说明:
Dataset | License |
---|---|
Ade Corpus V2 | Unlicensed |
Banking 77 | CC BY 4.0 |
NeurIPS Impact Statement Risks | MIT License/CC BY 4.0 |
One Stop English | CC BY-SA 4.0 |
Overruling | Unlicensed |
Semiconductor Org Types | CC BY-NC 4.0 |
Systematic Review Inclusion | CC BY 4.0 |
TAI Safety Research | CC BY-SA 4.0 |
Terms Of Service | Unlicensed |
Tweet Eval Hate | Unlicensed |
Twitter Complaints | Unlicensed |