Skip to content

评测数据

以下数据集均转化为标准评测Prompt再进行评测

IMDB

#评测指标-Quasi-Exact Match

数据描述:

IMDB是一个二元情绪分类的大型电影评论数据集,包含比以前的基准数据集多得多的数据。其中25000条截然不同的电影评论作为训练集,25000条用于测试。还有其它的未标记数据可供使用。

数据集构成和规范:

下载的数据集文件大小:84.13 MB 生成的数据集大小:133.23 MB 总磁盘使用量:217.35 MB

源数据量:

训练集(25000),测试集(25000),未标记数据(50000)

评测数据量:

评测数据为源数据测试集中的25000个实例

数据字段:
KEYEXPLAIN
label分类ID
text评论文本

源数据集样例:

{
    "label": 0,
    "text": "Goodbye world2"
}

论文引用:

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

RAFT

#评测指标-Quasi-Exact Match

数据描述:

Real-world Annotated Few-shot Tasks (RAFT)数据集是在现实世界中构建的英语数据集的集合。每个数据集都与一个二元或多类分类任务相关联,目的是帮我们更好地理解语言模型是如何在具体的、具有现实世界价值的任务上执行的。每个数据集中只提供50个标记示例。

数据集构成和规范:

子数据集源数据量采样数据量源数据集样例
Ade Corpus V2训练集(50) 测试集(5000)测试集中采样(40)Sentence: No regional side effects were noted.
ID: 0
Label: 2
Banking 77训练集(50) 测试集(5000)测试集中采样(40)Query: Is it possible for me to change my PIN number?
ID: 0
Label: 23
NeurIPS Impact Statement Risks训练集(50) 测试集(150)测试集中采样(40)Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation...
Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf...
Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p...
ID: 0
Label: 1
One Stop English训练集(50) 测试集(516)测试集中采样(40)Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ...
ID: 0
Label: 3
Overruling训练集(50) 测试集(2350)测试集中采样(40)Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree....
ID: 0
Label: 2
Semiconductor Org Types训练集(50) 测试集(449)测试集中采样(40)Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver...
Organization name: North Carolina State Univ.,Raleigh,NC,USA
ID: 0
Label: 3
Systematic Review Inclusion训练集(50) 测试集(2243)测试集中采样(40)Title: Prototyping and transforming facial textures for perception research...
Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro...
Authors: Tiddeman, B.; Burt, M.; Perrett, D.
Journal: IEEE Comput Graphics Appl
ID: 0
Label: 2
TAI Safety Research训练集(50) 测试集(1639)测试集中采样(40)Title: Malign generalization without internal search
Abstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex...
Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search...
Publication Year: 2020
Item Type: blogPost
Author: Barnett, Matthew
Publication Title: AI Alignment Forum
ID: 0
Label: 1
Terms Of Service训练集(50) 测试集(5000)测试集中采样(40)Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen...
ID: 0
Label: 2

数据集版权使用说明:

DatasetLicense
Ade Corpus V2Unlicensed
Banking 77CC BY 4.0
NeurIPS Impact Statement RisksMIT License/CC BY 4.0
One Stop EnglishCC BY-SA 4.0
OverrulingUnlicensed
Semiconductor Org TypesCC BY-NC 4.0
Systematic Review InclusionCC BY 4.0
TAI Safety ResearchCC BY-SA 4.0
Terms Of ServiceUnlicensed
Tweet Eval HateUnlicensed
Twitter ComplaintsUnlicensed