Evaluation Dataset

The following datasets are all transformed into standard Evaluation Prompts before evaluation.

Dataset 1（IMDB）

Data description：

IMDB is a binary sentiment categorised large film review dataset containing much more data than previous benchmark datasets. 25,000 distinct film reviews were used as a training set and 25,000 were used as a test. There are other unlabeled data available.

Dataset structure：

Size of downloaded dataset files: 84.13 MB

Size of the generated dataset: 133.23 MB

Total amount of disk used: 217.35 MB

Amount of source data：

The dataset is split into train(25000), test(25000), unsupervised(50000)

Data field：

KEY	EXPLAIN
label	a classification label, with possible values including neg(0), pos (1)
text	a string feature

Sample of source dataset：

{
    "label": 0,
    "text": "Goodbye world2\n"
}

Citation information：

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

Dataset 2（RAFT）

#Metrics-Quasi-Exact match

Data description：

The Real-world Annotated Few-shot Tasks (RAFT) dataset is an aggregation of English-language datasets found in the real world. Associated with each dataset is a binary or multiclass classification task, intended to improve our understanding of how language models perform on tasks that have concrete, real-world value. Only 50 labeled examples are provided in each dataset.

Dataset structure：

sub datasets	Amount of source data	Amount of sampled data	Sample of source dataset
Ade Corpus V2	train（50） test（5000）	Sample from test of source dataset（40）	Sentence: No regional side effects were noted. ID: 0 Label: 2
Banking 77	train（50） test（5000）	Sample from test of source dataset（40）	Query: Is it possible for me to change my PIN number? ID: 0 Label: 23
NeurIPS Impact Statement Risks	train（50） test（150）	Sample from test of source dataset（40）	Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation... Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf... Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p... ID: 0 Label: 1
One Stop English	train（50） test（516）	Sample from test of source dataset（40）	Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ... ID: 0 Label: 3
Overruling	train（50） test（2350）	Sample from test of source dataset（40）	Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree.... ID: 0 Label: 2
Semiconductor Org Types	train（50） test（449）	Sample from test of source dataset（40）	Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver... Organization name: North Carolina State Univ.,Raleigh,NC,USA ID: 0 Label: 3
Systematic Review Inclusion	train（50） test（2243）	Sample from test of source dataset（40）	Title: Prototyping and transforming facial textures for perception research... Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro... Authors: Tiddeman, B.; Burt, M.; Perrett, D. Journal: IEEE Comput Graphics Appl ID: 0 Label: 2
TAI Safety Research	train（50） test（1639）	Sample from test of source dataset（40）	Title: Malign generalization without internal search Abstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex... Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search... Publication Year: 2020 Item Type: blogPost Author: Barnett, Matthew Publication Title: AI Alignment Forum ID: 0 Label: 1
Terms Of Service	train（50） test（5000）	Sample from test of source dataset（40）	Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen... ID: 0 Label: 2

Licensing information：

Dataset	License
Ade Corpus V2	Unlicensed
Banking 77	CC BY 4.0
NeurIPS Impact Statement Risks	MIT License/CC BY 4.0
One Stop English	CC BY-SA 4.0
Overruling	Unlicensed
Semiconductor Org Types	CC BY-NC 4.0
Systematic Review Inclusion	CC BY 4.0
TAI Safety Research	CC BY-SA 4.0
Terms Of Service	Unlicensed
Tweet Eval Hate	Unlicensed
Twitter Complaints	Unlicensed

Evaluation Dataset ​

Dataset 1（IMDB） ​

Data description： ​

Dataset structure： ​

Amount of source data： ​

Data field： ​

Sample of source dataset： ​

Citation information： ​

Dataset 2（RAFT） ​

Data description： ​

Dataset structure： ​

Licensing information： ​

Evaluation Dataset

Dataset 1（IMDB）

Data description：

Dataset structure：

Amount of source data：

Data field：

Sample of source dataset：

Citation information：

Dataset 2（RAFT）

Data description：

Dataset structure：

Licensing information：