Skip to content

Evaluation Dataset

The following datasets are all transformed into standard Evaluation Prompts before evaluation.

Dataset 1(IMDB)

#Metrics-Quasi-Exact match

Data description:

IMDB is a binary sentiment categorised large film review dataset containing much more data than previous benchmark datasets. 25,000 distinct film reviews were used as a training set and 25,000 were used as a test. There are other unlabeled data available.

Dataset structure:

Size of downloaded dataset files: 84.13 MB

Size of the generated dataset: 133.23 MB

Total amount of disk used: 217.35 MB

Amount of source data:

The dataset is split into train(25000), test(25000), unsupervised(50000)

Data field:

KEYEXPLAIN
labela classification label, with possible values including neg(0), pos (1)
texta string feature

Sample of source dataset:

{
    "label": 0,
    "text": "Goodbye world2\n"
}

Citation information:

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

Dataset 2(RAFT)

#Metrics-Quasi-Exact match

Data description:

The Real-world Annotated Few-shot Tasks (RAFT) dataset is an aggregation of English-language datasets found in the real world. Associated with each dataset is a binary or multiclass classification task, intended to improve our understanding of how language models perform on tasks that have concrete, real-world value. Only 50 labeled examples are provided in each dataset.

Dataset structure:

sub datasetsAmount of source dataAmount of sampled dataSample of source dataset
Ade Corpus V2train(50) test(5000)Sample from test of source dataset(40)Sentence: No regional side effects were noted.
ID: 0
Label: 2
Banking 77train(50) test(5000)Sample from test of source dataset(40)Query: Is it possible for me to change my PIN number?
ID: 0
Label: 23
NeurIPS Impact Statement Riskstrain(50) test(150)Sample from test of source dataset(40)Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation...
Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf...
Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p...
ID: 0
Label: 1
One Stop Englishtrain(50) test(516)Sample from test of source dataset(40)Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ...
ID: 0
Label: 3
Overrulingtrain(50) test(2350)Sample from test of source dataset(40)Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree....
ID: 0
Label: 2
Semiconductor Org Typestrain(50) test(449)Sample from test of source dataset(40)Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver...
Organization name: North Carolina State Univ.,Raleigh,NC,USA
ID: 0
Label: 3
Systematic Review Inclusiontrain(50) test(2243)Sample from test of source dataset(40)Title: Prototyping and transforming facial textures for perception research...
Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro...
Authors: Tiddeman, B.; Burt, M.; Perrett, D.
Journal: IEEE Comput Graphics Appl
ID: 0
Label: 2
TAI Safety Researchtrain(50) test(1639)Sample from test of source dataset(40)Title: Malign generalization without internal search
Abstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex...
Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search...
Publication Year: 2020
Item Type: blogPost
Author: Barnett, Matthew
Publication Title: AI Alignment Forum
ID: 0
Label: 1
Terms Of Servicetrain(50) test(5000)Sample from test of source dataset(40)Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen...
ID: 0
Label: 2

Licensing information:

DatasetLicense
Ade Corpus V2Unlicensed
Banking 77CC BY 4.0
NeurIPS Impact Statement RisksMIT License/CC BY 4.0
One Stop EnglishCC BY-SA 4.0
OverrulingUnlicensed
Semiconductor Org TypesCC BY-NC 4.0
Systematic Review InclusionCC BY 4.0
TAI Safety ResearchCC BY-SA 4.0
Terms Of ServiceUnlicensed
Tweet Eval HateUnlicensed
Twitter ComplaintsUnlicensed