Multi Token Completion - Registry of Open Data on AWS

Description

This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

Update Frequency

Not currently being updated

License

Datasets are published under CC-NC-SA-3.0. Human evaluation is published under CC-SA-4.0.

Documentation

https://multi-token-completion.s3.amazonaws.com/README.txt

Managed By

See all datasets managed by Amazon.

Contact

guyk@amazon.com, orenk@amazon.com

How to Cite

Multi Token Completion was accessed on DATE from https://registry.opendata.aws/multi-token-completion.

Usage Examples

Publications

Simple and Effective Multi-Token Completion from Masked Language Models by Oren Kalinsky, Guy Kushilevitz, Alex Libov & Yoav Goldberg

Resources on AWS

Description

multi-token-completion Datasets

Resource type

S3 Bucket

Amazon Resource Name (ARN)

arn:aws:s3:::multi-token-completion

AWS Region

us-east-1

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://multi-token-completion/