The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

Multi Token Completion

amazon.science machine learning natural language processing

Description

This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

Update Frequency

Not currently being updated

License

Datasets are published under CC-NC-SA-3.0. Human evaluation is published under CC-SA-4.0.

Documentation

https://multi-token-completion.s3.amazonaws.com/README.txt

Managed By

See all datasets managed by Amazon.

Contact

guyk@amazon.com, orenk@amazon.com

How to Cite

Multi Token Completion was accessed on DATE from https://registry.opendata.aws/multi-token-completion.

Usage Examples

Publications

Resources on AWS

  • Description
    multi-token-completion Datasets
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::multi-token-completion
    AWS Region
    us-east-1
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://multi-token-completion/

Edit this dataset entry on GitHub

Tell us about your project

Home