The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

Phrase Clustering Dataset (PCD) json natural language processing


This dataset is part of the paper "McPhraSy: Multi-Context Phrase Similarity and Clustering" by DN Cohen et al (2022). The purpose of PCD is to evaluate the quality of semantic-based clustering of noun phrases. The phrases were collected from the [Amazon Review Dataset] (

Update Frequency

Not updated


This data is available for anyone to use under the terms of the CDLA-permissive license, which is available here


Managed By

See all datasets managed by Amazon.


Post any questions to re:Post and use the AWS Open Data tag.

How to Cite

Phrase Clustering Dataset (PCD) was accessed on DATE from

Usage Examples


Resources on AWS

  • Description
    Phsrase Clustering Dataset (PCD)
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    AWS Region
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://amazon-phrase-clustering/

Edit this dataset entry on GitHub

Tell us about your project