The Registry of Open Data on AWS is now available on AWS Data Exchange
All datasets on the Registry of Open Data are now discoverable on AWS Data Exchange alongside 3,000+ existing data products from category-leading data providers across industries. Explore the catalog to find open, free, and commercial data sets. Learn more about AWS Data Exchange

The Massively Multilingual Image Dataset (MMID)

computer vision machine learning machine translation natural language processing


MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.)

Update Frequency

Language data is added as it is ready for distribution.


See citation instructions at


Managed By

Penn NLP

See all datasets managed by Penn NLP.


How to Cite

The Massively Multilingual Image Dataset (MMID) was accessed on DATE from

Resources on AWS

  • Description
    Images for words in various languages, packaged by in .tar archives by each language.
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    AWS Region
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://mmid-pds/

Edit this dataset entry on GitHub

Tell us about your project