Registry of Open Data on AWS

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with labeled.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with labeled.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

Radiant MLHub

cogearth observationenvironmentalgeospatiallabeledmachine learningsatellite imagerystac

Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image ...

Usage examples

See 8 usage examples →

RarePlanes

computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery

RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ...

Usage examples

Automatically compress and archive satellite imagery for Amazon S3 by Newel Hirst, Joseph Fahimi, and Justin Downes
RarePlanes Codebase by Thomas Hossler and Jacob Shermeyer
Notebook for training and testing YOLYv4 by Adam Van Etten
Announcing YOLTv4: Improved Satellite Imagery Object Detection by Adam Van Etten
RarePlanes: Synthetic Data Takes Flight by Jacob Shermeyer, Thomas Hossler, Adam Van Etten, Daniel Hogan, Ryan Lewis, Daeil Kim

See 6 usage examples →

High resolution, annual cropland and landcover maps for selected African countries

agriculturecogdeep learninglabeledland covermachine learningsatellite imagery

High resolution, annual cropland and landcover maps for selected African countries developed by Clark University's Agricultural Impacts Research Group using various machine learning approaches applied to Planet imagery, including field boundary and cultivated frequency maps, as well as multi-class land cover.

Usage examples

High resolution, annual maps of field boundaries for smallholder-dominated croplands at national scales by Estes et al. (2022)
Accessing and downloading data by Rahebe Abedi
A super-ensemble approach to map land cover types with high resolution over data-sparse African savanna landscapes by Song et al. (2023)
Final report-Phase 1: Creating open agricultural maps and ground truth data to better deliver farm extension services by Estes et al (2022)
Final report-Phase 2: Creating next generation field boundary and crop type maps: Rigorous multi-scale groundtruth provides sustainable extension services for smallholders by Wussah et al (2022)

See 5 usage examples →

A region-wide, multi-year set of crop field boundary labels for Africa

agriculturecoglabeledland covermachine learningsatellite imagery

Crop field boundaries digitized in Planet imagery collected across Africa between 2017 and 2023, developed by Farmerline, Spatial Collective, and the Agricultural Impacts Research Group at Clark University, with support from the Lacuna Fund (Estes et al, 2024; Details →

Usage examples

A region-wide, multi-year set of crop field boundary labels for Africa by Wussah et al. (2023)
Generalization enhancement strategies to enable cross-year cropland mapping with convolutional neural networks trained using historical samples by Khallaghi et al. (2025)
A region-wide, multi-year set of crop field boundary labels for Africa by Estes et al. (2024)
High resolution, annual maps of field boundaries for smallholder-dominated croplands at national scales by Estes et al. (2022)
Technical report on label develop and processing by Wussah et al. (2023)

See 7 usage examples →

PD12M

artdeep learningimage processinglabeledmachine learningmedia

PD12M is a collection of 12.4 million CC0/PD image-caption pairs for the purpose of training generative image models.

Usage examples

Hugging Face Dataset by Spawning
PD12M: A Large-Scale Image Captioning Dataset by Jordan Meyer, Nick Padgett, Laura Exline, Cullen Miller
Datasheet by Spawning
Downloading Images by Spawning
Working with the Metadata by Spawning

See 6 usage examples →

Sophos/ReversingLabs 20 Million malware detection dataset

cyber securitydeep learninglabeledmachine learning

A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the sam...

Usage examples

SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection by Richard Harang and Ethan M Rudd
SOREL-20M quickstart by Richard Harang
SOREL-20M dataset interface code by Richard Harang and Ethan M Rudd

See 3 usage examples →

Consented Activities of People

activity detectionactivity recognitioncomputer visionlabeledmachine learningprivacyvideo

The Consented Activities of People (CAP) dataset is a fine grained activity dataset for visual AI research curated using the Visym Collector platform.

Usage examples

Visym Collector by Visym Labs & Systems & Technology Research
OpenFAD - Open Fine Grained Activity Detection Challenge by Visym Labs & NIST

See 2 usage examples →

Emory Knee Radiograph (MRKR) dataset

bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray

The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps ...

Usage examples

Example Notebook by Emory-HITI
Emory Knee Radiograph Dataset by Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, Hari Trivedi.

See 2 usage examples →

Orcasound - bioacoustic data for marine conservation

biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledmachine learningmappingoceansopen source softwaresignal processing

Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.

Usage examples

Github for our open source projects by Orcasound open source community

See 1 usage example →

RSNA Abdominal Trauma Detection (RSNA-ABT)

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Blunt force abdominal trauma is among the most common types of traumatic injury, with the most frequent cause being motor vehicle accidents. Abdominal trauma may result in damage and internal bleeding of the internal organs, including the liver, spleen, kidneys, and bowel. Detection and classification of injuries are key to effective treatment and favorable outcomes. A large proportion of patients with abdominal trauma require urgent surgery. Abdominal trauma often cannot be diagnosed clinically by physical exam, patient symptoms, or laboratory tests. Prompt diagnosis of abdominal trauma using...

Usage examples

The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset by Rudie, Jeffrey D.

See 1 usage example →

RSNA Cervical Spine Fracture Detection (RSNA-CSF) Dataset

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Over 1.5 million spine fractures occur annually in the United States alone resulting in over 17,730 spinal cord injuries annually. The most common site of spine fracture is the cervical spine. There has been a rise in the incidence of spinal fractures in the elderly and in this population, fractures can be more difficult to detect on imaging due to degenerative disease and osteoporosis. Imaging diagnosis of adult spine fractures is now almost exclusively performed with computed tomography (CT). Quickly detecting and determining the location of any vertebral fractures is essential to prevent ne...

Usage examples

The RSNA Cervical Spine Fracture CT Dataset by Ming, Hui Lin

See 1 usage example →

RSNA Intracranial Hemorrhage Detection

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

RSNA assembled this dataset in 2019 for the RSNA Intracranial Hemorrhage Detection AI Challenge (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/). De-identified head CT studies were provided by four research institutions. A group of over 60 volunteer expert radiologists recruited by RSNA and the American Society of Neuroradiology labeled over 25,000 exams for the presence and subtype classification of acute intracranial hemorrhage.

Usage examples

Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge by Rudie, Jeffrey D.

See 1 usage example →

RSNA Pulmonary Embolism Detection

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

RSNA assembled this dataset in 2020 for the RSNA STR Pulmonary Embolism Detection AI Challenge (https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/). With more than 12,000 CT pulmonary angiography (CTPA) studies contributed by five international research centers, it is the largest publicly available annotated PE dataset. RSNA collaborated with the Society of Thoracic Radiology to recruit more than 80 expert thoracic radiologists who labeled the dataset with detailed clinical annotations.

Usage examples

The RSNA Pulmonary Embolism CT Dataset by Colak, Errol

See 1 usage example →

Gulfwide Avian Colony Monitoring Survey Photos

biologyconservationecosystemsenvironmentallabeledobject detection

For this project, The Water Institute (the Institute) and subcontractor Colibri Ecological Consulting, LLC (Colibri) utilized established methods and protocols capable of assessing changes of colonial waterbird populations and their important habitats within individual states and the broader northern Gulf of Mexico region. Data collection activities included: Aerial Photographic Nest Surveys: Implementation of fixed-wing aircraft surveys intended to assess waterbird colonies and document associated nesting within select portions of the northern Gulf of Mexico. Additional detail is provide...

RSNA Screening Mammography Breast Cancer Detection (RSNA-SMBC) Dataset

breast cancercancercomputer visioncsvlabeledlife sciencesmachine learningmammographymedical image computingmedical imagingradiology

According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requi...

YouTube 8 Million - Data Lakehouse Ready

amazon.sciencecomputer visionlabeledmachine learningparquetvideo

This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of...

Usage examples

Data Lake as Code Deployment Guide by AWS Industry Blueprints Team
YouTube 8 Million by Google Research

See 2 usage examples →