Registry of Open Data on AWS

About

This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry tagged with machine learning.

Search datasets (currently 13 matching datasets)

You are currently viewing a subset of data tagged with machine learning.

Add to this registry

If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.

Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.

Tell us about your project

If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.

The Human Sleep Project

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~15K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting. This data is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks, rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG...

Usage examples

Insomnia and morning motor vehicle accidents: A decision analysis of the risk of hypnotics versus the risk of untreated insomnia. Journal of Clinical Psychopharmacology. 2014 Jun;34(3):400-402. PMCID: PMC6794095. by Bianchi MT, Westover MB.
Decision modeling in sleep apnea: the critical roles of pre-test probability, cost of untreated OSA, and time horizon. Journal of Clinical Sleep Medicine. 2016 Mar;12(3):409-418. PMCID: PMC4773629. by Moro M, Westover MB, Kelly J, Bianchi MT.
HIV Increases Sleep-based Brain Age Despite Antiretroviral Therapy. SLEEP. 2021 Mar 30:zsab058. doi: 10.1093/sleep/zsab058. Epub ahead of print. PMCID: PMC8361332. by Leone MJ*, Sun H*, Boutros CL, Liu L, Ye E, Sullivan L, et al.
Dementia Detection from Brain Activity During Sleep. Sleep. Nov 30:zsac286. doi: 10.1093/sleep/zsac286. Epub ahead of print. PMID: 36448766.* by Ye E*, Sun H*, Krishnamurthy PV, Adra N, Ganglberger W, Thomas RJ, et al.
Large-Scale Automated Sleep Staging. Sleep. 2017 Oct 1;40(10). doi: 10.1093/sleep/zsx139. PMCID: PMC6251659. by Sun H, Jian J, Goparaju B, Huang GB, Sourina O, Bianchi MT, et al.

See 37 usage examples →

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5, 3.7, 4.0, and 4.2

bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing

This dataset contains alignment files and short nucleotide, copy number (CNV), repeat expansion (STR), structural variant (SV) and other variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, and v4.2.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6 and v4.2.7 datasets include results from trio small variant, de novo structural vari...

Usage examples

See 22 usage examples →

Allen Cell Imaging Collections

biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy

This bucket contains multiple datasets (as Quilt packages) created by the Allen Institute for Cell Science. The types of data included in this bucket are listed below:

Field of view or cropped images of cells
Segmentations of structures in the images (e.g., boundaries of cells, DNA, other intracellular structures, etc.)
Processed versions of the above images and segmentations
Machine learning predictions and labels of the data listed above
Models trained on the previously listed data
Additional supporting non-image data related to the above listed data types (e.g., gene expression data, whole genome sequencing data, features derived from the images or model predictions, metadata)
Simulation, analysis, and visualization data of in silico cell structures, cells, and cell populations

Extern...

Usage examples

See 20 usage examples →

NASA Prediction of Worldwide Energy Resources (POWER)

agricultureair qualityanalyticsarchivesatmosphereclimateclimate modeldata assimilationdeep learningearth observationenergyenvironmentalforecastgeosciencegeospatialglobalhistoryimagingindustrymachine learningmachine translationmetadatameteorologicalmodelnetcdfopendapradiationsatellite imagerysolarstatisticssustainabilitytime series forecastingwaterweatherzarr

NASA's goal in Earth science is to observe, understand, and model the Earth system to discover how it is changing, to better predict change, and to understand the consequences for life on Earth. The Applied Sciences Program, within the Earth Science Division of the NASA Science Mission Directorate, serves individuals and organizations around the globe by expanding and accelerating societal and economic benefits derived from Earth science, information, and technology research and development.

The Prediction Of Worldwide Energy Resources (POWER) Project, funded through the Applied Sciences Program at NASA Langley Research Center, gathers NASA Earth observation data and parameters related to the fields of surface solar irradiance and meteorology to serve the public in several free, easy-to-access and easy-to-use methods. POWER helps communities become resilient amid observed climate variability by improving data accessibility, aiding research in energy development, building energy efficiency, and supporting agriculture projects.

The POWER project contains over 380 satellite-derived meteorology and solar energy Analysis Ready Data (ARD) at four temporal levels: hourly, daily, monthly, and climatology. The POWER data archive provides data at the native resolution of the source products. The data is updated nightly to maintain near real time availability (2-3 days for meteorological parameters and 5-7 days for solar). The POWER services catalog consists of a series of RESTful Application Programming Interfaces, geospatial enabled image services, and web mapping Data Access Viewer. These three service offerings support data discovery, access, and distribution to the project’s user base as ARD and as direct application inputs to decision support tools.

The latest data version update includes hourly...

Usage examples

About the Prediction Of Worldwide Energy Resources (POWER) Project ArcGIS StoryMap. by The POWER Project
Accessing and Subsetting POWER data, from AWS S3 using Python. by The POWER Project
Global Surface Solar Energy Anomalies Including El Nino and La Nina Years. by C. H., and Coauthors
POWER Application Programming Interface (API) by The POWER Project
Modeling the potential for thermal concentrating solar power technologies by Zhang, Y., S. J. Smith, G. P. Kyle, and P. W. Stackhouse

See 18 usage examples →

Cell Painting Gallery

bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle

The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad Institute. A human-friendly listing of datasets, instructions for accessing them, and other documentation is at the corresponding GitHub page abou...

Usage examples

Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes by Bray M-A, Singh S, Han H, Davis CT, Borgeson B, Hartland C, Kost-Alimova M, Gustafsdottir SM, Gibson CC, & Carpenter AE
Accelerating Drug Discovery with high-throughput Cell Painting on AWS by Chris Kaspar
Center for Open Bioimage Analysis (COBA) YouTube Channel - video tutorials of CellProfiler and other softwares by Multiple Authors
Cell Painting predicts impact of lung cancer variants by Caicedo JC, Arevalo J, Piccioni F, Bray MA, Hartland CL, Wu X, Brooks AN, Berger AH, Boehm JS, Carpenter AE, & Singh S
Systematic morphological profiling of human gene and allele function via Cell Painting by Rohban MH, Singh S, Wu X, Berthet JB, Bray M-A, Shrestha Y, Varelas X, Boehm JS, & Carpenter AE

See 17 usage examples →

ESA WorldCover

agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar

The European Space Agency (ESA) WorldCover product provides global land cover maps for 2020 & 2021 at 10 m resolution based on Copernicus Sentinel-1 and Sentinel-2 data. The WorldCover product comes with 11 land cover classes and has been generated in the framework of the ESA WorldCover project, part of the 5th Earth Observation Envelope Programme (EOEP-5) of the European Space Agency. A first version of the product (v100), containing the 2020 map was released in October 2021. The 2021 map was released in October 2022 using an improved algorithm (v200). The WorldCover 2020 and 2021 maps we...

Usage examples

Exploring the dataset (STAC API examples) by VITO
ESA WorldCover 10 m 2020 v100 by Zanaga, D., Van De Kerchove, R., De Keersmaecker, W., Souverijns, N., Brockmann, C., Quast, R., Wevers, J., Grosu, A., Paccini, A., Vergnaud, S., Cartus, O., Santoro, M., Fritz, S., Georgieva, I., Lesiv, M., Carter, S., Herold, M., Li, Linlin, Tsendbazar, N.E., Ramoino, F., Arino, O., 2021.
The world's most populated and greenest megacities (and how we found out) by Michael Dangermond, Emily Meriam
WorldCover Viewer by VITO
ESA WorldCover 10 m 2021 v200 - Product User Manual by VITO

See 15 usage examples →

SpaceNet

computer visiondisaster responseearth observationgeospatialmachine learningsatellite imagery

SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).

Usage examples

FCAU-Net for the Semantic Segmentation of Fine-Resolution Remotely Sensed Images by Xuerui Niu, Qiaolin Zeng, Xiaobo Luo and Liangfu Chen
SpaceNet 6: Dataset Release by Jake Shermeyer
The SpaceNet 7 Multi-Temporal Urban Development Challenge: Dataset Release by Adam Van Etten
Getting Started with SpaceNet Data by Adam Van Etten
Solaris: an open source Python library for analyzing overhead imagery with machine learning by Nick Weir

See 15 usage examples →

2021 Amazon Last Mile Routing Research Challenge Dataset

amazon.scienceanalyticsdeep learninggeospatiallast milelogisticsmachine learningoptimizationroutingtransportationurban

The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in r...

Usage examples

Human-Centric Parcel Delivery at Deutsche Post with Operations Research and Machine Learning by Uğur Arikan , Thorsten Kranz, Baris Cem Sal, Severin Schmitt, Jonas Witt
Machine Learning for Data-Driven Last-Mile Delivery Optimization by Sami Serkan Özarik, Paulo da Costa, Alexandre M. Florio
The Driver-Aide Problem Coordinated Logistics for Last-Mile Delivery by S. Raghavan , Rui Zhang
Planning robust drone-truck delivery routes under road traffic uncertainty by Yu Yang, Chiwei Yan, Yufeng Cao, Roberto Roberti
Can language models be used for real-world urban-delivery route optimization? by Yang Liu, Fanyou Wu, Zhiyuan Liu, Kai Wang, Feiyue Wang, Xiaobo Qu

See 17 usage examples →

Low Altitude Disaster Imagery (LADI) Dataset

aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater

The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2023. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.

Usage examples

Large Scale Organization and Inference of an Imagery Dataset for Public Safety by Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert
LADI v1 Tutorials by Andrew Weinert, Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren, Ryan Earley, Nadia Dimitrova
Evaluating Multiple Video Understanding and Retrieval Tasks at TRECVID 2021 by George Awad, Asad Butt, Keith Curtis, Jonathan G. Fiscus, Afzal A. Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth Jones, Georges Quenot
Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI by Vamshi Krishna Enabothala, Morgan Dutton, and Sandeep Verma
LADI v2 Overview by Jeffrey Liu, Sam Scheele, Katherine Picchione

See 11 usage examples →

Radiant MLHub

cogearth observationenvironmentalgeospatiallabeledmachine learningsatellite imagerystac

Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image ...

Usage examples

See 8 usage examples →

Materials Project Data

chemistrycloud computingdata assimilationdigital assetsdigital preservationenergyenvironmentalfree softwaregenomeHPCinformation retrievalinfrastructurejsonmachine learningmaterials sciencemolecular dynamicsmoleculeopen source softwarephysicspost-processingx-ray crystallography

Materials Project is an open database of computed materials properties aiming to accelerate materials science research. The resources in this OpenData dataset contain the raw, parsed, and build data products.

Usage examples

See 7 usage examples →

10m Annual Land Use Land Cover (9-class)

cogearth observationenvironmentalgeospatialland coverland usemachine learningmappingplanetarysatellite imagerystacsustainability

This dataset, produced by Impact Observatory, Microsoft, and Esri, displays a global map of land use and land cover (LULC) derived from ESA Sentinel-2 imagery at 10 meter resolution for the years 2017 - 2023. Each map is a composite of LULC predictions for 9 classes throughout the year in order to generate a representative snapshot of each year. This dataset was generated by Impact Observatory, which used billions of human-labeled pixels (curated by the National Geographic Society) to train a deep learning model for land classification. Each global map was produced by applying this model to ...

Usage examples

Mapping the world with unmatched frequency by Mark Hannel
Global land use / land cover with Sentinel 2 and deep learning by K. Karra, C. Kontgis, Z. Statman-Weil, J. C. Mazzariello, M. Mathis and S. P. Brumby
‘Very Dire’: Devastated by Floods, Pakistan Faces Looming Food Crisis by New York Times
The world's most populated and greenest megacities (and how we found out) by Esri
These maps from satellite data show how much Earth has changed in only five years by Fast Company

See 6 usage examples →

Pacific Ocean Sound Recordings

acousticsbiodiversitybiologyclimatecoastaldeep learningecosystemsenvironmentalmachine learningmarine mammalsoceansopen source software

This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applications in ocean soundscape research, education, and the arts.

Usage examples

Seal bomb noise as a potential threat to Monterey Bay harbor porpoise by Simonis et al. (2020)
Reduction of Low-Frequency Vessel Noise in Monterey Bay National Marine Sanctuary During the COVID-19 Pandemic by Ryan et al. (2021)
New Passive Acoustic Monitoring in Monterey Bay National Marine Sanctuary by Ryan et al. (2016)
Tutorials on machine learning and signal processing methods for anthropogenic and cetacean study using the pacific-sound archives by MBARI
Animal-borne metrics enable acoustic detection of blue whale migration by Oestreich et al. (2020)

See 6 usage examples →

RarePlanes

computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery

RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ...

Usage examples

Notebook for training and testing YOLYv4 by Adam Van Etten
Getting Started with YOLTv4 for Object Detection in Imagery: Getting Training Data by Sophia Parafina
Automatically compress and archive satellite imagery for Amazon S3 by Newel Hirst, Joseph Fahimi, and Justin Downes
RarePlanes: Synthetic Data Takes Flight by Jacob Shermeyer, Thomas Hossler, Adam Van Etten, Daniel Hogan, Ryan Lewis, Daeil Kim
Announcing YOLTv4: Improved Satellite Imagery Object Detection by Adam Van Etten

See 6 usage examples →

Solar Dynamics Observatory (SDO) Machine Learning Dataset

machine learningNASA SMD AI

The v1 dataset includes AIA/HMI observations 2010-2018 and v2 includes AIA/HMI observations 2010-2020 in all 10 wavebands (94A, 131A, 171A, 193A, 211A, 304A, 335A, 1600A, 1700A, 4500A), with 512x512 resolution and 6 minutes cadence; HMI vector magnetic field observations in Bx, By, and Bz components, with 512x512 resolution and 12 minutes cadence; The EVE observations in 39 wavelengths from 2010-05-01 to 2014-05-26, with 10 seconds cadence.

Usage examples

Scripts for generating the SDOMLv2 dataset by Jin, Meng
A Machine-learning Data Set Prepared from the NASA Solar Dynamics Observatory Mission by Galvez, Richard; Fouhey, David F.; Jin, Meng; Szenicer, Alexandre; et al
ML applications based on the SDOMLv2 dataset by Wright, Paul J.
ML applications based on the SDOMLv1 dataset by Salvatelli, Valentina; dos Santos, Luiz F. G.; Bose, Souvik; Neuberg, Brad; Cheung, Mark C. M.; Janvier, Miho; Jin, Meng; Gal, Yarin; Boerner, Paul; Baydin, Atılım Güneş
Scripts for generating the SDOMLv1 dataset by Fouhey, David F.; Jin, Meng

See 6 usage examples →

ESA WorldCover Sentinel-1 and Sentinel-2 10m Annual Composites

agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar

The WorldCover 10m Annual Composites were produced, as part of the European Space Agency (ESA) WorldCover project, from the yearly Copernicus Sentinel-1 and Sentinel-2 archives for both years 2020 and 2021. These global mosaics consists of four products composites. A Sentinel-2 RGBNIR yearly median composite for bands B02, B03, B04, B08. A Sentinel-2 SWIR yearly median composite for bands B11 and B12. A Sentinel-2 NDVI yearly percentiles composite (NDVI 90th, NDVI 50th NDVI 10th percentiles). A Sentinel-1 GAMMA0 yearly median composite for bands VV, VH and VH/VV (power scaled). Each product is...

Usage examples

WorldCover Viewer by VITO
ESA WorldCover 10 m 2021 v200 - Product User Manual by VITO
Exploring the datasets by VITO
Release of the 10 m WorldCover map by Ruben Van De Kerchove
ESA WorldCover 10 m 2021 v200 by Zanaga, D., Van De Kerchove, R.,Daems, D.,De Keersmaecker, W., Brockmann, C., Kirches, G., Wevers, J., Cartus, O., Santoro, M., Fritz, S., Lesiv, M., Herold, M., Tsendbazar, N.E., Xu, P., Ramoino, F., Arino, O.

See 5 usage examples →

High resolution, annual cropland and landcover maps for selected African countries

agriculturecogdeep learninglabeledland covermachine learningsatellite imagery

High resolution, annual cropland and landcover maps for selected African countries developed by Clark University's Agricultural Impacts Research Group using various machine learning approaches applied to Planet imagery, including field boundary and cultivated frequency maps, as well as multi-class land cover.

Usage examples

Final report-Phase 2: Creating next generation field boundary and crop type maps: Rigorous multi-scale groundtruth provides sustainable extension services for smallholders by Wussah et al (2022)
High resolution, annual maps of field boundaries for smallholder-dominated croplands at national scales by Estes et al. (2022)
Final report-Phase 1: Creating open agricultural maps and ground truth data to better deliver farm extension services by Estes et al (2022)
A super-ensemble approach to map land cover types with high resolution over data-sparse African savanna landscapes by Song et al. (2023)
Accessing and downloading data by Rahebe Abedi

See 5 usage examples →

MONKEY

cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging

This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...

Usage examples

See 5 usage examples →

A region-wide, multi-year set of crop field boundary labels for Africa

agriculturecoglabeledland covermachine learningsatellite imagery

Crop field boundaries digitized in Planet imagery collected across Africa between 2017 and 2023, developed by Farmerline, Spatial Collective, and the Agricultural Impacts Research Group at Clark University, with support from the Lacuna Fund (Estes et al, 2024; Details →

Usage examples

A region-wide, multi-year set of crop field boundary labels for Africa by Wussah et al. (2023)
High resolution, annual maps of field boundaries for smallholder-dominated croplands at national scales by Estes et al. (2022)
Generalization enhancement strategies to enable cross-year cropland mapping with convolutional neural networks trained using historical samples by Khallaghi et al. (2025)
A region-wide, multi-year set of crop field boundary labels for Africa by Estes et al. (2024)
Instructions on data access and label-making demonstration notebook by Lyndon Estes

See 7 usage examples →

High Resolution Canopy Height Maps by WRI and Meta

aerial imageryagricultureclimatecogearth observationgeospatialimage processingland covermachine learningsatellite imagery

Global and regional Canopy Height Maps (CHM). Created using machine learning models on high-resolution worldwide Maxar satellite imagery.

Usage examples

Sub-meter resolution canopy height maps using self-supervised learning and a vision transformer trained on Aerial and GEDI Lidar by Jamie Tolan, Hung-I Yang, Ben Nosarzewski, Guillaume Couairon, Huy Vo, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, Camille Couprie
Using Artificial Intelligence to Map the Earth’s Forests by Jamie Tolan, Camille Couprie, John Brandt, Justine Spore, Tobias Tiecke, Tracy Johns and Patrick Nease
Every tree counts: Large-scale mapping of canopy height at the resolution of individual trees by Jamie Tolan, Camille Couprie, and Tracy Johns
Global Canopy Height on Earth Engine by Meta and WRI

See 4 usage examples →

OpenCell on AWS

biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy

The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome. This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library. These images can be interpreted both individually, to determine the localization of particular proteins of interest, and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.

Usage examples

See 4 usage examples →

Sentinel-2 L2A 120m Mosaic

agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagery

Sentinel-2 L2A 120m mosaic is a derived product, which contains best pixel values for 10-daily periods, modelled by removing the cloudy pixels and then performing interpolation among remaining values. As there are some parts of the world, which have lengthy cloudy periods, clouds might be remaining in some parts. The actual modelling script is available here.

Usage examples

See 4 usage examples →

iSDAsoil

agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery

iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this dataset are images of predicted soil properties, model error and satellite covariates used in the mapping process.

Usage examples

African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning by Tomislav Hengl, Matthew A. E. Miller, Josip Križan, Keith D. Shepherd, Andrew Sila, Milan Kilibarda, Ognjen Antonijević, Luka Glušica, Achim Dobermann, Stephan M. Haefele, Steve P. McGrath, Gifty E. Acquah, Jamie Collinson, Leandro Parente, Mohammadreza Sheykhmousa, Kazuki Saito, Jean-Martial Johnson, Jordan Chamberlin, Francis B. T. Silatsa, Martin Yemefack, John Wendt, Robert A. MacMillan, Ichsani Wheeler & Jonathan Crouch
iSDAsoil liming demo app on Observable by Jamie Collinson
iSDAsoil Python tutorial by Matt Miller
iSDAsoil homepage - view soil property maps online by iSDA

See 4 usage examples →

Allen Ivy Glioblastoma Atlas

biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology

This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.

Usage examples

See 3 usage examples →

CZ Grand Challenges - Imaging MIT Licensed data and models

biodiversitybioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingcziimaginglife sciencesmachine learningmicroscopymodelproteinzarr

This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

Quickstart Tutorial for CELL-Diff by CZI
CELL-Diff: Unified diffusion modeling for protein sequences and microscopy images by Zheng Dihan, Bo Huang
Quickstart Tutorial for SubCell by CZI
Documentation for SubCell by CZI
Documentation for CELL-Diff by CZI

See 6 usage examples →

CZ Grand Challenges - Transcriptomic MIT Licensed data and models

biodiversitybiologybiomolecular modelingcell biologyczihdf5life sciencesmachine learningmodelproteintranscriptomics

This dataset contains a transcriptomics biological data and models. The models embed transcriptomic data and facilitate transcriptomic analysis. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

Documentation for Transcriptformer by CZI
scGenePT: Is language all you need for modeling single-cell perturbations? by Ana-Maria Istrate, Donghui Li, Theofanis Karaletsos
Quickstart Tutorial for Transcriptformer by CZI
Documentation for scGenePT by CZI
A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model by Pearce, J. D., et. al.

See 6 usage examples →

I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalography (EEG) recordings from 1,020 comatose patients with a diagnosis of cardiac arrest who were admitted to an intensive care unit from seven academic hospitals in the U.S. and Europe. Patients were monitored with 18 bipolar EEG channels over hours to days for the diagnosis of seizures and for neurological prognostication. Long-term neurological function was determined using the Cerebral Performance Category scale.

Usage examples

The International Cardiac Arrest Research (I-CARE) Consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.
I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.

See 3 usage examples →

NASA SOHO/LASCO2 comet challenge on AWS

astronomymachine learningNASA SMD AI

The SOHO/LASCO data set (prepared for the challenge hosted in Topcoder) provided here comes from the instrument’s C2 telescope and comprises approximately 36,000 images spread across 2,950 comet observations. The human eye is a very sensitive tool and it is the only tool currently used to reliably detect new comets in SOHO data - particularly comets that are very faint and embedded in the instrument background noise. Bright comets can be easily detected in the LASCO data by relatively simple automated algorithms, but the majority of comets observed by the instrument are extremely faint, noise-...

Usage examples

Winners Selected for the NASA SOHO Comet Search with Artificial Intelligence Open-Science Challenge by Denise Hill
Topcoder NASA Comet Discovery: A Recap by TopCoder
Topcoder Challenge Finds Two New Comets For NASA by Annika Nagy

See 3 usage examples →

National Cancer Institute Imaging Data Commons (IDC) Collections

cancerdigital pathologyfluorescence imagingimage processingimaginglife sciencesmachine learningmicroscopyradiology

Imaging Data Commons (IDC) is a repository within the Cancer Research Data Commons (CRDC) that manages imaging data and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed by either funded US National Cancer Institute (NCI) data collection activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.

Usage examples

See 3 usage examples →

PD12M

artdeep learningimage processinglabeledmachine learningmedia

PD12M is a collection of 12.4 million CC0/PD image-caption pairs for the purpose of training generative image models.

Usage examples

Downloading Images by Spawning
Datasheet by Spawning
Source.Plus by Spawning
Hugging Face Dataset by Spawning
PD12M: A Large-Scale Image Captioning Dataset by Jordan Meyer, Nick Padgett, Laura Exline, Cullen Miller

See 6 usage examples →

SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The IIIC dataset includes 50,697 labeled EEG samples from 2,711 patients' and 6,095 EEGs that were annotated by physician experts from 18 institutions. These samples were used to train SPaRCNet (Seizures, Periodic and Rhythmic Continuum patterns Deep Neural Network), a computer program that classifies IIIC events with an accuracy matching clinical experts.

Usage examples

SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography by Jing, J., Ge, W., Struck, A. F., Fernandes, M., Hong, S., An, S., et al.
Development of Expert-Level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation by Jing J, Ge W, Hong S, Fernandes MB, Lin Z, Yang C et al., et al.
IIIC-SPaRCNet Github Repository by Brain Data Science Platform (BDSP)

See 3 usage examples →

Sophos/ReversingLabs 20 Million malware detection dataset

cyber securitydeep learninglabeledmachine learning

A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the sam...

Usage examples

SOREL-20M quickstart by Richard Harang
SOREL-20M dataset interface code by Richard Harang and Ethan M Rudd
SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection by Richard Harang and Ethan M Rudd

See 3 usage examples →

Wind AI Bench

benchmarkenergymachine learning

This data lake contains multiple datasets related to fundamental problems in wind energy research. This includes data for wind plant power production for various layouts/wind flow scenarios, data for two- and three-dimensional flow around different wind turbine airfoils/blades, wind turbine noise production, among others. The purpose of these datasets is to establish a standard benchmark against which new AI/ML methods can be tested, compared, and deployed. Details regarding the generation and formatting of the data for each dataset is included in the metadata as well as example noteboo...

Usage examples

Airfoil 2k by Dakota Ramos, Andrew Glaws
Wind AI Bench FLORIS PlayGen by Dakota Ramos, Andrew Glaws
Airfoil 9k by Dakota Ramos, Andrew Glaws

See 3 usage examples →

Africa Soil Information Service (AfSIS) Soil Chemistry

agricultureenvironmentalfood securitylife sciencesmachine learning

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. In this release, we include data collected during Phase I (2009-2013.) Georeferenced samples were collected from 19 countries in Sub-Saharan African using a statistically sound sampling scheme, and their soil properties were analyzed using both conventional soil testing methods and spectral methods (infrared diffuse reflectance spectroscopy). The two ...

Usage examples

See 2 usage examples →

AgricultureVision

aerial imageryagriculturecomputer visiondeep learningmachine learning

Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. The original dataset affiliated with the 2020 CVPR paper includes 94,986 512x512images sampled from 3,432 farmlands with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 201...

Usage examples

The 2nd International Workshop and Prize Challenge on Agriculture-Vision, Challenges & Opportunities for Computer Vision in Agricutlure by Humphrey Shi, Naira Hovakimyan, Jennifer Hobbs, Ed Delp, Melba Crawford, Zhen Li, David Clifford, Jim Yuan, Mang Tik Chiu, Xingqian Xu
Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis by Mang Tik Chiu, Xingqian Xu, Yunchao Wei, Zilong Huang, Alexander Schwing, Robert Brunner, Hrant Khachatrian, Hovnatan Karapetyan, Ivan Dozier, Greg Rose, David Wilson, Adrian Tudor, Naira Hovakimyan, Thomas S. Huang, Honghui Shi

See 2 usage examples →

Astrophysics Division Galaxy Segmentation Benchmark Dataset

astronomymachine learningNASA SMD AIsegmentation

Pan-STARSS imaging data and associated labels for galaxy segmentation into galactic centers, galactic bars, spiral arms and foreground stars derived from citizen scientist labels from the Galaxy Zoo: 3D project.

Usage examples

Galaxy Zoo: 3D – crowdsourced bar, spiral, and foreground star masks for MaNGA target galaxies by Karen L Masters, Coleman Krawczyk, Shoaib Shamsi, Alexander Todd, Daniel Finnegan, Matthew Bershady, Kevin Bundy, Brian Cherinka, Amelia Fraser-McKelvie, Dhanesh Krishnarao, Sandor Kruk, Richard R Lane, David Law, Chris Lintott, Michael Merrifield, Brooke Simmons, Anne-Marie Weijmans, Renbin Yan
Pan-STARRS Pixel Processing: Detrending, Warping, Stacking by C. Z. Waters, E. A. Magnier, P. A. Price, K. C. Chambers, W. S. Burgett, P. W. Draper, H. A. Flewelling, K. W. Hodapp, M. E. Huber, R. Jedicke, N. Kaiser, R.-P. Kudritzki, R. H. Lupton, N. Metcalfe, A. Rest, W. E. Sweeney, J. L. Tonry, R. J. Wainscoat, and W. M. Wood-Vase

See 2 usage examples →

Aurora Multi-Sensor Dataset

autonomous vehiclescomputer visiondeep learningimage processinglidarmachine learningmappingroboticstraffictransportationurbanweather

The Aurora Multi-Sensor Dataset is an open, large-scale multi-sensor dataset with highly accurate localization ground truth, captured between January 2017 and February 2018 in the metropolitan area of Pittsburgh, PA, USA by Aurora (via Uber ATG) in collaboration with the University of Toronto. The de-identified dataset contains rich metadata, such as weather and semantic segmentation, and spans all four seasons, rain, snow, overcast and sunny days, different times of day, and a variety of traffic conditions.
The Aurora Multi-Sensor Dataset contains data from a 64-beam Velodyne HDL-64E LiDAR sensor and seven 1920x1200-pixel resolution cameras including a forward-facing stereo pair and five wide-angle lenses covering a 360-degree view around the vehicle.
This data can be used to develop and evaluate large-scale long-term approaches to autonomous vehicle localization. Its size and diversity make it suitable for a wide range of research areas such as 3D reconstruction, virtual tourism, HD map construction, and map compression, among others.
The data was first presented at the International Conference on Intelligent Robots an
...

Usage examples

Introduction to Visualizing Sensor Types (Jupyter notebook) by Andrei Bârsan (note: Aurora makes no representations as to the accuracy or functionality of the tutorial)
"Pit30M: A benchmark for global localization in the age of self-driving cars", in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4477-4484) by Martinez, J., Doubov, S., Fan, J., Bârsan, I. A., Wang, S., Máttyus, G., Urtasun, R.

See 2 usage examples →

Brain Encoding Response Generator (BERG)

brain modelscomputer visiondeep learninglife sciencesmachine learningneuroimagingneuroscience

Brain Encoding Response Generator (BERG) is a resource consisting of multiple pre-trained encoding models of the brain and an accompanying Python package to generate accurate in silico neural responses to arbitrary stimuli with just a few lines of code.

Usage examples

Brain Encoding Response Generator (BERG) by Alessandro Gifford
In-Silico fMRI Data Tutorial by Alessandro Gifford
In-Silico EEG Data Tutorial by Alessandro Gifford
Quickstart Tutorial by Domenic Bersch
The Brain Encoding Response Generator by Alessandro Gifford

See 5 usage examples →

Consented Activities of People

activity detectionactivity recognitioncomputer visionlabeledmachine learningprivacyvideo

The Consented Activities of People (CAP) dataset is a fine grained activity dataset for visual AI research curated using the Visym Collector platform.

Usage examples

Visym Collector by Visym Labs & Systems & Technology Research
OpenFAD - Open Fine Grained Activity Detection Challenge by Visym Labs & NIST

See 2 usage examples →

CryoET Data Portal

cell biologycryo electron tomographyczielectron tomographylife sciencesmachine learningsegmentationstructural biology

Cryo-electron tomography (cryoET) is a powerful technique for visualizing 3D structures of cellular macromolecules at near atomic resolution in their native environment. Observing the inner workings of cells in context enables better understanding about the function of healthy cells and the changes associated with disease. However, the analysis of cryoET data remains a significant bottleneck, particularly the annotation of macromolecules within a set of tomograms, which often requires a laborious and time-consuming process of manual labelling that can take months to complete. Given the current...

Usage examples

See 5 usage examples →

DigitalCorpora

computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis

Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at Details →

Usage examples

See 2 usage examples →

EEGDash on AWS

life sciencesmachine learningneuroscience

The EEG-DaSh data archive will establish a data-sharing resource for MEEG (EEG, MEG) data, enabling large-scale computational advancements to preserve and share scientific data from publicly funded research for machine learning and deep learning applications.

Usage examples

See 2 usage examples →

Emory Knee Radiograph (MRKR) dataset

bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray

The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps ...

Usage examples

Emory Knee Radiograph Dataset by Brandon Price, Jason Adleberg, Kaesha Thomas, Zach Zaiman, Aawez Mansuri, Beatrice Brown-Mulry, Chima Okecheukwu, Judy Gichoya, Hari Trivedi.
Example Notebook by Emory-HITI

See 2 usage examples →

Harvard Electroencephalography Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University:Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH).

Usage examples

Harvard Electroencephalography Database by Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., et al.
Harvard-EEG-Database-Tools by Brain Data Science Platform (BDSP)

See 2 usage examples →

Harvard-Emory ECG Database

bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience

The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.

Usage examples

Harvard Electroencephalography Database by Moura Junior, V.; Reyna, M.; Hong, S.; Gupta, A.; Ghanta, M.; Sameni, R., et al.
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.

See 2 usage examples →

SeeFar V0

biodiversityclimatecoastalearth observationenvironmentalgeospatialglobalmachine learningmappingnatural resourcesatellite imagerysustainability

A collection of multi-resolution satellite images from both public and commercial satellites. The dataset is specifically curated for training geospatial foundation models.

Usage examples

SeeFar: Satellite Agnostic Multi-Resolution Dataset for Geospatial Foundation Models by Lowman, J., Zheng, K. L., Fraser, R., The, J. V. G., & Valipour, M.
Getting Started with SeeFar for Multi-Resolution Geospatial Analysis by James Lowman - Coastal Carbon

See 2 usage examples →

3DCoMPaT: Composition of Materials on Parts of 3D Things

computer visionmachine learning

3D CoMPaT is a richly annotated large-scale dataset of rendered compositions of Materials on Parts of thousands of unique 3D Models. This dataset primarily focuses on stylizing 3D shapes at part-level with compatible materials. Each object with the applied part-material compositions is rendered from four equally spaced views as well as four randomized views. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. We present two variations of this task and adapt state-of-art 2D/3D deep learning met...

Usage examples

3DCoMPaT: Composition of Materials on Parts of 3D Things by Yuchen Li, Ujjwal Upadhyay, Habib Slim, Ahmed Abdelreheem, Arpit Prajapati, Suhail Pothigara, Peter Wonka & Mohamed Elhoseiny

See 1 usage example →

A2D2: Audi Autonomous Driving Dataset

autonomous vehiclescomputer visiondeep learninglidarmachine learningmappingrobotics

An open multi-sensor dataset for autonomous driving research. This dataset comprises semantically segmented images, semantic point clouds, and 3D bounding boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and bus data for three sequences. We hope this dataset will further facilitate active research and development in AI, computer vision, and robotics for autonomous driving.

Usage examples

Autonomous Driving Data Service (ADDS) by Ajay Vohra, Amazon

See 1 usage example →

AI2 Diagram Dataset (AI2D)

machine learning

4,817 illustrative diagrams for research on diagram understanding and associated question answering.

Usage examples

A Diagram is Worth a Dozen Images by Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi

See 1 usage example →

AI2 Meaningful Citations Data Set

csvmachine learning

630 paper annotations

Usage examples

Identifying Meaningful Citations by Marco Valenzuela, Vu A. Ha, Oren Etzioni

See 1 usage example →

AI2 Reasoning Challenge (ARC) 2018

csvjsonmachine learning

7,787 multiple choice science questions and associated corpora

Usage examples

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challengg by Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord

See 1 usage example →

Astrophysics Division Galaxy Morphology Benchmark Dataset

astronomymachine learningNASA SMD AIsatellite imagery

Hubble Space Telescope imaging data and associated identification labels for galaxy morphology derived from citizen scientist labels from the Galaxy Zoo: Hubble project.

Usage examples

Galaxy Zoo: morphological classifications for 120 000 galaxies in HST legacy imaging by Kyle W. Willett, Melanie A. Galloway, Steven P. Bamford, Chris J. Lintott, Karen L. Masters, Claudia Scarlata, B. D. Simmons, Melanie Beck, Carolin N. Cardamone, Edmond Cheung, Edward M. Edmondson, Lucy F. Fortson, Roger L. Griffith, Boris Häußler, Anna Han, Ross Hart, Thomas Melvin, Michael Parrish, Kevin Schawinski, R. J. Smethurst, Arfon M. Smith

See 1 usage example →

CHIMERA

cancercomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologylife sciencesmachine learningmedical image computingmedical imaging

This dataset contains the training data for the CHIMERA - Combining HIstology, Medical imaging (radiology) and molEcular data for medical pRognosis and diAgnosis challenge. The CHIMERA Challenge aims to advance precision medicine in cancer care by addressing the critical need for multimodal data integration. Despite significant progress in AI, integrating transcriptomics, pathology, and radiology across clinical departments remains a complex challenge. Clinicians are faced with large, heterogeneous datasets that are difficult to analyze effectively. AI has the potential to unify multimodal dat...

Usage examples

CHIMERA Challenge by Computational Pathology Group Radboudumc, Nijmegen

See 1 usage example →

Corn Kernel Counting Dataset

agriculturecomputer visionmachine learning

Dataset associated with the March 2021 Frontiers in Robotics and AI paper "Broad Dataset and Methods for Counting and Localization of On-Ear Corn Kernels", DOI: 10.3389/frobt.2021.627009

Usage examples

Broad Dataset and Methods for Counting and Localization of On-Ear Corn Kernels by Jennifer Hobbs, Vachik Khachatryan, Barathwaj Anandan, Harutyun Hovhannisyan, David Wilson

See 1 usage example →

Discrete Reasoning Over the content of Paragraphs (DROP)

machine learningnatural language processing

The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).

Usage examples

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs by Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, Matt Gardner

See 1 usage example →

High Resolution Population Density Maps + Demographic Estimates by CIESIN and Meta

aerial imagerydemographicsdisaster responsegeospatialimage processingmachine learningpopulationsatellite imagery

Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV and Cloud-optimized GeoTIFF files. This refines CIESIN’s Gridded Population of the World using machine learning models on high-resolution worldwide Maxar satellite imagery. CIESIN population counts aggregated from worldwide census data are allocated to blocks where imagery appears to contain buildings.

Usage examples

Investigating environmental characteristics of US cities using publicly available ASDI datasets by Darren Ko

See 1 usage example →

Image classification - fast.ai datasets

computer visiondeep learningmachine learning

Some of the most important datasets for image classification research, including CIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIIT-Pets, and Stanford-Cars. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.

Usage examples

Oxford-IIIT Pet Image Classification on Amazon SageMaker by AWS

See 1 usage example →

Longitudinal Nutrient Deficiency

aerial imageryagriculturecomputer visiondeep learningmachine learning

Dataset associated with the 2021 AAAI Paper- Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery. The dataset contains 3 image sequences of aerial imagery from 386 farm parcels which have been annotated for nutrient deficiency stress.

Usage examples

Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery by Saba Dadsetan, Gisele Rose, Naira Hovakimyan, Jennifer Hobbs

See 1 usage example →

MAN TruckScenes

autonomous vehiclescomputer visiondeep learningGPSIMUlidarlogisticsmachine learningobject detectionobject trackingperceptionradarroboticstransportation

A large scale multimodal dataset for Autonomous Trucking. Sensor data was recorded with a heavy truck from MAN equipped with 6 lidars, 6 radars, 4 cameras and a high-precision GNSS. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time. It comprises more than 740 scenes of 20s each within a multitude of different environmental conditions. Bounding boxes are available for 27 object classes, 15 attributes, and a range of more than 230m. The scenes are t...

Usage examples

TruckScenes devkit tutorial by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin
TruckScenes devkit by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin
PyPi package by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin
MANTruckScenes: A multimodal dataset for autonomous trucking in diverse conditions by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, et al

See 4 usage examples →

Mars Spectrometry 2: Gas Chromatography for the Sample Analysis at Mars Data (SAM) Instrument

analyticsarchivesdeep learningmachine learningNASA SMD AIplanetary

NASA missions like the Curiosity and Perseverance rovers carry a rich array of instruments suited to collect data and build evidence towards answering if Mars ever had livable environmental conditions. These rovers can collect rock and soil samples and can take measurements that can be used to determine their chemical makeup.

Because communication between rovers and Earth is severely constrained, with limited transfer rates and short daily communication windows, scientists have a limited time to analyze the data and make difficult inferences about the chemistry in order to prioritize the next operations and send those instructions back to the rover.

This project aimed at building a model to automatically analyze gas chromatography mass spectrometry (GCMS) data collected for Mars exploration in order to help the scientists in their analysis of understanding the past habitability of Mars.

More information are available at https://mars.nasa.gov/msl/spacecraft/instruments/sam/ and the data from Mars are available and described at https://pds-geosciences.wustl.edu/missions/msl/sam.htm.

We request that you cite th...

Usage examples

Mars Spectrometry: Mars Spectrometry 2 (Challenge Results) by DrivenData team and NASA partners

See 1 usage example →

Mars Spectrometry: Detect Evidence for Past Habitability

analyticsarchivesdeep learningmachine learningNASA SMD AIplanetary

NASA missions like the Curiosity and Perseverance rovers carry a rich array of instruments suited to collect data and build evidence towards answering if Mars ever had livable environmental conditions. These rovers can collect rock and soil samples and can take measurements that can be used to determine their chemical makeup.

Because communication between rovers and Earth is severely constrained, with limited transfer rates and short daily communication windows, scientists have a limited time to analyze the data and make difficult inferences about the chemistry in order to prioritize the next operations and send those instructions back to the rover.

This project aimed at building a model to automatically analyze evolved gas analysis mass spectrometry (EGA-MS) data collected for Mars exploration in order to help the scientists in their analysis of understanding the past habitability of Mars.

More information are available at https://mars.nasa.gov/msl/spacecraft/instruments/sam/ and the data from Mars are available and described at https://pds-geosciences.wustl.edu/missions/msl/sam.htm.

We request that you ci...

Usage examples

Mars Spectrometry: Detect Evidence for Past Habitability (Challenge Results) by DrivenData team and NASA partners

See 1 usage example →

Multi-robot, Multi-Sensor, Multi-Environment Event Dataset (M3ED)

autonomous vehiclescomputer visiondeep learningevent cameraglobal shutter cameraGNSSGPSh5hdf5IMUlidarmachine learningperceptionroboticsRTK

M3ED is the first multi-sensor event camera (EC) dataset focused on high-speed dynamic motions in robotics applications. M3ED provides high-quality synchronized data from multiple platforms (car, legged robot, UAV), operating in challenging conditions such as off-road trails, dense forests, and performing aggressive flight maneuvers. M3ED also covers demanding operational scenarios for EC, such as high egomotion and multiple independently moving objects. M3ED includes high-resolution stereo EC (1280×720), grayscale and RGB cameras, a high-quality IMU, a 64-beam LiDAR, and RTK localization.

Usage examples

M3ED: Multi-Robot, Multi-Sensor, Multi-Environment Event Dataset by Chaney K, Cladera F, et al.

See 1 usage example →

NYUMets Brain Dataset

biologycancercomputer visionhealthimage processingimaginglife sciencesmachine learningmagnetic resonance imagingmedical imagingmedicineneurobiologyneuroimagingsegmentation

This dataset contains 8,000+ brain MRIs of 2,000+ patients with brain metastases.

Usage examples

Longitudinal deep neural networks for assessing metastatic brain cancer on a massive open benchmark. by Link et al (2023)

See 1 usage example →

Orcasound - bioacoustic data for marine conservation

biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledmachine learningmappingoceansopen source softwaresignal processing

Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.

Usage examples

Github for our open source projects by Orcasound open source community

See 1 usage example →

Quoref

machine learningnatural language processing

24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).

Usage examples

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning by Pradeep Dasigi, Nelson F. Liu, Ana Marasović, Noah A. Smith, Matt Gardner

See 1 usage example →

RSNA Abdominal Trauma Detection (RSNA-ABT)

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Blunt force abdominal trauma is among the most common types of traumatic injury, with the most frequent cause being motor vehicle accidents. Abdominal trauma may result in damage and internal bleeding of the internal organs, including the liver, spleen, kidneys, and bowel. Detection and classification of injuries are key to effective treatment and favorable outcomes. A large proportion of patients with abdominal trauma require urgent surgery. Abdominal trauma often cannot be diagnosed clinically by physical exam, patient symptoms, or laboratory tests. Prompt diagnosis of abdominal trauma using...

Usage examples

The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset by Rudie, Jeffrey D.

See 1 usage example →

RSNA Cervical Spine Fracture Detection (RSNA-CSF) Dataset

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

Over 1.5 million spine fractures occur annually in the United States alone resulting in over 17,730 spinal cord injuries annually. The most common site of spine fracture is the cervical spine. There has been a rise in the incidence of spinal fractures in the elderly and in this population, fractures can be more difficult to detect on imaging due to degenerative disease and osteoporosis. Imaging diagnosis of adult spine fractures is now almost exclusively performed with computed tomography (CT). Quickly detecting and determining the location of any vertebral fractures is essential to prevent ne...

Usage examples

The RSNA Cervical Spine Fracture CT Dataset by Ming, Hui Lin

See 1 usage example →

RSNA Intracranial Hemorrhage Detection

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

RSNA assembled this dataset in 2019 for the RSNA Intracranial Hemorrhage Detection AI Challenge (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/). De-identified head CT studies were provided by four research institutions. A group of over 60 volunteer expert radiologists recruited by RSNA and the American Society of Neuroradiology labeled over 25,000 exams for the presence and subtype classification of acute intracranial hemorrhage.

Usage examples

Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge by Rudie, Jeffrey D.

See 1 usage example →

RSNA Pulmonary Embolism Detection

computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography

RSNA assembled this dataset in 2020 for the RSNA STR Pulmonary Embolism Detection AI Challenge (https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/). With more than 12,000 CT pulmonary angiography (CTPA) studies contributed by five international research centers, it is the largest publicly available annotated PE dataset. RSNA collaborated with the Society of Thoracic Radiology to recruit more than 80 expert thoracic radiologists who labeled the dataset with detailed clinical annotations.

Usage examples

The RSNA Pulmonary Embolism CT Dataset by Colak, Errol

See 1 usage example →

Reasoning Over Paragraph Effects in Situations (ROPES)

jsonmachine learningnatural language processing

14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs).

Usage examples

Reasoning Over Paragraph Effects in Situations by Kevin Lin, Oyvind Tafjord, Peter Clark, Matt Gardner

See 1 usage example →

Voices Obscured in Complex Environmental Settings (VOiCES)

automatic speech recognitiondenoisingmachine learningspeaker identificationspeech processing

VOiCES is a speech corpus recorded in acoustically challenging settings, using distant microphone recording. Speech was recorded in real rooms with various acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise, either television, music, or babble, was concurrently played with clean speech. Data was recorded using multiple microphones strategically placed throughout the room. The corpus includes audio recordings, orthographic transcriptions, and speaker labels.

Usage examples

Getting started with VOiCES data by M.A. Barrios

See 1 usage example →

AI2 TabMCQ: Multiple Choice Questions aligned with the Aristo Tablestore

machine learningnatural language processing

9092 crowd-sourced science questions and 68 tables of curated facts

AI2 Tablestore (November 2015 Snapshot)

machine learningnatural language processing

68 tables of curated facts

Aristo Mini Corpus

csvjsonmachine learning

1,197,377 science-relevant sentences

Aristo Tuple KB

machine learningnatural language processing

294,000 science-relevant tuples

COCO - Common Objects in Context - fast.ai datasets

computer visiondeep learningmachine learning

COCO is a large-scale object detection, segmentation, and captioning dataset. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. If you use this dataset in your research please cite arXiv:1405.0312 [cs.CV].

CZ Grand Challenges - Imaging BSD licensed data and models

biodiversitybioinformaticsbiologybiomolecular modelingbrain imagescell biologycell imagingcziimaginglife sciencesmachine learningmicroscopymodelproteinzarr

This dataset contains a diverse range of imaging biological data and models. The data is sourced and curated by a team of experts at CZI and is made available as part of these datasets only when it is not publicly accessible or requires transformations to support model training.

Usage examples

Quickstart Tutorial for Cytoland by CZI
Documentation for Cytoland by CZI
Cytoland: robust virtual staining of landmark organelles by Liu, Hirata-Miyasaki, et al.

See 3 usage examples →

CZ Grand Challenges - Model Benchmarking

benchmarkbiologybiomolecular modelingcell biologyczilife sciencesmachine learningmodel

This dataset includes data and models relevant to benchmarking multimodal biological models. The data has been sourced and curated by a team of experts at CZI and is provided as part of these datasets only when it is not publicly available or requires transformation to support effective model benchmarking.

Usage examples

The molecular evolution of spermatogenesis across mammals by Murat, F., et al.
Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues by Tabula Sapiens Consortium et al.
Evaluating SubCell and Related Imaging Models by CZI

See 3 usage examples →

Cloud to Street - Microsoft Flood and Clouds Dataset

cogcomputer visiondeep learningearth observationfloodsgeospatialmachine learningsatellite imagerysynthetic aperture radar

This dataset consists of chips of Sentinel-1 and Sentinel-2 satellite data. Each Sentinel-1 chip contains a corresponding label for water and each Sentinel-2 chip contains a corresponding label for water and clouds. Data is stored in folders by a unique event identifier as the folder name. Within each event folder there are subfolders for Sentinel-1 (s1) and Sentinel-2 (s2) data. Each chip is contained in its own sub-folder with the folder name being the source image id, followed by a unique chip identifier consisting of a hyphenated set of 5 numbers. All bands of the satellite data, as well a...

DARPA Invisible Headlights Dataset

autonomous vehiclesbroadbandcomputer visionlidarmachine learningsegmentationus

"The DARPA Invisible Headlights Dataset is a large-scale multi-sensor dataset annotated for autonomous, off-road navigation in challenging off-road environments. It features simultaneously collected off-road imagery from multispectral, hyperspectral, polarimetric, and broadband sensors spanning wave-lengths from the visible spectrum to long-wave infrared and provides aligned LIDAR data for ground-truth shape. Camera calibrations, LiDAR registrations, and traversability annotations for a subset of the data are available."

Gretel Synthetic Safety Alignment Dataset

ai safetymachine learningnatural language processingsynthetic data

A comprehensive dataset designed for aligning language models with safety and ethical guidelines. Contains 8,361 curated triplets of prompts, responses, and safe responses across various risk categories. Each entry includes safety scores, judge reasoning, and harm probability assessments, making it valuable for model alignment, testing, and benchmarking.

Usage examples

See 3 usage examples →

Image localization - fast.ai datasets

computer visiondeep learningmachine learning

Some of the most important datasets for image localization research, including Camvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.

KITTI Vision Benchmark Suite

autonomous vehiclescomputer visiondeep learningmachine learningrobotics

Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth predic...

Multimedia Commons

computer visionmachine learningmultimediavideo

The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, s...

NLP - fast.ai datasets

deep learningmachine learningnatural language processing

Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext 103, and ACL-2010 French-English 10^9 corpus. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.

Natural Scenes Dataset

computer visionimage processingimaginglife sciencesmachine learningmagnetic resonance imagingneuroimagingneurosciencenifti

Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retin...

Open Food Facts Images

image processingmachine learning

A dataset of all images of Open Food Facts, the biggest open dataset of food products in the world.

RSNA Screening Mammography Breast Cancer Detection (RSNA-SMBC) Dataset

breast cancercancercomputer visioncsvlabeledlife sciencesmachine learningmammographymedical image computingmedical imagingradiology

According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requi...

Textbook Question Answering (TQA)

machine learning

1,076 textbook lessons, 26,260 questions, 6229 images

The Massively Multilingual Image Dataset (MMID)

computer visionmachine learningmachine translationnatural language processing

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.)

ZEST: ZEroShot learning from Task descriptions

machine learningnatural language processing

ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.

Amazon Bin Image Dataset

amazon.sciencecomputer visionmachine learning

The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.

Usage examples

See 2 usage examples →

YouTube 8 Million - Data Lakehouse Ready

amazon.sciencecomputer visionlabeledmachine learningparquetvideo

This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of...

Usage examples

Data Lake as Code Deployment Guide by AWS Industry Blueprints Team
YouTube 8 Million by Google Research

See 2 usage examples →

Amazon-PQA

amazon.sciencemachine learningnatural language processing

Amazon product questions and their answers, along with the public product information.

Usage examples

Answering Product-Questions by Utilizing Questions from Other Contextually Similar Products by Ohad Rozen, David Carmel, Avihai Mejer, Vitaly Mirkis, and Yftah Ziser

See 1 usage example →

Answer Reformulation

amazon.sciencemachine learningnatural language processing

Original StackExchange answers and their voice-friendly Reformulation.

Usage examples

Voice-based Reformulation of Community Answers by Simone Filice, Nachshon Cohen & David Carmel

See 1 usage example →

Automatic Speech Recognition (ASR) Error Robustness

amazon.sciencedeep learningmachine learningnatural language processingspeech recognition

Sentence classification datatasets with ASR Errors.

Usage examples

Using Phoneme Representations to Build Predictive Models Robust to ASR Errors by Anjie Fang, Simone Filice, Nut Limsopatham and Oleg Rokhlenko

See 1 usage example →

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

amazon.scienceconversation datamachine learningnatural language processing

This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted on EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here: https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570. The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversational dataset. It also includes the CONVBERT-DG and BERT-DG checkpoints descri...

Usage examples

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue by Shikib Mehri, Mihail Eric, Dilek Hakkani-Tur

See 1 usage example →

Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems

amazon.scienceconversation datamachine learningnatural language processing

This dataset provides extra annotations on top of the publicly released Topical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paper "Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems" (https://arxiv.org/abs/2005.12529?context=cs.CL). The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json. Each of these files will have additional annotations on top of the original Topical-Chat dataset. These specific annotations are: dialogue act annotations a...

Usage examples

Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems by Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, Mihail Eric & Dilek Hakkani-Tur

See 1 usage example →

Humor Detection from Product Question Answering Systems

amazon.sciencemachine learningnatural language processing

This dataset provides labeled humor detection from product question answering systems. The dataset contains 3 csv files: Humorous.csv containing the humorous product questions, Non-humorous-unbiased.csv containing the non-humorous prodcut questions from the same products as the humorous one, and, Details →

Usage examples

Humor Detection in Product Question Answering Systems. by Yftah Ziser, Elad Kravi & David Carmel

See 1 usage example →

Humor patterns used for querying Alexa traffic

amazon.sciencedialogmachine learningnatural language processing

Humor patterns used for querying Alexa traffic when creating the taxonomy described in the paper "“Alexa, Do You Want to Build a Snowman?” Characterizing Playful Requests to Conversational Agents" by Shani C., Libov A., Tolmach S., Lewin-Eytan L., Maarek Y., and Shahaf D. (CHI LBW 2022). These patterns corrospond to the researchers' hypotheses regarding what humor types are likely to appear in Alexa traffic. These patterns were used for querying Alexa traffic to evaluate these hypotheses.

Usage examples

“Alexa, Do You Want to Build a Snowman?” Characterizing Playful Requests to Conversational Agents by Shani C., Libov A., Tolmach S., Lewin-Eytan L., Maarek Y., and Shahaf D.

See 1 usage example →

Learning to Rank and Filter - community question answering

amazon.sciencemachine learningnatural language processing

This dataset provides product related questions and answers, including answers' quality labels, as as part of the paper 'IR Evaluation and Learning in the Presence of Forbidden Documents'.

Usage examples

IR Evaluation and Learning in the Presence of Forbidden Documents by David Carmel, Nachshon Cohen, Amir Ingber & Elad Kravi

See 1 usage example →

Multi Token Completion

amazon.sciencemachine learningnatural language processing

This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span r...

Usage examples

Simple and Effective Multi-Token Completion from Masked Language Models by Oren Kalinsky, Guy Kushilevitz, Alex Libov & Yoav Goldberg

See 1 usage example →

Pre- and post-purchase product questions

amazon.sciencemachine learningnatural language processing

This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.

Usage examples

"Did you buy it already?", Detecting Users Purchase-State From Their Product-Related Questions by Lital Kuchy, David Carmel, Thomas Huet & Elad Kravi

See 1 usage example →

Product Comparison Dataset for Online Shopping

amazon.sciencemachine learningnatural language processingonline shoppingproduct comparison

The Product Comparison dataset for online shopping is a new, manually annotated dataset with about 15K human generated sentences, which compare related products based on one or more of their attributes (the first such data we know of for product comparison). It covers ∼8K product sets, their selected attributes, and comparison texts.

Usage examples

Generating Explainable Product Comparisons for Online Shopping by Nikhita Vedula, Marcus Collins, Eugene Agichtein and Oleg Rokhlenko

See 1 usage example →

PyEnvs and CallArgs

code completionmachine learning

PyEnvs is a collection of 2814 permissively licensed Python packages along with their isolated development environments. Paired with a program analyzer (e.g. Jedi Language Server), it supports querying for project-related information. CallArgs is a dataset built on top of PyEnvs for function call argument completion. It provides function definition, implementation, and usage information for each function call instance.

Usage examples

Better Context Makes Better Code Language Models: A Case Study on Function Call Argument Completion by Hengzhi Pei, Jinman Zhao, Leonard Lausen, Sheng Zha, George Karypis

See 1 usage example →

VoiSeR

amazon.scienceinformation retrievalmachine learningnatural language processing

Voice-based refinements of product search

Usage examples

VoiSeR: A New Benchmark for Voice-Based Search Refinement by Simone Filice, Giuseppe Castellucci, Marcus Collins, Eugene Agichtein & Oleg Rokhlenko

See 1 usage example →

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

amazon.sciencemachine learningnatural language processing

This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval...

Usage examples

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation by Nachshon Cohen, Oren Kalinsky, Yftah Ziser & Alessandro Moschitti

See 1 usage example →

Wizard of Tasks

amazon.scienceconversation datadialogmachine learningnatural language processing

Wizard of Tasks (WoT) is a dataset containing conversations for Conversational Task Assistants (CTAs). A CTA is a conversational agent whose goal is to help humans to perform real-world tasks. A CTA can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. WoT contains about 550 conversations with ~18,000 utterances in two domains, i.e., Cooking and Home Improvement.

Usage examples

Wizard of Tasks: A Novel Conversational Dataset for Solving Real-World Tasks in Conversational Settings by Jason Ingyu Choi, Saar Kuzi, Nikhita Vedula, Jie Zhao, Giuseppe Castellucci, Marcus Collins, Shervin Malmasi, Oleg Rokhlenko and Eugene Agichtein

See 1 usage example →

Airborne Object Tracking Dataset

amazon.sciencecomputer visiondeep learningmachine learning

Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.

Amazon Berkeley Objects Dataset

amazon.sciencecomputer visiondeep learninginformation retrievalmachine learningmachine translation

Amazon Berkeley Objects (ABO) is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as "spin" or "360º-View" images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.

Amazon Seller Contact Intent Sequence

amazon.scienceHawkes Processmachine learningtemporal point process

When sellers need help from Amazon, such as how to create a listing, they often reach out to Amazon seller support through email, chat or phone. For each contact, we assign an intent so that we can manage the request more easily. The data we present in this release includes 548k contacts with 118 intents from 70k sellers sampled from recent years. There are 3 columns. 1. De-identified seller id - seller_id_anon; 2. Noisy inter-arrival time in the unit of hour between contacts - interarrival_time_hr_noisy; 3. An integer that represents the contact intent - contact_intent. Note that, to balance ...

FashionLocalTriplets

amazon.sciencecomputer visionmachine learning

Fine-grained localized visual similarity and search for fashion.

TSBench

benchmarkdeep learningmachine learningmeta learningtime series forecasting

TSBench comprises thousands of benchmark evaluations for time series forecasting methods. It provides various metrics (i.e. measures of accuracy, latency, number of model parameters, ...) of 13 time series forecasting methods across 44 heterogeneous datasets. Time series forecasting methods include both classical and deep learning methods while several hyperparameters settings are evaluated for the deep learning methods.In addition to the tabular data providing the metrics, TSBench includes the probabilistic forecasts of all evaluated methods for all 44 datasets. While the tabular data is smal...