This registry exists to help people discover and share datasets that are available via AWS resources. See recent additions and learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry tagged with machine learning.
You are currently viewing a subset of data tagged with machine learning.
If you want to add a dataset or example of how to use a dataset to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository.
Unless specifically stated in the applicable dataset documentation, datasets available through the Registry of Open Data on AWS are not provided and maintained by AWS. Datasets are provided and maintained by a variety of third parties under a variety of licenses. Please check dataset licenses and related documentation to determine if a dataset may be used for your application.
If you have a project using a listed dataset, please tell us about it. We may work with you to feature your project in a blog post.
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~15K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting. This data is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks, rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG...
bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing
This dataset contains alignment files and short nucleotide, copy number (CNV), repeat expansion (STR), structural variant (SV) and other variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, and v4.2.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6 and v4.2.7 datasets include results from trio small variant, de novo structural vari...
biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy
This bucket contains multiple datasets (as Quilt packages) created by the Allen Institute for Cell Science. The types of data included in this bucket are listed below:
agricultureair qualityanalyticsarchivesatmosphereclimateclimate modeldata assimilationdeep learningearth observationenergyenvironmentalforecastgeosciencegeospatialglobalhistoryimagingindustrymachine learningmachine translationmetadatameteorologicalmodelnetcdfopendapradiationsatellite imagerysolarstatisticssustainabilitytime series forecastingwaterweatherzarr
NASA's goal in Earth science is to observe, understand, and model the Earth system to discover how it is changing, to better predict change, and to understand the consequences for life on Earth. The Applied Sciences Program, within the Earth Science Division of the NASA Science Mission Directorate, serves individuals and organizations around the globe by expanding and accelerating societal and economic benefits derived from Earth science, information, and technology research and development.
The Prediction Of Worldwide Energy Resources (POWER) Project, funded through the Applied Sciences Program at NASA Langley Research Center, gathers NASA Earth observation data and parameters related to the fields of surface solar irradiance and meteorology to serve the public in several free, easy-to-access and easy-to-use methods. POWER helps communities become resilient amid observed climate variability by improving data accessibility, aiding research in energy development, building energy efficiency, and supporting agriculture projects.
The POWER project contains over 380 satellite-derived meteorology and solar energy Analysis Ready Data (ARD) at four temporal levels: hourly, daily, monthly, and climatology. The POWER data archive provides data at the native resolution of the source products. The data is updated nightly to maintain near real time availability (2-3 days for meteorological parameters and 5-7 days for solar). The POWER services catalog consists of a series of RESTful Application Programming Interfaces, geospatial enabled image services, and web mapping Data Access Viewer. These three service offerings support data discovery, access, and distribution to the project’s user base as ARD and as direct application inputs to decision support tools.
The latest data version update includes hourly...
bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle
The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad Institute. A human-friendly listing of datasets, instructions for accessing them, and other documentation is at the corresponding GitHub page abou...
agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The European Space Agency (ESA) WorldCover product provides global land cover maps for 2020 & 2021 at 10 m resolution based on Copernicus Sentinel-1 and Sentinel-2 data. The WorldCover product comes with 11 land cover classes and has been generated in the framework of the ESA WorldCover project, part of the 5th Earth Observation Envelope Programme (EOEP-5) of the European Space Agency. A first version of the product (v100), containing the 2020 map was released in October 2021. The 2021 map was released in October 2022 using an improved algorithm (v200). The WorldCover 2020 and 2021 maps we...
computer visiondisaster responseearth observationgeospatialmachine learningsatellite imagery
SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).
amazon.scienceanalyticsdeep learninggeospatiallast milelogisticsmachine learningoptimizationroutingtransportationurban
The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in r...
aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater
The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2023. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.
cogearth observationenvironmentalgeospatiallabeledmachine learningsatellite imagerystac
Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image ...
chemistrycloud computingdata assimilationdigital assetsdigital preservationenergyenvironmentalfree softwaregenomeHPCinformation retrievalinfrastructurejsonmachine learningmaterials sciencemolecular dynamicsmoleculeopen source softwarephysicspost-processingx-ray crystallography
Materials Project is an open database of computed materials properties aiming to accelerate materials science research. The resources in this OpenData dataset contain the raw, parsed, and build data products.
cogearth observationenvironmentalgeospatialland coverland usemachine learningmappingplanetarysatellite imagerystacsustainability
This dataset, produced by Impact Observatory, Microsoft, and Esri, displays a global map of land use and land cover (LULC) derived from ESA Sentinel-2 imagery at 10 meter resolution for the years 2017 - 2023. Each map is a composite of LULC predictions for 9 classes throughout the year in order to generate a representative snapshot of each year. This dataset was generated by Impact Observatory, which used billions of human-labeled pixels (curated by the National Geographic Society) to train a deep learning model for land classification. Each global map was produced by applying this model to ...
acousticsbiodiversitybiologyclimatecoastaldeep learningecosystemsenvironmentalmachine learningmarine mammalsoceansopen source software
This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applications in ocean soundscape research, education, and the arts.
computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery
RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ...
machine learningNASA SMD AI
The v1 dataset includes AIA/HMI observations 2010-2018 and v2 includes AIA/HMI observations 2010-2020 in all 10 wavebands (94A, 131A, 171A, 193A, 211A, 304A, 335A, 1600A, 1700A, 4500A), with 512x512 resolution and 6 minutes cadence; HMI vector magnetic field observations in Bx, By, and Bz components, with 512x512 resolution and 12 minutes cadence; The EVE observations in 39 wavelengths from 2010-05-01 to 2014-05-26, with 10 seconds cadence.
agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The WorldCover 10m Annual Composites were produced, as part of the European Space Agency (ESA) WorldCover project, from the yearly Copernicus Sentinel-1 and Sentinel-2 archives for both years 2020 and 2021. These global mosaics consists of four products composites. A Sentinel-2 RGBNIR yearly median composite for bands B02, B03, B04, B08. A Sentinel-2 SWIR yearly median composite for bands B11 and B12. A Sentinel-2 NDVI yearly percentiles composite (NDVI 90th, NDVI 50th NDVI 10th percentiles). A Sentinel-1 GAMMA0 yearly median composite for bands VV, VH and VH/VV (power scaled). Each product is...
agriculturecogdeep learninglabeledland covermachine learningsatellite imagery
High resolution, annual cropland and landcover maps for selected African countries developed by Clark University's Agricultural Impacts Research Group using various machine learning approaches applied to Planet imagery, including field boundary and cultivated frequency maps, as well as multi-class land cover.
cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging
This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...
agriculturecoglabeledland covermachine learningsatellite imagery
Crop field boundaries digitized in Planet imagery collected across Africa between 2017 and 2023, developed by Farmerline, Spatial Collective, and the Agricultural Impacts Research Group at Clark University, with support from the Lacuna Fund (Estes et al, 2024; Details →
aerial imageryagricultureclimatecogearth observationgeospatialimage processingland covermachine learningsatellite imagery
Global and regional Canopy Height Maps (CHM). Created using machine learning models on high-resolution worldwide Maxar satellite imagery.
biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy
The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome. This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library. These images can be interpreted both individually, to determine the localization of particular proteins of interest, and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.
agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagery
Sentinel-2 L2A 120m mosaic is a derived product, which contains best pixel values for 10-daily periods, modelled by removing the cloudy pixels and then performing interpolation among remaining values. As there are some parts of the world, which have lengthy cloudy periods, clouds might be remaining in some parts. The actual modelling script is available here.
agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery
iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this dataset are images of predicted soil properties, model error and satellite covariates used in the mapping process.
biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology
This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalography (EEG) recordings from 1,020 comatose patients with a diagnosis of cardiac arrest who were admitted to an intensive care unit from seven academic hospitals in the U.S. and Europe. Patients were monitored with 18 bipolar EEG channels over hours to days for the diagnosis of seizures and for neurological prognostication. Long-term neurological function was determined using the Cerebral Performance Category scale.
astronomymachine learningNASA SMD AI
The SOHO/LASCO data set (prepared for the challenge hosted in Topcoder) provided here comes from the instrument’s C2 telescope and comprises approximately 36,000 images spread across 2,950 comet observations. The human eye is a very sensitive tool and it is the only tool currently used to reliably detect new comets in SOHO data - particularly comets that are very faint and embedded in the instrument background noise. Bright comets can be easily detected in the LASCO data by relatively simple automated algorithms, but the majority of comets observed by the instrument are extremely faint, noise-...
cancerdigital pathologyfluorescence imagingimage processingimaginglife sciencesmachine learningmicroscopyradiology
Imaging Data Commons (IDC) is a repository within the Cancer Research Data Commons (CRDC) that manages imaging data and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed by either funded US National Cancer Institute (NCI) data collection activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.
artdeep learningimage processinglabeledmachine learningmedia
PD12M is a collection of 12.4 million CC0/PD image-caption pairs for the purpose of training generative image models.
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The IIIC dataset includes 50,697 labeled EEG samples from 2,711 patients' and 6,095 EEGs that were annotated by physician experts from 18 institutions. These samples were used to train SPaRCNet (Seizures, Periodic and Rhythmic Continuum patterns Deep Neural Network), a computer program that classifies IIIC events with an accuracy matching clinical experts.
cyber securitydeep learninglabeledmachine learning
A dataset intended to support research on machine learning techniques for detecting malware. It includes metadata and EMBER-v2 features for approximately 10 million benign and 10 million malicious Portable Executable files, with disarmed but otherwise complete files for all malware samples. All samples are labeled using Sophos in-house labeling methods, have features extracted using the EMBER-v2 feature set, well as metadata obtained via the pefile python library, detection counts obtained via ReversingLabs telemetry, and additional behavioral tags that indicate the rough behavior of the sam...
benchmarkenergymachine learning
This data lake contains multiple datasets related to fundamental problems in wind energy research. This includes data for wind plant power production for various layouts/wind flow scenarios, data for two- and three-dimensional flow around different wind turbine airfoils/blades, wind turbine noise production, among others. The purpose of these datasets is to establish a standard benchmark against which new AI/ML methods can be tested, compared, and deployed. Details regarding the generation and formatting of the data for each dataset is included in the metadata as well as example noteboo...
agricultureenvironmentalfood securitylife sciencesmachine learning
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. In this release, we include data collected during Phase I (2009-2013.) Georeferenced samples were collected from 19 countries in Sub-Saharan African using a statistically sound sampling scheme, and their soil properties were analyzed using both conventional soil testing methods and spectral methods (infrared diffuse reflectance spectroscopy). The two ...
aerial imageryagriculturecomputer visiondeep learningmachine learning
Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. The original dataset affiliated with the 2020 CVPR paper includes 94,986 512x512images sampled from 3,432 farmlands with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 201...
astronomymachine learningNASA SMD AIsegmentation
Pan-STARSS imaging data and associated labels for galaxy segmentation into galactic centers, galactic bars, spiral arms and foreground stars derived from citizen scientist labels from the Galaxy Zoo: 3D project.
autonomous vehiclescomputer visiondeep learningimage processinglidarmachine learningmappingroboticstraffictransportationurbanweather
The Aurora Multi-Sensor Dataset is an open, large-scale multi-sensor dataset with highly accurate localization ground truth, captured between January 2017 and February 2018 in the metropolitan area of Pittsburgh, PA, USA by Aurora (via Uber ATG) in collaboration with the University of Toronto. The de-identified dataset contains rich metadata, such as weather and semantic segmentation, and spans all four seasons, rain, snow, overcast and sunny days, different times of day, and a variety of traffic conditions.
The Aurora Multi-Sensor Dataset contains data from a 64-beam Velodyne HDL-64E LiDAR sensor and seven 1920x1200-pixel resolution cameras including a forward-facing stereo pair and five wide-angle lenses covering a 360-degree view around the vehicle.
This data can be used to develop and evaluate large-scale long-term approaches to autonomous vehicle localization. Its size and diversity make it suitable for a wide range of research areas such as 3D reconstruction, virtual tourism, HD map construction, and map compression, among others.
The data was first presented at the International Conference on Intelligent Robots an...
activity detectionactivity recognitioncomputer visionlabeledmachine learningprivacyvideo
The Consented Activities of People (CAP) dataset is a fine grained activity dataset for visual AI research curated using the Visym Collector platform.
cell biologycryo electron tomographyczielectron tomographylife sciencesmachine learningsegmentationstructural biology
Cryo-electron tomography (cryoET) is a powerful technique for visualizing 3D structures of cellular macromolecules at near atomic resolution in their native environment. Observing the inner workings of cells in context enables better understanding about the function of healthy cells and the changes associated with disease. However, the analysis of cryoET data remains a significant bottleneck, particularly the annotation of macromolecules within a set of tomograms, which often requires a laborious and time-consuming process of manual labelling that can take months to complete. Given the current...
computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis
Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at Details →
life sciencesmachine learningneuroscience
The EEG-DaSh data archive will establish a data-sharing resource for MEEG (EEG, MEG) data, enabling large-scale computational advancements to preserve and share scientific data from publicly funded research for machine learning and deep learning applications.
bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray
The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps ...
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University:Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH).
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.
brain modelscomputer visiondeep learninglife sciencesmachine learningneuroimagingneuroscience
Neural Encoding Simulation Toolkit (NEST) is a resource consisting of multiple pre-trained encoding models of the brain and an accompanying Python package to generate accurate in silico neural responses to arbitrary stimuli with just a few lines of code.
biodiversityclimatecoastalearth observationenvironmentalgeospatialglobalmachine learningmappingnatural resourcesatellite imagerysustainability
A collection of multi-resolution satellite images from both public and commercial satellites. The dataset is specifically curated for training geospatial foundation models.
computer visionmachine learning
3D CoMPaT is a richly annotated large-scale dataset of rendered compositions of Materials on Parts of thousands of unique 3D Models. This dataset primarily focuses on stylizing 3D shapes at part-level with compatible materials. Each object with the applied part-material compositions is rendered from four equally spaced views as well as four randomized views. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. We present two variations of this task and adapt state-of-art 2D/3D deep learning met...
autonomous vehiclescomputer visiondeep learninglidarmachine learningmappingrobotics
An open multi-sensor dataset for autonomous driving research. This dataset comprises semantically segmented images, semantic point clouds, and 3D bounding boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and bus data for three sequences. We hope this dataset will further facilitate active research and development in AI, computer vision, and robotics for autonomous driving.
machine learning
4,817 illustrative diagrams for research on diagram understanding and associated question answering.
csvmachine learning
630 paper annotations
csvjsonmachine learning
7,787 multiple choice science questions and associated corpora
astronomymachine learningNASA SMD AIsatellite imagery
Hubble Space Telescope imaging data and associated identification labels for galaxy morphology derived from citizen scientist labels from the Galaxy Zoo: Hubble project.
cancercomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologylife sciencesmachine learningmedical image computingmedical imaging
This dataset contains the training data for the CHIMERA - Combining HIstology, Medical imaging (radiology) and molEcular data for medical pRognosis and diAgnosis challenge. The CHIMERA Challenge aims to advance precision medicine in cancer care by addressing the critical need for multimodal data integration. Despite significant progress in AI, integrating transcriptomics, pathology, and radiology across clinical departments remains a complex challenge. Clinicians are faced with large, heterogeneous datasets that are difficult to analyze effectively. AI has the potential to unify multimodal dat...
agriculturecomputer visionmachine learning
Dataset associated with the March 2021 Frontiers in Robotics and AI paper "Broad Dataset and Methods for Counting and Localization of On-Ear Corn Kernels", DOI: 10.3389/frobt.2021.627009
machine learningnatural language processing
The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).
aerial imagerydemographicsdisaster responsegeospatialimage processingmachine learningpopulationsatellite imagery
Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV and Cloud-optimized GeoTIFF files. This refines CIESIN’s Gridded Population of the World using machine learning models on high-resolution worldwide Maxar satellite imagery. CIESIN population counts aggregated from worldwide census data are allocated to blocks where imagery appears to contain buildings.
computer visiondeep learningmachine learning
Some of the most important datasets for image classification research, including CIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIIT-Pets, and Stanford-Cars. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.
aerial imageryagriculturecomputer visiondeep learningmachine learning
Dataset associated with the 2021 AAAI Paper- Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery. The dataset contains 3 image sequences of aerial imagery from 386 farm parcels which have been annotated for nutrient deficiency stress.
autonomous vehiclescomputer visiondeep learningGPSIMUlidarlogisticsmachine learningobject detectionobject trackingperceptionradarroboticstransportation
A large scale multimodal dataset for Autonomous Trucking. Sensor data was recorded with a heavy truck from MAN equipped with 6 lidars, 6 radars, 4 cameras and a high-precision GNSS. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time. It comprises more than 740 scenes of 20s each within a multitude of different environmental conditions. Bounding boxes are available for 27 object classes, 15 attributes, and a range of more than 230m. The scenes are t...
analyticsarchivesdeep learningmachine learningNASA SMD AIplanetary
NASA missions like the Curiosity and Perseverance rovers carry a rich array of instruments suited to collect data and build evidence towards answering if Mars ever had livable environmental conditions. These rovers can collect rock and soil samples and can take measurements that can be used to determine their chemical makeup.
Because communication between rovers and Earth is severely constrained, with limited transfer rates and short daily communication windows, scientists have a limited time to analyze the data and make difficult inferences about the chemistry in order to prioritize the next operations and send those instructions back to the rover.
This project aimed at building a model to automatically analyze gas chromatography mass spectrometry (GCMS) data collected for Mars exploration in order to help the scientists in their analysis of understanding the past habitability of Mars.
More information are available at https://mars.nasa.gov/msl/spacecraft/instruments/sam/ and the data from Mars are available and described at https://pds-geosciences.wustl.edu/missions/msl/sam.htm.
We request that you cite th...
analyticsarchivesdeep learningmachine learningNASA SMD AIplanetary
NASA missions like the Curiosity and Perseverance rovers carry a rich array of instruments suited to collect data and build evidence towards answering if Mars ever had livable environmental conditions. These rovers can collect rock and soil samples and can take measurements that can be used to determine their chemical makeup.
Because communication between rovers and Earth is severely constrained, with limited transfer rates and short daily communication windows, scientists have a limited time to analyze the data and make difficult inferences about the chemistry in order to prioritize the next operations and send those instructions back to the rover.
This project aimed at building a model to automatically analyze evolved gas analysis mass spectrometry (EGA-MS) data collected for Mars exploration in order to help the scientists in their analysis of understanding the past habitability of Mars.
More information are available at https://mars.nasa.gov/msl/spacecraft/instruments/sam/ and the data from Mars are available and described at https://pds-geosciences.wustl.edu/missions/msl/sam.htm.
We request that you ci...
autonomous vehiclescomputer visiondeep learningevent cameraglobal shutter cameraGNSSGPSh5hdf5IMUlidarmachine learningperceptionroboticsRTK
M3ED is the first multi-sensor event camera (EC) dataset focused on high-speed dynamic motions in robotics applications. M3ED provides high-quality synchronized data from multiple platforms (car, legged robot, UAV), operating in challenging conditions such as off-road trails, dense forests, and performing aggressive flight maneuvers. M3ED also covers demanding operational scenarios for EC, such as high egomotion and multiple independently moving objects. M3ED includes high-resolution stereo EC (1280×720), grayscale and RGB cameras, a high-quality IMU, a 64-beam LiDAR, and RTK localization.
biologycancercomputer visionhealthimage processingimaginglife sciencesmachine learningmagnetic resonance imagingmedical imagingmedicineneurobiologyneuroimagingsegmentation
This dataset contains 8,000+ brain MRIs of 2,000+ patients with brain metastases.
biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledmachine learningmappingoceansopen source softwaresignal processing
Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.
machine learningnatural language processing
24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
Blunt force abdominal trauma is among the most common types of traumatic injury, with the most frequent cause being motor vehicle accidents. Abdominal trauma may result in damage and internal bleeding of the internal organs, including the liver, spleen, kidneys, and bowel. Detection and classification of injuries are key to effective treatment and favorable outcomes. A large proportion of patients with abdominal trauma require urgent surgery. Abdominal trauma often cannot be diagnosed clinically by physical exam, patient symptoms, or laboratory tests. Prompt diagnosis of abdominal trauma using...
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
Over 1.5 million spine fractures occur annually in the United States alone resulting in over 17,730 spinal cord injuries annually. The most common site of spine fracture is the cervical spine. There has been a rise in the incidence of spinal fractures in the elderly and in this population, fractures can be more difficult to detect on imaging due to degenerative disease and osteoporosis. Imaging diagnosis of adult spine fractures is now almost exclusively performed with computed tomography (CT). Quickly detecting and determining the location of any vertebral fractures is essential to prevent ne...
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
RSNA assembled this dataset in 2019 for the RSNA Intracranial Hemorrhage Detection AI Challenge (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/). De-identified head CT studies were provided by four research institutions. A group of over 60 volunteer expert radiologists recruited by RSNA and the American Society of Neuroradiology labeled over 25,000 exams for the presence and subtype classification of acute intracranial hemorrhage.
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
RSNA assembled this dataset in 2020 for the RSNA STR Pulmonary Embolism Detection AI Challenge (https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/). With more than 12,000 CT pulmonary angiography (CTPA) studies contributed by five international research centers, it is the largest publicly available annotated PE dataset. RSNA collaborated with the Society of Thoracic Radiology to recruit more than 80 expert thoracic radiologists who labeled the dataset with detailed clinical annotations.
jsonmachine learningnatural language processing
14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs).
automatic speech recognitiondenoisingmachine learningspeaker identificationspeech processing
VOiCES is a speech corpus recorded in acoustically challenging settings, using distant microphone recording. Speech was recorded in real rooms with various acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise, either television, music, or babble, was concurrently played with clean speech. Data was recorded using multiple microphones strategically placed throughout the room. The corpus includes audio recordings, orthographic transcriptions, and speaker labels.
machine learningnatural language processing
9092 crowd-sourced science questions and 68 tables of curated facts
machine learningnatural language processing
68 tables of curated facts
computer visiondeep learningmachine learning
COCO is a large-scale object detection, segmentation, and captioning dataset. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. If you use this dataset in your research please cite arXiv:1405.0312 [cs.CV].
cogcomputer visiondeep learningearth observationfloodsgeospatialmachine learningsatellite imagerysynthetic aperture radar
This dataset consists of chips of Sentinel-1 and Sentinel-2 satellite data. Each Sentinel-1 chip contains a corresponding label for water and each Sentinel-2 chip contains a corresponding label for water and clouds. Data is stored in folders by a unique event identifier as the folder name. Within each event folder there are subfolders for Sentinel-1 (s1) and Sentinel-2 (s2) data. Each chip is contained in its own sub-folder with the folder name being the source image id, followed by a unique chip identifier consisting of a hyphenated set of 5 numbers. All bands of the satellite data, as well a...
autonomous vehiclesbroadbandcomputer visionlidarmachine learningsegmentationus
"The DARPA Invisible Headlights Dataset is a large-scale multi-sensor dataset annotated for autonomous, off-road navigation in challenging off-road environments. It features simultaneously collected off-road imagery from multispectral, hyperspectral, polarimetric, and broadband sensors spanning wave-lengths from the visible spectrum to long-wave infrared and provides aligned LIDAR data for ground-truth shape. Camera calibrations, LiDAR registrations, and traversability annotations for a subset of the data are available."
ai safetymachine learningnatural language processingsynthetic data
A comprehensive dataset designed for aligning language models with safety and ethical guidelines. Contains 8,361 curated triplets of prompts, responses, and safe responses across various risk categories. Each entry includes safety scores, judge reasoning, and harm probability assessments, making it valuable for model alignment, testing, and benchmarking.
computer visiondeep learningmachine learning
Some of the most important datasets for image localization research, including Camvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.
autonomous vehiclescomputer visiondeep learningmachine learningrobotics
Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth predic...
computer visionmachine learningmultimediavideo
The Multimedia Commons is a collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets. The International Computer Science Institute (ICSI) and Lawrence Livermore National Laboratory are producing and distributing a core set of derived feature sets and annotations as part of an effort to enable large-scale video search capabilities. They have released this feature corpus into the public domain, under Creative Commons License 0, s...
deep learningmachine learningnatural language processing
Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext 103, and ACL-2010 French-English 10^9 corpus. This is part of the fast.ai datasets collection hosted by AWS for convenience of fast.ai students. See documentation link for citation and license details for each dataset.
computer visionimage processingimaginglife sciencesmachine learningmagnetic resonance imagingneuroimagingneurosciencenifti
Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retin...
image processingmachine learning
A dataset of all images of Open Food Facts, the biggest open dataset of food products in the world.
breast cancercancercomputer visioncsvlabeledlife sciencesmachine learningmammographymedical image computingmedical imagingradiology
According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requi...
machine learning
1,076 textbook lessons, 26,260 questions, 6229 images
computer visionmachine learningmachine translationnatural language processing
MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.)
machine learningnatural language processing
ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.
amazon.sciencecomputer visionmachine learning
The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.
amazon.sciencecomputer visionlabeledmachine learningparquetvideo
This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of...
amazon.sciencemachine learningnatural language processing
Amazon product questions and their answers, along with the public product information.
amazon.sciencemachine learningnatural language processing
Original StackExchange answers and their voice-friendly Reformulation.
amazon.sciencedeep learningmachine learningnatural language processingspeech recognition
Sentence classification datatasets with ASR Errors.
amazon.scienceconversation datamachine learningnatural language processing
This bucket contains the checkpoints used to reproduce the baseline results reported in the DialoGLUE benchmark hosted on EvalAI (https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview). The associated scripts for using the checkpoints are located here: https://github.com/alexa/dialoglue. The associated paper describing the benchmark and checkpoints is here: https://arxiv.org/abs/2009.13570. The provided checkpoints include the CONVBERT model, a BERT-esque model trained on a large open-domain conversational dataset. It also includes the CONVBERT-DG and BERT-DG checkpoints descri...
amazon.scienceconversation datamachine learningnatural language processing
This dataset provides extra annotations on top of the publicly released Topical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paper "Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems" (https://arxiv.org/abs/2005.12529?context=cs.CL). The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json. Each of these files will have additional annotations on top of the original Topical-Chat dataset. These specific annotations are: dialogue act annotations a...
amazon.sciencemachine learningnatural language processing
This dataset provides labeled humor detection from product question answering systems. The dataset contains 3 csv files: Humorous.csv containing the humorous product questions, Non-humorous-unbiased.csv containing the non-humorous prodcut questions from the same products as the humorous one, and, Details →
amazon.sciencedialogmachine learningnatural language processing
Humor patterns used for querying Alexa traffic when creating the taxonomy described in the paper "“Alexa, Do You Want to Build a Snowman?” Characterizing Playful Requests to Conversational Agents" by Shani C., Libov A., Tolmach S., Lewin-Eytan L., Maarek Y., and Shahaf D. (CHI LBW 2022). These patterns corrospond to the researchers' hypotheses regarding what humor types are likely to appear in Alexa traffic. These patterns were used for querying Alexa traffic to evaluate these hypotheses.
amazon.sciencemachine learningnatural language processing
This dataset provides product related questions and answers, including answers' quality labels, as as part of the paper 'IR Evaluation and Learning in the Presence of Forbidden Documents'.
amazon.sciencemachine learningnatural language processing
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span r...
amazon.sciencemachine learningnatural language processing
This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.
amazon.sciencemachine learningnatural language processingonline shoppingproduct comparison
The Product Comparison dataset for online shopping is a new, manually annotated dataset with about 15K human generated sentences, which compare related products based on one or more of their attributes (the first such data we know of for product comparison). It covers ∼8K product sets, their selected attributes, and comparison texts.
code completionmachine learning
PyEnvs is a collection of 2814 permissively licensed Python packages along with their isolated development environments. Paired with a program analyzer (e.g. Jedi Language Server), it supports querying for project-related information. CallArgs is a dataset built on top of PyEnvs for function call argument completion. It provides function definition, implementation, and usage information for each function call instance.
amazon.scienceinformation retrievalmachine learningnatural language processing
Voice-based refinements of product search
amazon.sciencemachine learningnatural language processing
This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval...
amazon.scienceconversation datadialogmachine learningnatural language processing
Wizard of Tasks (WoT) is a dataset containing conversations for Conversational Task Assistants (CTAs). A CTA is a conversational agent whose goal is to help humans to perform real-world tasks. A CTA can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. WoT contains about 550 conversations with ~18,000 utterances in two domains, i.e., Cooking and Home Improvement.
amazon.sciencecomputer visiondeep learningmachine learning
Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.
amazon.sciencecomputer visiondeep learninginformation retrievalmachine learningmachine translation
Amazon Berkeley Objects (ABO) is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as "spin" or "360º-View" images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.
amazon.scienceHawkes Processmachine learningtemporal point process
When sellers need help from Amazon, such as how to create a listing, they often reach out to Amazon seller support through email, chat or phone. For each contact, we assign an intent so that we can manage the request more easily. The data we present in this release includes 548k contacts with 118 intents from 70k sellers sampled from recent years. There are 3 columns. 1. De-identified seller id - seller_id_anon; 2. Noisy inter-arrival time in the unit of hour between contacts - interarrival_time_hr_noisy; 3. An integer that represents the contact intent - contact_intent. Note that, to balance ...
amazon.sciencecomputer visionmachine learning
Fine-grained localized visual similarity and search for fashion.
benchmarkdeep learningmachine learningmeta learningtime series forecasting
TSBench comprises thousands of benchmark evaluations for time series forecasting methods. It provides various metrics (i.e. measures of accuracy, latency, number of model parameters, ...) of 13 time series forecasting methods across 44 heterogeneous datasets. Time series forecasting methods include both classical and deep learning methods while several hyperparameters settings are evaluated for the deep learning methods.In addition to the tabular data providing the metrics, TSBench includes the probabilistic forecasts of all evaluated methods for all 44 datasets. While the tabular data is smal...