bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~15K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting. This data is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks, rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG...
Details →
Usage examples
-
Automated Sleep Apnea Quantification Based on Respiratory Movement. International Journal of Medical Sciences 2014; 11(8):796-802. PMCID: PMC4057486. by Bianchi MT, Lipoma T, Darling C, Alameddine Y, Westover MB.
-
Algorithm for automatic detection of self-similarity and prediction of residual central respiratory events during continuous positive airway pressure. Sleep. 2021 Apr 9;44(4):zsaa215. doi: 10.1093/sleep/zsaa215. PMCID: PMC8631077. by Oppersma E, Ganglberger W, Sun H, Thomas RJ*, Westover MB*
-
HIV Increases Sleep-based Brain Age Despite Antiretroviral Therapy. SLEEP. 2021 Mar 30:zsab058. doi: 10.1093/sleep/zsab058. Epub ahead of print. PMCID: PMC8361332. by Leone MJ*, Sun H*, Boutros CL, Liu L, Ye E, Sullivan L, et al.
-
Effects of cholinergic neuromodulation on thalamocortical rhythms during NREM sleep: a model study. Frontiers in Computational Neuroscience. 2020 Jan 23;13:100. doi: 10.3389/fncom.2019.00100. eCollection 2019. PMCID: PMC6990259. by Li Q, Song JL, Li SH, Westover MB, Zhang R.
-
Decision modeling in sleep apnea: the critical roles of pre-test probability, cost of untreated OSA, and time horizon. Journal of Clinical Sleep Medicine. 2016 Mar;12(3):409-418. PMCID: PMC4773629. by Moro M, Westover MB, Kelly J, Bianchi MT.
See 37 usage examples →
encyclopedicinternetnatural language processingweb archive
A corpus of web crawl data composed of over 50 billion web pages.
Details →
Usage examples
-
Of using Common Crawl to play Family Feud by Paul Masurel
-
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
-
LAION-5B: An open large-scale dataset for training next generation image-text models by Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, et al
-
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
-
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
See 35 usage examples →
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers.
The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...
Details →
Usage examples
-
Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas by Yang Liu, Nilay S. Sethi, et al.
-
Oncogenic Signaling Pathways in The Cancer Genome Atlas by Francisco Sanchez-Vega, Marco Mina, et al.
-
Integrated Genomic Analysis of the Ubiquitin Pathway across Cancer Types by Zhongqi Ge, Jake S. Leighton, et al.
-
Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each
Tumor Context
by Hua-Sheng Chiu, Sonal Somvanshi, et al.
-
Spatial Organization And Molecular Correlation Of Tumor-Infiltrating Lymphocytes Using Deep
Learning On Pathology Images
by Joel Saltz, Rajarsi Gupta, et al.
See 29 usage examples →
alchemical free energy calculationsbiomolecular modelingcoronavirusCOVID-19foldingathomehealthlife sciencesmolecular dynamicsproteinSARS-CoV-2simulationsstructural biology
Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. Run by the Folding@home Consortium, a worldwide network of research laboratories focusing on a variety of different diseases, Folding@home seeks to address problems in human health on a scale that is infeasible by another other means, sharing the results of these large-scale studies with the research community through peer-reviewed publications and publicly shared datasets. During the COVID-19 epidemic, Folding@home focused its resources on understanding the vulnerabilities in SARS-CoV-2, the virus that causes COVID-19 disease, and working closely with a number of experimental collaborators to accelerate progress toward effective therapies for treating COVID-19 and ending the pandemic. In the process, it created the world's first exascale distributed computing resource, enabling it to generate valuable scientific datasets of unprecedented size. More information about Folding@home's COVID-19 research activities at the Folding@home COVID-19 page. In addition to working directly with experimental collaborators and rapidly sharing new research findings through preprint servers, Folding@home has joined other researchers in committing to rapidly share all COVID-19 research data, and has joined forces with AWS and the Molecular Sciences Software Institute (MolSSI) to share datasets of unprecedented side through the AWS Open Data Registry, indexing these massive datasets via the MolSSI COVID-19 Molecular Structure and Therapeutics Hub. The complete index of all Folding@home datasets can be found here. Th...
Details →
Usage examples
See 24 usage examples →
cancergenomiclife sciencesSTRIDESwhole genome sequencing
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic.
TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...
Details →
Usage examples
See 24 usage examples →
agriculturedisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is
a land monitoring constellation of two satellites that provide high resolution
optical imagery and provide continuity for the current SPOT and Landsat missions.
The mission provides a global coverage of the Earth's land surface every 5 days,
making the data of great use in on-going studies. L1C data are available from
June 2015 globally. L2A data are available from November 2016 over Europe
region and globally since January 2017.
Details →
Usage examples
See 23 usage examples →
agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
This joint NASA/USGS program provides the longest continuous space-based record of
Earth’s land in existence. Every day, Landsat satellites provide essential information
to help land managers and policy makers make wise decisions about our resources and our environment.
Data is provided for Landsats 1, 2, 3, 4, 5, 7, 8, and 9 (excludes Landsat 6).As of June 28, 2023 (announcement),
the previous single SNS topic arn:aws:sns:us-west-2:673253540267:public-c2-notify
was replaced with
three new SNS topics for different types of scenes.
Details →
Usage examples
See 23 usage examples →
bambioinformaticsbiologycramgeneticgenomicgenotypinglife sciencesmachine learningpopulation geneticsshort read sequencingstructural variationtertiary analysisvariant annotationwhole genome sequencing
This dataset contains alignment files and short nucleotide, copy number (CNV), repeat expansion (STR), structural variant (SV) and other variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b, v3.7.6, v4.0.3, and v4.2.7 software. All DRAGEN analyses were performed in the cloud using the Illumina Connected Analytics bioinformatics platform powered by Amazon Web Services (see 'Data solution empowering population genomics' for more information). The v3.7.6 and v4.2.7 datasets include results from trio small variant, de novo structural vari...
Details →
Usage examples
See 22 usage examples →
biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy
This bucket contains multiple datasets (as Quilt packages) created by the Allen Institute for Cell Science. The types of data included in this bucket are listed below:
- Field of view or cropped images of cells
- Segmentations of structures in the images (e.g., boundaries of cells, DNA, other intracellular structures, etc.)
- Processed versions of the above images and segmentations
- Machine learning predictions and labels of the data listed above
- Models trained on the previously listed data
- Additional supporting non-image data related to the above listed data types (e.g., gene expression data, whole genome sequencing data, features derived from the images or model predictions, metadata)
- Simulation, analysis, and visualization data of in silico cell structures, cells, and cell populations
Extern...
Details →
Usage examples
-
Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy by Chawin Ounkomol, Sharmishtaa Seshamani, Mary M. Maleckar, Forrest Collman & Gregory R. Johnson
-
Integrated Mitotic Stem Cell by Allen Institute for Cell Science
-
Integrated intracellular organization and its variations in human iPS cells by Matheus P. Viana, Jianxu Chen*, Theo A. Knijnenburg*, Ritvik Vasan*, Calysta Yan*... Allen Institute for Cell Science... Graham T. Johnson, Ruwanthi N. Gunawardane, Nathalie Gaudreault, Julie A. Theriot & Susanne M. Rafelski
-
Colony context and size-dependent compensation mechanisms give rise to variations in nuclear growth trajectories by Julie C. Dixon*, Christopher L. Frick*, Chantelle L. Leveille*, Philip Garrison, Peyton A. Lee, Saurabh S. Mogre, Benjamin Morris, Nivedita Nivedita, Ritvik Vasan, Jianxu Chen, Cameron L. Fraser, Clare R. Gamlin, Leigh K. Harris, Melissa C. Hendershott, Graham T. Johnson, Kyle N. Klein, Sandra A. Oluoch, Derek J. Thirstrup, M. Filip Sluzewski, Lyndsay Wilhelm, Ruian Yang, Daniel M. Toloudis, Matheus P. Viana, Julie A. Theriot & Susanne M. Rafelski
-
Cell states beyond transcriptomics: Integrating structural organization and gene expression in hiPSC-derived cardiomyocytes by Kaytlyn A. Gerbin*, Tanya Grancharova*, Rory M. Donovan-Maiye*, Melissa C. Hendershott*, Helen G. Anderson, Jackson M. Brown, Jianxu Chen, Stephanie Q. Dinh, Jamie L. Gehring, Gregory R. Johnson, HyeonWoo Lee, Aditya Nath, Angelique M. Nelson, M. Filip Sluzewski, Matheus P. Viana, Calysta Yan, Rebecca J. Zaunbrecher, Kimberly R. Cordes Metzler, Nathalie Gaudreault, Theo A. Knijnenburg, Susanne M. Rafelski, Julie A. Theriot & Ruwanthi N. Gunawardane
See 20 usage examples →
natural language processing
Japanese dictionaries and pre-trained models (word embeddings and language models) for natural language processing.
SudachiDict is the dictionary for a Japanese tokenizer (morphological analyzer) Sudachi.
chiVe is Japanese pretrained word embeddings (word vectors), trained using the ultra-large-scale web corpus NWJC by National...
Details →
Usage examples
-
形態素解析器『Sudachi』のための大規模辞書開発 by 坂本美保, 川原典子, 久本空海, 髙岡一馬, 内田佳孝
-
chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて ~形態素解析器Sudachiと超大規模ウェブコーパスNWJCによる分散表現の獲得と改良~ by 久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸
-
chiTra Tutorial by Works Applications
-
詳細化した同義関係をもつ同義語辞書の作成 by 高岡一馬, 岡部裕子, 川原典子, 坂本美保, 内田佳孝
-
jdartsclone: TRIE Data Structure using Double-Array by Works Applications
See 20 usage examples →
bioinformaticscell biologylife sciencessingle-cell transcriptomicstranscriptomics
CZ CELLxGENE Discover (cellxgene.cziscience.com) is a free-to-use platform for the exploration, analysis, and retrieval of single-cell data. CZ CELLxGENE Discover hosts the largest aggregation of standardized single-cell data from the major human and mouse tissues, with modalities that include gene expression, chromatin accessibility, DNA methylation, and spatial transcriptomics.
This year, CZ CELLxGENE Discover has made available all of its human and mouse RNA single-cell data through Census (https://chanzuckerberg.github.io/cellxgene-census/) – a free-to-use service with an API and data that...
Details →
Usage examples
See 19 usage examples →
cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing
The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...
Details →
Usage examples
See 19 usage examples →
agricultureair qualityanalyticsarchivesatmosphereclimateclimate modeldata assimilationdeep learningearth observationenergyenvironmentalforecastgeosciencegeospatialglobalhistoryimagingindustrymachine learningmachine translationmetadatameteorologicalmodelnetcdfopendapradiationsatellite imagerysolarstatisticssustainabilitytime series forecastingwaterweatherzarr
NASA's goal in Earth science is to observe, understand, and model the Earth system to discover how it is changing, to better predict change, and to understand the consequences for life on Earth. The Applied Sciences Program, within the Earth Science Division of the NASA Science Mission Directorate, serves individuals and organizations around the globe by expanding and accelerating societal and economic benefits derived from Earth science, information, and technology research and development.
The Prediction Of Worldwide Energy Resources (POWER) Project, funded through the Applied Sciences Program at NASA Langley Research Center, gathers NASA Earth observation data and parameters related to the fields of surface solar irradiance and meteorology to serve the public in several free, easy-to-access and easy-to-use methods. POWER helps communities become resilient amid observed climate variability by improving data accessibility, aiding research in energy development, building energy efficiency, and supporting agriculture projects.
The POWER project contains over 380 satellite-derived meteorology and solar energy Analysis Ready Data (ARD) at four temporal levels: hourly, daily, monthly, and climatology. The POWER data archive provides data at the native resolution of the source products. The data is updated nightly to maintain near real time availability (2-3 days for meteorological parameters and 5-7 days for solar). The POWER services catalog consists of a series of RESTful Application Programming Interfaces, geospatial enabled image services, and web mapping Data Access Viewer. These three service offerings support data discovery, access, and distribution to the project’s user base as ARD and as direct application inputs to decision support tools.
The latest data version update includes hourly...
Details →
Usage examples
See 18 usage examples →
agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imageryweather
NEW GOES-19 Data!!! On April 4, 2025 at 1500 UTC, the GOES-19 satellite will be declared the Operational GOES-East satellite. All products and services, including NODD, for GOES-East will transition to GOES-19 data at that time. GOES-19 will operate out of the GOES-East location of 75.2°W starting on April 1, 2025 and through the operational transition. Until the transition time and during the final stretch of Post Launch Product Testing (PLPT), GOES-19 products are considered non-operational regardless of their validation maturity level. Shortly following the transition of GOES-19 to GOES-East, all data distribution from GOES-16 will be turned off. GOES-16 will drift to the storage location at 104.7°W. GOES-19 data should begin flowing again on April 4th once this maneuver is complete.
NEW GOES 16 Reprocess Data!! The reprocessed GOES-16 ABI L1b data mitigates systematic data issues (including data gaps and image artifacts) seen in the Operational products, and improves the stability of both the radiometric and geometric calibration over the course of the entire mission life. These data were produced by recomputing the L1b radiance products from input raw L0 data using improved calibration algorithms and look-up tables, derived from data analysis of the NIST-traceable, on-board sources. In addition, the reprocessed data products contain enhancements to the L1b file format, including limb pixels and pixel timestamps, while maintaining compatibility with the operational products. The datasets currently available span the operational life of GOES-16 ABI, from early 2018 through the end of 2024. The Reprocessed L1b dataset shows improvement over the Operational L1b products but may still contain data gaps or discrepancies. Please provide feedback to Dan Lindsey (dan.lindsey@noaa.gov) and Gary Lin (guoqing.lin-1@nasa.gov). More information can be found in the [GOES-R ABI Reprocess User Guide](https://github.com/NOAA-Big-Data-Program/nodd-data-docs/blob/main/GOES/GOES-R_ABI_Reprocessed_L1b_User_Guide-v1.1.pdf).
NOTICE: As of January 10th 2023, GOES-18 assumed the GOES-West position and all data files are deemed both operational and provisional, so no ‘preliminary, non-operational’ caveat is needed. GOES-17 is now offline, shifted approximately 105 degree West, where it will be in on-orbit storage. GOES-17 data will no longer flow into the GOES-17 bucket. Operational GOES-West products can be found in the GOES-18 bucket.
GOES satellites (GOES-16, GOES-17, GOES-18 & GOES-19) provide continuous weather imagery and
monitoring of meteorological and space environment data across North America.
GO
...
Details →
Usage examples
See 18 usage examples →
agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is
a land monitoring constellation of two satellites that provide high resolution
optical imagery and provide continuity for the current SPOT and Landsat missions.
The mission provides a global coverage of the Earth's land surface every 5 days,
making the data of great use in ongoing studies.
This dataset is the same as the Sentinel-2
dataset, except the JP2K files were converted into Cloud-Optimized GeoTIFFs (COGs).
Additionally, SpatioTemporal Asset Catalog metadata has were in a JSON file
alongside the data, and a STAC API called Earth-search
is freely available t...
Details →
Usage examples
See 18 usage examples →
bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimage-based profilingimaginglife sciencesmachine learningmedicinemicroscopyorganelle
The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay.
The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types.
The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020).
This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad Institute.
A human-friendly listing of datasets, instructions for accessing them, and other documentation is at the corresponding GitHub page abou...
Details →
Usage examples
-
A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay by Bray M-A, Gustafsdottir SM, Rohban MH, Singh S, Ljosa V, Sokolnicki KL, Bittker JA, Bodycombe NE, Dancik V, Hasaka TP, Hon CS, Kemp MM, Li K, Walpita D, Wawer MJ, Golub TR, Schreiber SL, Clemons PA, Shamji AF, & Carpenter AE
-
Image-based Profiling Handbook - for processing image-based profiling datasets using CellProfiler and pycytominer by Multiple Authors
-
Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes by Bray M-A, Singh S, Han H, Davis CT, Borgeson B, Hartland C, Kost-Alimova M, Gustafsdottir SM, Gibson CC, & Carpenter AE
-
Image-based Profiling Recipe by Multiple Authors
-
Center for Open Bioimage Analysis (COBA) YouTube Channel - video tutorials of CellProfiler and other softwares by Multiple Authors
See 17 usage examples →
agricultureearth observationmeteorologicalnatural resourceweather
Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
Details →
Usage examples
See 17 usage examples →
agriculturedisaster responseearth observationelevationgeospatial
A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.
Details →
Usage examples
See 17 usage examples →
cogearth observationgeophysicsgeospatialglobalicenetcdfsatellite imagerystaczarr
The Inter-mission Time Series of Land Ice Velocity and Elevation (ITS_LIVE) project has a singular mission: to accelerate ice sheet and glacier research by producing globally comprehensive, high resolution, low latency, temporally dense, multi-sensor records of land ice and ice shelf change while minimizing barriers between the data and the user.
ITS_LIVE data currently consists of NetCDF Level 2 scene-pair ice flow products posted to a standard 120 m grid derived from Landsat 4/5/7/8/9, Sentinel-2 optical scenes, and Sentinel-1 SAR scenes. We have processed all land-ice intersecting image pai...
Details →
Usage examples
-
Ubiquitous acceleration in Greenland Ice Sheet calving from 1985 to 2022 by Greene, C.A., A.S. Gardner, M. Wood, and J.K. Cuzzone
-
A serverless React-Leaflet website to plot and share ITS_LIVE data by Jacob Fahnestock
-
ITS_LIVE Point Data Access by Maria Liukis, Alex S. Gardner, Luis A. López, Mark Fahnestock, and Joseph H. Kennedy
-
Using xarray to examine cloud-based glacier surface velocity data by Emma Marshall
-
Detecting seasonal ice dynamics in satellite images by Greene, C.A., A.S. Gardner, and L.C. Andrews
See 16 usage examples →
agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The European Space Agency (ESA) WorldCover product provides global land cover maps for 2020 & 2021 at 10 m resolution based on Copernicus Sentinel-1 and Sentinel-2 data. The WorldCover product comes with 11 land cover classes and has been generated in the framework of the ESA WorldCover project, part of the 5th Earth Observation Envelope Programme (EOEP-5) of the European Space Agency. A first version of the product (v100), containing the 2020 map was released in October 2021. The 2021 map was released in October 2022 using an improved algorithm (v200). The WorldCover 2020 and 2021 maps we...
Details →
Usage examples
See 15 usage examples →
bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use.
The v4.1 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals, of diverse ancestries, sequenced sequenced as part of various disease-specific and population genetic studies.
The gnomAD Principal Investigators and team can be found here, and the groups that have contributed data to the current release are listed here.
Sign up for the gnom...
Details →
Usage examples
-
Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nature Communications 11, 2539 (2020) by Wang, Q., Pierce-Hoffman, E., Cummings, B. B., Karczewski, K. J., Alföldi, J., Francioli, L. C., Gauthier, L. D., Hill, A. J., O’Donnell-Luria, A. H., Genome Aggregation Database (gnomAD) Production Team, Genome Aggregation Database (gnomAD) Consortium, & MacArthur, D. G.
-
A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020) by Collins, R. L., Brand, H., Karczewski, K. J., Zhao, X., Alföldi, J., Francioli, L. C., Khera, A. V., Lowther, C., Gauthier, L. D., Wang, H., Watts, N. A., Solomonson, M., O’Donnell-Luria, A., Baumann, A., Munshi, R., Walker, M., Whelan, C., Huang, Y., Brookings, T., ... Talkowski, M. E.
-
gnomAD quality control GitHub repository by gnomAD Production Team
-
Evaluating potential drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020) by Minikel, E. V., Karczewski, K. J., Martin, H. C., Cummings, B. B., Whiffin, N., Rhodes, D., Alföldi, J., Trembath, R. C., van Heel, D. A., Daly, M. J., Genome Aggregation Database Production Team, Genome Aggregation Database Consortium, Schreiber, S. L., & MacArthur, D. G.
-
A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024) by Chen, S., Francioli, L. C., Goodrich, J. K., Collins, R. L., Wang, Q., Alföldi, J., Watts, N. A., Vittal, C., Gauthier, L. D., Poterba, T., Wilson, M. W., Tarasova, Y., Phu, W., Yohannes, M. T., Koenig, Z., Farjoun, Y., Banks, E., Donnelly, S., Gabriel, S., Gupta, N., Ferriera, S., Tolonen, C., Novod, S., Bergelson, L., Roazen, D., Ruano-Rubio, V., Covarrubias, M., Llanwarne, C., Petrillo, N., Wade, G., Jeandet, T., Munshi, R., Tibbetts, K., gnomAD Project Consortium, O’Donnell-Luria, A., Solomonson, M., Seed, C., Martin, A. R., Talkowski, M. E., Rehm, H. L., Daly, M. J., Tiao, G., Neale, B. M., MacArthur, D. G. & Karczewski, K. J.
See 15 usage examples →
broadbandcoastalContinuously Operating Reference Station (CORS)earthquakesgeophysicsgeosciencegeoscienceGNSSGPSoceansRINEXseismology
GeoNet provides geological hazard information for Aotearoa New Zealand. This dataset contains data and products recorded by the GeoNet sensor network.
GNSS (Global Navigation Satellite System) data include raw data in proprietary and Receiver Independent Exchange Format (RINEX) and local tie-in survey conducted during equipment changes, more details can be found on the GeoNet geodetic page website.
Coastal gauge data include relative measurement of sea level measured by tsunami monitoring gauges. Raw and quality control data are provided in CREX format (Character Form for the Representtion and eXchange of metereological data), more details can be found on the GeoNet coastal tsunami monitoring gauges page.
Camera images data include webcam images from the GeoNet Volcano monitoring network and Built Environment Instrumentation Programme, more details can be found on the GeoNet camera page.
Waveform data include raw data from weak and strong motion instruments of the GeoNet seismic networks, more details can be found on the GeoNet seismic waveform page.
Seismic data products include strong motion derived data, more details can be found on the GeoNet Strong Motion products page.
Time Series data products include derived time...
Details →
Usage examples
See 15 usage examples →
computer visiondisaster responseearth observationgeospatialmachine learningsatellite imagery
SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available
imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options
to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets
developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).
Details →
Usage examples
See 15 usage examples →
bambioinformaticsfast5fastafastqgenomiclife scienceslong read sequencingshort read sequencingtranscriptomics
The Singapore Nanopore Expression (SG-NEx) project is an international collaboration to generate reference transcriptomes and a comprehensive benchmark data set for long read Nanopore RNA-Seq. Transcriptome profiling is done using PCR-cDNA sequencing (PCR-cDNA), amplification-free cDNA sequencing (direct cDNA), direct sequencing of native RNA (direct RNA), and short read RNA-Seq. The SG-NEx core data includes 5 of the most commonly used cell lines and it is extended with additional cell lines and samples that cover a broad range of human tissues. All core samples are sequenced with at least 3 ...
Details →
Usage examples
See 15 usage examples →
amazon.scienceanalyticsdeep learninggeospatiallast milelogisticsmachine learningoptimizationroutingtransportationurban
The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in r...
Details →
Usage examples
-
Code repository used for the 2021 Amazon Routing Research Challenge (this repository is included for reference and documentation purposes only, you do not need to install it to access the data) by CAVE Lab, MIT Center for Transportation and Logistics
-
Human-Centric Parcel Delivery at Deutsche Post with Operations Research and Machine Learning by Uğur Arikan , Thorsten Kranz, Baris Cem Sal, Severin Schmitt, Jonas Witt
-
Does parking matter? The impact of parking time on last-mile delivery optimization by Sara Reed, Ann Melissa Campbell, Barrett W. Thomas
-
Inverse Optimization for Routing Problems by Pedro Zattoni Scroccaro, Piet van Beek, Peyman Mohajerin Esfahani, Bilge Atasoy
-
Integrating driver behavior into last-mile delivery routing - Combining machine learning and optimization in a hybrid decision support framework by Peter Dieter, Matthew Caron, Guido Schryen
See 17 usage examples →
agricultureclimatemeteorologicalweather
Near Real Time JPSS data is now flowing! See bucket information on the right side of this page to access products!
Satellites in the JPSS constellation gather global measurements of atmospheric, terrestrial and oceanic conditions, including sea and land surface temperatures, vegetation, clouds, rainfall, snow and ice cover, fire locations and smoke plumes, atmospheric temperature, water vapor and ozone. JPSS delivers key observations for the Nation's essential products and services, including forecasting severe weather like hurricanes, tornadoes and blizzards days in advance, and assessin...
Details →
Usage examples
See 14 usage examples →
coastalcogdeafricaearth observationgeospatialland covernatural resourcesatellite imagerystacsustainability
The Global Mangrove Watch (GMW) dataset is a result of the collaboration between Aberystwyth University (U.K.), solo Earth Observation (soloEO; Japan), Wetlands International the World Conservation Monitoring Centre (UNEP-WCMC) and the Japan Aerospace Exploration Agency (JAXA). The primary objective of producing this dataset is to provide countries lacking a national mangrove monitoring system with first cut mangrove extent and change maps, to help safeguard against further mangrove forest loss and degradation.
The Global Mangrove Watch dataset (version 2) consists of a global baseline map of ...
Details →
Usage examples
See 13 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
Digital Earth Africa (DE Africa) provides free and open access to a copy of Landsat Collection 2 Level-2 products over Africa. These products are produced and provided by the United States Geological Survey (USGS).
The Landsat series of Earth Observation satellites, jointly led by USGS and NASA, have been continuously acquiring images of the Earth’s land surface since 1972. DE Africa provides data from Landsat 5, 7 and 8 satellites, including historical observations dating back to late 1980s and regularly updated new acquisitions.
New Level-2 Landsat 7 and Landsat 8 data are available after 15...
Details →
Usage examples
See 13 usage examples →
biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience
This data set, made available by Janelia's FlyLight project, consists of fluorescence images
of Drosophila melanogaster driver lines, aligned to standard templates, and stored in formats
suitable for rapid searching in the cloud. Additional data will be added as it is published.
Details →
Usage examples
-
File Operations on AWS S3 by Rob Svirskas
-
Tutorial for neuronbridger (R API) by Alexander Bates
-
Scaling Neuroscience Research on AWS by Konrad Rokicki
-
An image resource of subdivided Drosophila GAL4-driver expression patterns for neuron-level searches by Geoffrey W Meissner, Zachary Dorman, Aljoscha Nern, Kaitlyn Forster, Theresa Gibney, Jennifer Jeter, Lauren Johnson, Yisheng He, Kelley Lee, Brian Melton, Brianna Yarbrough, Jody Clements, Cristian Goina, Hideo Otsuna, Konrad Rokicki, Robert R Svirskas, Yoshinori Aso, Gwyneth M Card, Barry J Dickson, Erica Ehrhardt, Jens Goldammer, Masayoshi Ito, Wyatt Korff, Ryo Minegishi, Shigehiro Namiki, Gerald M Rubin, Gabriella Sterne, Tanya Wolff, Oz Malkesman, FlyLight Project Team
-
FlyLight Project Website by Geoffrey Meissner
See 13 usage examples →
agriculturecogdisaster responseearth observationgeospatialglobalicesatellite imagerysynthetic aperture radar
Developed and operated by the Canadian Space Agency, it is Canada's first commercial Earth observation satellite
Développé et exploité par l'Agence spatiale canadienne, il s'agit du premier satellite commercial d'observation de la Terre au Canada.
Details →
Usage examples
See 12 usage examples →
agricultureclimatecogdeafricaearth observationfood securitygeospatialmeteorologicalsatellite imagerystacsustainability
Digital Earth Africa (DE Africa) provides free and open access to a copy of the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) monthly and daily products over Africa. The CHIRPS rainfall maps are produced and provided by the Climate Hazards Center in collaboration with the US Geological Survey, and use both rain gauge and satellite observations.
The CHIRPS-2.0 Africa Monthly dataset is regularly indexed to DE Africa from the CHIRPS monthly data. The CHIRPS-2.0 Africa Daily dataset is likewise indexed from the CHIRPS daily data. Both products have been converted to clou...
Details →
Usage examples
See 11 usage examples →
climatecoastaldeafricaearth observationgeospatialsatellite imagerysustainability
Africa's long and dynamic coastline is subject to a wide range of pressures, including extreme weather and climate, sea level rise and human development. Understanding how the coastline responds to these pressures is crucial to managing this region, from social, environmental and economic perspectives.
The Digital Earth Africa Coastlines (provisional) is a continental dataset that includes annual shorelines and rates of coastal change along the entire African coastline from 2000 to the present.
The product combines satellite data from the Digital Earth Africa program with tidal modelling t...
Details →
Usage examples
See 11 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes.
The geomedian component combines measurements collected over the specified timeframe to produce one representative, multispectral measurement for every pixel unit of the African continent. The end result is a comprehensive dataset that can be used to generate true-colour images for visual inspection of anthropogenic or natural landmarks. The full spectral dataset can be used to develop m...
Details →
Usage examples
See 11 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacwater
Water Observations from Space (WOfS) is a service that draws on satellite imagery to provide historical surface water observations of the whole African continent. WOfS allows users to understand the location and movement of inland and coastal water present in the African landscape. It shows where water is usually present; where it is seldom observed; and where inundation of the surface has been observed by satellite.
They are generated using the WOfS classification algorithm on Landsat satellite data. There are several WOfS products available for the African continent including scene-level dat...
Details →
Usage examples
-
Water Observations from Space: accurate maps of surface water through time for the continent of Africa by Meghan Halabisky, Kenneth Mubea, Fatou Mar, Fang Yuan, Chad Burton, Eloise Birchall, Negin F. Moghaddam, Sena Ghislain Adimou, Bako Mamane, David Ongo, Edward Boamah, Ee-Faye Chong, Nikita Gandhi, Alex Leith, Lisa Hall and Adam Lewis
-
Digital Earth Africa Explorer (Water Observations from Space Annual Summary) by Digital Earth Africa Contributors
-
Digital Earth Africa Notebook Repo by Digital Earth Africa Contributors
-
Digital Earth Africa Explorer (Water Observations from Space) by Digital Earth Africa Contributors
-
Digital Earth Africa Sandbox by Digital Earth Africa Contributors
See 11 usage examples →
Homo sapiensimaginglife sciencesmagnetic resonance imagingneuroimagingneuroscience
This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG)
In addition to the raw data, preprocessed data is also included for some datasets.
A complete list of the available datasets can be seen in the documentation lonk provided below.
Details →
Usage examples
See 11 usage examples →
aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater
The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2023. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.
Details →
Usage examples
-
Train and Deploy an Image Classifier for Disaster Response by Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren
-
TRECVID 2020: A comprehensive campaign for evaluating video retrieval tasks across multiple application domains by George Awad, Asad A. Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Alan F. Smeaton, Yvette Graham, Gareth J. F. Jones, Wessel Kraaij, Georges Quenot
-
LADI v1 Tutorials by Andrew Weinert, Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren, Ryan Earley, Nadia Dimitrova
-
Large Scale Organization and Inference of an Imagery Dataset for Public Safety by Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert
-
An overview on the evaluated video retrieval tasks at TRECVID 2022 by George Awad, Keith Curtis, Asad Butt, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Lukas Diduch, Jeffrey Liu, Yvette Graham, Georges Quenot
See 11 usage examples →
cogdisaster responseearth observationgeospatialsatellite imagerystac
Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment,
monitoring of staging areas and emergency response, damage assessment, and recovery. These images are generated
using the Maxar ARD pipeline, tiled on an organized grid in analysis-ready
cloud-optimized formats.
Details →
Usage examples
See 11 usage examples →
climatecoastaldisaster responseenvironmentalmeteorologicaloceanswaterweather
ANNOUNCEMENTS: [NOS OFS Version Updates and Implementation of Upgraded Oceanographic Forecast Modeling Systems for Lakes Superior and Ontario; Effective October 25, 2022}(https://www.weather.gov/media/notification/pdf2/scn22-91_nos_loofs_lsofs_v3.pdf)
For decades, mariners in the United States have depended on NOAA's Tide Tables for the best estimate of expected water levels. These tables provide accurate predictions of the astronomical tide (i.e., the change in water level due to the gravitational effects of the moon and sun and the rotation of the Earth); however, they cannot predict water-level changes due to wind, atmospheric pressure, and river flow, which are often significant.
The National Ocean Service (NOS) has the mission and mandate to provide guidance and information to support navigation and coastal needs. To support this mission, NOS has been developing and implementing hydrodynamic model-based Operational Forecast Systems.
This forecast guidance provides oceanographic information that helps mariners safely navigate their local waters. This national network of hydrodynamic models provides users with operational nowcast and forecast guidance (out to 48 – 120 hours) on parameters such as water levels, water temperature, salinity, and currents. These forecast systems are implemented in critical ports, harbors, estuaries, Great Lakes and coastal waters of the United States, and form a national backbone of real-time data, tidal predictions, data management and operational modeling.
Nowcasts and forecasts are scientific predictions about the present and future states of water levels (and possibly currents and other relevant oceanographic variables, such as salinity and temperature) in a coastal area. These predictions rely on either observed data or forecasts from a numerical model. A nowcast incorporates recent (and often near real-time) observed meteorological, oceanographic, and/or river flow rate data. A nowcast covers the period from the recent past (e.g., the past few days) to the present, and it can make predictions for locations where observational data are not available. A forecast incorporates meteorological, oceanographic, and/or river flow rate forecasts and makes predictions for times where observational data will not be available. A forecast is usually initiated by the results of a nowcast.
OFS generally runs four times per day (every 6 hours) on NOAA's Weather and Climate Operational Supercomputing Systems (WCOSS) in a standard Coastal Ocean Modeling Framework (COMF) developed by the Center for Operational Oceanographic Products and Services (CO-OPS). COMF is a set...
Details →
Usage examples
See 11 usage examples →
agriculturecogdisaster responseearth observationgeospatialimagingsatellite imagerystac
Imagery acquired
by the China-Brazil Earth Resources Satellite (CBERS), 4 and 4A.
The
image files are recorded and processed by Instituto Nacional de Pesquisas
Espaciais (INPE) and are converted to Cloud Optimized Geotiff
format in order to optimize its use for cloud based applications.
Contains all CBERS-4 MUX, AWFI, PAN5M and
PAN10M scenes acquired since
the start of the satellite mission and is daily updated with
new scenes.
CBERS-4A MUX Level 4 (Orthorectified) scenes are being
ingested starting from 04-13-2021. CBERS-4A WFI Level 4 (Orthorectified)
scenes are being ingested starting from ...
Details →
Usage examples
See 10 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsynthetic aperture radar
DE Africa’s Sentinel-1 backscatter product is developed to be compliant with the CEOS Analysis Ready Data for Land (CARD4L) specifications.
The Sentinel-1 mission, composed of a constellation of two C-band Synthetic Aperture Radar (SAR) satellites, are operated by European Space Agency (ESA) as part of the Copernicus Programme. The mission currently collects data every 12 days over Africa at a spatial resolution of approximately 20 m.
Radar backscatter measures the amount of microwave radiation reflected back to the sensor from the ground surface. This measurement is sensitive to surface rough...
Details →
Usage examples
See 10 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product.
Together, they cover all Earth’s land surfaces, large islands, inland and coastal waters every 3-5 days.
Sentinel-2 data is tiered by level of pre-processing. Level-0, Level-1A and Level-1B data contain raw data fr...
Details →
Usage examples
See 10 usage examples →
climateearth observationenvironmentalnatural resourceoceanssatellite imagerywaterweather
A global, gap-free, gridded, daily 1 km Sea Surface Temperature (SST) dataset created by merging multiple Level-2 satellite SST datasets. Those input datasets include the NASA Advanced Microwave Scanning Radiometer-EOS (AMSR-E), the JAXA Advanced Microwave Scanning Radiometer 2 (AMSR-2) on GCOM-W1, the Moderate Resolution Imaging Spectroradiometers (MODIS) on the NASA Aqua and Terra platforms, the US Navy microwave WindSat radiometer, the Advanced Very High Resolution Radiometer (AVHRR) on several NOAA satellites, and in situ SST observations from the NOAA iQuam project. Data are available fro...
Details →
Usage examples
See 10 usage examples →
aerial imagerycogearth observationgeospatialsatellite imagerystac
The New Zealand Imagery dataset consists of New Zealand's publicly owned aerial and satellite imagery, which is freely available to use under an open licence. The dataset ranges from the latest high-resolution aerial imagery down to 5cm in some urban areas to lower resolution satellite imagery that provides full coverage of mainland New Zealand, Chathams and other offshore islands. It also includes historical imagery that has been scanned from film, orthorectified (removing distortions) and georeferenced (correctly positioned) to create a unique and crucial record of changes to the New Zea...
Details →
Usage examples
See 10 usage examples →
astronomyobject detectionplanetarysurvey
Raw data that discovers Near Earth Objects (NEOs) which potentially could impact Earth
Details →
Usage examples
See 9 usage examples →
energyenvironmentalgeospatiallidarmodelsolar
Data released under the Department of Energy's (DOE) Open Energy Data Initiative
(OEDI). The Open Energy Data Initiative aims to improve and automate
access of high-value energy data sets across the U.S. Department of Energy’s
programs, offices, and national laboratories. OEDI aims to make data
actionable and discoverable by researchers and industry to accelerate
analysis and advance innovation.
Details →
Usage examples
-
The Distributed Generation Market Demand Model (dGen):Documentation by B. Sigrin, M. Gleason, R. Preus, I. Baring-Gould, R. Margolis
-
Rooftop Solar Photovoltaic Technical Potential in the United States: A Detailed Assessment by Pieter Gagnon, Robert Margolis, Jennifer Melius, Caleb Phillips, and Ryan Elmore
-
Estimating rooftop solar technical potential across the US using a combination of GIS-based methods, lidar data, and statistical modeling by Pieter Gagnon et al 2018 Environ. Res. Lett. 13 024027
-
Tracking the Sun Tool by Lawrence Berkeley National Laboratory (LBNL)
-
On the Use of Coupled Wind, Wave, and Current Fields in the Simulation of Loads on BottomSupported Offshore Wind Turbines during Hurricanes by E. Kim, L. Manuel, M. Curcic, S. S. Chen, C. Phillips, P. Veers
See 9 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsynthetic aperture radar
The ALOS/PALSAR annual mosaic is a global 25 m resolution dataset that combines data from many images captured by JAXA’s PALSAR and PALSAR-2 sensors on ALOS-1 and ALOS-2 satellites respectively. This product contains radar measurement in L-band and in HH and HV polarizations. It has a spatial resolution of 25 m and is available annually for 2007 to 2010 (ALOS/PALSAR) and 2015 to 2020 (ALOS-2/PALSAR-2).
The JERS annual mosaic is generated from images acquired by the SAR sensor on the Japanese Earth Resources Satellite-1 (JERS-1) satellite. This product contains radar measurement in L-band and H...
Details →
Usage examples
See 9 usage examples →
agriculturecogdeafricaearth observationfood securitygeospatialsatellite imagerystacsustainability
Digital Earth Africa's cropland extent map (2019) shows the estimated location of croplands in Africa for the period January to December 2019. Cropland is defined as: "a piece of land of minimum 0.01 ha (a single 10m x 10m pixel) that is sowed/planted and harvest-able at least once within the 12 months after the sowing/planting date." This definition will exclude non-planted grazing lands and perennial crops which can be difficult for satellite imagery to differentiate from natural vegetation.
This provisional cropland extent map has a resolution of 10m, and was built using Cope...
Details →
Usage examples
See 9 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability
Fractional cover (FC) describes the landscape in terms of coverage by green vegetation, non-green vegetation (including deciduous trees during autumn, dry grass, etc.) and bare soil. It provides insight into how areas of dry vegetation and/or bare soil and green vegetation are changing over time. The product is derived from Landsat satellite data, using an algorithm developed by the Joint Remote Sensing Research Program.
Digital Earth Africa's FC service has two components. Fractional Cover is estimated from each Landsat scene, providing measurements from individual days. Fractional Cover...
Details →
Usage examples
See 9 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
Digital Earth Africa’s Monthly NDVI Anomaly service provides estimate of vegetation condition, for each caldendar month, against the long-term baseline condition measured for the month from 1984 to 2020 in the NDVI Climatology. A standardised anomaly is calculated by subtracting the long-term mean from an observation of interest and then dividing the result by the long-term standard deviation. Positive NDVI anomaly values indicate vegetation is greener than average conditions, and are usually due to increased rainfall in a region. Negative values indicate additional plant stress relative to t...
Details →
Usage examples
See 9 usage examples →
bioinformaticsgenomiclife scienceslong read sequencing
The dataset contains reference samples that will be useful for benchmarking and comparing bioinformatics tools for genome analysis. Currently, there are two samples, which are NA12878 (HG001) and NA24385 (HG002), sequenced on an Oxford Nanopore Technologies (ONT) PromethION using the latest R10.4.1 flowcells. Raw signal data output by the sequencer is provided for these datasets in BLOW5 format, and can be rebasecalled when basecalling software updates bring accuracy and feature improvements over the years. Raw signal data is not only for rebasecalling, but also can be used for emerging bioinf...
Details →
Usage examples
-
Flexible and efficient handling of nanopore sequencing signal data with slow5tools. by Samarakoon, H., Ferguson, J.M., Jenner, S.P. et al.
-
Directly processing on an s3fs mount by Hasindu Gamaarachchi
-
Accelerated nanopore basecalling with SLOW5 data format. by Samarakoon, H., Ferguson, J.M., Gamaarachchi H. et al.
-
Slow5tools: toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format. by Samarakoon, H., Ferguson, J.M., Jenner, S.P. et al.
-
Fetching subsets with slow5curl and samtools by Bonson Wong
See 9 usage examples →
environmentalgeospatialmeteorological
Released to the public as part of the Department of Energy's Open Energy Data Initiative,
the Wind Integration National Dataset (WIND)
is an update and expansion of the Eastern Wind Integration Data Set and
Western Wind Integration Data Set. It supports the next generation of wind
integration studies.
Details →
Usage examples
-
Power from wind: Open data on AWS by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer
-
Overview and Meteorological Validation of the Wind Integration National Dataset Toolkit by Caroline Draxl, Bri-Mathias Hodge, Andrew Clifton, Jim McCaa
-
A Twenty-Year Analysis of Winds in California for Offshore Wind Energy Production Using WRF v4.1.2 by Alex Rybchuk, Mike Optis, Julie K. Lundquist, Michael Rossol, Walt Musial
-
Validation of Power Output for the WIND Toolkit by J. King, Andrew Clifton, Bri-Mathias Hodge
-
WTK-LED: The WIND Toolkit Long-Term Ensemble Dataset by Caroline Draxl, Jiali Wang, Lindsay Sheridan, et al.
See 9 usage examples →
array tomographybiologyelectron microscopyimage processinglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneuroscience
This bucket contains multiple neuroimaging datasets (as Neuroglancer Precomputed Volumes) across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include segmentations and meshes.
Details →
Usage examples
-
A Community-Developed Open-Source Computational Ecosystem for Big Neuro Data by J. T. Vogelstein, E. Perlman, B. Falk, A. Baden, W. Gray Roncal, V. Chandrashekhar, F. Collman, S. Seshamani, J. L. Patsolic, K. Lillaney, M. Kazhdan, R. Hider, D. Pryor, J. Matelsky, T. Gion, P. Manavalan, B. Wester, M. Chevillet, E. T. Trautman, K. Khairy, E. Bridgeford, D. M. Kleissas, D. J. Tward, A. K. Crow, B. Hsueh, M. A. Wright, M. I. Miller, S. J. Smith, R. J. Vogelstein, K. Deisseroth, and R. Burns
-
Download by Benjamin Falk
-
Visualization using Neuroglancer by Benjamin Falk
-
To the Cloud! A Grassroots Proposal to Accelerate Brain Science Discovery by J. T. Vogelstein, B. Mensh, M. Häusser, N. Spruston, A. C. Evans, K. Kording, K. Amunts, C. Ebell, J. Muller, M. Telefont, S. Hill, S. P. Koushika, C. Calì, P. A. Valdés-Sosa, P. B. Littlewood, C. Koch, S. Saalfeld, A. Kepecs, H. Peng, Y. O. Halchenko, G. Kiar, M. M. Poo, J. B. Poline, M. P. Milham, A. P. Schaffer, R. Gidron, H. Okano, V. D. Calhoun, M. Chun, D. M. Kleissas, R. J. Vogelstein, E. Perlman, R. Burns, R. Huganir, and M. I. Miller
-
Igneous by William Silversmith
See 9 usage examples →
bambioinformaticsbiologycoronavirusCOVID-19fast5fastafastqgeneticgenomichealthjsonlife scienceslong read sequencingmedicineMERSmetadataopen source softwareRDFSARSSARS-CoV-2SPARQL
COVID-19 PubSeq is a free and open online bioinformatics public sequence resource with on-the-fly analysis of sequenced SARS-CoV-2 samples that allows for a quick turnaround in identification of new virus strains. PubSeq allows anyone to upload sequence material in the form of FASTA or FASTQ files with accompanying metadata through the web interface or REST API.
Details →
Usage examples
See 9 usage examples →
earth observationearthquakesseismology
This dataset contains ground motion velocity and acceleration seismic waveforms recorded by the Southern California Seismic Network (SCSN) and archived at the Southern California Earthquake Data Center (SCEDC). A Distributed Acousting Sensing (DAS) dataset is included.
Details →
Usage examples
See 9 usage examples →
agriculturedisaster responseelevationgeospatiallidarstac
The goal of the USGS 3D Elevation Program (3DEP) is to collect elevation data in the form of light detection and ranging (LiDAR) data over the conterminous United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period. This dataset provides two realizations of the 3DEP point cloud data. The first resource is a public access organization provided in Entwine Point Tiles format, which a lossless, full-density, streamable octree based on LASzip (LAZ) encoding. The second resource is a Requester Pays of the original, Raw LAZ (Compressed LAS) 1.4 3DEP format, and more co...
Details →
Usage examples
See 9 usage examples →
cogdisaster responseearth observationsatellite imagerystac
Light Every Night - World Bank Nighttime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is...
Details →
Usage examples
See 9 usage examples →
autonomous vehiclescomputer visionlidarroboticstransportationurban
Public large-scale dataset for autonomous driving. It enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.
Details →
Usage examples
-
nuScenes CAN bus tutorial by Motional
-
Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking by Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, Abhinav Valada
-
nuScenes devkit by Motional
-
nuScenes Map Expansion Tutorial by Motional
-
nuScenes: A multimodal dataset for autonomous driving by Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom
See 9 usage examples →
cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac
ArcticDEM - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2007 to the present. The ArcticDEM project seeks to fill the need for high-resolution time-series elevation data in the Arctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. ArcticDEM data is constructed from in-track and cross-track high...
Details →
Usage examples
-
PGC Dynamic STAC API Tutorial by Polar Geospatial Center
-
Dynamic ice loss from the Greenland Ice Sheet driven by sustained glacier retreat by Michalea D. King, Ian M. Howat, Salvatore G. Candela, Myoung J. Noh, Seongsu Jeong, Brice P. Y. Noël, Michiel R. van den Broeke, Bert Wouters, Adelaide Negrete
-
Future Evolution of Greenland's Marine-Terminating Outlet Glaciers by Ginny A. Catania, Leigh A. Stearns, Twila A. Moon, Ellen M. Enderlin, R. H. Jackson
-
Automated stereo-photogrammetric DEM generation at high latitudes: Surface Extraction with TIN-based Search-space Minimization (SETSM) validation and demonstration over glaciated regions by Myoung-Jong Noh, Ian M. Howat
-
ArcticDEM Explorer by Polar Geospatial Center & ESRI
See 8 usage examples →
autonomous vehiclescomputer visionlidarrobotics
This autonomous driving dataset includes data from a 128-beam Velodyne Alpha-Prime lidar, a 5MP Blackfly camera, a 360-degree Navtech radar, and post-processed Applanix POS LV GNSS data. This dataset was collect in various weather conditions (sun, rain, snow) over the course of a year. The intended purpose of this dataset is to enable benchmarking of long-term all-weather odometry and metric localization across various sensor types. In the future, we hope to also support an object detection benchmark.
Details →
Usage examples
-
Boreas: A multi-season autonomous driving dataset by K Burnett, D J Yoon, Y Wu, A Z Li, H Zhang, S Lu, J Qian, W Tseng, A Lambert, K YK Leung, A P Schoellig, Timothy D Barfoot
-
Do we need to compensate for motion distortion and doppler effects in spinning radar navigation? by K Burnett, A P Schoellig, T D Barfoot
-
Are We Ready for Radar to Replace Lidar in All-Weather Mapping and Localization? by K Burnett, Y Wu, D J Yoon, A P Schoellig, T D Barfoot
-
Need for Speed: Fast Correspondence-Free Lidar Odometry Using Doppler Velocity by D J Yoon, K Burnett, J Laconte, Y Chen, H Vhavle, S Kammel, J Reuther, T D Barfoot
-
Project Lidar onto Camera Frames (Jupyter notebook) by Keenan Burnett
See 8 usage examples →
agricultureatmosphereclimateearth observationenvironmentalmodeloceanssimulationsweather
High-resolution historical and future climate simulations from 1980-2100
Details →
Usage examples
-
Memo on the Evaluation of Downscaled GCMs Using WRF by Stefan Rahimi
-
Is Bias Correction in Dynamical Downscaling Defensible? by Risser, M. D., Rahimi, S., Goldenson, N., Hall, A., Lebo, Z. J., and Feldman, D. R.
-
Memorandum on Evaluating Global Climate Models for Studying Regional Climate Change in California by Will Krantz, David Pierce, Naomi Goldenson, Daniel Cayan
-
Downscaling file descriptions, directory structure, and data access by Stefan Rahimi & Lei Huang
-
Memo on the Development and Availability of Dynamically Downscaled Projections Using WRF by Stefan Rahimi
See 8 usage examples →
cancergeneticgenomicHomo sapienslife sciencesSTRIDEStranscriptomicswhole genome sequencing
The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic
characterization of a large panel of human cancer cell lines. The CCLE provides public access to
genomic data, visualization and analysis for over 1100 cancer cell lines. This dataset contains
RNA-Seq Aligned Reads, WXS Aligned Reads, and WGS Aligned Reads data.
Details →
Usage examples
See 8 usage examples →
earth observationenergygeospatialmeteorologicalwater
Released to the public as part of the Department of Energy's Open Energy Data Initiative,
this is the highest resolution publicly available long-term wave hindcast
dataset that – when complete – will cover the entire U.S. Exclusive Economic
Zone (EEZ).
Details →
Usage examples
-
Predicting ocean waves along the US East Coast during energetic winter storms: sensitivity to whitecapping parameterizations by Allahdadi, M.N., He, R., and Neary, V.S
-
High-resolution hindcasts for U.S. wave energy resource characterization by Yang, Z. and V.S. Neary
-
High-Resolution Regional Wave Hindcast for the U.S. West Coast by Yang, Zhaoqing; Wu, Wei-Cheng; Wang, Taiping; Castrucci, Luca
-
Development and validation of a regional-scale high-resolution unstructured model for wave energy resource characterization along the US East Coast by Allahdadi, M.N., Gunawan, J. Lai, R. He, V.S. Neary
-
Nearshore wave energy resource characterization along the East Coast of the United States by Ahn, S. V.S. Neary, Allahdadi, N. and R. He
See 8 usage examples →
agricultureagriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystac
Digital Earth Africa’s NDVI climatology product represents the long-term average baseline condition of vegetation for every Landsat pixel over the African continent. Both mean and standard deviation NDVI climatologies are available for each calender month.Some key features of the product are:
- NDVI climatologies were developed using harmonized Landsat 5,7,and 8 satellite imagery.
- Mean and standard deviation NDVI climatologies are produced for each calender month, using a temporal baseline period from 1984-2020 (inclusive)
- Datasets have a spatial...
Details →
Usage examples
See 8 usage examples →
life sciencesMus musculusneurophysiologyneuroscienceopen source software
Electrophysiological recordings of mouse brain activity acquired using Neuropixels probes and accompanying behavioral data.
Details →
Usage examples
See 8 usage examples →
climateCMIP5natural resourcesustainability
A collection of downscaled climate change projections, derived from the
General Circulation Model (GCM) runs conducted under the Coupled Model
Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across
the four greenhouse gas emissions scenarios known as Representative
Concentration Pathways (RCPs) [Meinshausen et al. 2011]. The NASA Earth
Exchange group maintains the NEX-DCP30 (CMIP5), NEX-GDDP (CMIP5), and
LOCA (CMIP5).
Details →
Usage examples
-
Statistical downscaling using Localized Constructed Analogs (LOCA). by Pierce, D. W., D. R. Cayan, and B. L. Thrasher (2014)
-
Accessing and plotting NASA-NEX data, from GEOSChem-on-cloud tutorial. by Jiawei Zhuang
-
Sample Python code to analyze NASA-NEX data. by Jiawei Zhuang
-
Climate Downscaling Using YNet: a Deep Convolutional Network with Skip Connections and Fusion by Yumin Liu, Auroop Ganguly, and Jennifer Dy
-
Potential changes in cooling degree day under different global warming levels and shared socioeconomic pathways in West Africa by Oluwarotimi Delano Thierry Odou, Heidi Heinrichs Ursula, Rabani Adamou, Thierry Godjo, and Mounkaila S Moussa
See 8 usage examples →
biodiversityearth observationecosystemsenvironmentalgeospatialmappingoceans
Water-column sonar data archived at the NOAA National Centers for Environmental Information.
Details →
Usage examples
See 8 usage examples →
cogearth observationelevationgeospatialstac
The New Zealand Elevation dataset consists of New Zealand's publicly owned digital elevation models and digital surface models, which are freely available to use under an open licence. The dataset contains 1m resolution grids derived from LiDAR data. Point clouds are not included in the initial release.All of the elevation files are Cloud Optimised GeoTIFFs using LERC compression for the main grid and LERC compression with lower max_z_error for the overviews. These elevation files are accompanied by
Details →
Usage examples
See 8 usage examples →
earth observationearthquakesseismology
This dataset contains various types of digital data relating
to earthquakes in central and northern California.
Time series data come from broadband, short period, and strong motion
seismic sensors, GPS, and other geophysical sensors.
Details →
Usage examples
See 8 usage examples →
cogearth observationenvironmentalgeospatiallabeledmachine learningsatellite imagerystac
Radiant MLHub is an open library for geospatial training data that hosts datasets generated by Radiant Earth Foundation's team as well as other training data catalogs contributed by Radiant Earth’s partners. Radiant MLHub is open to anyone to access, store, register and/or share their training datasets for high-quality Earth observations. All of the training datasets are stored using a SpatioTemporal Asset Catalog (STAC) compliant catalog and exposed through a common API. Training datasets include pairs of imagery and labels for different types of machine learning problems including image ...
Details →
Usage examples
See 8 usage examples →
cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac
The Reference Elevation Model of Antarctica - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2009 to the present. The REMA project seeks to fill the need for high-resolution time-series elevation data in the Antarctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. REMA data is constructed from in...
Details →
Usage examples
-
The Reference Elevation Model of Antarctica by Ian M. Howat, Claire Porter, Benjanim E. Smith, Myoung-Jong Noh, Paul Morin
-
Deep glacial troughs and stabilizing ridges unveiled beneath the margins of the Antarctic ice sheet by Morlighem, M., Rignot, E., Binder, T. et al.
-
OpenTopography access to REMA by OpenTopography
-
The surface extraction from TIN based search-space minimization (SETSM) algorithm by Myoung-Jong Noh, Ian M. Howat
-
REMA Explorer by Polar Geospatial Center & ESRI
See 8 usage examples →
bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences
The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.
Details →
Usage examples
-
Environmental Determinants of cardiovasular disease: lessons learned from air pollution by Al-Kindi SG, Brook RD, Biswal S, Rajagopalan S.
-
Epigenetic biomarkers and preterm birth by Park B, Khanam R, Vinayachandran V, et.al.
-
Finding and Downloading TaRGET II Data files by TaRGET-DCC
-
Visualize TaRGET II data with WashU Epigenome Browser by WashU Epigenome Browser
-
The role of environmental exposures and the epigenome in health and disease. by Perera BPU, Faulk C, Svoboda LK, Goodrich JM, Dolinoy DC.
See 8 usage examples →
cogearth observationgeospatialminingnatural resourcesatellite imagerysustainability
The Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Level 1
Precision Terrain Corrected Registered At-Sensor Radiance (AST_L1T) data contains
calibrated at-sensor radiance, which corresponds with the ASTER Level 1B (AST_L1B),
that has been geometrically corrected, and rotated to a north-up UTM projection.
The AST_L1T is created from a single resampling of the corresponding ASTER L1A (AST_L1A) product.The precision terrain correction process incorporates GLS2000 digital elevation data with
derived ground control points (GCPs) to achieve topographic accuracy for all daytim...
Details →
Usage examples
See 7 usage examples →
cancergenomiclife sciencesSTRIDEStranscriptomics
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the
understanding of the molecular basis of cancer through the application of large-scale proteome and
genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016).
Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression
Quantification, and miRNA Expression Quantification data.
Details →
Usage examples
-
Cancer Genomics Cloud by Seven Bridges
-
Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer by Hui Zhang, Tao Liu, Zhen Zhang, Samuel H. Payne, Bai Zhang, Jason E. McDermott, Jian-Ying
Zhou, Vladislav A. Petyuk, Li Chen, Debjit Ray, Shisheng Sun, Feng Yang, Lijun Chen, Jing
Wang, Punit Shah, Seong Won Cha, Paul Aiyetan, Sunghee Woo, Yuan Tian, Marina A. Gritsenko,
Therese R. Clauss, Caitlin Choi, Matthew E. Monroe, Stefani Thomas, Song Nie, Chaochao Wu,
Ronald J. Moore, Kun-Hsing Yu, David L. Tabb, David Fenyö, Vineet Bafna, Yue Wang, Henry
Rodriguez, Emily S. Boja, Tara Hiltke, Robert C. Rivers, Lori Sokoll, Heng Zhu, Ie-Ming
Shih, Leslie Cope, Akhilesh Pandey, Bing Zhang, Michael P. Snyder, Douglas A. Levine,
Richard D. Smith, Daniel W. Chan, Karin D. Rodland, the CPTAC Investigators
-
Genomic Data Commons by National Cancer Institute
-
Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities by Suhas Vasaikar, Chen Huang, Xiaojing Wang. Vladislav A. Petyuk, Sara R. Savage, Bo Wen,
Yongchao Dou, Yun Zhang, Zhiao Shi, Osama A. Arshad, Marina A. Gritsenko, Lisa J. Zimmerman,
Jason E. McDermott, Therese R. Clauss, Ronald J. Moore, Rui Zhao, Matthew E. Monroe, Yi-Ting
Wang, Matthew C. Chambers, Robbert J.C. Slebos, Ken S. Lau, Qianxing Mo, Li Ding, Matthew
Ellis, Mathangi Thiagarajan, Christopher R. Kinsinger, Henry Rodriguez, Richard D. Smith,
Karin D. Rodland, Daniel C. Liebler, Tao Liu, Bing Zhang, Clinical Proteomic Tumor Analysis
Consortium
-
Proteomic analysis of colon and rectal carcinoma using standard and customized databases
by Slebos RJ, Wang X, Wang X, Zhang B, Tabb DL, Liebler DC
See 7 usage examples →
agricultureatmosphereclimateearth observationenvironmentalmodeloceanssimulationsweather
The sixth phase of global coupled ocean-atmosphere general circulation model ensemble.
Details →
Usage examples
See 7 usage examples →
cogearth observationgeosciencegeospatialimage processingopen source softwaresatellite imagerystac
Earth observation (EO) data cubes produced from analysis-ready data (ARD) of CBERS-4, Sentinel-2 A/B and Landsat-8 satellite images for Brazil. The datacubes are regular in time and use a hierarchical tiling system. Further details are described in Ferreira et al. (2020).
Details →
Usage examples
See 7 usage examples →
disaster responseevents
This project monitors the world's broadcast, print,
and web news from nearly every corner of every country in
over 100 languages and identifies the people, locations,
organizations, counts, themes, sources, emotions,
quotes, images and events driving our global society every
second of every day.
Details →
Usage examples
See 7 usage examples →
bamcancergeneticgenomiclife sciencesvcf
The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.
Details →
Usage examples
See 7 usage examples →
chemistrycloud computingdata assimilationdigital assetsdigital preservationenergyenvironmentalfree softwaregenomeHPCinformation retrievalinfrastructurejsonmachine learningmaterials sciencemolecular dynamicsmoleculeopen source softwarephysicspost-processingx-ray crystallography
Materials Project is an open database of computed materials properties aiming to accelerate materials science research. The resources in this OpenData dataset contain the raw, parsed, and build data products.
Details →
Usage examples
See 7 usage examples →
agricultureagricultureclimatedisaster responseenvironmentaltransportationweather
The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model. Additionally, note that no streamflow or other data assimilation is performed within any of the NWM retrospective simulations
One application of this dataset is to provide historical context to current near real-time streamflow, soil moisture and snowpack conditions. The retrospective data can be used to infer flow frequencies and perform temporal analyses with hourly streamflow output and 3-hourly land surface output. This dataset can also be used in the development of end user applications which require a long baseline of data for system training or verification purposes.
...
Details →
Usage examples
See 7 usage examples →
air qualitycitiesenvironmentalgeospatial
Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.
Details →
Usage examples
See 7 usage examples →
aerial imagerycogdisaster responseearth observationsatellite imagery
OpenAerialMap is a collection of high-resolution openly licensed satellite and aerial imagery.
Details →
Usage examples
See 10 usage examples →
citiescoastalcogelevationenvironmentallidarurban
This dataset is Lidar data that has been collected by the Scottish public sector and made available under the Open Government Licence. The data are available as point cloud (LAS format or in LAZ compressed format), along with the derived Digital Terrain Model (DTM) and Digital Surface Model (DSM) products as Cloud optimized GeoTIFFs (COG) or standard GeoTIFF. The dataset contains multiple subsets of data which were each commissioned and flown in response to different organisational requirements. The details of each can be found at https://remotesensingdata.gov.scot/data#/list
Details →
Usage examples
See 7 usage examples →
autonomous vehicleslidarroboticstransportationurban
nuPlan is the world's first large-scale planning benchmark for autonomous driving.
Details →
Usage examples
See 7 usage examples →
cogearth observationenvironmentalgeospatialland coverland usemachine learningmappingplanetarysatellite imagerystacsustainability
This dataset, produced by Impact Observatory, Microsoft, and Esri, displays a global map of land use and land cover (LULC)
derived from ESA Sentinel-2 imagery at 10 meter resolution for the years 2017 - 2023. Each map is a composite of LULC predictions for 9 classes throughout the year
in order to generate a representative snapshot of each year. This dataset was generated by Impact Observatory, which used billions of human-labeled pixels
(curated by the National Geographic Society) to train a deep learning model for land classification.
Each global map was produced by applying this model to ...
Details →
Usage examples
See 6 usage examples →
autonomous vehiclescomputer visiongeospatiallidarrobotics
Home of the Argoverse datasets.Public datasets supported by detailed maps to test, experiment, and teach self-driving vehicles how to understand the world around them.This bucket includes the following datasets:
- Argoverse 1 (AV1)
- Motion Forecasting
- Tracking
- Argoverse 2 (AV2)
- Motion Forecasting
- Lidar
- Sensor
- Trust, but Verify (TbV)
Details →
Usage examples
-
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting by Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, James Hays
-
Trust, but Verify: Cross-Modality Fusion for HD Map Change Detection by John Lambert, James Hays
-
Argoverse: 3D Tracking and Forecasting With Rich Maps by Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, James Hays
-
PyPi package for `av2` by Argoverse Authors
-
conda-forge package for `av2` by Argoverse Authors
See 6 usage examples →
calcium imagingelectron microscopyimaginglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneurosciencevolumetric imagingx-rayx-ray microtomographyx-ray tomography
This data ecosystem, Brain Observatory Storage Service & Database (BossDB), contains several neuro-imaging datasets across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include dense segmentation and meshes.
Details →
Usage examples
-
intern: Integrated Toolkit for Extensible and Reproducible Neuroscience by Jordan K Matelsky, Luis Rodriguez, Daniel Xenes, Timothy Gion, Robert Hider Jr., Brock Wester, William Gray-Roncal
-
A Community-Developed Open-Source Computational Ecosystem for Big Neuro Data by J. T. Vogelstein, E. Perlman, B. Falk, A. Baden, W. Gray Roncal, V. Chandrashekhar, F. Collman, S. Seshamani, J. L. Patsolic, K. Lillaney, M. Kazhdan, R. Hider, D. Pryor, J. Matelsky, T. Gion, P. Manavalan, B. Wester, M. Chevillet, E. T. Trautman, K. Khairy, E. Bridgeford, D. M. Kleissas, D. J. Tward, A. K. Crow, B. Hsueh, M. A. Wright, M. I. Miller, S. J. Smith, R. J. Vogelstein, K. Deisseroth, and R. Burns
-
bossDB by bossDB Team
-
Data access and download by Jordan Matelsky
-
The Block Object Storage Service (bossDB): A Cloud-Native Approach for Petascale Neuroscience Discovery by Robert Hider Jr., Dean M. Kleissas, Derek Pryor, Timothy Gion, Luis Rodriguez, Jordan Matelsky, William Gray-Roncal, Brock Wester
See 6 usage examples →
cogcomputer visionearth observationgeospatialimage processingsatellite imagerystacsynthetic aperture radar
Open Synthetic Aperture Radar (SAR) data from Capella Space. Capella Space is an information services company
that provides on-demand, industry-leading, high-resolution synthetic aperture radar (SAR) Earth observation
imagery. Through a constellation of small satellites, Capella provides easy access to frequent, timely, and
flexible information affecting dozens of industries worldwide. Capella's high-resolution SAR satellites are
matched with unparalleled infrastructure to deliver reliable global insights that sharpen our understanding
of the changing world – improving decisions ...
Details →
Usage examples
See 6 usage examples →
cancergenomiclife sciencesSTRIDEStranscriptomics
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the
understanding of the molecular basis of cancer through the application of large-scale proteome and
genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset
contains open RNA-Seq Gene Expression Quantification data.
Details →
Usage examples
-
Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma by Clark DJ, Dhanasekaran SM, Petralia F, Pan J, Song X, Hu Y, da Veiga Leprevost F, Reva B,
Lih TM, Chang HY, Ma W, Huang C, Ricketts CJ, Chen L1, Krek A, Li Y, Rykunov D, Li QK, Chen
LS, Ozbek U, Vasaikar S, Wu Y, Yoo S, Chowdhury S, Wyczalkowski MA, Ji J, Schnaubelt M, Kong
A, Sethuraman S, Avtonomov DM, Ao M, Colaprico A, Cao S, Cho KC, Kalayci S, Ma S, Liu W,
Ruggles K, Calinawan A, Gümüş ZH, Geizler D, Kawaler E, Teo GC, Wen B, Zhang Y, Keegan S, Li
K, Chen F, Edwards N, Pierorazio PM, Chen XS, Pavlovich CP, Hakimi AA, Brominski G, Hsieh
JJ, Antczak A, Omelchenko T, Lubinski J, Wiznerowicz M, Linehan WM, Kinsinger CR,
Thiagarajan M, Boja ES, Mesri M, Hiltke T, Robles AI, Rodriguez H, Qian J, Fenyö D, Zhang B,
Ding L, Schadt E, Chinnaiyan AM, Zhang Z, Omenn GS, Cieslik M, Chan DW, Nesvizhskii AI, Wang
P, Zhang H; Clinical Proteomic Tumor Analysis Consortium
-
Evaluation of NCI-7 Cell Line Panel as a Reference Material for Clinical Proteomics by Clark DJ, Hu Y, Bocik W, Chen L, Schnaubelt M, Roberts R, Shah P, Whiteley G, Zhang H
-
CPTAC Data Portal by National Cancer Institute
-
Cancer Genomics Cloud by Seven Bridges
-
Proteomic Data Commons by National Cancer Institute
See 6 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacwater
The Digital Earth Africa continental Waterbodies Monitoring Service identifies more than 700,000 water bodies from over three decades of satellite observations. This service maps persistent and seasonal water bodies and the change in their water surface area over time. Mapped water bodies may include, but are not limited to, lakes, ponds, man-made reservoirs, wetlands, and segments of some river systems.On a local, regional, and continental scale, this service helps improve our understanding of surface water dynamics and water availability and can be used for monitoring water bodies such as we...
Details →
Usage examples
See 6 usage examples →
cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac
EarthDEM - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2002 to the present. The EarthDEM project seeks to fill the need for high-resolution time-series elevation data in non-polar regions. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. EarthDEM data is constructed from in-track and cross-track ...
Details →
Usage examples
-
PGC Dynamic STAC API Tutorial by Polar Geospatial Center
-
Multi-Source EO for Dynamic Wetland Mapping and Monitoring in the Great Lakes Basin by Michael J. Battaglia, Sarah Banks, Amir Behnamian, Laura Bourgeau-Chavez, Brian Brisco, Jennifer Corcoran, Zhaohua Chen, Brian Huberty, James Klassen, Joseph Knight, Paul Morin, Kevin Murnaghan, Keith Pelletier, Lori White
-
The surface extraction from TIN based search-space minimization (SETSM) algorithm by Myoung-Jong Noh, Ian M. Howat
-
GLARS Data Viewer by Great Lakes Alliance for Remote Sensing
-
NASA CSDA SmallSat Data Explorer by NASA Commercial Smallsat Acquisition Program
See 6 usage examples →
life sciencesMus musculusneurophysiologyneuroscienceopen source software
Electrophysiological recordings acquired using Neuropixels probes in different mice and labs, targeting the same brain locations (including posterior parietal cortex, hippocampus, and thalamus).
Details →
Usage examples
See 6 usage examples →
agricultureclimatemeteorologicalweather
The Rapid Refresh Forecast System (RRFS) is the National Oceanic and Atmospheric Administration’s (NOAA) next generation convection-allowing, rapidly-updated ensemble prediction system, currently scheduled for operational implementation in 2026. The operational configuration will feature a 3 km grid covering North America and include deterministic forecasts every hour out to 18 hours, with deterministic and ensemble forecasts to 60 hours four times per day at 00, 06, 12, and 18 UTC.The RRFS will provide guidance to support forecast interests including, but not limited to, aviation, severe convective weather, renewable energy, heavy precipitation, and winter weather on timescales where rapidly-updated guidance is particularly useful.
The RRFS is underpinned by the Unified Forecast System (UFS), a community-based Earth modeling initiative, and benefits from collaborative development efforts across NOAA, academia, and research institutions.
This bucket provides access to real time, experimental RRFS prototype output. And will provide access to final retrospective output once completed.
The real-time RRFS prototype is experimental and evolving. [
Real-time RRFS output will cease to be generated for several months beginning on ~3 December 2024. This step is being taken to allow for final retrospective testing to be completed. ] It is not under 24x7 monitoring and is not operational. Output may be delayed or missing. Outputs will change. When significant changes to output take place, this description will be updated.
We currently provide hourly deterministic forecasts at 3 km grid spacing out to 60 hours at 00, 06, 12, and 18 UTC, and out to 18 hours for other cycles. Output is organized by cycle date and cycle hour.For example,
rrfs_a/rrfs_a.20241201/12/control
contains the deterministic forecast initialized at 12 UTC on 01 December 2024. Users will find two types of output in GRIB2 format. The first is:
rrfs.t12z.natlev.f018.grib2
Meaning that this is the RRFS_A initialized at 12 UTC, covers the full North America domain, and is the native level post-processed gridded data at hour 18. This output is on a rotated latitude-longitude grid at 3 km grid spacing.
The second output file in grib2 format is:
rrfs.t12z.prslev.f018.conus.grib2
The “prslev” descriptor indicates that this post-processed gridded data is output on pressure levels. The “conus” descriptor indicates that it covers the contiguous United StatesFor users interested in other domains, output is provided on the full 3-km North American grid and also subset over Alaska, Hawaii, and Puerto Rico. The files are identified as follows:
North America:
rrfs.t00z.prslev.f002.grib2
Alaska:
rrfs.t00z.prslev.f002.ak.grib2
Hawaii:
rrfs.t00z.prslev.f002.hi.grib2
Puerto Rico:
rrfs.t00z.prslev.f002.pr.grib2
Beginning on December 8th, 2023 we now provide prototype RRFSv1 ensemble output and products. Output is available for 00, 06, 12, and 18 UTC cycles, and is organized by cycle date and cycle hour. For example,
rrfs_a/rrfs_a.20231214/00/mem0001
contains the forecast from member 1, and
rrfs_a/rrfs_a.20231214/00/enspost_timelag
...
Details →
Usage examples
-
A Limited Area Modeling Capability for the Finite-Volume Cubed-Sphere (FV3) Dynamical Core and Comparison With a Global Two-Way Nest by Black, T. L., J. A. Abeles, B. T. Blake, D. Jovic, E. Rogers, X. Zhang, E. A. Aligo, L. C. Dawson, Y. Lin, E. Strobach, P. C. Shafran, and J. R. Carley
-
Community modeling framework underpinning the RRFS - The UFS Short Range Weather Application by UFS Community
-
Assessment of the data assimilation framework for the Rapid Refresh Forecast System v0.1 and impacts on forecasts of a convective storm case study by Banos, I. H., W. D. Mayfield, G. Ge, L. F. Sapucci, J. R. Carley, and L. Nance
-
Status and Opportunities with the Rapid Refresh Forecast System by Carley J. R. and C. R. Alexander
-
Highlights from a Year of Continued Development of the Rapid Refresh Forecast System (RRFS) by Carley J. R. and C. R. Alexander
See 6 usage examples →
biologyhealthimage processingimaginglife sciencesmagnetic resonance imagingneurobiologyneuroimaging
This dataset contains deidentified raw k-space data and DICOM image files of over 1,500 knees and 6,970 brains.
Details →
Usage examples
See 6 usage examples →
citiestransportationurban
Data of trips taken by taxis and for-hire vehicles in New York City. Note: access to this dataset is free, however direct S3 access does require an AWS account. Anonymous downloads are accessible from the dataset's documentation webpage listed below.
Details →
Usage examples
See 6 usage examples →
bioinformaticsbiologygeneticgenomiclife sciencesreference index
This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...
Details →
Usage examples
-
Accessible, curated metagenomic data through ExperimentHub by Edoardo Pasolli, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd, Curtis Huttenhower, Martin Morgan, Nicola Segata, and Levi Waldron
-
Wrangling Galaxy's reference data by Daniel Blankenberg, James E. Johnson, The Galaxy Team, James Taylor, Anton Nekrutenko
-
Using Open Bio Ref Data with Galaxy and Bioconductor by Enis Afgan, Alexandru Mahmoud, Nuwan Goonasekera
-
Galaxy by Galaxy Project
-
TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages by Tiago C. Silva, Antonio Colaprico, Catharina Olsen, Fulvio D'Angelo, Gianluca Bontempi, Michele Ceccarelli, Houtan Noushmehr
See 6 usage examples →
disaster responseearth observationearthquakes
Grillo has developed an IoT-based earthquake early-warning system,
with sensors currently deployed in Mexico, Chile, Puerto Rico and Costa Rica,
and is now opening its entire archive of unprocessed accelerometer
data to the world to encourage the development of new algorithms
capable of rapidly detecting and characterizing earthquakes in
real time.
Details →
Usage examples
See 6 usage examples →
acousticsbiodiversitybiologyclimatecoastaldeep learningecosystemsenvironmentalmachine learningmarine mammalsoceansopen source software
This project offers passive acoustic data (sound recordings) from a deep-ocean environment off central California. Recording began in July 2015, has been nearly continuous, and is ongoing. These resources are intended for applications
in ocean soundscape research, education, and the arts.
Details →
Usage examples
See 6 usage examples →
geospatialgeothermalimage processingseismology
Released to the public as part of the Department of Energy's Open Energy Data
Initiative, these data represent vertical and horizontal distributed acoustic
sensing (DAS) data collected as part of the Poroelastic Tomography (PoroTomo)
project funded in part by the Office of Energy Efficiency and Renewable
Energy (EERE), U.S. Department of Energy.
Details →
Usage examples
-
Ground motion response to an ML 4.3 earthquake using co-located distributed acoustic sensing and seismometer arrays by Herbert F Wang, Xiangfang Zeng, Douglas E Miller, Dante Fratta, Kurt L Feigl, Clifford H Thurber, Robert J Mellors
-
PoroTomo DAS Data Processing Tutorial for hdf5 Files by Nicole Taverna and Michael Rossol
-
DAS and DTS at Brady Hot Springs: Observations about Coupling and Coupled Interpretations by Douglas E. Miller, Thomas Coleman, Xiangfang Zeng, Jeremy R. Patterson , Elena C. Reinnisch, Michael A. Cardiff, Herbert F. Wang, Dante Fratta, Whitney Trainor-Guitton, Clifford H. Thurber, Michelle ROBERTSON, Kurt FEIGL, and The PoroTomo Team
-
PoroTomo DAS Data Processing Tutorial for hdf5 Files via HSDS and h5pyd by Michael Rossol and Nicole Taverna
-
PoroTomo DAS Data Processing Tutorial for SEG-Y Files by Nicole Taverna and Ross Ring-Jarvi
See 6 usage examples →
computer visiondeep learningearth observationgeospatiallabeledmachine learningsatellite imagery
RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very high resolution dataset built to test the value of synthetic data from an overhead perspective. The real portion ...
Details →
Usage examples
See 6 usage examples →
bamCOVID-19geneticgenomiclife sciencesMERSSARSSARS-CoV-2virus
Serratus is a collaborative open science project for ultra-rapid discovery of known and unknown coronaviruses in response to the COVID-19 pandemic through re-analysis of publicly available genomic data. Our resulting vertebrate viral alignment data is explorable via the Serratus Explorer and directly accessible on Amazon S3.
Details →
Usage examples
See 6 usage examples →
machine learningNASA SMD AI
The v1 dataset includes AIA/HMI observations 2010-2018 and v2 includes AIA/HMI observations 2010-2020 in all 10 wavebands (94A, 131A, 171A, 193A, 211A, 304A, 335A, 1600A, 1700A, 4500A), with 512x512 resolution and 6 minutes cadence; HMI vector magnetic field observations in Bx, By, and Bz components, with 512x512 resolution and 12 minutes cadence; The EVE observations in 39 wavelengths from 2010-05-01 to 2014-05-26, with 10 seconds cadence.
Details →
Usage examples
-
ML applications based on the SDOMLv2 dataset by Wright, Paul J.
-
A Machine-learning Data Set Prepared from the NASA Solar Dynamics Observatory Mission by Galvez, Richard; Fouhey, David F.; Jin, Meng; Szenicer, Alexandre; et al
-
SDOMLv2 Github by Wright, Paul J.; Jin, Meng; Cheung, Mark C. M.
-
ML applications based on the SDOMLv1 dataset by Salvatelli, Valentina; dos Santos, Luiz F. G.; Bose, Souvik; Neuberg, Brad; Cheung, Mark C. M.; Janvier, Miho; Jin, Meng; Gal, Yarin; Boerner, Paul; Baydin, Atılım Güneş
-
Scripts for generating the SDOMLv2 dataset by Jin, Meng
See 6 usage examples →
cloud computingdatacenterenergyHPCworkload analysis
Collection of parsed datacenter logs and time series data of hardware utilization from the MIT Supercloud system.
Details →
Usage examples
See 6 usage examples →
censusdifferential privacydisclosure avoidanceethnicitygroup quartershispanichousinghousing unitslatinonoisy measurementspopulationraceredistrictingvoting age
The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File (2023-06-30) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9 , and implemented in https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code). The NMF was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations, that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File. The NMF consists of the full set of privacy-protected statistical queries (counts of individuals or housing units with particular combinations of characteristics) of confidential 2010 Census data relating to the 2010 Demonstration Data Products Suite – Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File – Production Settings (2023-04-03). These statistical queries, called “noisy measurements” were produced under the zero-Concentrated Differential Privacy framework (Bun, M. and Steinke, T [2016] https://arxiv.org/abs/1605.02065; see also Dwork C. and Roth, A. [2014] https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] https://arxiv.org/abs/2004.00010), which added positive or negative integer-valued noise to each of the resulting counts. The noisy measurements are an intermediate stage of the TDA prior to the post-processing the TDA then performs to ensure internal and hierarchical consistency within the resulting tables. The Census Bureau has released these 2010 Census demonstration data to enable data users to evaluate the expected impact of disclosure avoidance variability on 2020 Census data. The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Demonstration Noisy Measurement File (2023-04-03) has been cleared for public dissemination by the Census Bureau Disclosure Review Board (CBDRB-FY22-DSEP-004).
The 2010 Census Production Settings Demographic and Housing Characteristics Demonstration Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2010 Census Edited File (CEF), which includes confidential data initially collected in the 2010 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) (https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/2010-demonstration-data-products/04-Demonstration_Data_Products_Suite/2023-04-03/). As these 2010 Census demonstration data are intended to support study of the design and expected impacts of the 2020 Disclosure Avoidance System, the 2010 CEF records were pre-processed before application of the zCDP framework. This pre-processing converted the 2010 CEF records into the input-file format, response codes, and tabulation categories used for the 2020 Census, which differ in substantive ways from the format, response codes, and tabulation categories originally used for the 2010 Census.
The NMF provides estimates ...
Details →
Usage examples
-
The 2020 Census Disclosure Avoidance System Topdown Algorithm by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
-
Computing Confidence Intervals Using the 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement File (2023-04-03) by Cumings-Menon, R., Hawes, M., and Spence, M. (2023) "(Jupyter notebook explaining how to calculate estimates and confidence intervals from the noisy measurement files)"
-
Geographic Spines in the 2020 Census Disclosure Avoidance System by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
-
2010 Census Summary File 1 Technical Documentation by U.S. Census Bureau
-
DAS 2020 Redistricting Production Code Release by U.S. Census Bureau (Public GitHub repository for the 2020 Census DAS, vintaged as of the commit used to produce the official production run of the Redistricting product. The zCDP framework NMFs were generated in a for-internal-use-only pickled (https://docs.python.org/3/library/pickle.html; https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.pickleFile.html) form as a byproduct of the use of this code. A stand-alone script was developed and used to convert these internal-use NMFs into the Parquet format used in this product (that script is not yet publicly available.)
See 5 usage examples →
censusdifferential privacydisclosure avoidanceethnicitygroup quartershousinghousing unitsnoisy measurementspopulationraceredistrictingvoting age
The 2020 Census Demographic and Housing Characteristics Noisy Measurement File is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022], and implemented in primitives.py). The 2020 Census Demographic and Housing Characteristics Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] ), which added positive or negative integer-valued noise to each of the resulting counts. These are estimated counts of individuals and housing units included in the 2020 Census Edited File (CEF), which includes confidential data collected in the 2020 Census of Population and Housing.
The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the Census Demographic and Housing Characteristics Summary File. In addition to the noisy measurements, constraints based on invariant calculations --- counts computed without noise --- are also included (with the exception of the state-level total populations, which can be sourced separately from data.census.gov).
The Noisy Measurement File was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File.
The noisy measurements are p...
Details →
Usage examples
-
2020 Census Demographic and Housing Characteristics File Technical Documentation by U.S. Census Bureau. Note that the zCDP framework NMFs were generated in a for-internal-use-only pickled (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.pickleFile.html) form as a byproduct of the use of this code. A stand-alone script was developed and used to convert these internal-use NMFs into the Parquet format used in this product (that script is not yet publicly available).
-
Geographic Spines in the 2020 Census Disclosure Avoidance System by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
-
The 2020 Census Disclosure Avoidance System Topdown Algorithm by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
-
Computing Confidence Intervals Using the 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement File (2023-04-03) (Jupyter notebook explaining how to calculate estimates and confidence intervals from the noisy measurement files) by Cumings-Menon, R., Hawes, M., and Spence, M. (2023)
-
DAS 2020 DHC Production Code Release by U.S. Census Bureau
See 5 usage examples →
agriculturefood securitygeneticgenomiclife sciences
The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.
Details →
Usage examples
See 5 usage examples →
agriculturecogdisaster responseearth observationgeospatialimagingsatellite imagerystacsustainability
Imagery acquired
by Amazonia-1 satellite.
The
image files are recorded and processed by Instituto Nacional de Pesquisas
Espaciais (INPE) and are converted to Cloud Optimized Geotiff
format in order to optimize its use for cloud based applications.
WFI Level 4 (Orthorectified) scenes are being
ingested daily starting from 08-29-2022, the complete
Level 4 archive will be ingested by the end of October 2022.
Details →
Usage examples
See 5 usage examples →
chemical biologychemistryclimatedatacenterdigital assetsgeochemistrygeophysicsgeosciencemarinenetcdfoceans
Argo is an international program to observe the interior of the ocean with a fleet of profiling floats drifting in the deep ocean currents (https://argo.ucsd.edu). Argo GDAC is a dataset of 5 billion in situ ocean observations from 18.000 profiling floats (4.000 active) which started 20 years ago. Argo GDAC dataset is a collection of 18.000 NetCDF files. It is a major asset for ocean and climate science, a contributor to IOCCP reports.
Details →
Usage examples
See 5 usage examples →
biologycell biologycomputer visionelectron microscopyimaginglife sciencesmicroscopysegmentation
The Automated Segmentation of intracellular substructures in Electron Microscopy (ASEM) project provides deep learning models trained to segment structures in 3D images of cells acquired by Focused Ion Beam Scanning Electron Microscopy (FIB-SEM). Each model is trained to detect a single type of structure (mitochondria, endoplasmic reticulum, golgi apparatus, nuclear pores, clathrin-coated pits) in cells prepared via chemically-fixation (CF) or high-pressure freezing and freeze substitution (HPFS). You can use our open source pipeline to load a model and predict a class of sub-cellular structur...
Details →
Usage examples
See 5 usage examples →
atmosphereclimateclimate modeldata assimilationforecastgeosciencegeospatiallandmeteorologicalweatherzarr
This is a cloud-hosted subset of the CAM6+DART (Community Atmosphere Model version 6 Data Assimilation Research Testbed) Reanalysis dataset. These data products are designed to facilitate a broad variety of research using the NCAR CESM 2.1 (National Center for Atmospheric Research's Community Earth System Model version 2.1), including model evaluation, ensemble hindcasting, data assimilation experiments, and sensitivity studies. They come from an 80 member ensemble reanalysis of the global troposphere and stratosphere using DART and CAM6. The data products represent states of the atmospher...
Details →
Usage examples
See 5 usage examples →
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the all data for the CAncer MEtastases in LYmph nOdes challeNge or CAMELYON. CAMELYON was the first challenge using whole-slide images in computational pathology and aimed to help pathologists identify breast cancer metastases in sentinel lymph nodes. Lymph node metastases are extremely important to find, as they indicate that the cancer is no longer localized and systemic treatment might be warranted. Searching for these metastases in H&E-stained tissue is difficult and time-consuming and AI algorithms can play a role in helping make this faster and more accura...
Details →
Usage examples
See 5 usage examples →
climateclimate modelclimate projectionsCMIP6ocean circulationocean currentsocean sea surface heightocean simulationocean velocity
This dataset provides several global fields describing the state of atmosphere, ocean, land and ice from a high-resolution (0.1o for the ocean/ice models 0.25o for the land/atmosphere models) numerical earth system model, the Community Earth System Model (CESM, https://www.cesm.ucar.edu/). Texas A&M University (TAMU) and National Center for Atmospheric Research together with international partners collaboratively carried out a large set of high-resolution climate simulations, including a 500-year long preindustrial control simulation (PI-CTRL) described here. The CESM uses dynamic equation...
Details →
Usage examples
See 5 usage examples →
air qualityclimateenvironmentalgeospatialmeteorological
CMAS Data Warehouse on AWS collects and disseminates meteorology, emissions and air quality model input and output for Community Multiscale Air Quality (CMAQ) Model Applications. This dataset is available as part of the AWS Open Data Program, therefore egress fees are not charged to either the host or the person downloading the data. This S3 bucket is maintained as a public service by the University of North Carolina's CMAS Center, the US EPA’s Office of Research and Development, and the US EPA’s Office of Air and Radiation. Metadata and DOIs for datasets included in the CMAS Data Wareho...
Details →
Usage examples
See 5 usage examples →
cancergeneticgenomiclife sciencesSTRIDESwhole genome sequencing
The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a
longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various
standard approved treatments. The MMRF’s vision is to track the treatment and results for each
CoMMpass patient so that someday the information can be used to guide decisions for newly
diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue
samples, gene...
Details →
Usage examples
-
"Interim Analysis of the Mmrf Commpass Trial: Identification of Novel Rearrangements
Potentially Associated with Disease Initiation and Progression"
by Sagar Lonial, MD, Venkata D Yellapantula, Winnie Liang, PhD, Ahmet Kurdoglu, BS, Jessica
Aldrich, MSc, Christophe M. Legendre, MD, Kristi Stephenson, Jonathan Adkins, Jackie
McDonald, Adrienne Helland, Megan Russell, Austin Christofferson, Lori Cuyugan, Dan Rohrer,
Alex Blanski, Meghan Hodges, Mmrf CoMMpass Network, Mary Derome, Daniel Auclair, PhD, Pamela
G. Kidd, MD, Scott Jewell, PhD, David Craig, PhD, John Carpten, PhD, Jonathan J. Keats, PhD
-
Genomic Data Commons by National Cancer Institute
-
"Molecular Predictors of Outcome and Drug Response in Multiple Myeloma: An Interim Analysis
of the Mmrf CoMMpass Study"
by Jonathan J Keats, PhD, Gil Speyer, Austin Christofferson, Christophe Legendre, PhD, Jessica
Aldrich, Megan Russell, Lori Cuyugan, Jonathan Adkins, Alex Blanski, Meghan Hodges, Dan
Rohrer, Sundar Jagannath, MD, Ravi Vij, MD, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben
Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD, Robert M Rifkin, Norma C
Gutierrez, MD PhD, Mmrf CoMMpass Network, Jennifer Yesil, MS, Mary Derome, MS, Seungchan
Kim, PhD, Winnie Liang, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD,
Daniel Auclair, PhD, Sagar Lonial, MD FACP
-
"Interim Analysis Of The MMRF CoMMpass Trial: a Longitudinal Study In Multiple Myeloma
Relating Clinical Outcomes To Genomic and Immunophenotypic Profiles"
by Keats JJ, Craig DW, Liang W, Venkata Y, Kurdoglu A, Aldrich J, Auclair D, Allen K, Harrison
B, Jewell S, Kidd PG, Correll M, Jagannath S, Siegel DS, Vij R, Orloff G, Zimmerman TM, MMRF
CoMMpass Network, Capone W, Carpten J, Lonial S.
-
"Identification of Initiating Trunk Mutations and Distinct Molecular Subtypes: An Interim
Analysis of the Mmrf Commpass Study"
by Jonathan J Keats, PhD, Gil Speyer, Legendre Christophe, Christofferson Austin, Kristi
Stephenson, BS, Ahmet Kurdoglu, Megan Russell, Aldrich Jessica, Cuyugan Lori, Jonathan
Adkins, Jackie McDonald, Adrienne Helland, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar
Jagannath, MD, David Siegel, MD PhD, Ravi Vij, MD MBA, Gregory Orloff, MD, Todd Zimmerman,
MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD PhD, Robert M.
Rifkin, Norma C Gutierrez, The MMRF CoMMpass Network, Jen Toups, Mary Derome, MS, Winnie
Liang, PhD, Seunchan Kim, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John
David Carpten, PhD, Sagar Lonial, MD
See 5 usage examples →
atmosphereclimateclimate modelgeospatialicelandmodeloceanssustainabilityzarr
The Community Earth System Model (CESM) Large Ensemble Numerical Simulation (LENS) dataset includes a 40-member ensemble of climate simulations for the period 1920-2100 using historical data (1920-2005) or assuming the RCP8.5 greenhouse gas concentration scenario (2006-2100), as well as longer control runs based on pre-industrial conditions. The data comprise both surface (2D) and volumetric (3D) variables in the atmosphere, ocean, land, and ice domains. The total data volume of the original dataset is ~500TB, which has traditionally been stored as ~150,000 individual CF/NetCDF files on disk o...
Details →
Usage examples
See 5 usage examples →
disaster responsegeospatialmappingosm
Daylight is a complete distribution of global, open map data that’s freely available with support from community and professional mapmakers. Meta combines the work of global contributors to projects like OpenStreetMap with quality and consistency checks from Daylight mapping partners to create a free, stable, and easy-to-use street-scale global map.
The Daylight Map Distribution contains a validated subset of the OpenStreetMap database. In addition to the standard OpenStreetMap PBF format, Daylight is available in two parquet formats that are optimized for AWS Athena including geometries (Points, LineStrings, Polygons, or MultiPolygons). First, Daylight OSM Features contains the nearly 1B renderable OSM features. Second, Daylight OSM Elements contains all of OSM, including all 7B nodes without attributes, and relations that do not contain geometries, such as turn restrictions.
Daylight ...
Details →
Usage examples
See 5 usage examples →
agriculturecogdisaster responseearth observationgeospatialland coverland usemachine learningmappingnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The WorldCover 10m Annual Composites were produced, as part of the European Space Agency (ESA) WorldCover project, from the yearly Copernicus Sentinel-1 and Sentinel-2 archives for both years 2020 and 2021. These global mosaics consists of four products composites. A Sentinel-2 RGBNIR yearly median composite for bands B02, B03, B04, B08. A Sentinel-2 SWIR yearly median composite for bands B11 and B12. A Sentinel-2 NDVI yearly percentiles composite (NDVI 90th, NDVI 50th NDVI 10th percentiles). A Sentinel-1 GAMMA0 yearly median composite for bands VV, VH and VH/VV (power scaled). Each product is...
Details →
Usage examples
-
WorldCover Viewer by VITO
-
ESA WorldCover 10 m 2021 v200 - Product User Manual by VITO
-
Exploring the datasets by VITO
-
Release of the 10 m WorldCover map by Ruben Van De Kerchove
-
ESA WorldCover 10 m 2021 v200 by Zanaga, D., Van De Kerchove, R.,Daems, D.,De Keersmaecker, W., Brockmann, C., Kirches, G., Wevers, J., Cartus, O., Santoro, M., Fritz, S., Lesiv, M., Herold, M., Tsendbazar, N.E., Xu, P., Ramoino, F., Arino, O.
See 5 usage examples →
citiesclimateenergyenergy modelinggeospatialmetadatamodelopen source softwaresustainabilityutilities
The U.S. Department of Energy (DOE) funded a three-year project, End-Use Load Profiles for the U.S. Building Stock, that culminated in this publicly
available dataset of calibrated and validated 15-minute resolution load profiles for all major residential and commercial building types and end uses,
across all climate regions in the United States. These EULPs were created by calibrating the ResStock and ComStock physics-based building stock models
using many different measured datasets, as described here.
This dataset includes load profiles for both the baseline building stock and the building ...
Details →
Usage examples
See 5 usage examples →
agriculturecogdisaster responseelevationgeospatialhydrologysatellite imagerystac
Height Above Nearest Drainage (HAND) is a terrain model that normalizes topography to the relative heights along the drainage network and is used to describe the relative soil gravitational potentials or the local drainage potentials. Each pixel value represents the vertical distance to the nearest drainage. The HAND data provides near-worldwide land coverage at 30 meters and was produced from the 2021 release of the Copernicus GLO-30 Public DEM as distributed in the Registry of Open Data on AWS.
Details →
Usage examples
See 5 usage examples →
agriculturecogearth observationearthquakesecosystemsenvironmentalgeologygeophysicsgeospatialglobalinfrastructuremappingnatural resourcesatellite imagerysynthetic aperture radarurban
This data set is the first-of-its-kind spatial representation of multi-seasonal, global SAR repeat-pass interferometric coherence and backscatter signatures. Global coverage comprises all land masses and ice sheets from 82 degrees northern to 79 degrees southern latitude. The data set is derived from high-resolution multi-temporal repeat-pass interferometric processing of about 205,000 Sentinel-1 Single-Look-Complex data acquired in Interferometric Wide-Swath mode (Sentinel-1 IW mode) from 1-Dec-2019 to 30-Nov-2020. The data set was developed by Earth Big Data LLC and Gamma Remote Sensing AG, under contract for NASA's Jet Propulsion Laboratory. ...
Details →
Usage examples
-
Jupyter Notebook to access and visualize global mosaics of the global data set by Josef Kellndorfer
-
Webinar: The new era of SAR Time Series Analysis and Visualization: Cloud meets Big SAR Data. IEEE GRSS Bay Area Chapter (Dec. 3rd 2021) by Josef Kellndorfer
-
Generating Global Temporal Coherence Maps from one year of Sentinel-1 C-band data, ESA Fringe 2021 Poster (Youtube) by Oliver Cartus, Josef Kellndorfer, Shadi Oveisgharan, Batu Osmanoglu, Paul Rosen, Urs Wegmüller
-
Jupyter Notebook to access and visualize sub regions of the global data set by Josef Kellndorfer
-
Global seasonal Sentinel-1 interferometric coherence and backscatter data set by Josef Kellndorfer, Oliver Cartus, Marco Lavalle, Christophe Magnard, Pietro Milillo, Shadi Oveisgharan, Batu Osmanoglu, Paul A. Rosen, Urs Wegmüller
See 5 usage examples →
agriculturecogdeep learninglabeledland covermachine learningsatellite imagery
High resolution, annual cropland and landcover maps for selected African countries developed by Clark University's Agricultural Impacts Research Group using various machine learning approaches applied to Planet imagery, including field boundary and cultivated frequency maps, as well as multi-class land cover.
Details →
Usage examples
See 5 usage examples →
life sciencesMus musculusneurophysiologyneuroscienceopen source software
Behavioral data of mice performing a decision-making task, associated with 2020 publication of the IBL.
Details →
Usage examples
See 5 usage examples →
agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imageryweather
Himawari-9, stationed at 140.7E, owned and operated by the Japan Meteorological Agency (JMA), is a geostationary meteorological satellite, with Himawari-8 as on-orbit back-up, that provides constant and uniform coverage of east Asia, and the west and central Pacific regions from around 35,800 km above the equator with an orbit corresponding to the period of the earth’s rotation. This allows JMA weather offices to perform uninterrupted observation of environmental phenomena such as typhoons, volcanoes, and general weather systems. Archive data back to July 2015 is available for Full Disk (AHI-L...
Details →
Usage examples
See 5 usage examples →
aerial imageryearth observationelevationgeospatiallidar
The KyFromAbove initiative is focused on building and maintaining a current basemap for Kentucky that can meet the needs of its users at the state, federal, local, and regional level. A common basemap, including current color leaf-off aerial photography and elevation data (LiDAR), reduces the cost of developing GIS applications, promotes data sharing, and add efficiencies to many business processes. All basemap data acquired through this effort is being made available in the public domain. KyFromAbove acquires aerial imagery and LiDAR during leaf-off conditions in the Commonwealth. The imagery...
Details →
Usage examples
See 5 usage examples →
fastageneticgenomiclife sciencesmetagenomicsSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
This repository is a re-analysis of the NCBI Sequence Read Archive (SRA), December 2023 freeze, to make it more accessible. The SRA is an open access database of biological sequences, containing raw data from high-throughput DNA and RNA sequencing platforms. It is the largest database of public DNA sequences worldwide, containing a wealth of genomic diversity across all living organisms. This repository contains Logan, a set of compressed FASTA files for all individual SRA accessions, in the form of unitigs and contigs. Borrowing methods from the real of genome assembly, unitigs preserve nearl...
Details →
Usage examples
See 5 usage examples →
cancerclassificationcomputational pathologycomputer visiondeep learningdigital pathologygrand-challenge.orghistopathologyimaginglife sciencesmachine learningmedical image computingmedical imaging
This dataset contains the training data for the Machine learning for Optimal detection of iNflammatory cells in the KidnEY or MONKEY challenge. The MONKEY challenge focuses on the automated detection and classification of inflammatory cells, specifically monocytes and lymphocytes, in kidney transplant biopsies using Periodic acid-Schiff (PAS) stained whole-slide images (WSI). It contains 80 WSI, collected from 4 different pathology institutes, with annotated regions of interest. For each WSI up to 3 different PAS scans and one IHC slide scan are available. This dataset and challenge support th...
Details →
Usage examples
See 5 usage examples →
elevationlidarplanetarystac
The lunar orbiter laser altimeter (LOLA) has collected and released almost 7 billion individual laser altimeter returns from the lunar surface. This dataset includes individual altimetry returns scraped from the Planetary Data System (PDS) LOLA Reduced Data Record (RDR) Query Tool, V2.0. Data are organized in 15˚ x 15˚ (longitude/latitude) sections, compressed and encoded into the Cloud Optimized Point Cloud (COPC) file format, and collected into a Spatio-Temporal Asset Catalog (STAC) collection for query and analysis. The data are in latitude, longitude, and radius (X, Y, Z) format with the p...
Details →
Usage examples
See 5 usage examples →
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
The National Air Quality Forecasting Capability (NAQFC) dataset contains model-generated Air-Quality (AQ) forecast guidance from three different prediction systems. The first system is a coupled weather and atmospheric chemistry numerical forecast model, known as the Air Quality Model (AQM). It is used to produce forecast guidance for ozone (O3) and particulate matter with diameter equal to or less than 2.5 micrometers (PM2.5) using meteorological forecasts based on NCEP’s operational weather forecast models such as North American Mesoscale Models (NAM) and Global Forecast System (GFS), and atmospheric chemistry based on the EPA’s Community Multiscale Air Quality (CMAQ) model. In addition, the modeling system incorporates information related to chemical emissions, including anthropogenic emissions provided by the EPA and fire emissions from NOAA/NESDIS. The NCEP NAQFC AQM output fields in this archive include 72-hr forecast products of model raw and bias-correction predictions, extending back to 1 January 2020. All of the output was generated by the contemporaneous operational AQM, beginning with AQMv5 in 2020, with upgrades to AQMv6 on 20 July 2021, and AQMv7 on 14 May 2024. The history of AQM upgrades is documented here
The second prediction is known as the Hybrid Single-Particle Lagrangian Integrated Trajectory model (HYSPLIT). It is a widely used atmospheric transport and dispersion model containing an internal dust-generation module. It provides forecast guidance for atmospheric dust concentration and, prior to 28 June 2022, it also provided the NAQFC forecast guidance for smoke. Since that date, the third prediction system, a regional numerical weather prediction (NWP) model known as the Rapid Refresh (RAP) model, has subsumed HYSPLIT for operational smoke guidance, simulating the emission, transport, and deposition of smoke particles that originate from biomass burning (fires) and anthropogenic sources.
The output from each of these modeling systems is generated over three separate domains, one covering CONUS, one Alaska, and the other Hawaii. Currently, for this archive, the ozone, (PM2.5), and smoke output is available over all three domains, while dust products are available only over the CONUS domain. The predicted concentrations of all species in the lowest model layer (i.e., the layer in contact with the surface) are available, as are vertically integrated values of smoke and dust. The data is gridded horizontally within each domain, with a grid spacing of approximately 5 km over CONUS, 6 km over Alaska, and 2.5 km over Hawaii. Ozone concentrations are provided in parts per billion (PPB), while the concentrations of all other species are quantified in units of micrograms per cubic meter (ug/m3), except for the column-integrated smoke values which are expressed in units of mg/m2.
Temporally, O3 and PM2.5 are available as maximum and/or averaged values over various time periods. Specifically, O3 is available in both 1-hour and 8-hour (backward calculated) averages, as well as preceding 1-hour and 8-hour maximum values. Similarly, PM2.5 is available in 1-hour and 24-hour average values and 24-hour maximum values. In addition, all O3 and PM2.5 fields are available with bias-corrected magnitudes, based on derived model biases relative to observations.
The AQM produces hourly forecast guidance for O3 and PM2.5 out to 72 hours twice per day, starting at 0600 and 1...
Details →
Usage examples
-
Using VIIRS fire radiative power data to simulate biomass burning emissions, plume rise and smoke transport in a real-time air quality modeling system (Proc. 2017 IEEE Int. Geoscience and Remote Sensing Symp. (IGARSS0),Fort Worth, TX, IEEE, 2806–2808) by Ahmadov, R., and Coauthors
-
An empirically derived emission algorithm for wind-blown dust (J.Geophys. Res., 115, D16212) by Draxler, R. R., P. Ginoux, and A. F. Stein
-
Improving NOAA NAQFC PM2.5 predictions with a bias correction approach (2017, Wea. and Forecasting, 32(2), 407–421) by Huang, J., McQueen, J., Wilczak, J., Djalalova, I., Stajner, I., Shafran, P., Allured, D., Lee, P., Pan, L., Tong, D., Huang, H.-C., DiMego, G., Upadhayay, S., & Delle Monache, L
-
Development and evaluation of an advanced National Air Quality Forecasting Capability using the NOAA Global Forecast System version 16 (2022, Geosci. Model Dev., 15, 3281–3313) by Campbell, P.C., and Coauthors
-
Development of the next-generation air quality prediction system in the UFS framework: Enhancing predictability of wildfire air quality impacts (2024)(Bull. Amer. Meteor. Soc. In review) by Huang, J.P., I. Stajner, R. Montuoro, F. Yang, K. Wang, H.-C. Huang, C.-H. Jeon, B. Curtis, J. McQueen, H. Liu, B. Baker, D. Tong , Y. Tang, P. Campbell, G. Grell, G. Frost, R. Schwantes, S. Wang, S. Kondragunta, F. Li, and Y. Jung
See 5 usage examples →
earth observationgeospatialsatellite imageryurban
NDUI is combined with cloud shadow-free Landsat Normalized Difference Vegetation Index (NDVI) composite and DMSP/OLS Night Time Light (NTL) to characterize global urban areas at a 30 m resolution,and it can greatly enhance urban areas, which can then be easily distinguished from bare lands including fallows and deserts. With the capability to delineate urban boundaries and, at the same time, to present sufficient spatial details within urban areas, the NDUI has the potential for urbanization studies at regional and global scales.
Details →
Usage examples
See 5 usage examples →
agricultureclimateearth observationmeteorologicalweather
Open-Meteo integrates weather models from reputable national weather services, offering a swift and efficient weather API. Real-time weather forecasts are unified into a time-series database that provides historical and future weather data for any location worldwide.Through Open-Meteo on AWS Open Data, you can download the Open-Meteo weather database and analysis weather data locally. Docker images are provided to download data and to expose an HTTP API endpoint. Using Open-Meteo SDKs, you can seamlessly integrate weather data into your Python, Typescript, Swift, Kotlin, or Java applications.T...
Details →
Usage examples
See 5 usage examples →
disaster responsegeospatialmappingosm
OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3.
Details →
Usage examples
See 5 usage examples →
geospatialglobalmappingosmparquettransportation
Overture is a collaboratively built, global, open map data project for developers who build map services or use geospatial data. Overture Open Map Data contains data that are standardized under the themes of Admins, Base, Buildings, Places, and Transportation. Overture also includes a Global Entity Reference System (GERS) which encodes map data to a shared universal reference. Beginning with the Overture 2023-11-14-alpha.0 release, the data is available as cloud-native GeoParquet files.
Details →
Usage examples
See 5 usage examples →
citieselevationgeospatiallandlidarmappingurban
The objective of the Mapa 3D Digital da Cidade (M3DC) of the São Paulo City Hall is to publish LiDAR point cloud data. The initial data was acquired in 2017 by aerial surveying and future data will be added. This publicly accessible dataset is provided in the Entwine Point Tiles format as a lossless octree, full density, based on LASzip (LAZ) encoding.
Details →
Usage examples
See 5 usage examples →
autonomous racingautonomous vehiclescomputer visionGNSSimage processinglidarlocalizationobject detectionobject trackingperceptionradarrobotics
The RACECAR dataset is the first open dataset for full-scale and high-speed autonomous racing. Multi-modal sensor data has been collected from fully autonomous Indy race cars operating at speeds of up to 170 mph (273 kph). Six teams who raced in the Indy Autonomous Challenge during 2021-22 have contributed to this dataset. The dataset spans 11 interesting racing scenarios across two race tracks which include solo laps, multi-agent laps, overtaking situations, high-accelerations, banked tracks, obstacle avoidance, pit entry and exit at different speeds. The data is organized and released in bot...
Details →
Usage examples
-
rosbag2nuscenes conversion library by John Chrosniak, Emory Ducote, John Link, Madhur Behl
-
RACECAR Tutorials - ROS2 Visualization by Amar Kulkarni, Utkarsh Chirimar
-
RACECAR Tutorials - ROS2 Localization by Amar Kulkarni
-
RACECAR Tutorials - nuScenes by John Chrosniak
-
RACECAR--The Dataset for High-Speed Autonomous Racing by Amar Kulkarni, John Chrosniak, Emory Ducote, Florian Sauerbeck, Andrew Saba, Utkarsh Chirimar, John Link, Marcello Cellina, and Madhur Behl
See 5 usage examples →
air qualityenvironmental
SPARTAN (Surface PARTiculate mAtter Network) measures and provides surface ambient particulate matter (PM2.5 and PM10) concentration and the chemical composition around the world, with the purpose of connecting ground-based PM2.5 and satellite remote sensing.
Details →
Usage examples
See 5 usage examples →
biodiversityecosystemsfisheriesmarine
The project presents Sea Around Us Global Fisheries Catch Data aggregated at EEZ level. The data are computed from reconstructed catches from various official fisheries statistics, scientific, technical and policy reports about the fisheries, and includes estimation of discards, unreported and illegal catch data from all maritime countries and major territories of the world.This project was the result of a work between Sea Around Us and the CIC programme, a collaborative programme between the University of British Columbia (UBC) and AWS.
Details →
Usage examples
See 5 usage examples →
climateenvironmentalmeteorologicaloceansoceanssustainabilityweather
This dataset includes archival hourly data from the [Sofar Spotter buoy global network] (https://weather.sofarocean.com/) from 2019 to March 2022.
Details →
Usage examples
-
Exploring Wave Spectra Variables from the Spotter Archive by Isabel A. Houghton
-
Performance Statistics of a Real-Time Pacific Ocean Weather Sensor Network (2021) by I. Houghton, P. Smit, D. Clark, C. Dunning, A. Fisher, N. Nidzieko, P. Chamberlain, T. Janssen
-
Exploring Bulk Variables from the Spotter Archive by Isabel A. Houghton
-
Performance Characteristics of “Spotter,” a Newly Developed Real-Time Wave Measurement Buoy (2019) by K. Raghukumar, G. Chang, F. Spada, C. Jones, T. Janssen, A. Gans
-
Analyzing Spotter data with CloudDrift by Milan Curcic
See 5 usage examples →
climateenvironmentalGPSweather
SondeHub Radiosonde telemetry contains global radiosonde (weather balloon) data captured by SondeHub from our participating radiosonde_auto_rx receiving stations. radiosonde_auto_rx is a open source project aimed at receiving and decoding telemetry from airborne radiosondes using software-defined-radio techniques, enabling study of the telemetry and sometimes recovery of the radiosonde itself.
Currently 313 receiver stations are providing data for an average of 384 radiosondes a day. The data within this repository contains received telemetry frames, including radiosonde type, gps position, a...
Details →
Usage examples
See 5 usage examples →
biologyimaginglife sciencesneurobiologyneuroimagingneuroscience
The Human Connectome Project (HCP Young Adult, HCP-YA) is mapping the healthy human connectome by collecting and freely distributing neuroimaging and behavioral data on 1,200 normal young adults, aged 22-35.
Details →
Usage examples
-
The Human Connectome Project: A retrospective by Elam JS, Glasser MF, Harms MP, Sotiropoulos SN, Andersson JL, Burgess GC, Curtiss SW, et al.
-
Exploring the Human Connectom by The Human Connectome Project
-
The minimal preprocessing pipelines for the Human Connectome Project by Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, Xu J, Jbabdi S, et al.
-
The Human Connectome Workbench by The Human Connectome Project
-
The WU-Minn Human Connectome Project: an overview. by Van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil, K, and the WU-Minn HCP Consortium.
See 5 usage examples →
aerial imageryearth observationelevationgeospatialland coverlidar
The State of Vermont has partnered with Amazon's Open Data Initative to make a wide range of geospatial data available in the public domain. Vermont acquires aerial imagery and LiDAR during leaf-off conditions. The imagery typically ranges from 30-centimeter to 15-centimeter in resolution and is available from Vermont's Amazon S3 bucket in a Cloud Optimized GeoTiff (COG) format. LiDAR data has been acquired and is available as USGS Quality Level-1 (QL1) and Level-2 (QL2) compliant datasets in COG format. Geospatial datasets derived from imagery and/or lidar are also available as COGs, ...
Details →
Usage examples
See 8 usage examples →
bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indextranscriptomics
A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).
Details →
Usage examples
-
BLAST+ Docker by NCBI BLAST
-
BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang
-
BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden
-
Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman
See 4 usage examples →
atmosphereclimatedeep learningenvironmentalexplorationgeophysicsgeosciencegeospatialglobaliceplanetarysatellite imageryzarr
The Chalmers Cloud Ice Climatology (CCIC) is a novel, deep-learning-based climate record of ice-particle concentrations in the atmosphere. CCIC results are available at high spatial and temporal resolution (0.07° / 3 h from 1983, 0.036° / 30 min from 2000) and thus ideally suited for evaluating high-resolution weather and climate models or studying individual weather systems.
Details →
Usage examples
See 4 usage examples →
atmosphereclimateclimate modelgeospatialicelandmodeloceanssustainabilityzarr
The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14th, 2021.
NCAR has copied a subset (currently ~500 TB) of CESM2 LENS data to Amazon S3 as part of the AWS Public Datasets Program. To optimize for large-scale analytics we have represented ...
Details →
Usage examples
See 4 usage examples →
data assimilationelectricityenergyenergy modelingindustrialmeteorologicalsolartransportation
Projects that use the dsgrid toolkit assemble bottom-up descriptions of electricity
demand and related data that are highly resolved geographically, temporally, and sectorally.
Typically modelers describe multiple scenarios of future energy use at hourly resolution,
suitable for inclusion in long-term power system planning models, i.e., capacity expansion and
production cost models.
Details →
Usage examples
See 7 usage examples →
air temperatureatmospheremeteorologicalnear-surface air temperaturenear-surface relative humiditynear-surface specific humidityprecipitationweather
These products are a subset of the ECMWF real-time forecast data and are made available to the public free of charge. They are based on the medium-range (high-resolution and ensemble) and seasonal forecast models. Products are available at 0.4 degrees resolution in GRIB2 format unless stated otherwise.
Details →
Usage examples
See 4 usage examples →
atmosphereclimateearth observationglobalsignal processingweather
This is an updating archive of radio occultation (RO) data using the transmitters of the Global Navigation Satellite Systems (GNSS) as generated and processed by the COSMIC DAAC (ucar), the Jet Propulsion Laboratory (jpl) of the California Institute of Technology, and the Radio Occultation Meteorology Satellite Application Facility (romsaf). The contributions for ucar and romsaf are currently active.
This dataset is funded by the NASA Earth Science Data Systems and the Advancing Collaborative Connections for Earth System Science (ACCESS) 2019 program.
Details →
Usage examples
See 4 usage examples →
bioinformaticsbiologygeneticgenomiclife sciences
The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of
research groups funded by the National Human Genome Research Institute (NHGRI). The goal
of ENCODE is to build a comprehensive parts list of functional elements in the human genome,
including elements that act at the protein and RNA levels, and regulatory elements that
control cells and circumstances in which a gene is active. ENCODE investigators employ a
variety of assays and methods to identify functional elements. The discovery and annotation
of gene elements is accomplished primarily by sequencing a ...
Details →
Usage examples
See 4 usage examples →
geopackagehydrographyhydrologic modelhydrologysimulationszarr
GEOGLOWS is the Group on Earth Observation's Global Water Sustainability Program. It coordinates efforts from public
and private entities to make application ready river data more accessible and sustainably available to underdeveloped
regions. The GEOGLOWS Hydrological Model provides a retrospective and daily forecast of global river discharge at 7
million river sub-basins. The stream network is a hydrologically conditioned subset of the TDX-Hydro streams and
basins data produced by the United State's National Geospatial Intelligence Agency. The daily forecast provides 3
hourly average discharge in a 51 member ensemble and 15 day lead time derived from the ECMWF Integrated Forecast
System (IFS). The retrospective simulation is derived from ERA5 climate reanalysis data and provides daily average
streamflow beginning on 1 January 1940. New forecasts are uploaded daily and the retrospective simulation is updated
weekly on Sundays to keep the lag time between 5 and 12 days.
The geoglows-v2 bucket contains: (1) model configuration files used to generate the simulations, (2) the GIS streams
datasets used by the model, (3) the GIS streams datasets optimized for visualizations used by Esri's Living Atlas
layer, (4) several supporting table of metadata including country names, river names, hydrological properties used for
modeling.
The geoglows-v2-forecasts bucket contains: (1) daily 15 forecasts in zarr format optimized for time series queries of
all ensemble members in the prediction, (2) CSV formatted summary files optimized for producing time series animated
web maps for the entire global streams dataset.
The geoglows-v2-retrospective bucket contains: (1) the model retrospective outputs in (1a) zarr format optimized for
time series queries of up to a few hundred rivers on demand as well as (1b) in netCDF format best for bulk download
...
Details →
Usage examples
See 4 usage examples →
geneticgenomiclife sciencesreference indexvcf
Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most up to date GIAB release.
Details →
Usage examples
See 4 usage examples →
aerial imageryagricultureclimatecogearth observationgeospatialimage processingland covermachine learningsatellite imagery
Global and regional Canopy Height Maps (CHM). Created using machine learning models on high-resolution worldwide Maxar satellite imagery.
Details →
Usage examples
-
Global Canopy Height on Earth Engine by Meta and WRI
-
Every tree counts: Large-scale mapping of canopy height at the resolution of individual trees by Jamie Tolan, Camille Couprie, and Tracy Johns
-
Sub-meter resolution canopy height maps using self-supervised learning and a vision transformer trained on Aerial and GEDI Lidar by Jamie Tolan, Hung-I Yang, Ben Nosarzewski, Guillaume Couairon, Huy Vo, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, Camille Couprie
-
Using Artificial Intelligence to Map the Earth’s Forests by Jamie Tolan, Camille Couprie, John Brandt, Justine Spore, Tobias Tiecke, Tracy Johns and Patrick Nease
See 4 usage examples →
agricultureearth observationmeteorologicalnatural resourceweather
Historical and one-day delay data from the IDEAM radar network.
Details →
Usage examples
See 4 usage examples →
cogelevationplanetarystac
The Japan Aerospace EXploration Agency (JAXA) SELenological and ENgineering Explorer (SELENE) mission’s Kaguya spacecraft was launched on September 14, 2007 and science operations around the Moon started October 20, 2007. The primary mission in a circular polar orbit 100-km above the surface lasted from October 20, 2007 until October 31, 2008. An extended mission was then conducted in lower orbits (averaging 50km above the surface) from November 1, 2008 until the SELENE mission ended with Kaguya impacting the Moon on June 10, 2009. These data are digital terrain models derived using the NASA A...
Details →
Usage examples
See 4 usage examples →
cogplanetarysatellite imagerystac
The Japan Aerospace EXploration Agency (JAXA) SELenological and ENgineering Explorer (SELENE) mission’s Kaguya spacecraft was launched on September 14, 2007 and science operations around the Moon started October 20, 2007. The primary mission in a circular polar orbit 100-km above the surface lasted from October 20, 2007 until October 31, 2008. An extended mission was then conducted in lower orbits (averaging 50km above the surface) from November 1, 2008 until the SELENE mission ended with Kaguya impacting the Moon on June 10, 2009. These data were collected in monoscopic observing mode. To cre...
Details →
Usage examples
See 4 usage examples →
air temperatureatmosphereforecastgeosciencegeospatialglobalmeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction (NWP) model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. The global ensemble (MOGREPS-G) produces forecasts for the whole globe up to a week ahead. The projection used is the Equirectangular Latitude-Longitude and the grid resolution is 20km. The forecast is updated regularly with a 4-hour time delay and formatted via NetCDF. ...
Details →
Usage examples
See 4 usage examples →
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Molecular Profiling to Predict Response to Treatment (MP2PRT) program is part of the NCI's Cancer Moonshot Initiative. The aim of this program is the retrospective characterization and analysis of biospecimens collected from completed NCI-sponsored trials of the National Clinical Trials Network and the NCI Community Oncology Research Program. This study, titled "Identification of Genetic Changes Associated with Relapse and/or Adaptive Resistance in Patients Registered as Favorable Histology Wilms Tumor on AREN03B2", performs genomic characterization (WGS 30X, Total RNAseq, mi...
Details →
Usage examples
See 4 usage examples →
biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience
This data set, made available by Janelia's MouseLight project, consists of
images and neuron annotations of the Mus musculus brain, stored in formats suitable
for viewing and annotation using the HortaCloud cloud-based annotation system.
Details →
Usage examples
-
MouseLight NeuronBrowser by Tiago A. Ferreira, Jayaram Chandrashekar
-
MouseLight Project Website by Tiago A. Ferreira, Jayaram Chandrashekar
-
Reconstruction of 1,000 Projection Neurons Reveals New Cell Types and Organization of Long-Range Connectivity in the Mouse Brain by Johan Winnubst, Erhan Bas, Tiago A. Ferreira, Zhuhao Wu, Michael N. Economo, Patrick Edson, Ben J. Arthur, Christopher Bruns, Konrad Rokicki, David Schauder, Donald J. Olbris, Sean D. Murphy, David G. Ackerman, Cameron Arshadi, Perry Baldwin, Regina Blake, Ahmad Elsayed, Mashtura Hasan, Daniel Ramirez, Bruno Dos Santos, Monet Weldon, Amina Zafar, Joshua T. Dudman, Charles R. Gerfen, Adam W. Hantman, Wyatt Korff, Scott M. Sternson, Nelson Spruston, Karel Svoboda, Jayaram Chandrashekar
-
HortaCloud by David Schauder, Donald J. Olbris, Jody Clements, Cristian Goina, Robert R. Svirskas, Konrad Rokicki
See 4 usage examples →
atmosphereclimateclimate modelgeospatiallandmodelsustainabilityzarr
The NA-CORDEX dataset contains regional climate change scenario data and guidance for North America, for use in impacts, decision-making, and climate science. The NA-CORDEX data archive contains output from regional climate models (RCMs) run over a domain covering most of North America using boundary conditions from global climate model (GCM) simulations in the CMIP5 archive. These simulations run from 1950–2100 with a spatial resolution of 0.22°/25km or 0.44°/50km. This AWS S3 version of the data includes selected variables converted to Zarr format from the original NetCDF. Only daily data a...
Details →
Usage examples
See 4 usage examples →
aerial imageryagriculturecogearth observationgeospatialnatural resourceregulatory
The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This "leaf-on" imagery andtypically ranges from 30 centimeters to 100 centimeters in resolution and is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeoTiff format. More details on NAIP
Details →
Usage examples
-
Individual Tree Detection in Large-Scale Urban Environments using High-Resolution Multispectral Imagery by Jonathan Ventura, Milo Honsberger, Cameron Gonsalves, Julian Rice, Camille Pawlak, Natalie L.R. Love, Skyler Han, Viet Nguyen, Keilana Sugano, Jacqueline Doremus, G. Andrew Fricker, Jenn Yost, Matt Ritter
-
Urban Tree Detection by Jonathan Ventura
-
EOS Land Viewer by Earth Observing System
-
VoyagerSearch showing off Batch + NAIP by Voyager
See 4 usage examples →
cogplanetarysatellite imagerystac
Knowledge of a planetary surface’s topography is necessary to understand its geology and enable landed mission operations. The Solid State Imager (SSI) on board NASA’s Galileo spacecraft acquired more than 700 images of Jupiter’s moon Europa. Although moderate- and high-resolution coverage is extremely limited, repeat coverage of a small number of sites enables the creation of digital terrain models (DTMs) via stereophotogrammetry. Here we provide stereo-derived DTMs of five sites on Europa. The sites are the bright band Agenor Linea, the crater Cilix, the crater Pwyll, pits and chaos adjacent...
Details →
Usage examples
See 4 usage examples →
cogplanetarysatellite imagerystac
These data are infrared image mosaics, tiled to the Mars quadrangle, generated using Thermal Emission Imaging System (THEMIS) images from the 2001 Mars Odyssey orbiter mission. The mosaic is generated at the full resolution of the THEMIS infrared dataset, which is approximately 100 meters/pixel. The mosaic was absolutely photogrammetrically controlled to an improved Viking MDIM network that was develop by the USGS Astrogeology processing group using the Integrated Software for Imagers and Spectrometers. Image-to-image alignment precision is subpixel (i.e., <100m). These 8-bit, qualitative d...
Details →
Usage examples
See 4 usage examples →
cogplanetarysatellite imagerystac
The Solid State Imager (SSI) on NASA's Galileo spacecraft acquired more than 500 images of Jupiter's moon, Europa. These images vary from relatively low-resolution hemispherical imaging, to high-resolution targeted images that cover a small portion of the surface. Here we provide a set of 92 image mosaics generated from minimally processed, projected Galileo images with photogrammetrically improved locations on Europa's surface.
These images provide users with nearly the entire Galileo Europa imaging dataset at its native resolution and with improved relative image locations. The S
...
Details →
Usage examples
See 4 usage examples →
cogplanetarysatellite imagerystac
The Solid State Imager (SSI) on NASA's Galileo spacecraft acquired more than 500 images of Jupiter's moon, Europa. These images vary from relatively low-resolution hemispherical imaging, to high-resolution targeted images that cover a small portion of the surface. Here we provide a set of 481 minimally processed, projected Galileo images with photogrammetrically improved locations on Europa's surface. These individual images were subsequently used as input into a set of 92 observation mosaics.
These images provide users with nearly the entire Galileo Europa imaging dataset at its native resolution and with improved relative image locations. The Solid State Imager on NASA's Galileo spacecraft provided the only moderate- to high-resolution images of Jupiter's moon, Europa. Unfortunately, uncertainty in the position and pointing of the spacecraft, as well as the position and orientation of Europa, when the images were acquired resulted in significant errors in image locations on the surface. The result of these errors is that images acquired during different Galileo orbits, or even at different times during the same orbit, are significantly misaligned (errors of up to 100 km on the surface).
The dataset provides a set of individual images that can be used for scientific analysis
...
Details →
Usage examples
See 4 usage examples →
cogelevationplanetarysatellite imagerystac
As of March, 2023 the Mars Reconnaissance Orbiter (MRO) High Resolution Science Experiment (HiRISE) sensor has collected more than 5000 targeted stereopairs. During HiRISE acquisition, the Context Camera (CTX) also collects lower resolution, higher spatial extent context images. These CTX acquisitions are also targeted stereopairs. This data set contains targeted CTX DTMs and orthoimages, created using the NASA Ames Stereopipeline. These data have been created using relatively controlled CTX images that have been globally bundle adjusted using the USGS Integrated System for Imagers and Spectro...
Details →
Usage examples
See 4 usage examples →
cogplanetarysatellite imagerystac
These data are digital terrain models (DTMs) created by multiple different institutions and released to the Planetary Data System (PDS) by the University of Arizona. The data are processed from the Planetary Data System (PDS) stored JP2 files, map projected, and converted to Cloud Optimized GeoTiffs (COGs) for efficient remote data access. These data are controlled to the Mars Orbiter Laser Altimeter (MOLA). Therefore, they are a proxy for the geodetic coordinate reference frame. These data are not guaranteed to co-register with an uncontrolled products (e.g., the uncontrolled High Resolution ...
Details →
Usage examples
See 4 usage examples →
cogplanetarysatellite imagerystac
These data are red and color Reduced Data Record (RDR) observations collected and originally processed by the High Resolution Imaging Science Experiment (HiRISE) team. The mdata are processed from the Planetary Data System (PDS) stored RDRs, map projected, and converted to Cloud Optimized GeoTiffs (COGs) for efficient remote data access. These data are not photogrammetrically controlled and use a priori NAIF SPICE pointing. Therefore, these data will not co-register with controlled data products. Data are released using simple cylindrical (planetocentric positive East, center longitude 0, -180...
Details →
Usage examples
See 4 usage examples →
agricultureclimatemeteorologicalweather
UPDATE TO GHCN PREFIXES - The NODD team is working on improving performance and access to the GHCNd data and will be implementing an updated prefix structure. For more information on the prefix changes, please see the "READ ME on the NODD Github". If you have questions, comments, or feedback, please reach out to nodd@noaa.gov with GHCN in the subject line.
Global Historical Climatology Network - Daily is a dataset from NOAA that contains daily observations over global land areas. It contains station-based measurements...
Details →
Usage examples
See 4 usage examples →
agricultureclimatedisaster responseenvironmentalweather
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
The HRRR ZARR formatted data was originally generated by the University of Utah under a grant provided by NOAA. They are are continuing to publish ZARR versions of HRRR data. For information about data in the s3://hrrrzarr/ please contact Details →
Usage examples
See 4 usage examples →
earth observationenergygeospatialmeteorologicalsolar
Released to the public as part of the Department of Energy's Open Energy Data Initiative,
the National Solar Radiation Database (NSRDB) is
a serially complete collection of hourly and half-hourly values of the three
most common measurements of solar radiation – global horizontal, direct
normal, and diffuse horizontal irradiance — and meteorological data. These
data have been collected at a sufficient number of locations and temporal and
spatial scales to accurately represent regional solar radiation climates.
Details →
Usage examples
-
HSDS Examples by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer, Michael Rossol
-
Physics-guided machine learning for improved accuracy of the National Solar Radiation Database by Grant Buster, Mike Bannister, Aron Habte, Dylan Hettinger, Galen Maclaurin, Michael Rossol, Manajit Sengupta, Yu Xie
-
The National Solar Radiation Data Base (NSRDB) by Manajit Sengupta, Yu Xe, Anthony Lopez, Aron Habte, Galen Maclaurin, James Shelby
-
NSRDB Viewer by Manajit Sengupta, Yu Xe, Anthony Lopez, Aron Habte, Galen Maclaurin, James Shelby, Paul Edwards
See 4 usage examples →
graphjsonmetadatascholarly communication
An open, comprehensive index of scolarly papers, citations, authors, institutions, and journals.
Details →
Usage examples
See 4 usage examples →
biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy
The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins
using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome.
This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library.
These images can be interpreted both individually, to determine the localization of particular proteins of interest,
and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.
Details →
Usage examples
-
OpenCell: proteome-scale endogenous tagging enables the cartography of human cellular organization by Nathan H. Cho, Keith C. Cheveralls, Andreas-David Brunner, Kibeom Kim, André C. Michaelis, Preethi Raghavan, et al.
-
OpenCell web portal by OpenCell team
-
Self-Supervised Deep-Learning Encodes High-Resolution Features of Protein Subcellular Localization by Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti, Loic A. Royer
-
cytoself (an unsupervised ML model to quantify localization patterns) by Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti, Loic A. Royer
See 4 usage examples →
agricultureanalysis ready dataceosdisaster responseearth observationgeospatialsatellite imagerystacsustainabilitysynthetic aperture radar
The RADARSAT Constellation Mission (RCM) is Canada's third generation of Earth observation satellites. Launched on June 12, 2019, the three identical satellites work together to bring solutions to key challenges for Canadians. As part of ongoing Open Government efforts, NRCan produces a CEOS analysis ready data (ARD) of Canada landmass using a 30M Compact-Polarization standard coverage, every 12 days. RCM CEOS-ARD (POL) is the first ever polarimetric dataset approved by the CEOS committee. Previously, users were stuck ordering, downloading and processing RCM images (level 1) on their own, often with expensive software. This new dataset aims to remove these burdens with a new STAC catalog for discovery and direct download links.
La mission de la Constellation RADARSAT (MCR) est la troisième génération de satellites d'observation de la Terre du Canada. Lancés le 12 juin 2019, les trois satellites identiques travaillent ensemble pour apporter des solutions aux principaux défis des Canadiens. Dans le cadre des efforts continus pour un gouvernement ouvert, RNCan produit des données prêtes à l'analyse CEOS (ARD) de la masse terrestre du Canada en utilisant une couverture standard de 30 m en polarisation compacte, tous les 12 jours. Les CEOS-ARD (POL) du MCR constituent le premier ensemble de données polarimétriques jamais approuvé par le comité CEOS. Auparavant, les utilisateurs étaient obligés de commander, de télécharger...
Details →
Usage examples
See 4 usage examples →
bioinformaticsbiologygeneticgenomicinfrastructurelife sciencessingle-cell transcriptomicstranscriptomicswhole genome sequencing
Pre-built refgenie reference genome data assets used for aligning and analyzing DNA sequence data.
Details →
Usage examples
See 4 usage examples →
agricultureclimateearth observationenvironmentalmeteorologicalmodelsustainabilitywaterweather
SILO is a database of Australian climate data from 1889 to the present. It provides continuous, daily time-step data products in ready-to-use formats for research and operational applications.
SIL...
Details →
Usage examples
See 4 usage examples →
climateearth observationenvironmentalgeospatialglobaloceans
Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.
Details →
Usage examples
-
Adjusting for desert-dust-related biases in a climate data record of sea surface temperature (2020). by Merchant, C.J. and Embury, O.
-
Working with surftemp-sst data - Tutorial 1 - Getting started by Niall McCarroll
-
Working with surftemp-sst data - Tutorial 2 - Analysing Marine Heatwaves by Niall McCarroll
-
Satellite-based time-series of sea-surface temperature since 1981 for climate applications (2019). by Merchant, C.J., Embury, O., Bulgin, C.E., Block, T., Corlett, G.K., Fiedler, E., Good, S.A., Mittaz, J., Rayner, N.A., Berry, D., Eastwood, S., Taylor, M., Tsushima, Y., Waterfall, A., Wilson, R. and Donlon, C.
See 4 usage examples →
agriculturecogdisaster responseearth observationgeospatialsatellite imagerysynthetic aperture radar
Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.
Details →
Usage examples
See 4 usage examples →
agriculturecogearth observationgeospatialmachine learningnatural resourcesatellite imagery
Sentinel-2 L2A 120m mosaic is a derived product, which contains best pixel values for 10-daily periods, modelled by removing the cloudy pixels and then performing interpolation among remaining values. As there are some parts of the world, which have lengthy cloudy periods, clouds might be remaining in some parts. The actual modelling script is available here.
Details →
Usage examples
See 4 usage examples →
cogearth observationenvironmentalgeospatiallandoceanssatellite imagerystac
This data set consists of observations from the Sentinel-3 satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-3 is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the Ocean and Land Colour Instrument (OLCI) for medium resolution marine and terrestrial optical measurements, the Sea and Land Surface Temperature Radiometer (SLSTR), the SAR Radar Altimeter (SRAL), the MicroWave Radiometer (MWR) and the Precise Orbit Determination (POD) instruments. The satellite was launched in 2016 and entered routine operational phase in 201...
Details →
Usage examples
See 4 usage examples →
air qualityatmospherecogearth observationenvironmentalgeospatialsatellite imagerystac
This data set consists of observations from the Sentinel-5 Precursor (Sentinel-5P) satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-5P is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the TROPOspheric Monitoring Instrument (TROPOMI) which is a spectrometer that senses ultraviolet (UV), visible (VIS), near (NIR) and short wave infrared (SWIR) to monitor ozone, methane, formaldehyde, aerosol, carbon monoxide, nitrogen dioxide and sulphur dioxide in the atmosphere. The satellite was launched in October 2017 and entered ro...
Details →
Usage examples
See 4 usage examples →
biodiversitybiologyecosystemsimage processingmultimediawildlife
The SiPeCaM goal is to create a data source that allows to evaluate changes in the biodiversity state, considering key aspect of how does the ecosystem behaves.
Details →
Usage examples
See 4 usage examples →
meteorologicalsatellite imageryweather
Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.
Details →
Usage examples
See 4 usage examples →
bioinformaticshealthlife sciencesnatural language processingus
The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and gov...
Details →
Usage examples
See 4 usage examples →
geneticgenome wide association studygenomiclife sciencespopulation genetics
Linkage disequilibrium (LD) matrices of UK Biobank participants of a British ancestry, based on imputed genotypes.
Details →
Usage examples
See 4 usage examples →
geneticgenome wide association studygenomiclife sciencespopulation genetics
A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.
Details →
Usage examples
See 4 usage examples →
coastalfloods
The Virginia Coastal Resilience Master Plan builds on the 2020 Virginia Coastal Resilience Master Planning Framework, which outlined the goals and principles of the Commonwealth’s statewide coastal resilience strategy. Recognizing the urgent challenge flooding already poses, the Commonwealth developed Phase One of the Master Plan on an accelerated timeline and focused this first assessment on the impacts of tidal and storm surge coastal flooding on coastal Virginia. The Master Plan leveraged the combined efforts of more than two thousand stakeholders, subject matter experts, and government personnel. We centered the development of this plan around three core components:
A Technical Study compiled essential data, research, processes, products, and resilience efforts in the Coastal Resilience Database, which forms much of basis of this plan and the Coastal Resilience Web Explorer;
A Technical Advisory Committee supported coordination across key stakeholders and ensured the incorporation of the best available subject matter knowledge, data, and methods into this plan; and
Stakeholder Engagement captured diverse resilience perspectives from residents, local and regional officials, and other stakeholders across Virginia’s coastal communities to drive regionally specific resilience priorities.Data products used and generated for the Virginia Coastal Resilience.
This dataset represents the data that was developed for the technical study. Appendix F - Data Product List provides a list of available data. Other Appendix documents provide the inpu
...
Details →
Usage examples
See 4 usage examples →
robotics
This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkley's scanning rig and the Google scanner. The UC Berkley's scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 ...
Details →
Usage examples
-
The Closure Signature: A Functional Approach to Model Underactuated Compliant Robotic Hands by Maria Pozzi, Gionata Salvietti, João Bimbo, Monica Malvezzi, Domenico Prattichizzo
-
Pre-touch sensing for sequential manipulation by Boling Yang, Patrick Lancaster, Joshua R. Smith
-
Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set by Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, Aaron M Dollar
-
Label Fusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes by Pat Marion, Peter R. Florence, Lucas Manuelli, Russ Tedrake
See 4 usage examples →
agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery
iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this dataset are images of predicted soil properties, model error and satellite covariates used in the mapping process.
Details →
Usage examples
-
iSDAsoil Python tutorial by Matt Miller
-
iSDAsoil homepage - view soil property maps online by iSDA
-
iSDAsoil liming demo app on Observable by Jamie Collinson
-
African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning by Tomislav Hengl, Matthew A. E. Miller, Josip Križan, Keith D. Shepherd, Andrew Sila, Milan Kilibarda, Ognjen Antonijević, Luka Glušica, Achim Dobermann, Stephan M. Haefele, Steve P. McGrath, Gifty E. Acquah, Jamie Collinson, Leandro Parente, Mohammadreza Sheykhmousa, Kazuki Saito, Jean-Martial Johnson, Jordan Chamberlin, Francis B. T. Silatsa, Martin Yemefack, John Wendt, Robert A. MacMillan, Ichsani Wheeler & Jonathan Crouch
See 4 usage examples →
disaster responsegeospatialmappingosm
The real-changesets is an augmented representation of OpenStreetMap changesets in JSON format. It contains the current and the previous version of each feature in a changeset.
It's primary used by OSMCha, the main OpenStreetMap validation tool, to have a visualization of the changeset and provide to the user the understanding of what was changed on the map.
The real-changesets are created by combining the changeset metadata and the augmented diff generated by overpass.
Details →
Usage examples
See 4 usage examples →
agriculturelidarlocalizationmappingrobotics
AG-LOAM dataset has been released to facilitate the evaluation of LiDAR-based odometry algorithms in agricultural environments.
- It was collected by a wheeled mobile robot at the Agricultural Experimental Station of the University of California, Riverside, during Winter 2022 and Winter 2023.
- It provides LiDAR point cloud data captured using a Velodyne VLP-16 sensor, along with ground-truth trajectories obtained from an RTK-GPS system.
- It consists of 18 sequences collected over three phases, covering diverse planting environments, terrain conditions, path patterns, and robot motion profiles.
- It ...
Details →
Usage examples
See 3 usage examples →
cogdisaster responsegeospatialsatellite imagerystac
synthetic Aperture Radar (SAR) data is a powerful tool for monitoring and assessing disaster events and can provide valuable insights for researchers, scientists, and emergency response teams.
The Alaska Satellite Facility (ASF) curates this collection of (primarily) SAR and SAR-derived satellite data products from a variety of data sources for disaster events.
Details →
Usage examples
See 3 usage examples →
biologycancercomputer visiongene expressiongeneticglioblastomaHomo sapiensimage processingimaginglife sciencesmachine learningneurobiology
This dataset consists of images of glioblastoma human brain tumor tissue sections that have been probed for expression of particular genes believed to play a role in development of the cancer. Each tissue section is adjacent to another section that was stained with a reagent useful for identifying histological features of the tumor. Each of these types of images has been completely annotated for tumor features by a machine learning process trained by expert medical doctors.
Details →
Usage examples
See 3 usage examples →
biologygene expressiongeneticimage processingimaginglife sciencesMus musculusneurobiologytranscriptomics
The Allen Mouse Brain Atlas is a genome-scale collection of cellular resolution gene expression profiles using in situ hybridization (ISH). Highly methodical data production methods and comprehensive anatomical coverage via dense, uniformly spaced sampling facilitate data consistency and comparability across >20,000 genes. The use of an inbred mouse strain with minimal animal-to-animal variance allows one to treat the brain essentially as a complex but highly reproducible three-dimensional tissue array. The entire Allen Mouse Brain Atlas dataset and associated tools are available through an...
Details →
Usage examples
See 3 usage examples →
cancergeneticgenomicHomo sapienslife sciencesSTRIDES
Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked
collectively to better understand drugs and drug combinations that should be prioritized for
further development within clinical and/or molecular subsets of acute myeloid leukemia (AML)
patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia
samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...
Details →
Usage examples
See 3 usage examples →
climateenvironmentalsatellite imagery
A dataset of satellite retrievals of atmospheric methane that extends from 30 April 2018 to present.
Details →
Usage examples
See 3 usage examples →
bioinformaticsbiologycancergeneticgenomicHomo sapienslife sciencesreference index
Broad maintained human genome reference builds hg19/hg38 and decoy references.
Details →
Usage examples
See 3 usage examples →
cancercomputational pathologycomputer visiondeep learninghistopathologylife sciences
This page describes the COBRA (Classification Of Basal cell carcinoma, Risky skin cancers and Abnormalities) skin pathology dataset, which comprises over 7000 histopathology whole-slide-images related to the diagnosis of basal cell carcinoma skin cancer, the most commonly diagnosed cancer. The dataset includes biopsies and excisions and is divided into four groups. The first group contains about 2,500 BCC biopsies with subtype labels, while the second group includes 2,500 non-BCC biopsies with different types of skin dysplasia. The third group has 1,000 labelled risky cancer biopsies, includin...
Details →
Usage examples
See 3 usage examples →
coronavirusCOVID-19life sciences
A harmonized collection of the core data pertaining to COVID-19 reported cases by geography, in a format prepared for analysis
Details →
Usage examples
See 3 usage examples →
cell biologycomputer visionelectron microscopyimaginglife sciencesorganelle
High resolution images of subcellular structures.
Details →
Usage examples
-
Correlative three-dimensional super-resolution and block-face electron microscopy of whole vitreously frozen cells. by David P. Hoffman, Gleb Shtengel, C. Shan Xu, Kirby R. Campbell, Melanie Freeman, Lei Wang, Daniel E. Milkie, H. Amalia Pasolli, Nirmala Iyer, John A. Bogovic, Daniel R. Stabley, Abbas Shirinifard, Song Pang, David Peale, Kathy Schaefer, Wim Pomp, Chi-Lun Chang, Jennifer Lippincott-Schwartz, Tom Kirchhausen1, David J. Solecki, Eric Betzig, Harald F. Hess
-
Whole-cell organelle segmentation in volume electron microscopy by Lisa Heinrich, Davis Bennett, David Ackerman, Woohyun Park, Jon Bogovic, Nils Eckstein, et al.
-
Enhanced FIB-SEM systems for large-volume 3D imaging by C. Shan Xu, Kenneth J. Hayworth, Zhiyuan Lu, Patricia Grob, Ahmed M. Hassan, José G. García-Cerdán, Krishna K. Niyogi, Eva Nogales, Richard J. Weinberg, Harald F. Hess.
See 3 usage examples →
agriculturecomputer visionIMUlidarlocalizationmappingrobotics
CitrusFarm is a multimodal agricultural robotics dataset that provides both multispectral images and navigational sensor data for localization, mapping and crop monitoring tasks.
- It was collected by a wheeled mobile robot in the Agricultural Experimental Station at the University of California Riverside in the summer of 2023.
- It offers a total of nine sensing modalities, including stereo RGB, depth, monochrome, near-infrared and thermal images, as well as wheel odometry, LiDAR, IMU and GPS-RTK data.
- It comprises seven sequences collected from three citrus tree fields, featuring various tree species at different growth stages, distinctive planting patterns, as well as varying daylight conditions.
- It spans a total operation time of 1.7 hours, covers a total distance of 7.5 km, and consti...
Details →
Usage examples
See 3 usage examples →
cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing
The goal of the project is to identify recurrent genetic alterations (mutations, deletions,
amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI)
utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome
sequencing. The samples were processed and submitted for genomic characterization using pipelines
and procedures established within The Cancer Genome Analysis (TCGA) project.
Details →
Usage examples
-
Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., Calvin A. Johnson,
Ph.D., James D. Phelan, Ph.D., James Q. Wang, Ph.D., Sandrine Roulland, Ph.D., Monica
Kasbekar, Ph.D., Ryan M. Young, Ph.D., Arthur L. Shaffer, Ph.D., Daniel J. Hodson, M.D.,
Ph.D., Wenming Xiao, Ph.D., et al.
-
Genomic Data Commons by National Cancer Institute
-
A multiprotein supercomplex controlling oncogenic signalling in lymphoma by Phelan JD, Young RM, Webster DE, Roulland S, Wright GW, Kasbekar M, Shaffer AL 3rd,
Ceribelli M, Wang JQ, Schmitz R, Nakagawa M, Bachy E, Huang DW, Ji Y, Chen L, Yang Y, Zhao
H, Yu X, Xu W, Palisoc MM, Valadez RR, Davies-Hill T, Wilson WH, Chan WC, Jaffe ES, Gascoyne
RD, Campo E, Rosenwald A, Ott G, Delabie J, Rimsza LM, Rodriguez FJ, Estephan F, Holdhoff M,
Kruhlak MJ, Hewitt SM, Thomas CJ, Pittaluga S, Oellerich T, Staudt LM
See 3 usage examples →
energygeothermal
Data released from projects funded by the Department of Energy's Geothermal
Technologies Office (DOE GTO) that are too large or complex to be conveniently
accessed by traditional means. The GDR data lake aims to improve and automate
access of high-value geothermal data sets, making data actionable and discoverable
by researchers and industry to accelerate analysis and advance innovation.
This data lake is a sister-data lake to the Department of Energy’s Open Energy Data
Initiative (OEDI) Data Lake.
Details →
Usage examples
See 3 usage examples →
biologycell imagingelectrophysiologyinfrastructurelife sciencesneuroimagingneurophysiologyneuroscience
DANDI is a public archive of neurophysiology datasets, including raw and processed data, and associated software containers. Datasets are shared according to a Creative Commons CC0 or CC-BY licenses. The data archive provides a broad range of cellular neurophysiology data. This includes electrode and optical recordings, and associated imaging data using a set of community standards: NWB:N - NWB:Neurophysiology, BIDS - Brain Imaging Data Structure, and Details →
Usage examples
See 3 usage examples →
atmosphereclimateclimate modelgeospatialmodelzarr
The EURO-CORDEX dataset contains regional climate model data for Europe, for use in impacts, decision-making, and climate science.
Currently, the bucket contains monthly datasets of 2m air temperature downscaled from CMIP5 global model datasets using different
regional climate models.
Details →
Usage examples
See 3 usage examples →
cancerepigenomicsgenomiclife sciencesSTRIDEStranscriptomicswhole exome sequencingwhole genome sequencing
The Exceptional Responders Initiative is a pilot study to investigate the underlying molecular factors driving exceptional treatment responses of cancer patients to drug therapies. Study researchers will examine molecular profiles of tumors from patients either enrolled in a clinical trial for an investigational drug(s) and who achieved an exceptional response relative to other trial participants, or who achieved an exceptional response to a non-investigational chemotherapy. An exceptional response is defined as achievement of either a complete response or a partial response for at least 6 mon...
Details →
Usage examples
See 3 usage examples →
agricultureearth observationmeteorologicalweather
The up-to-date weather radar from the FMI radar network is available as Open Data. The data contain both single radar data along with composites over Finland in GeoTIFF and HDF5-formats. Available composite parameters consist of radar reflectivity (DBZ), rainfall intensity (RR), and precipitation accumulation of 1, 12, and 24 hours. Single radar parameters consist of radar reflectivity (DBZ), radial velocity (VRAD), rain classification (HCLASS), and Cloud top height (ETOP 20). Raw volume data from singe radars are also provided in HDF5 format with ODIM 2.3 conventions. Radar data becomes avail...
Details →
Usage examples
See 3 usage examples →
cancergenomiclife sciences
The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation
Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse
array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive
genomic profiling assay. This dataset contains open Clinical and Biospecimen data.
Details →
Usage examples
-
Genomic Data Commons by National Cancer Institute
-
Targeted next-generation sequencing of advanced prostate cancer identifies potential
therapeutic targets and disease heterogeneity.
by Beltran H, Yelensky R, Frampton GM, Park K, Downing SR, MacDonald TY, Jarosz M, Lipson D,
Tagawa ST, Nanus DM, Stephens PJ, Mosquera JM, Cronin MT, Rubin MA
-
High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer
Pathogenesis
by Ryan J. Hartmaier, Lee A. Albacker, Juliann Chmielecki, Mark Bailey, Jie He, Michael E.
Goldberg, Shakti Ramkissoon, James Suh, Julia A. Elvin, Samuel Chiacchia, Garrett M.
Frampton, Jeffrey S. Ross, Vincent Miller, Philip J. Stephens and Doron Lipson
See 3 usage examples →
genomegenotypinggolden retriever lifetime studylife sciencesmorris animal foundation
Morris Animal Foundation’s Golden Retriever Lifetime Study is a longitudinal, prospective study following 3044 golden retrievers. The Study’s purpose is to identify the nutritional, environmental, lifestyle and genetic risk factors for cancer and other diseases. The Golden Oldie’s study enrolled an additional cohort of golden retrievers that had reached the age of 12 years or older and had not yet been diagnosed with a malignant cancer. This population can be used as a control group for conditions with high mortality in younger age. This dataset contains the data for ~1.1 million genetic marke...
Details →
Usage examples
-
GRLS GWAS Tutorial by Tamer Mansour
-
The Golden Retriever Lifetime Study: establishing an observational cohort study with translational relevance for human health by Michael K. Guy, Rodney L. Page, Wayne A. Jensen, Patricia N. Olson, J. David Haworth, Erin E. Searfoss, and Diane E. Brown
-
Cohort profile: The Golden Retriever Lifetime Study (GRLS) by Julia Labadie, Brenna Swafford, Mara DePena, Kathy Tietje, Rodney Page, Janet Patterson-Kane
See 3 usage examples →
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The International Cardiac Arrest REsearch consortium (I-CARE) Database includes baseline clinical information and continuous electroencephalography (EEG) recordings from 1,020 comatose patients with a diagnosis of cardiac arrest who were admitted to an intensive care unit from seven academic hospitals in the U.S. and Europe. Patients were monitored with 18 bipolar EEG channels over hours to days for the diagnosis of seizures and for neurological prognostication. Long-term neurological function was determined using the Cerebral Performance Category scale.
Details →
Usage examples
-
I-CARE:International Cardiac Arrest REsearch consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.
-
The International Cardiac Arrest Research (I-CARE) Consortium Electroencephalography Database by Amorim E, Zheng WL, Ghassemi MM, Aghaeeaval M, Kandhare P, Karukonda V, et al.
-
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.
See 3 usage examples →
aerial imageryagriculturecogearth observationgeospatialimagingmappingnatural resourcesustainability
The State of Indiana Geographic Information Office and IOT Office of Technology manage a series of digital orthophotography dating back to 2005. Every year's worth of imagery is available as Cloud Optimized GeoTIFF (COG) files, original GeoTIFF, and other compressed deliverables such as ECW and MrSID. Additionally, each imagery year is organized into a tile grid scheme covering the entire geography of Indiana. All years of imagery are tiled from a 5,000 ft grid or sub tiles depending upon the resolution of the imagery. The naming of the tiles reflects the lower left coordinate from the...
Details →
Usage examples
See 3 usage examples →
agricultureearth observationgeospatialimaginglidarmappingnatural resourcesustainability
The State of Indiana Geographic Information Office and IOT Office of Technology manage a series of digital LiDAR LAS files stored in AWS, dating back to the 2011-2013 collection and including the NRCS-funded 2016-2020 collection. These LiDAR datasets are available as uncompressed LAS files, for cloud storage and access. Each year's data is organized into a tile grid scheme covering the entire geography of Indiana, ensuring easy access and efficient processing. The tiles' naming reflects each tile's lower left coordinate, facilitating accurate data management and retrieval. The AWS ...
Details →
Usage examples
See 3 usage examples →
benchmarkbioinformaticslife sciencesmetagenomicsmicrobiome
Database for use with Kraken2 (taxonomic annotation of metagenomic sequencing reads) including all NCBI RefSeq genomes available in release V205
Details →
Usage examples
See 3 usage examples →
bioinformaticshealthlife sciencesnatural language processingus
MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large,
single-center database comprising information relating to patients
admitted to critical care units at a large tertiary care hospital.
Data includes vital signs, medications, laboratory measurements,
observations and notes charted by care providers, fluid balance,
procedure codes, diagnostic codes, imaging reports, hospital length
of stay, survival data, and more. The database supports applications
including academic and industrial research, quality improvement initiatives,
and higher education coursework. The MIMIC-I...
Details →
Usage examples
See 3 usage examples →
computed tomographyhealthimaginglife sciencesmagnetic resonance imagingmedicineniftisegmentation
With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validati...
Details →
Usage examples
See 3 usage examples →
air temperatureatmosphereforecastgeosciencegeospatialmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. Uncovering 2 years' worth of historical data, updated regularly with a time delay. The Global deterministic model is a global configuration of the Met Office Unified Models providing the most accurate short range deterministic forecast by any national meteorological servic...
Details →
Usage examples
See 3 usage examples →
forecastgeosciencegeospatialglobalmarinemodelnetcdfocean sea surface heightoceansweather
The Global Ocean component of the Met Office Global Coupled Atmosphere-Land-Ocean-Ice system which has been running in operations since May 2022. The system provides a global physical analysis and coupled forecast products providing 3D daily mean fields of temperature and salinity, zonal and meridional velocities; 2D daily mean fields of sea surface height, bottom temperature, mixed layer depth, sea ice fraction, sea ice thickness and sea ice zonal and meridional velocities; and instantaneous hourly fields for sea surface height, sea surface temperature and surface currents. The Met Office Glo...
Details →
Usage examples
See 3 usage examples →
forecastgeosciencegeospatialglobalmarinemodelnetcdfocean sea surface heightoceansweather
The Met Office runs global wave forecast models to support marine safety and operational decision making. Met Office configurations are developed to be run using the community wave model WAVEWATCH IIITM. The global wave configuration is designed to generate accurate forecasts for open waters of the world’s oceans and larger seas. The Met Office wave models are forced using wind data from the Met Office Global Atmospheric Hi-Res Model. The global wave model is run to provide a five day outlook for wave characteristics defining height, period and direction of waves within a given sea-state. The ...
Details →
Usage examples
See 3 usage examples →
air temperatureatmosphereforecastgeosciencegeospatialmodelnear-surface air temperaturenear-surface relative humiditynetcdfweather
The flagship Numerical Weather Prediction model developed and used at the Met Office, is the Unified Model, the same model is used for both weather and climate prediction. For weather forecasting the Met Office runs several configurations of the Unified Model as part of its operational Numerical Weather Prediction suite. Uncovering 2 years' worth of historical data, updated regularly with a time delay. The UK deterministic model is a post processed regional downscaled configuration of the Unified Model, covering the UK and Ireland, with a resolution of approximately 0.018 degrees. The Unit...
Details →
Usage examples
See 3 usage examples →
computer visionurbanusvideo
The Multiview Extended Video with Activities (MEVA) dataset consists
video data of human activity, both scripted and unscripted,
collected with roughly 100 actors over several weeks. The data was
collected with 29 cameras with overlapping and non-overlapping
fields of view. The current release consists of about 328 hours
(516GB, 4259 clips) of video data, as well as 4.6 hours (26GB) of
UAV data. Other data includes GPS tracks of actors, camera models,
and a site map. We have also released annotations for roughly 184 hours of
data. Further updates are planned.
Details →
Usage examples
See 3 usage examples →
air temperatureclimateclimate modelclimate projectionsCMIP6cogearth observationenvironmentalglobalmodelNASA Center for Climate Simulation (NCCS)near-surface relative humiditynear-surface specific humiditynetcdfprecipitation
The NEX-GDDP-CMIP6 dataset is comprised of global downscaled climate
scenarios derived from the General Circulation Model (GCM) runs
conducted under the Coupled Model Intercomparison Project Phase 6
(CMIP6) and across two of the four "Tier 1" greenhouse gas emissions
scenarios known as Shared Socioeconomic Pathways (SSPs). The CMIP6
GCM runs were developed in support of the Sixth Assessment Report
of the Intergovernmental Panel on Climate Change (IPCC AR6). This
dataset includes downscaled projections from ScenarioMIP model
runs for which daily scenarios were produced and distributed...
Details →
Usage examples
-
NEX-GDDP-CMIP6 Dashboard by NASA
-
NASA Global Daily Downscaled Projections, CMIP6 by Thrasher, B., Wang, W., Michaelis, A., Melton, F., Lee, T. and Nemani, R.
-
NASA and ASDI announce no-cost access to important climate dataset on the AWS Cloud by Dr. Manil Maskey and Ana Pinheiro Privette
See 3 usage examples →
archivesastronomydatacenterimagingsatellite imageryx-ray
NASA data for high energy astrophysics (generally x-ray and gamma-ray domains) is made available here by the High Energy Astrophysics Science Archive Research Center. The HEASARC hosts the full data archives of over 30 different missions spanning 50 years. The data archive for each mission will contain a range of data types from spacecraft housekeeping and raw photon event list data up to high level science-ready products such as images, light curves (time series), and energy spectra.
This is a relatively modest total data volume but contains significant complexity and heterogeneity among the different missions. Data provided here are stored in the Flexible Image Transport System (FITS) format common in astronomy. Higher level products are further defined to be consistent between missions following data model standards agreed by the community and maintained by the HEASARC. Analysis of these data may require software also provided by HEASARC, the HEASoft package, consisting of tools generic to all FITS data, generic to all HEASARC-compliant data, and/or specific to individual missions as appropriate. Some missions provide standard science-ready data products, while others provide low-level data types and software to generate science-ready products from them. See the links for each mission for more information on how to use the data.
The HEASARC Website also has archive browsing tools where you can query for observations corresponding to temporal and spatial constraints among others. These tools will ultimately point to files located on the archive by giving a URL beginning with the path https://heasarc.gsfc.nasa.gov/FTP/. The data that are provided in the ODR follow the same structure, so when our tools give an https access URL, a user can simply swap in s3://nasa-heasarc/ for the first part of that URL and get a cloud URI. Note also that some selections have been made to what has been copied to the ODR, while the HEASARC archive itself remains the definitive and legacy source for the complete datasets.
The HEASARC also...
Details →
Usage examples
See 3 usage examples →
archivesastronomydatacenterimagingsatellite imagery
NASA data for cosmic microwave background (CMB) analysis is made available here by the Legacy Archive for Microwave Background Data Analysis (LAMBDA), which is a part of NASA's High Energy Astrophysics Science Archive Research Center (HEASARC). LAMBDA hosts the data archives of over 30 different CMB missions spanning 30+ years. The data archive for each mission may contain a range of data types from low-level time-ordered data to high level science-ready products such as sky maps and angular power spectra. Also provided in consistent formats are a variety of full sky maps in complementary ...
Details →
Usage examples
See 3 usage examples →
astronomymachine learningNASA SMD AI
The SOHO/LASCO data set (prepared for the challenge hosted in Topcoder) provided here comes from the instrument’s C2 telescope and comprises approximately 36,000 images spread across 2,950 comet observations. The human eye is a very sensitive tool and it is the only tool currently used to reliably detect new comets in SOHO data - particularly comets that are very faint and embedded in the instrument background noise. Bright comets can be easily detected in the LASCO data by relatively simple automated algorithms, but the majority of comets observed by the instrument are extremely faint, noise-...
Details →
Usage examples
See 3 usage examples →
bioinformaticsbiologyGeneLabgenomicimaginglife sciencesspace biology
NASA’s Space Biology Open Science Data Repository (OSDR) introduces a one-stop site where users can explore and contribute a variety of NASA open science biological data. This site consolidates data from the Ames Life Sciences Data Archive (ALSDA) and GeneLab and includes information about the broader NASA Open Science and Open Data initiatives, all at one centralized location. Our mission is to maximize the utilization of the valuable biological research resources and enable new discoveries.
OSDR introduces access to data generated from spaceflight and space relevant experiments that explore
...
Details →
Usage examples
-
Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration by Ryan T. Scott, Kirill Grigorev, Graham Mackintosh, Samrawit G. Gebre, Christopher E. Mason, Martha E. Del Alto, Sylvain V. Costes
-
GeneLab: Omics database for spaceflight experiments by Shayoni Ray, Samrawit Gebre, Homer Fogle, Daniel C Berrios, Peter B Tran, Jonathan M Galazka, Sylvain V Costes
-
NASA GeneLab: interfaces for the exploration of space omics data by Daniel C Berrios, Jonathan Galazka, Kirill Grigorev, Samrawit Gebre, Sylvain V Costes
See 3 usage examples →
analyticsanomaly detectionarchivescomputed tomographydatacenterdigital assetselectricityenergyfluid dynamicsimage processingphysicspost-processingradiationsignal processingsource codeturbulencevideox-rayx-ray tomography
The Large Helical Device (LHD), owned and operated by the National Institute for Fusion Science (NIFS), is one of the world's largest plasma confinement device which employs a heliotron magnetic configuration generated by the superconducting coils. The objectives are to conduct academic research on the confinement of steady-state, high-temperature, high-density plasmas, core plasma physics, and fusion reactor engineering, which are necessary to develop future fusion reactors. All the archived data of the LHD plasma diagnostics are available since the beginning of the LHD experiment, starte...
Details →
Usage examples
See 3 usage examples →
climateenvironmentalmeteorologicaloceanssustainabilityweather
This dataset includes hourly sea surface temperature and current data collected by satellite-tracked surface drifting buoys ("drifters") of the NOAA Global Drifter Program. The Drifter Data Assembly Center (DAC) at NOAA’s Atlantic Oceanographic and Meteorological Laboratory (AOML) has applied quality control procedures and processing to edit these observational data and obtain estimates at regular hourly intervals. The data include positions (latitude and longitude), sea surface temperatures (total, diurnal, and non-diurnal components) and velocities (eastward, northward) with accompanying uncertainty estimates. Metadata include identification numbers, experiment number, start location and time, end location and time, drogue loss date, death code, manufacturer, and drifter type.
Please note that data from the Global Drifter Program are also available at 6-hourly intervals but derived via alternative methods. The 6-hourly dataset goes back further in time (1979) and may be more appropriate for studies of long-term, low frequency patterns of the oceanic circulation. Yet, the 6-hourly dataset does not resolve fully high-frequency processes such as tides and inertial oscillations as well as sea surface temperature diurnal variability.
[CITING NOAA - hourly position, current, and sea surface temperature from drifters data. Citation for this dataset should include the following information below.]
Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; ...
Details →
Usage examples
-
Working with GDP hourly data using python and xarray, a CloudDrift notebook by Shane Elipot
-
A global surface drifter dataset at hourly resolution (2016) by Elipot, S., R. Lumpkin, R. C. Perez, J. M. Lilly, J. J. Early, and A. M. Sykulski
-
A Dataset of Hourly Sea Surface Temperature From Drifting Buoys (2022) by Elipot, S., A. Sykulski, R. Lumpkin, L. Centurioni, and M. Pazos
See 3 usage examples →
aerial imageryclimatecogdisaster responseweather
In order to support NOAA's homeland security and emergency response requirements, the National Geodetic Survey Remote Sensing Division (NGS/RSD) has the capability to acquire and rapidly disseminate a variety of spatially-referenced datasets to federal, state, and local government agencies, as well as the general public. Remote sensing technologies used for these projects have included lidar, high-resolution digital cameras, a film-based RC-30 aerial camera system, and hyperspectral imagers. Examples of rapid response initiatives include acquiring high resolution images with the Emerge/App...
Details →
Usage examples
See 3 usage examples →
agricultureclimatemeteorologicalweather
NOAA has generated a multi-decadal reanalysis and reforecast data set to accompany the next-generation version of its ensemble prediction system, the Global Ensemble Forecast System, version 12 (GEFSv12). Accompanying the real-time forecasts are “reforecasts” of the weather, that is, retrospective forecasts spanning the period 2000-2019. These reforecasts are not as numerous as the real-time data; they were generated only once per day, from 00 UTC initial conditions, and only 5 members were provided, with the following exception. Once weekly, an 11-member reforecast was generated, and these ex...
Details →
Usage examples
See 3 usage examples →
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
NOTE - Upgrade NCEP Global Forecast System to v16.3.0 - Effective November 29, 2022 See notification HERE
The Global Forecast System (GFS) is a weather forecast model produced
by the National Centers for Environmental Prediction (NCEP). Dozens of
atmospheric and land-soil variables are available through this dataset,
from temperatures, winds, and precipitation to soil moisture and
atmospheric ozone concentration. The entire globe is covered by the GFS
at a base horizontal resolution of 18 miles (28 kilometers) between grid
points, which is used by the operational forecasters who predict weather
out to 16 days in the future. Horizontal resolution drops to 44 miles
(70 kilometers) between grid point for forecasts between one week and two
weeks.
The NOAA Global Forecast Systems (GFS) Warm Start Initial Conditions are
produced by the National Centers for Environmental Prediction Center (NCEP)
to run operational deterministic medium-range numerical weather predictions.
The GFS is built with the GFDL Finite-Volume Cubed-Sphere Dynamical Core (FV3)
and the Grid-Point Statistical Interpolation (GSI) data assimilation system.
Please visit the links below in the Documentation section to find more details
about the model and the data...
Details →
Usage examples
See 3 usage examples →
computer forensicscomputer securitycyber securitydigital forensicsmalwaremixed file datasetransomware
NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common...
Details →
Usage examples
See 3 usage examples →
cancerdigital pathologyfluorescence imagingimage processingimaginglife sciencesmachine learningmicroscopyradiology
Imaging Data Commons (IDC) is a repository within the
Cancer Research Data Commons (CRDC) that manages imaging data
and enables its integration with the other components of CRDC. IDC hosts a growing number of imaging collections that are contributed
by either funded US National Cancer Institute (NCI) data collection
activities, or by the individual researchers.Image data hosted by IDC is stored in DICOM format.
Details →
Usage examples
See 3 usage examples →
climate projectionsCMIP5CMIP6earth observationenergygeospatialmeteorologicalsolar
The National Climate Database (NCDB) seeks to be the definitive source of climate
data for energy applications. The goal of the NCDB is to provide unbiased high
temporal and spatial resolution climate data needed for renewable energy modeling.
The NCDB seeks to maintain the inherent relationship between the various parameters
that are needed to model solar, wind, hydrology and load and provide data for multiple
important climate scenarios.
Details →
Usage examples
See 3 usage examples →
agriculturebiodiversitybiologyclimatedigital preservationecosystemsenvironmental
The National Herbarium of New South Wales is one of the most significant scientific, cultural and historical botanical resources in the Southern hemisphere. The 1.43 million preserved plant specimens have been captured as high-resolution images and the biodiversity metadata associated with each of the images captured in digital form. Botanical specimens date from year 1770 to today, and form voucher collections that document the distribution and diversity of the world's flora through time, particularly that of NSW, Austalia and the Pacific.The data is used in biodiversity assessment, syste...
Details →
Usage examples
See 3 usage examples →
citieseventsgeospatial
Open City Model is an initiative to provide cityGML data for all the buildings in the United States.
By using other open datasets in conjunction with our own code and algorithms it is our goal to provide 3D geometries for every US building.
Details →
Usage examples
See 3 usage examples →
archivesastronomyatmospheregloballife sciencesopen source softwaresignal processing
This platform is maintained by CRAAM (Mackenzie Radio Astronomy and Astrophysics Center), a research center operated by UPM (Mackenzie Presbyterian University) and INPE (National Institute for Space Research), to provide public and free access for researchers, students, and the interested public to VLF (Very Low Frequency) data from CRAAM's antenna systems. Amazon AWS supports all data stored through the AWS Open Data Program.
Very Low Frequency (VLF) signals can be used for navigation services, communication with submarines, and are a powerful tool to study the low-altitude Earth's io...
Details →
Usage examples
-
Open VLF platform by CRAAM
-
All source code, program utilities, and file layouts. by Kauffmann, DHV; Santiago, LS; Oliveira, R de.
-
Open VLF: Scientific Open Data Initiative for CRAAM's SAVNET and AWESOME VLF Data (2023). by Kauffmann, DHV; Santiago, LS; Raulin, JP; Correia, E; Oliveira, R de.
See 3 usage examples →
alphafoldlife sciencesmsaopen source softwareopenfoldproteinprotein foldingprotein template
Multiple sequence alignments (MSAs) for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth. MSAs were generated with HHBlits (-n3) and JackHMMER against MGnify, BFD, UniRef90, and UniClust30 while templates were identified from PDB70 with HHSearch, all according to procedures outlined in the supplement to the AlphaFold 2 Nature paper, Jumper et al. 2021. We expect the database to be broadly useful to structural biologists training or valid...
Details →
Usage examples
-
Run inference at scale for OpenFold, a PyTorch-based protein folding ML model, using Amazon EKS by Shubha Kumbadakone, Ankur Srivastava, and Sachin Kadyan
-
OpenProteinSet: Training data for structural biology at scale by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian, et al
-
OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J, et al
See 3 usage examples →
artdeep learningimage processinglabeledmachine learningmedia
PD12M is a collection of 12.4 million CC0/PD image-caption pairs for the purpose of training generative image models.
Details →
Usage examples
See 6 usage examples →
autonomous vehiclescomputer visionlidarmarine navigationrobotics
This dataset presents a multi-modal maritime dataset acquired in restricted waters in Pohang, South Korea. The sensor suite is composed of three LiDARs (one 64-channel LiDAR and two 32-channel LiDARs), a marine radar, two visual cameras used as a stereo camera, an infrared camera, an omnidirectional camera with 6 directions, an AHRS, and a GPS with RTK. The dataset includes the sensor calibration parameters and SLAM-based baseline trajectory. It was acquired while navigating a 7.5 km route that includes a narrow canal area, inner and outer port areas, and a near-coastal area. The aim of this d...
Details →
Usage examples
See 3 usage examples →
bioinformaticsbiologyecosystemsenvironmentalgeneticgenomichealthlife sciencesmetagenomicsmicrobiome
QIIME 2 (pronounced “chime two”) is a microbiome multi-omics bioinformatics and data science platform that is trusted, free, open source, extensible, and community developed and supported.
Details →
Usage examples
See 3 usage examples →
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The IIIC dataset includes 50,697 labeled EEG samples from 2,711 patients' and 6,095 EEGs that were annotated by physician experts from 18 institutions. These samples were used to train SPaRCNet (Seizures, Periodic and Rhythmic Continuum patterns Deep Neural Network), a computer program that classifies IIIC events with an accuracy matching clinical experts.
Details →
Usage examples
-
Development of Expert-Level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation by Jing J, Ge W, Hong S, Fernandes MB, Lin Z, Yang C et al., et al.
-
IIIC-SPaRCNet Github Repository by Brain Data Science Platform (BDSP)
-
SPaRCNet data:Seizures, Rhythmic and Periodic Patterns in ICU Electroencephalography by Jing, J., Ge, W., Struck, A. F., Fernandes, M., Hong, S., An, S., et al.
See 3 usage examples →
computed tomographycomputer visioncoronavirusCOVID-19grand-challenge.orgimaginglife sciencesSARS-CoV-2
The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on grand-challenge.org.
Details →
Usage examples
-
How Well Do Self-Supervised Models Transfer to Medical Imaging? by Anton J, Castelli L, Chan MF, Outthers M, Tang WH, Cheung V, et al.
-
Study of Thoracic CT in COVID-19: The STOIC Project by Revel MP, Boussouar S, de Margerie-Mellon C, Saab I, Lapotre T, Mompoint D, et al.
-
STOIC2021 Challenge by Diagnostic Image Analysis Group, Radboudumc, Nijmegen
See 3 usage examples →
cogearth observationgeospatialnatural resourcesatellite imagerywater
Aquatic reflectance produced with the dark spectrum fitting (DSF) algorithm as implemented in the Atmospheric Correction for OLI “lite” (ACOLITE) software (version 20221114.0). Aquatic reflectance is defined here as unitless water-leaving radiance reflectance and represents the ratio of water-leaving radiance (units of watts per square meter per steradian per nanometer) to downwelling irradiance (units of watts per square meter per nanometer) multiplied by pi.
Details →
Usage examples
See 3 usage examples →
cyber securitydeep learninglabeledmachine learning
A dataset intended to support research on machine learning
techniques for detecting malware. It includes metadata and EMBER-v2
features for approximately 10 million benign and 10 million malicious
Portable Executable files, with disarmed but otherwise complete
files for all malware samples. All samples are labeled using Sophos
in-house labeling methods, have features extracted using the
EMBER-v2 feature set, well as metadata obtained via the pefile
python library, detection counts obtained via ReversingLabs
telemetry, and additional behavioral tags that indicate the rough
behavior of the sam...
Details →
Usage examples
See 3 usage examples →
astronomy
TESS-Gaia Light Curve (TGLC) is a PSF-based TESS full-frame image (FFI) light curve product. Using Gaia DR3 as priors, the team forward models the FFIs with the effective point spread function to remove contamination from nearby stars. The resulting light curves show a photometric precision closely tracking the pre-launch prediction of the noise level: TGLC's photometric precision consistently reaches ≲2% at 16th TESS magnitude even in crowded fields, demonstrating excellent decontamination and deblending power.
Details →
Usage examples
See 3 usage examples →
amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome
The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...
Details →
Usage examples
-
Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.
-
New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al
-
The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight & Jeffrey I. Gordon
See 3 usage examples →
genome wide association studygenomiclife scienceslofteevep
VEP determines the effect of genetic variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. The European Bioinformatics Institute produces the VEP tool/db and releases updates every 1 - 6 months. The latest release contains 267 genomes from 232 species containing 5567663 protein coding genes. This dataset hosts the last 5 releases for human, rat, and zebrafish. Also, it hosts the required reference files for the Loss-Of-Function Transcript Effect Estimator (LOFTEE) plugin as it is commonly used with VEP.
Details →
Usage examples
See 3 usage examples →
bioinformaticslife sciencesmedicinepharmaceuticalstructural biology
VirtualFlow Versions of Ligand Libraries in Ready-To-Dock Format
Details →
Usage examples
-
VirtualFlow 2.0 - The Next Generation Drug Discovery Platform Enabling Adaptive Screens of 69 Billion Molecules by Christoph Gorgulla, AkshatKumar Nigam, Matt Koop, Süleyman Selim Çınaroğlu, Christopher Secker, Mohammad Haddadnia, Abhishek Kumar, Yehor Malets, Alexander Hasson, Roni Levin-Konigsberg, Dmitry Radchenko, Aditya Kumar, Minko Gehev, Pierre-Yves Aquilanti, Henry Gabb, Amr Alhossary, Gerhard Wagner, Al, Yurii S. Moroz, Konstantin Fackeldey, Haribabu Arthanari
-
VirtualFlow tutorial by Christoph Gorgulla
-
VirtualFlow for Virtual Screening (VFVS) Module on GitHub by Christoph Gorgulla
See 3 usage examples →
atmosphereclimateearth observationforecastgeosciencehydrologymeteorologicalmodeloceansweather
Global real-time Earth system data deemed by the World Meteorological Organisation (WMO) as essential for provision of services for the protection of life and property and for the well-being of all nations. Data is sourced from all WMO Member countries / territories and retained for 24-hours. Met Office and NOAA operate this Global Cache service curating and publishing the dataset on behalf of WMO.
Details →
Usage examples
See 3 usage examples →
benchmarkenergymachine learning
This data lake contains multiple datasets related to fundamental problems
in wind energy research. This includes data for wind plant power production
for various layouts/wind flow scenarios, data for two- and three-dimensional
flow around different wind turbine airfoils/blades, wind turbine noise
production, among others. The purpose of these datasets is to establish a
standard benchmark against which new AI/ML methods can be tested, compared,
and deployed. Details regarding the generation and formatting of the data for
each dataset is included in the metadata as well as example noteboo...
Details →
Usage examples
See 3 usage examples →
fastqgeneticgenomiclife scienceswhole genome sequencing
The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
Details →
Usage examples
See 2 usage examples →
1940 censusarchivescensusdemographynara
The 1940 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1940, although some persons were missed. The 1940 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 2, 2012.
The 1940 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The coverage is nation wide and inclu...
Details →
Usage examples
See 2 usage examples →
1950 censusarchivescensusdemographynara
The 1950 Census population schedules were created by the Bureau of the Census in an attempt to enumerate every person living in the United States on April 1, 1950, although some persons were missed. The 1950 census population schedules were digitized by the National Archives and Records Administration (NARA) and released publicly on April 1, 2022.
The 1950 Census enumeration district maps contain maps of counties, cities, and other minor civil divisions that show enumeration districts, census tracts, and related boundaries and numbers used for each census. The coverage is nation wide and inclu...
Details →
Usage examples
See 2 usage examples →
censusdifferential privacydisclosure avoidanceethnicitygroup quartershispanichousinghousing unitslatinonoisy measurementspopulationraceredistrictingvoting age
The 2010 Census Production Settings Redistricting Data (P.L. 94-171) Demonstration Noisy Measurement File (2023-04-03) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9 , and implemented in https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code). The NMF was produced using the official “production settings,” the final set of algorithmic parameters and privacy-loss budget allocations, that were used to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File and the 2020 Census Demographic and Housing Characteristics File.
The NMF consists of the full set of privacy-protected statistical queries (counts of individuals or housing units with particular combinations of characteristics) of confidential 2010 Census data relating to the redistricting data portion of the 2010 Demonstration Data Products Suite – Redistricting and Demographic and Housing Characteristics File – Production Settings (2023-04-03). These statistical queries, called “noisy measurements” were produced under the zero-Concentrated Differential Privacy framework (Bun, M. and Steinke, T [2016] https://arxiv.org/abs/1605.02065; see also Dwork C. and Roth, A. [2014] https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) implemented via the discrete Gaussian mechanism (Cannone C., et al., [2023] https://arxiv.org/abs/2004.00010), which added positive or negative integer-valued noise to each of the resulting counts. The noisy measurements are an intermediate stage of the TDA prior to the post-processing the TDA then performs to ensure internal and hierarchical consistency within the resulting tables. The Census Bureau has released these 2010 Census demonstration data to enable data users to evaluate the expected impact of disclosure avoidance variability on 2020 Census data. The 2010 Census Production Settings Redistricting Data (P.L.94-171) Demonstration Noisy Measurement File (2023-04-03) has been cleared for public dissemination by the Census Bureau Disclosure Review Board (CBDRB-FY22-DSEP-004).
The data includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2010 Census Edited File (CEF), which includes confidential data initially collected in the 2010 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2010 Census Production Settings Privacy-Protected Microdata File - Redistricting (P.L. 94-171) and Demographic and Housing Characteristics File (2023-04-03) (https://www2.census.gov/programs-surveys/decennial/2020/program-management/data-product-planning/2010-demonstration-data-products/04-Demonstration_Data_Products_Suite/2023-04-03/). As these 2010 Census demonstration data are intended to support study of the design and expected impacts of the 2020 Disclosure Avoidance System, the 2010 CEF records were pre-processed before application of the zCDP framework. This pre-processing converted the 2010 CEF records into the input-file format, response codes, and tabulation categories used for the 2020 Census, which differ in substantive ways from the format, response codes, and tabulation categories originally used for the 2010 Census.
The NMF provides estimates of counts of...
Details →
Usage examples
-
Geographic Spines in the 2020 Census Disclosure Avoidance System by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
-
The 2020 Census Disclosure Avoidance System Topdown Algorithm by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
See 2 usage examples →
censusdifferential privacydisclosure avoidanceethnicitygroup quartershousinghousing unitsnoisy measurementspopulationraceredistrictingvoting age
The 2020 Census Redistricting Data (P.L. 94-171) Noisy Measurement File (NMF) is an intermediate output of the 2020 Census Disclosure Avoidance System (DAS) TopDown Algorithm (TDA) (as described in Abowd, J. et al [2022] https://doi.org/10.1162/99608f92.529e3cb9, and implemented in the DAS 2020 Redistricting Production Code). The NMF was generated using the Census Bureau's implementation of the Discrete Gaussian Mechanism, calibrated to satisfy zero-Concentrated Differential Privacy with bounded neighbors.
The NMF values, called noisy measurements are the output of applying the Discrete Gaussian Mechanism to counts from the 2020 Census Edited File (CEF). They are generally inconsistent with one another (for example, in a county composed of two tracts, the noisy measurement for the county's total population may not equal the sum of the noisy measurements of the two tracts' total population), and frequently negative (especially when the population being measured was small), but are integer-valued. The NMF was later post-processed as part of the DAS code to take the form of microdata and to satisfy various constraints. The NMF documented here contains both the noisy measurements themselves as well as the data needed to represent the DAS constraints; thus, the NMF could be used to reproduce the steps taken by the DAS code to produce microdata from the noisy measurements by applying the production code base.
The 2020 Census Redistricting Data (P.L. 94-171) Noisy Measurement File includes zero-Concentrated Differentially Private (zCDP) (Bun, M. and Steinke, T [2016]) noisy measurements, implemented via the discrete Gaussian mechanism. These are estimated counts of individuals and housing units included in the 2020 Census Edited File (CEF), which includes confidential data initially collected in the 2020 Census of Population and Housing. The noisy measurements included in this file were subsequently post-processed by the TopDown Algorithm (TDA) to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File.
The NMF ...
Details →
Usage examples
-
The 2020 Census Disclosure Avoidance System Topdown Algorithm by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
-
Geographic Spines in the 2020 Census Disclosure Avoidance System by Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., Zhuravlev, P.
See 2 usage examples →
bioinformaticsbiologygeneticgenomicimaginglife sciences
The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program
is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension).
The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living
organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding
the conformation of the nuclear DNA and how it is maintained or changes in response to environmental
and cellular cues over time will provide insights into basic biology as well as aspects of human
health...
Details →
Usage examples
See 2 usage examples →
floodsglobalnear-surface air temperaturenear-surface specific humiditynetcdfprecipitation
Hydrological extremes, in the form of droughts and floods, have impacts on a wide range of sectors including water availability, food security, and energy production, among others. Given continuing large impacts of droughts and floods and the expectation for significant regional changes projected in the future, there is an urgent need to provide estimates of past events and their future risk, globally. However, current estimates of hydrological extremes are not robust and accurate enough, due to lack of long-term data records, standardized methods for event identification, geographical inconsi...
Details →
Usage examples
See 2 usage examples →
agricultureenvironmentalfood securitylife sciencesmachine learning
This dataset contains soil infrared spectral data and paired soil property
reference measurements for georeferenced soil samples that were collected
through the Africa Soil Information Service (AfSIS) project, which lasted
from 2009 through 2018. In this release, we include data collected during
Phase I (2009-2013.) Georeferenced samples were collected from 19 countries
in Sub-Saharan African using a statistically sound sampling scheme,
and their soil properties were analyzed using both conventional soil
testing methods and spectral methods (infrared diffuse reflectance
spectroscopy). The two ...
Details →
Usage examples
See 2 usage examples →
aerial imageryagriculturecomputer visiondeep learningmachine learning
Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. The original dataset affiliated with the 2020 CVPR paper includes 94,986 512x512images sampled from 3,432 farmlands with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 201...
Details →
Usage examples
-
The 2nd International Workshop and Prize Challenge on Agriculture-Vision, Challenges & Opportunities for Computer Vision in Agricutlure by Humphrey Shi, Naira Hovakimyan, Jennifer Hobbs, Ed Delp, Melba Crawford, Zhen Li, David Clifford, Jim Yuan, Mang Tik Chiu, Xingqian Xu
-
Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis by Mang Tik Chiu, Xingqian Xu, Yunchao Wei, Zilong Huang, Alexander Schwing, Robert Brunner, Hrant Khachatrian, Hovnatan Karapetyan, Ivan Dozier, Greg Rose, David Wilson, Adrian Tudor, Naira Hovakimyan, Thomas S. Huang, Honghui Shi
See 2 usage examples →
electrophysiologyHomo sapienslife sciencesMus musculusneurobiologysignal processing
This is a large-scale survey that describes the physiology (strength, kinetics, and short term plasticity) of thousands of synapses from patch clamp experiments in mouse visual cortex and human middle temporal gyrus.
Details →
Usage examples
See 2 usage examples →
electrophysiologylife sciencesMus musculusneurobiologysignal processing
Extracellular electrophysiology data is growing at a remarkable pace. This data, collected neuropixels probes by the Allen Institute and the International Brain Lab can be used to benchmark throughput rates and storage ratios of various data compression algorithms.
Details →
Usage examples
See 2 usage examples →
astronomymachine learningNASA SMD AIsegmentation
Pan-STARSS imaging data and associated labels for galaxy segmentation into galactic centers, galactic bars, spiral arms and foreground stars derived from citizen scientist labels from the Galaxy Zoo: 3D project.
Details →
Usage examples
-
Galaxy Zoo: 3D – crowdsourced bar, spiral, and foreground star masks for MaNGA target galaxies by Karen L Masters, Coleman Krawczyk, Shoaib Shamsi, Alexander Todd, Daniel Finnegan, Matthew Bershady, Kevin Bundy, Brian Cherinka, Amelia Fraser-McKelvie, Dhanesh Krishnarao, Sandor Kruk, Richard R Lane, David Law, Chris Lintott, Michael Merrifield, Brooke Simmons, Anne-Marie Weijmans, Renbin Yan
-
Pan-STARRS Pixel Processing: Detrending, Warping, Stacking by C. Z. Waters, E. A. Magnier, P. A. Price, K. C. Chambers, W. S. Burgett, P. W. Draper, H. A. Flewelling, K. W. Hodapp, M. E. Huber, R. Jedicke, N. Kaiser, R.-P. Kudritzki, R. H. Lupton, N. Metcalfe, A. Rest, W. E. Sweeney, J. L. Tonry, R. J. Wainscoat, and W. M. Wood-Vase
See 2 usage examples →
agricultureclimatedisaster responseearth observationenvironmentalmeteorologicalmodelweather
Global and high-resolution regional atmospheric models from Météo-France.
- ARPEGE World covers the entire world at a base horizontal resolution of 0.5° (~55km) between grid points, it predicts weather out up to 114 hours in the future.
- ARPEGE Europe covers Europe and North-Africa at a base horizontal resolution of 0.1° (~11km) between grid points, it predicts weather out up to 114 hours in the future.
- AROME France covers France at a base horizontal resolution of 0.025° (~2.5km) between grid points, it predicts weather out up to 42 hours in the future.
- AROME France HD covers France and neighborhood at a base horizontal resolution of 0.01° (~1.5km) between grid points, it predicts weather out up to 42 hours in the future.
Dozens of atmospheric variables are avail...
Details →
Usage examples
See 2 usage examples →
autonomous vehiclescomputer visiondeep learningimage processinglidarmachine learningmappingroboticstraffictransportationurbanweather
The Aurora Multi-Sensor Dataset is an open, large-scale multi-sensor dataset with highly accurate localization ground truth, captured between January 2017 and February 2018 in the metropolitan area of Pittsburgh, PA, USA by Aurora (via Uber ATG) in collaboration with the University of Toronto. The de-identified dataset contains rich metadata, such as weather and semantic segmentation, and spans all four seasons, rain, snow, overcast and sunny days, different times of day, and a variety of traffic conditions.
The Aurora Multi-Sensor Dataset contains data from a 64-beam Velodyne HDL-64E LiDAR sensor and seven 1920x1200-pixel resolution cameras including a forward-facing stereo pair and five wide-angle lenses covering a 360-degree view around the vehicle.
This data can be used to develop and evaluate large-scale long-term approaches to autonomous vehicle localization. Its size and diversity make it suitable for a wide range of research areas such as 3D reconstruction, virtual tourism, HD map construction, and map compression, among others.
The data was first presented at the International Conference on Intelligent Robots an
...
Details →
Usage examples
-
"Pit30M: A benchmark for global localization in the age of self-driving cars", in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4477-4484) by Martinez, J., Doubov, S., Fan, J., Bârsan, I. A., Wang, S., Máttyus, G., Urtasun, R.
-
Introduction to Visualizing Sensor Types (Jupyter notebook) by Andrei Bârsan (note: Aurora makes no representations as to the accuracy or functionality of the tutorial)
See 2 usage examples →
biodiversitybioinformaticslife sciences
The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. BHL operates as a worldwide consortium of natural history, botanical, research, and national libraries working together to digitize the natural history literature held in their collections and make it freely available for open access.
Details →
Usage examples
See 5 usage examples →
fluorescence imagingGeneLabgeneticgenetic mapslife sciencesmicroscopyNASA SMD AI
Fluorescence microscopy images of individual nuclei from mouse fibroblast cells, irradiated with Fe particles or X-rays with fluorescent foci indicating 53BP1 positivity, a marker of DNA damage. These are maximum intensity projections of 9-layer microscopy Z-stacks.
Details →
Usage examples
See 2 usage examples →
gene expressionGeneLabgeneticgenetic mapslife sciencesNASA SMD AIspace biology
RNA sequencing data from spaceflown and control mouse liver samples, sourced from NASA GeneLab and augmented with generative adversarial network.
Details →
Usage examples
See 2 usage examples →
amazon.sciencebioinformaticsbiologycoronavirusCOVID-19healthlife sciencesmedicineMERSSARS
A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and rela...
Details →
Usage examples
See 5 usage examples →
cancergenomiclife sciencesSTRIDEStranscriptomics
The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project.
The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...
Details →
Usage examples
See 2 usage examples →
biologycell imagingcell paintingfluorescence imaginghigh-throughput imagingimaginglife sciencesmicroscopy
The Cell Painting Image Collection is a collection of freely
downloadable microscopy image sets. Cell Painting is an
unbiased high throughput imaging assay used to analyze
perturbations in cell models. In addition to the images
themselves, each set includes a description of the biological
application and some type of "ground truth" (expected results).
Researchers are encouraged to use these image sets as reference
points when developing, testing, and publishing new image
analysis algorithms for the life sciences. We hope that the
this data set will lead to a better understanding of w...
Details →
Usage examples
See 2 usage examples →
bioinformaticsbiologygenomiclife sciencesmappingmedicinereference indexwhole genome sequencing
Genomic tools use reference databases as indexes to operate quickly and efficiently, analogous to how web search engines use indexes for fast querying. Here, we aggregate genomic, pan-genomic and metagenomic indexes for analysis of sequencing data.
Details →
Usage examples
See 2 usage examples →
earth observationgeosciencegeospatialland coverlidarmappingsurvey
The goal of this project is to collect all publicly available large scale LiDAR datasets and archive them in an
uniform fashion for easy access and use. Initial efforts to collect the datasets are concentrated on Europe and will
be in future expanded to USA and other regions, striving for global coverage. Every dataset includes files in original
data format and translated to COPC format. For faster browsing, we include an overview file that includes a small
subset of data points from every dataset file in a single COPC file.
Details →
Usage examples
See 2 usage examples →
activity detectionactivity recognitioncomputer visionlabeledmachine learningprivacyvideo
The Consented Activities of People (CAP) dataset is a fine grained activity dataset for visual AI research curated using the Visym Collector platform.
Details →
Usage examples
See 2 usage examples →
agriculturecogdisaster responseearth observationelevationgeospatialsatellite imagery
The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. We provide two instances of Copernicus DEM named GLO-30 Public and GLO-90. GLO-90 provides worldwide coverage at 90 meters. GLO-30 Public provides limited worldwide coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that in both cases ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized Ge...
Details →
Usage examples
See 2 usage examples →
copyright monitoringcover song identificationlive song identificationmusicmusic features datasetmusic information retrievalmusic recognition
CoversBR is the first large audio database with, predominantly, Brazilian music for the tasks of Covers Song
Identification (CSI) and Live Song Identifications (LSI). Due to copyright restrictions audios of
the songs cannot be made available, however metadata and files of features have public access. Audio
streamings captured from radio and TV channels for the live song identification task will be made public.
CoversBR is composed of metadata and features extracted from 102298 songs, distributed in 26366
groups of covers/versions, with an average of 3.88 versions per group. The entire collecti...
Details →
Usage examples
See 2 usage examples →
COVID-19economicsfinancial marketshiringmarket data
This dataset provides daily updates on the volume of US job listings filtered by geography industry job family and role; normalized to pre-covid levels.These data files feed the business intelligence visuals at covidjobimpacts.greenwich.hr, a public-facing site hosted by Greenwich.HR and OneModel Inc.
Data is derived from online job listings tracked continuously, calculated daily and published nightly. On average data from 70% of all new US jobs are captured,
and the dataset currently contains data from 3.3 million hiring organizations.Data for each filter segment is represented as the 7-day ...
Details →
Usage examples
See 2 usage examples →
cell biologycryo electron tomographyczielectron tomographylife sciencesmachine learningsegmentationstructural biology
Cryo-electron tomography (cryoET) is a powerful technique for visualizing 3D structures of cellular macromolecules at near atomic resolution in their native environment. Observing the inner workings of cells in context enables better understanding about the function of healthy cells and the changes associated with disease. However, the analysis of cryoET data remains a significant bottleneck, particularly the annotation of macromolecules within a set of tomograms, which often requires a laborious and time-consuming process of manual labelling that can take months to complete. Given the current...
Details →
Usage examples
See 5 usage examples →
bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing
The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...
Details →
Usage examples
See 2 usage examples →
computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis
Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at Details →
Usage examples
See 2 usage examples →
agricultureclimatecoastalearth observationenvironmentalsustainabilityweather
This dataset contains historical and projected dynamically downscaled climate data for the State of Alaska and surrounding regions at 20km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 3.5). We downscaled both ERA-Interim historical reanalysis data (1979-2015) and both historical and projected runs from 2 GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical run: 1970-2005 and RCP 8.5: 2006-2100)....
Details →
Usage examples
See 2 usage examples →
biasbiologycancerhealthimaginglife sciencesmammographyx-ray
EMBED is a racially diverse mammography dataset containing 3.4M screening and diagnostic images from 110,000 patients collected from 2013-2020, with an
equal representation of black and white women. The dataset is comprised of 2D, synthetic 2D (C-view), and 3D (digital breast tomosynthesis, i.e. DBT)
images. It contains 60,000 annotated lesions linked to structured imaging descriptors and ground truth pathologic outcomes grouped into six severity
classes. This release represents 20% of the total 2D and C-view dataset and is available for research use. DBT, US, and MRI exams will be added at a
...
Details →
Usage examples
-
Sample Notebook by Emory-HITI
-
The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4M Screening and Diagnostic Mammograms by Jiwoong J. Jeong, Brianna L. Vey, Ananth Reddy, Thomas Kim, Thiago Santos, Ramon Correa, Raman Dutt, Marina Mosunjac, Gabriela Oprea-Ilies, Geoffrey Smith, Minjae Woo, Christopher R. McAdams, Mary S. Newell, Imon Banerjee, Judy Gichoya, Hari Trivedi
See 2 usage examples →
bioinformaticsbiologycomputer visioncsvhealthimaginglabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray
The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of
503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset
provides imaging data in DICOM format along with detailed clinical information, including patient-
reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in
similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type,
and presence of hardware, enhancing its value for research and model development. MRKR addresses
significant gaps ...
Details →
Usage examples
See 2 usage examples →
autonomous vehiclescomputer visionlidarmappingroboticstransportationurbanweather
This research presents a challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. The vehicles The vehicles were manually driven on an average route of 66 km in Michigan that included a mix of driving scenarios like the Detroit Airport, freeways, city-centres, university campus and suburban neighbourhood, etc. Each vehicle used in this data collection is a Ford Fusion outfitted with an Applanix POS-LV inertial measurement unit (IMU), four HDL-32E Velodyne 3D-lidar scanners, 6 Point Grey 1.3 MP Cameras arranged on the...
Details →
Usage examples
See 2 usage examples →
cancergenomiclife sciencesSTRIDESwhole genome sequencing
Biopsies of castration resistant prostate cancer metastases were subjected to whole genome sequencing (WGS), along with RNA-sequencing (RNA-Seq). The overarching goal of the study is to illuminate molecular mechanisms of acquired resistance to therapeutic agents, and particularly androgen signaling inhibitors, in the treatment of metastatic castration resistant prostate cancer (mCRPC). This study is made available on AWS via the NIH STRIDES Initiative.
Details →
Usage examples
See 2 usage examples →
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University:Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH).
Details →
Usage examples
-
Harvard-EEG-Database-Tools by Brain Data Science Platform (BDSP)
-
Harvard Electroencephalography Database by Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., et al.
See 2 usage examples →
bioinformaticsdeep learninglife sciencesmachine learningmedicineneurophysiologyneuroscience
The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.
Details →
Usage examples
-
WFDB Software Package by Moody, G., Pollard, T., & Moody, B.
-
Harvard Electroencephalography Database by Moura Junior, V.; Reyna, M.; Hong, S.; Gupta, A.; Ghanta, M.; Sameni, R., et al.
See 2 usage examples →
bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing
Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.
Details →
Usage examples
-
No Evidence Known Viruses Play a Role in the Pathogenesis of Onchocerciasis-Associated Epilepsy. An Explorative Metagenomic Case-Control Study by Michael Roach,Adrian Cantu,Melissa Krizia Vieri,Matthew Cotten,Paul Kellam,My Phan,Lia van der Hoek,Michel Mandro,Floribert Tepage,Germain Mambandu,Gisele Musinya,Anne Laudisoit,Robert Colebunders,Robert Edwards, John L. Mokili
-
The Hecatomb Tutorial by Michael Roach
See 2 usage examples →
bioinformaticsbiologygenomiclife sciencesmetagenomicsmicrobiomereference indexwhole genome sequencing
This dataset comprises pre-built indexes for the bioinformatics software Kaiju, which is used for taxonomic classification of metagenomic sequencing data. Various indexes for different source reference databases are available.
Details →
Usage examples
See 2 usage examples →
astronomy
The James Webb Space Telescope (JWST) is the world's next flagship infrared observatory led by NASA with its partners, ESA (European Space Agency), and CSA (Canadian Space Agency). JWST offers scientists the opportunity to observe galaxy evolution, the formation of stars and planets, exoplanetary systems, and our own solar system, in ways never before possible.
Details →
Usage examples
See 2 usage examples →
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the all data for the LEarning biOchemical Prostate cAncer Recurrence from histopathology sliDes challenge or LEOPARD.Prostate cancer, impacting 1.4 million men annually, is a prevalent malignancy (H. Sung et al., 2021). A substantial number of these individuals undergo prostatectomy as the primary curative treatment. The efficacy of this surgery is assessed, in part, by monitoring the concentration of prostate-specific antigen (PSA) in the bloodstream. While the role of PSA in prostate cancer screening is debatable (W. F. Clark et al., 2018; E. A. M. Heijnsdijk et al., 2018), it serves as a valuable biomarker for postprostatectomy follow-up in patients. Following successful surgery, PSA concentration is typically undetectable (<0.1 ng/mL) within 4-6 weeks (S. S. Goonewardene et al., 2014). However, approximately 30% of patients experience biochemical recurrence, signifying the resurgence of prostate cancer cells. This recurrence serves as a prognostic indicator for progression to clinical metastases and eventual prostate cancer-related mortality (C. L. Amling, 2014; S. J. Freedland et al., 2005; M. Han et al., 2001; T. Van den Broeck et al., 2001. Current clinical practices gauge the risk of biochemical recurrence by considering the International Society of Urological Pathology (ISUP) grade, PSA value at diagnosis, and TNM staging criteria (J. I. Epstein et al., 2016). A recent European consensus guideline suggests categorizing patients into low-risk, intermediate-risk, and high-risk groups based on these factors (N. Mottet et al., 2021). Notably, a high ISUP grade independently assigns a patient to the intermediate (grade 2/3) or high-risk group (grade 4/5). The Gleason growth patterns, representing morphological patterns of prostate cancer, are used to categorize cancerous tissue into ISUP grade groups (J. I. Epstein, 2010; P. M. Pierorazio et al., 2013; G. J. L. H. van Leenders et al., 2020; J. I. Epstein et al., 2016). However, the ISUP grade has limitations, such as grading disagreement among pathologists (J. I. Epstein et al., 2016) and coarse descriptors of tissue morphology. Recently, deep learning was shown (H. Pinckaers et al., 2022; O. Eminaga et. al., 2024)...
Details →
Usage examples
See 2 usage examples →
chemistryfluid dynamicsmaterials sciencephysicsspace biology
NASA's Physical Sciences Research Program, along with its predecessors, has conducted significant fundamental and applied research in the physical sciences. The International Space Station (ISS) is an orbiting laboratory that provides an ideal facility to conduct long-duration experiments in the near absence of gravity and allows continuous and interactive research similar to Earth-based laboratories. This enables scientists to pursue innovations and discoveries not currently achievable by other means. NASA's Physical Sciences Research Program also benefits from collaborations with several of the ISS international partners—Europe, Russia, Japan, and Canada—and foreign governments with space programs, such as France, Germany and Italy.
In fulfillment of the Open Science model, NASA's Physical Sciences Research Program is pleased to offer the PSI data repository for physical science experiments performed in reduced-gravity environments such as the ISS, Space Shuttle flights, and Free-flyers. PSI also includes data from some related ground-based studies. The PSI system is accessible and open to the public. This provides the opportunity for researchers to data mine results from prior flight investigations, expanding on the research performed. This approach will allow numerous ground-based investigations to be conducted fro
...
Details →
Usage examples
See 2 usage examples →
csvlife sciencesSTRIDEStxtxml
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:
The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license
The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining
These datasets collectively span
...
Details →
Usage examples
See 2 usage examples →
agricultureagricultureclimatedisaster responseenvironmentaltransportationweather
The Analysis Of Record for Calibration (AORC) is a gridded record of near-surface weather conditions covering the continental United States and Alaska and their hydrologically contributing areas. It is defined on a latitude/longitude spatial grid with a mesh length of 30 arc seconds (~800 m), and a temporal resolution of one hour. Elements include hourly total precipitation, temperature, specific humidity, terrain-level pressure, downward longwave and shortwave radiation, and west-east and south-north wind components. It spans the period from 1979 across the Continental U.S. (CONUS) and from 1981 across Alaska, to the near-present (at all locations). This suite of eight variables is sufficient to drive most land-surface and hydrologic models and is used as input to the National Water Model (NWM) retrospective simulation. While the native AORC process generates netCDF output, the data is post-processed to create a cloud optimized Zarr formatted equivalent for dissemination using cloud technology and infrastructure.
AORC Version 1.1 dataset creation
The AORC dataset was created after reviewing, identifying, and processing multiple large-scale, observation, and analysis datasets. There are two versions of The Analysis Of Record for Calibration (AORC) data.
The initial AORC Version 1.0 dataset was completed in November 2019 and consisted of a grid with 8 elements at a resolution of 30 arc seconds. The AORC version 1.1 dataset was created to address issues "see Table 1 in Fall et al., 2023" in the version 1.0 CONUS dataset. Full documentation on version 1.1 of the AORC data and the related journal publication are provided below.
The native AORC version 1.1 process creates a dataset that consists of netCDF files with the following dimensions: 1 hour, 4201 latitude values (ranging from 25.0 to 53.0), and 8401 longitude values (ranging from -125.0 to -67).
The data creation runs with a 10-day lag to ensure the inclusion of any corrections to the input Stage IV and NLDAS data.
Note - The full extent of the AORC grid as defined in its data files exceed those cited above; those outermost rows and columns of data grids are filled with missing values and are the remnant of an early set of required AORC extents that have since been adjusted inward.
AORC Version 1.1 Zarr Conversion
The goal for converting the AORC data from netCDF to Zarr was to allow users to quickly and efficiently load/use the data. For example, one year of data takes 28 mins to load via NetCDF while only taking 3.2 seconds to load via Zarr (resulting in a substantial increase in speed). For longer periods of time, the percentage increase in speed using Zarr (vs NetCDF) is even higher. Using Zarr also leads to less memory and CPU utilization.
It was determined that the optimal conversion for the data was 1 year worth of Zarr files with a chunk size of 18MB. The chunking was completed across all 8 variables. The chunks consist of the following dimensions: 144 time, 128 latitude, and 256 longitude. To create the files in the Zarr format, the NetCDF files were rechunked using chunk() and "Xarray". After chunking the files, they were converted to a monthly Zarr file. Then, each monthly Zarr file was combined using "to_zarr" to create a Zarr file that represents a full year
Users wanting more than 1 year of data will be able to utilize Zarr utilities/libraries to combine multiple years up to the span of the full data set.
There are eight variables representing the meteorological conditions
Total Precipitaion (APCP_surface)
- Hourly total precipitation (kgm-2 or mm) for Calibration (AORC) dataset
Air Temperature (TMP_2maboveground)
- Temperature (at 2 m above-ground-level (AGL)) (K)
Specific Humidity (SPFH_2maboveground)
- Specific humidity (at 2 m AGL) (g g-1)
Downward Long-Wave Radiation Flux (DLWRF_surface)
- longwave (infrared)
- radiation flux (at the surface) (W m-2)
Downward Short-Wave Radiation Flux (DSWRF_surface)
- Downward shortwave (solar)
- radiation flux (at the surface) (W m-2)
Pressure (PRES_surface)
- Air pressure (at the surface) (Pa)
**U-Component of Wind (UGRD_10maboveground)"
1)U (west-east) - components of the wind (at 10 m AGL) (m s-1)
**V-Component of Wind (VGRD_10maboveground)"
- V (south-north) - components of the wind (at 10 m AGL) (m s-1)
Precipitation and Temperature
The gridded AORC precipitation dataset contains one-hour Accumulated Surface Precipitation (APCP) ending at the “top” of each hour, in liquid water-equivalent units (kg m-2 to the nearest 0.1 kg m-2), while the gridded AORC temperature dataset is comprised of instantaneous, 2 m above-ground-level (AGL) temperatures at the top of each hour (in Kelvin, to the nearest 0.1).
Specific Humidity, Pressure, Downward Radiation, Wind
...
Details →
Usage examples
-
Explore the AORC 1.1 dataset in Zarr by Michael AuCoin
-
The Office of Water Prediction's Analysis of Record for Calibration, version 1.1: Dataset description and precipitation evaluation (09 July 2023). J. Am. Water Resour. Assoc., 59 (6). 1246-1272. by Greg Fall, David Kitzmiller, Sandra Pavlovic, Ziya Zhang, Nathan Patrick, Michael St. Laurent, Carl Trypaluk, Wanru Wu, and Dennis Miller
See 2 usage examples →
agricultureclimatemeteorologicalweather
The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite obser...
Details →
Usage examples
See 2 usage examples →
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The GFS data files stored here can be immediately used for OAR/ARL’s NOAA-EPA Atmosphere-Chemistry Coupler Cloud (NACC-Cloud) tool, and are in a Network Common Data Form (netCDF), which is a very common format used across the scientific community. These particular GFS files contain a comprehensive number of global atmosphere/land variables at a relatively high spatiotemporal resolution (approximately 13x13 km horizontal, vertical resolution of 127 levels, and hourly), are not only necessary for the NACC-Cloud tool to adequately drive community air quality applications (e.g., U.S. EPA’s Community Multiscale Air Quality model; https://www.epa.gov/cmaq), but can be very useful for a myriad of other applications in the Earth system modeling communities (e.g., atmosphere, hydrosphere, pedosphere, etc.). While many other data file and record formats are indeed available for Earth system and climate research (e.g., GRIB, HDF, GeoTIFF), the netCDF files here are advantageous to the larger community because of the comprehensive, high spatiotemporal information they contain, and because they are more scalable, appendable, shareable, self-describing, and community-friendly (i.e., many tools available to the community of users). Out of the four operational GFS forecast cycles per day (at 00Z, 06Z, 12Z and 18Z) this particular netCDF dataset is updated daily (/inputs/yyyymmdd/) for the 12Z cycle and includes 24-hr output for both 2D (gfs.t12z.sfcf$0hh.nc) and 3D variables (gfs.t12z.atmf$0hh.nc).
Also available are netCDF formatted Global Land Surface Datasets (GLSDs) developed by Hung et al. (2024). The GLSDs are based on numerous satellite products, and have been gridded to match the GFS spatial resolution (~13x13 km). These GLSDs contain vegetation canopy data (e.g., land surface type, vegetation clumping index, leaf area index, vegetative canopy height, and green vegetation fraction) that are supplemental to and can be combined with the GFS meteorological netCDF data for various applications, including NOAA-ARL's canopy-app. The canop...
Details →
Usage examples
See 2 usage examples →
agricultureclimatemeteorologicalweather
The MRMS system was developed to produce severe weather, transportation, and precipitation products for improved decision-making capability to improve hazardous weather forecasts and warnings, along with hydrology, aviation, and numerical weather prediction.
MRMS is a system with fully-automated algorithms that quickly and intelligently integrate data streams from multiple radars, surface and upper air observations, lightning detection systems, satellite observations, and forecast models. Numerous two-dimensional multiple-sensor products offer assistance for hail, wind, tornado, quantitative precipitation estimations, convection, icing, and turbulence diagnosis.
MRMS is being used to develop and test new Federal Aviation Administration (FAA) NextGen products in addition to advancing techniques in quality control, icing detection, and turbulence in collaboration with the National Center for Atmospheric Research, the University Corporation for Atmospheric Research, and Lincoln Laboratories.
MRMS was deployed operationally in 2014 at the National Center for Environmental Prediction (NCEP). All of the 100+ products it produces are available via NCEP to all of the WFOs, RFCs, CWSUs and NCEP service centers. In addition, the MRMS product suite is publicly available to any other entity who wishes to access and use the data. Other federal agencies that use MRMS include FEMA, DOD, FAA, and USDA.
MRMS is the proposed operational version of the WDSS-II and NMQ research systems.
...
Details →
Usage examples
-
Multi-Radar Multi-Sensor (MRMS) Quantitative Precipitation Estimation: Initial Operating Capabilities by Jian Zhang, Kenneth Howard, Carrie Langston, Brian Kaney, Youcun Qi, Lin Tang, Heather Grams, Yadong Wang, Stephen Cocks, Steven Martinaitis, Ami Arthur, Karen Cooper, Jeff Brogden, David Kitzmiller
-
Multi-Radar Multi-Sensor (MRMS) Severe Weather and Aviation Products: Initial Operating Capabilities by Travis M. Smith, Valliappa Lakshmanan, Gregory J. Stumpf, Kiel L. Ortega, Kurt Hondl, Karen Cooper, Kristin M. Calhoun, Darrel M. Kingfield, Kevin L. Manross, Robert Toomey, Jeff Brogden
See 2 usage examples →
agricultureclimatedisaster responseenvironmentalmeteorologicaloceansweather
The Unified Forecast System Subseasonal to Seasonal prototypes consist of reforecast data from the UFS atmosphere-ocean coupled model experimental prototype version 5, 6, 7, and 8 produced by the Medium Range and Subseasonal to Seasonal Application team of the UFS-R2O project. The UFS prototypes are the first dataset released to the broader weather community for analysis and feedback as part of the development of the next generation operational numerical weather prediction system from NWS. The datasets includes all the major weather variables for atmosphere, land, ocean, sea ice, and ocean wav...
Details →
Usage examples
-
The impact of tropical SST biases on the S2S precipitation forecast skill over the Contiguous United States in the UFS global coupled model by Hedanqiu Bai, Bin Li, Avichal Mehra, Jessica Meixner, Shrinivas Moorthi, Sulagna Ray, Lydia Stefanova, Jiande Wang, Jun Wang, Denise Worthen, Fanglin Yang, and Cristiana Stan
-
Advances in Seasonal Predictions of Arctic Sea Ice With NOAA UFS by Jieshun Zhu, Wanqiu Wang, Yanyun Liu, Arun Kumar, and David DeWitt
See 2 usage examples →
climateoceans
The World Ocean Database (WOD) is the largest uniformly formatted, quality-controlled, publicly available historical subsurface ocean profile database. From Captain Cook's second voyage in 1772 to today's automated Argo floats, global aggregation of ocean variable information including temperature, salinity, oxygen, nutrients, and others vs. depth allow for study and understanding of the changing physical, chemical, and to some extent biological state of the World's Oceans. Browse the bucket via the AWS S3 explorer: https://noaa-wod-pds.s3.amazonaws.com/index.html
Details →
Usage examples
-
The World Ocean Database User's Manual by Hernan E. Garcia, Tim P. Boyer, Ricardo A. Locarnini, Olga K. Baranova, Melissa M. Zweng
-
The World Ocean Database Introduction by Tim P. Boyer, Olga K. Baranova, Carla Coleman, Hernan E. Garcia, Alexandra Grodsky, Ricardo A. Locarnini, Alexey V. Mishonov, Christopher R. Paver, James R. Reagan, Dan Seidov, Igor V. Smolyar, Katharine W. Weathers, Melissa M. Zweng
See 2 usage examples →
archivesgovernment recordsnaranational archives catalog
The National Archives Catalog dataset contains all of the descriptions; authority records; digitized and electronic records; and tags, transcriptions and comments for NARA’s archival holdings available in the Catalog.
Details →
Usage examples
See 2 usage examples →
cancergenomiclife sciences
The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas
(DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis and
targeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantification
data.
Details →
Usage examples
See 2 usage examples →
anomaly detectionclassificationdisaster responseearth observationenvironmentalNASA SMD AIsatellite imagerysocioeconomicurban
Detection of nighttime combustion (fire and gas flaring) from daily top of atmosphere data from NASA's Black Marble VNP46A1 product using VIIRS Day/Night Band and VIIRS thermal bands.
Details →
Usage examples
See 2 usage examples →
geneticgenomiclife sciencessqlitetertiary analysisvariant annotation
OpenCRAVAT is a module variant annotation tool developed by KarchinLab at Johns Hopkins.
This dataset is a mirror of the OpenCRAVAT store available at https://store.opencravat.org.
You can configure OpenCRAVAT to use this mirror by editing the "cravat-system.yml" file.
The path to this file is in the first output line of the command "oc config system". In that file,
change the value of "store_url" to "https://opencravat-store-aws.s3.amazonaws.com".
Details →
Usage examples
See 2 usage examples →
cancergenomiclife sciences
The OHSU-CNL study offers the whole exome and RNA-sequencing on a cohort of 100 cases with rare
hematologic malignancies such as Chronic neutrophilic leukemia (CNL), atypical chronic myeloid
leukemia (aCML), and unclassified myelodysplastic syndrome/myeloproliferative neoplasms
(MDS/MPN-U). This dataset contains open RNA-Seq Gene Expression Quantification data.
Details →
Usage examples
See 2 usage examples →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
JAXA has responded to the Earthquake events in Turkey and Syria by conducting emergency disaster observations and providing data as requested by the Disaster and Emergency Management Authority (AFAD), Ministry of Interior in Turkey, through Sentinel Asia and the International Disaster Charter. Additional information on the event and dataset can be found here. The 25 m PALSAR-2 ScanSAR is normalized backscatter data of PALSAR-2 broad area observation mode with observation width of 350 km. Polarization data are stored as 16-bit digital numbers (DN). The DN values can be converted to gamma naught...
Details →
Usage examples
See 2 usage examples →
cancergeneticgenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing
This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers.
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.
Details →
Usage examples
See 2 usage examples →
amino acidarchivesbioinformaticsbiomolecular modelingcell biologychemical biologyCOVID-19electron microscopyelectron tomographyenzymelife sciencesmoleculenuclear magnetic resonancepharmaceuticalproteinprotein templateSARS-CoV-2structural biologyx-ray crystallography
The "Protein Data Bank (PDB) archive" was established in 1971 as the first open-access digital data archive in biology. It is a collection of three-dimensional (3D) atomic-level structures of biological macromolecules (i.e., proteins, DNA, and RNA) and their complexes with one another and various small-molecule ligands (e.g., US FDA approved drugs, enzyme co-factors). For each PDB entry (unique identifier: 1abc or PDB_0000001abc) multiple data files contain information about the 3D atomic coordinates, sequences of biological macromolecules, information about any small molecules/ligan...
Details →
Usage examples
See 2 usage examples →
climateclimate modelelectricityenergyenergy modelingenvironmentalgovernment recordsinfrastructureopen source softwareutilities
The Public Utility Data Liberation Project (PUDL) provides analysis-ready energy system data to climate advocates,
researchers, policymakers, and journalists.
PUDL is an open source data processing pipeline
that makes US energy data easier to access and use programmatically. Hundreds of gigabytes of valuable data
are published by US government agencies, but it's often difficult to work with.
PUDL takes the original spreadsheets, CSV files, and databases and turns them into a unified resource. This allows users to
spend more time on novel analysis and less time on data preparation.
This...
Details →
Usage examples
See 2 usage examples →
agriculturedisaster responseearth observationenvironmentalwater
Near Real-time and archival data of High-resolution (10 m) flood inundation dataset over the Contiguous United States, developed based on the Sentinel-1 SAR imagery (2016-current) archive, using an automated Radar Produced Inundation Diary (RAPID) algorithm.
Details →
Usage examples
See 2 usage examples →
coronavirusCOVID-19information retrievallife sciencesnatural language processingtext analysis
The REaltime DAta Synthesis and Analysis (REDASA) COVID-19 snapshot contains the output of the curation protocol produced by our curator community. A detailed description can be found in our paper. The first S3 bucket listed in Resources contains a large collection of medical documents in text format extracted from the CORD-19 dataset, plus other sources deemed relevant by the REDASA consortium. The second S3 bucket contains a series of documents surfaced by Amazon Kendra that were considered relevant for each medical question asked. The final S3 bucket contains the GroundTruth annotations cr...
Details →
Usage examples
See 2 usage examples →
genetichealthHomo sapienslife scienceslong read sequencingmappingvariant annotationvcfwhole genome sequencing
Reference data bundle for analyzing HiFi human whole genome
sequencing data
Details →
Usage examples
See 2 usage examples →
cogcomputer visionearth observationgeospatialimage processingsatellite imagerystac
Satellogic EarthView dataset includes high-resolution satellite images captured over all continents. The dataset is organized in Hive partition format and hosted by AWS. The dataset can be accessed via STAC browser or aws cli. Each item of the dataset corresponds to a specific region and date, with some of the regions revisited for additional data. The dataset provides Top-of-Atmosphere (TOA) reflectance values across four spectral bands (Red, Green, Blue, Near-Infrared) at a Ground Sample Distance (GSD) of 1 meter, accompanied by comprehensive metadata such as off-nadir angles, sun elevation,...
Details →
Usage examples
-
Explore Satellogic EarthView in SageMaker Studio Lab (SMSL) by Javier Marin
-
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision by Velázquez, Diego and Rodríguez, Pau and Alonso, Sergio and Gonfaus, Josep M. and González, Jordi and, Richarte, Gerardo and Marín, Javier and Bengio, Yoshua and Lacoste, Alexandre
See 2 usage examples →
biodiversityclimatecoastalearth observationenvironmentalgeospatialglobalmachine learningmappingnatural resourcesatellite imagerysustainability
A collection of multi-resolution satellite images from both public and commercial satellites. The dataset is specifically curated for training geospatial foundation models.
Details →
Usage examples
See 2 usage examples →
disaster responseearth observationenvironmentalgeospatialsatellite imagerysynthetic aperture radar
The S1 Single Look Complex (SLC) dataset contains Synthetic Aperture Radar (SAR) data in the C-Band wavelength. The SAR sensors are installed on a two-satellite (Sentinel-1A and Sentinel-1B) constellation orbiting the Earth with a combined revisit time of six days, operated by the European Space Agency. The S1 SLC data are a Level-1 product that collects radar amplitude and phase information in all-weather, day or night conditions, which is ideal for studying natural hazards and emergency response, land applications, oil spill monitoring, sea-ice conditions, and associated climate change effec...
Details →
Usage examples
See 2 usage examples →
biodiversitybiologyecosystemsgeospatiallandlife sciencesnatural resourcesurvey
Archival soundscapes recorded in the rainforest landscapes of
Central Africa, with a focus on the vocalizations of African forest
elephants (Loxodonta cyclotis).
Details →
Usage examples
See 2 usage examples →
aerial imagerycogconservationdeep learningearth observationenvironmentalgeospatialimage processingland cover
Canopy Tree Height maps for California in 2020. Created using a deep learning model on very-high-resolution airborne imagery from the National Agriculture Imagery Program (NAIP) by United States Department of Agriculture (USDA).
Details →
Usage examples
See 2 usage examples →
cancercomputational pathologycomputer visiondeep learninggrand-challenge.orghistopathologylife sciences
"This dataset contains the training data for the Tumor InfiltratinG lymphocytes in breast cancER or TIGER challenge. TIGER is the first challenge on fully automated assessment of tumor-infiltrating lymphocytes (TILs) in breast cancer histopathology slides. TILs are proving to be an important biomarker in cancer patients as they can play a part in killing tumor cells, particularly in some types of breast cancer. Identifying and measuring TILs can help to better target treatments, particularly immunotherapy, and may result in lower levels of other more aggressive treatments, including chemo...
Details →
Usage examples
See 2 usage examples →
geospatialsatellite imagery
The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiances
from the five Terra instruments. They have been fully validate to contain the original
Terra instrument Level 1 data. Each Level 1 Terra Basic Fusion file contains one full
Terra orbit of data and is typically 15 – 40 GB in size, depending on how much data was
collected for that orbit. It contains instrument radiance in physical units; radiance
quality indicator; geolocation for each IFOV at its native resolution; sun-view geometry;
bservation time; and other attributes/metadata. It is stored in HDF5, conformed to CF
conventions, and accessible by netCDF-4 enhanced models. It’s naming convention
follows: TERRA_BF_L1B_OXXXX_YYYYMMDDHHMMSS_F000_V000.h5. A concise description of the
dataset, along with links to complete documentation and available software tools, can
be found on the Terra Fusion project page: https://terrafusion.web.illinois.edu.
Terra is the flagship satellite of NASA’s Earth Observing System (EOS). It was launched
into orbit on December 18, 1999 and carries five instruments. These are the
Moderate-resolution Imaging Spectroradiometer (MODIS), the Multi-angle Imaging
SpectroRadiometer (MISR), the Advanced Spaceborne Thermal Emission and Reflection
Radiometer (ASTER), the Clouds and Earth’s Radiant Energy System (CERES), and the
Measurements of Pollution in the Troposphere (MOPITT).
The Terra Basic Fusion dataset is an easy-to-access record of the Level 1 radiances
for instruments on...
Details →
Usage examples
See 2 usage examples →
astronomy
The Transiting Exoplanet Survey Satellite (TESS) is a multi-year survey that has discovered exoplanets in orbit around bright stars across the entire sky using high-precision photometry. The survey also enables a wide variety of stellar astrophysics, solar system science, and extragalactic variability studies. More information about TESS is available at MAST and the TESS Science Support Center.
Details →
Usage examples
See 2 usage examples →
oceans
The COAWST modeling system has been used to simulate ocean, wave and sediment transport processes along the of US East Coast and Gulf of Mexico. The grid has a horizontal resolution of approximately 5km and is resolved with 16 vertical terrain following levels. The model has been executed on a daily basis since August 2009 with outputs written every hour. This archive contains model output from 2009-08-21 to 2022-06-17.
Details →
Usage examples
-
Coupled-Ocean-Atmosphere-Wave-Sediment Transport (COAWST) Modeling System, U.S. Geological Survey Software Release, 23 April 2019 by Warner, J.C., Ganju, N.K., Sherwood, C.R., Kalra, T.S., Aretxabaleta, A., He, R., Zambon, J., and Kumar, N.
-
COAWST Explorer Notebook by Rich Signell
See 2 usage examples →
bioinformaticsbiologychemistryenzymegraphlife sciencesmoleculeproteinRDFSPARQL
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions EMBL-EBI, SIB Swiss Institute of Bioinformatics and PIR are committed to the long-term preservation of the UniProt databases.
Details →
Usage examples
See 2 usage examples →
atmosphereelectricitymeteorologicalmodelsustainabilityturbulenceweatherzarr
Large Eddy Simulation (LES) data of the Winds of the North Sea in 2050 (WINS50) project.
Details →
Usage examples
See 2 usage examples →
astronomyimagingsatellite imagerysurvey
The Wide-field Infrared Survey Explorer (WISE) was a NASA Medium Explorer satellite in low-Earth orbit that conducted an all-sky astronomical imaging survey over four infrared bands from 2010-2011. The 3-Band Cryo Data Release contains 3.4, 4.6 and 12 micron (W1, W2, W3) imaging data that were acquired between 6 Aug and 29 Sept 2010 while the detectors were cooled by the inner cryogen tank following the exhaustion of the outer tank.
Details →
Usage examples
See 1 usage example →
computer visionmachine learning
3D CoMPaT is a richly annotated large-scale dataset of rendered compositions of Materials on Parts of thousands of unique 3D Models.
This dataset primarily focuses on stylizing 3D shapes at part-level with compatible materials.
Each object with the applied part-material compositions is rendered from four equally spaced views as well as four randomized views.
We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects.
We present two variations of this task and adapt state-of-art 2D/3D deep learning met...
Details →
Usage examples
See 1 usage example →
autonomous vehiclescomputer visiondeep learninglidarmachine learningmappingrobotics
An open multi-sensor dataset for autonomous driving research. This dataset comprises semantically segmented images, semantic point clouds, and 3D bounding boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and bus data for three sequences. We hope this dataset will further facilitate active research and development in AI, computer vision, and robotics for autonomous driving.
Details →
Usage examples
See 1 usage example →
machine learning
4,817 illustrative diagrams for research on diagram understanding and associated question answering.
Details →
Usage examples
See 1 usage example →
astronomyimagingsatellite imagerysurvey
The Wide-field Infrared Survey Explorer (WISE) was a NASA Medium Explorer satellite in low-Earth orbit that conducted an all-sky astronomical imaging survey over four infrared bands from 2010-2011. The All-Sky Release includes all data taken during the WISE full cryogenic mission phase, 7 January 2010 to 6 August 2010, in the 3.4, 4.6, 12, and 22 micron bands (i.e., W1, W2, W3, W4) that were processed with improved calibrations and reduction algorithms.
Details →
Usage examples
See 1 usage example →
astronomyimagingobject detectionparquetsatellite imagerysurvey
The Wide-field Infrared Survey Explorer (WISE) was a NASA Medium Explorer satellite in low-Earth orbit that conducted an all-sky astronomical imaging survey over four infrared bands from 2010-2011. The AllWISE Data Release combines data from all cryogenic and post-cryogenic survey phases and provides a comprehensive view of the mid-infrared sky. The Images Atlas includes 18,240 FITS image sets at 3.4, 4.6, 12 and 22 microns. The Source Catalog contains position, apparent motion, and flux information for over 747 million objects detected on the Atlas Images.
Details →
Usage examples
See 1 usage example →
electrophysiologyimage processingimaginglife sciencesMus musculusneurobiologyneuroimagingsignal processing
The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, c...
Details →
Usage examples
See 1 usage example →
electrophysiologyimage processingimaginglife sciencesMus musculusneurobiologyneuroimagingsignal processing
The Allen Institute for Neural Dynamics (AIND) is committed to FAIR, Open, and Reproducible science. We therefore share all of the raw and derived data we collect publicly with rich metadata, including preliminary data collected during methods development, as near to the time of collection as possible.
Details →
Usage examples
See 1 usage example →
agriculturecogdisaster responseearth observationenvironmentalgeospatialsatellite imagerystacsynthetic aperture radar
The Sentinel-1 mission is a constellation of
C-band Synthetic Aperature Radar (SAR) satellites from the European Space Agency launched since 2014.
These satellites collect observations of radar backscatter intensity day or night, regardless of the
weather conditions, making them enormously valuable for environmental monitoring.
These radar data have been processed from original Ground Range Detected (GRD) scenes into a Radiometrically
Terrain Corrected, tiled product suitable for analysis. This product is available over the Contiguous United States (CONUS)
since 2017 when Sentinel-1 data becam...
Details →
Usage examples
See 1 usage example →
astronomymachine learningNASA SMD AIsatellite imagery
Hubble Space Telescope imaging data and associated identification labels for galaxy morphology derived from citizen scientist labels from the Galaxy Zoo: Hubble project.
Details →
Usage examples
-
Galaxy Zoo: morphological classifications for 120 000 galaxies in HST legacy imaging by Kyle W. Willett, Melanie A. Galloway, Steven P. Bamford, Chris J. Lintott, Karen L. Masters, Claudia Scarlata, B. D. Simmons, Melanie Beck, Carolin N. Cardamone, Edmond Cheung, Edward M. Edmondson, Lucy F. Fortson, Roger L. Griffith, Boris Häußler, Anna Han, Ross Hart, Thomas Melvin, Michael Parrish, Kevin Schawinski, R. J. Smethurst, Arfon M. Smith
See 1 usage example →
geneticgenomiclife sciencesvcf
Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpret...
Details →
Usage examples
See 1 usage example →
amazon.sciencebioinformaticshealthlife sciencesnatural language processingus
DE-SynPUF is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,300,000 persom (2.3m) data sets in the OMOP Common Data Model format. The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information. The purposes of the DE-SynPUF are to:
- allow data entrepreneurs to develop and create software and applications that may eventually be applied to actual CMS claims data;
- train researchers on the use and complexity of conducting analyses with CMS claims data prior to initiating the process to obtain access to actual CMS data; and,
- support safe data mining innovations that may reveal unan...
Details →
Usage examples
See 4 usage examples →
bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing
This repository within the ACTIV TRACE initiative houses a comprehensive collection of datasets related to SARS-CoV-2. The processing of SARS-CoV-2 Sequence Read Archive (SRA) files has been optimized to identify genetic variations in viral samples. This information is then presented in the Variant Call Format (VCF). Each VCF file corresponds to the SRA parent-run's accession ID. Additionally, the data is available in the parquet format, making it easier to search and filter using the Amazon Athena Service. The SARS-CoV-2 Variant Calling Pipeline is designed to handle new data every six ho...
Details →
Usage examples
See 1 usage example →
coronavirusCOVID-19life sciencesMERSSARS
Full-text and metadata dataset of COVID-19 and coronavirus-related research articles optimized for machine readability.
Details →
Usage examples
See 1 usage example →
atmosphereclimateclimate modelearth observationgeosciencegeospatialmeteorologicalsimulationsweatherzarr
Downscaled future and historical climate projections for California and her environs in support of California's Fifth Climate Assessment
Details →
Usage examples
See 1 usage example →
encyclopedicinternetnatural language processing
A corpus of web screenshot and metadata data composed of over 70 million websites.
Details →
Usage examples
See 1 usage example →
atmosphereclimateclimate modelgeospatialicelandmodeloceanssustainability
Data from ARISE-SAI Experiments with CESM2
Details →
Usage examples
See 1 usage example →
agriculturecomputer visionmachine learning
Dataset associated with the March 2021 Frontiers in Robotics and AI paper "Broad Dataset and Methods for Counting and Localization of On-Ear Corn Kernels", DOI: 10.3389/frobt.2021.627009
Details →
Usage examples
See 1 usage example →
climatecoastaldisaster responseenvironmentalmeteorologicaloceanssustainabilitywaterweather
The University of Wisconsin Probabilistic Downscaling (UWPD) is a statistically downscaled dataset based on the Coupled Model Intercomparison Project Phase 5 (CMIP5) climate models. UWPD consists of three variables, daily precipitation and maximum and minimum temperature. The spatial resolution is 0.1°x0.1° degree resolution for the United States and southern Canada east of the Rocky Mountains.
The downscaling methodology is not deterministic. Instead, to properly capture unexplained variability and extreme events, the methodology predicts a spatially and temporally varying Probability Density Function (PDF) for each variable. Statistics such as the mean, mean PDF and annual maximum statistics can be calculated directly from the daily PDF and these statistics are included in the dataset. In addition, “standard”, “raw” data is created by randomly sampling from the PDFs to create a “realization” of the local scale given the large-scale from the climate model. There are 3 realizations for temperature and 14 realizations for precipitation.
...
Details →
Usage examples
See 1 usage example →
earth observationoceans
Community provided bathymetry data collected in collaboration with the International Hydrographic Organization.
Details →
Usage examples
See 1 usage example →
air temperatureatmosphereforecastmeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynear-surface specific humidityocean circulationocean currentsocean sea surface heightocean simulationocean velocityoceanstime series forecastingweather
DMI forecast data consist of various models where each model contains different set of parameters relating to a specific domain like ocean (WAM), storm flooding (DKSS) or weather (HARMONIE)
Details →
Usage examples
See 1 usage example →
earth observationgeospatialsolarspace weather
The United States Air Force (USAF) Defense Meteorological Satellite Program (DMSP) SSJ precipitating particle instrument measures in-situ total flux and energy distribution of electrons and ions at low earth orbit. These precipitating particles are of interest for space weather operations and research, in part because they produce aurora during normal and very strong geomagnetic storms. This dataset contains both sensor-level raw data (as detailed in Redmon et al. 2017) and a high-level machine-learning-ready data product.
Details →
Usage examples
See 1 usage example →
machine learningnatural language processing
The DROP dataset contains 96k Question and Answer pairs (QAs) over 6.7K paragraphs, split between train (77k QAs), development (9.5k QAs) and a hidden test partition (9.5k QAs).
Details →
Usage examples
See 1 usage example →
atmosphereelectricitymeteorologicalmodelsustainabilityweather
ERA5 reanalysis data on AWS, preprocessed for use with the Weather Research and Forecasting (WRF) model.
Details →
Usage examples
See 1 usage example →
archivesinternetnatural language processingweb archive
The End of Term Web Archive (EOT) captures and saves U.S. Government websites at the end of presidential administrations. The EOT has thus far preserved websites from administration changes in 2008, 2012, 2016, and 2020. Data from these web crawls have been made openly available in several formats in this dataset.
Details →
Usage examples
See 1 usage example →
atmospheremeteorologicalnear-surface air temperaturenetcdfprecipitation
EM-Earth provides data for precipitation, mean air temperature, air temperature range, and dew-point temperature at 0.1° spatial resolution over global land areas from 1950 to 2019. EM-Earth provides hourly/daily deterministic estimates, and daily probabilistic estimates (25 ensemble members), to meet the diverse requirements of hydrometeorological applications.
Details →
Usage examples
See 1 usage example →
demographicsgeospatialurban
This bucket contains multiple datasets (as Quilt packages) created by the
Center for Geospatial Sciences (CGS) at the University of California-Riverside.
The data in this bucket contains the following:
- Tabular and geographic data from the US Census
- Land Cover imagery collected from Multi-Resolution Land Characteristics Consortium
- Road network data processed from OpenStreetMap
Details →
Usage examples
See 1 usage example →
biodiversitybioinformaticsconservationearth observationlife sciences
The Global Biodiversity Information Facility (GBIF) is an international network and data infrastructure funded by the world's governments providing global data that document the occurrence of species. GBIF currently integrates datasets documenting over 1.6 billion species occurrences, growing daily. The GBIF occurrence dataset combines data from a wide array of sources including specimen-related data from natural history museums, observations from citizen science networks and environment recording schemes. While these data are constantly changing at GBIF.org, periodic snapshots are taken a...
Details →
Usage examples
See 1 usage example →
aerial imagerydemographicsdisaster responsegeospatialimage processingmachine learningpopulationsatellite imagery
Population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV
and Cloud-optimized GeoTIFF files. This refines CIESIN’s Gridded Population of the World
using machine learning models on high-resolution worldwide Maxar satellite imagery. CIESIN population counts aggregated from worldwide census
data are allocated to blocks where imagery appears to contain buildings.
Details →
Usage examples
See 1 usage example →
computational fluid dynamicsgreen aviationlow-pressure turbineturbulence
The archive comprises snapshot, point-probe, and time-average data produced via a high-fidelity computational simulation of turbulent air flow over a low pressure turbine blade, which is an important component in a jet engine. The simulation was undertaken using the open source PyFR flow solver on over 5000 Nvidia K20X GPUs of the Titan supercomputer at Oak Ridge National Laboratory under an INCITE award from the US DOE. The data can be used to develop an enhanced understanding of the complex three-dimensional unsteady air flow patterns over turbine blades in jet engines. This could in turn le...
Details →
Usage examples
See 1 usage example →
cancergenomiclife sciencesSTRIDESwhole genome sequencing
The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel,
next-generation, tumor-derived culture models annotated with genomic and clinical data.
HCMI-developed models and related data are available as a community resource. The NCI is
contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs
are tasked with producing next-generation cancer models from clinical samples. The cancer models
include tumor types that are rare, originate from patients from underrepresented populations, lack
precision therapy, or lack ca...
Details →
Usage examples
See 1 usage example →
cramfast5fastqgeneticgenomiclife sciences
This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.
Details →
Usage examples
See 1 usage example →
agriculturecropland partitioningirrigated croplandland coverland userainfed cropland
A framework integrating the Budyko model has been developed to distinguish between rainfed and irrigated cropland areas across Africa. This expands on remote sensing land cover products available for agricultural water studies in Africa and thereby helps address the need for deeper insights into cropland patterns. Validation against an independent dataset revealed an overall accuracy of 73% with high precision and specificity scores. These results validate the framework’s effectiveness in identifying irrigated areas while minimizing errors in misclassifying rainfed areas as irrigated.
Details →
Usage examples
-
A framework for disaggregating remote-sensing cropland into rainfed and irrigated classes at continental scale by Owusu, A., Kagone, S., Leh, M., Velpuri, N. M., Gumma, M. K., Ghansah, B., Thilina-Prabhath, P., Akpoti, K., Mekonnen, K., Tinonetsana, P., & Mohammed, I.
-
Rainfed and Irrigated Cropland Areas for Africa by Owusu, A., Kagone, S., Leh, M., and Velpuri, N.M.
-
Cropland percentage by iwmiwaplus
-
Water use in Awash basin by A. Owusu, K. Akpoti, M. Leh, N. Velpuri
See 4 usage examples →
computer visiondeep learningmachine learning
Some of the most important datasets for image classification research, including
CIFAR 10 and 100, Caltech 101, MNIST, Food-101, Oxford-102-Flowers, Oxford-IIIT-Pets,
and Stanford-Cars. This is part of the fast.ai datasets collection hosted by
AWS for convenience of fast.ai students. See documentation link for citation and
license details for each dataset.
Details →
Usage examples
See 1 usage example →
agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imageryweather
The Geo-KOMPSAT-2A (GK2A) is the new generation geostationary meteorological satellite (located in 128.2°E) of the Korea Meteorological Administration (KMA). The main mission of the GK2A is to observe the atmospheric phenomena over the Asia-Pacific region. The Advance Meteorological Imager (AMI) on GK2A scan the Earth full disk every 10 minutes and the Korean Peninsula area every 2 minutes with a high spatial resolution of 4 visible channels and 12 infrared channels. In addition, the AMI has an ability of flexible target area scanning useful for monitoring severe weather events such as typhoon...
Details →
Usage examples
See 1 usage example →
astronomyimagingsurvey
These data correspond to the International LOFAR Telescope observations of the sky field ELAIS-N1 (16:10:01 +54:30:36) during the cycle 2 of observations. There are 11 runs of about 8 hours each plus the corresponding observation of the calibration targets before and after the target field. The data are measurement sets (MS) containing the cross-correlated data and metadata divided in 371 frequency sub-bands per target centred at ~150 MHz.
Details →
Usage examples
See 1 usage example →
analyticsblockchainclimatecommercecopyright monitoringcsvfinancial marketsgovernancegovernment spendingjsonmarket datasocioeconomicstatisticstransparencyxml
The Legal Entity Identifier (LEI) is a 20-character, alpha-numeric code based on the ISO 17442 standard developed by the International Organization for Standardization (ISO). It connects to key reference information that enables clear and unique identification of legal entities participating in financial transactions. Each LEI contains information about an entity’s ownership structure and thus answers the questions of 'who is who’ and ‘who owns whom’. Simply put, the publicly available LEI data pool can be regarded as a global directory, which greatly enhances transparency in the global ma...
Details →
Usage examples
See 1 usage example →
aerial imageryagriculturecomputer visiondeep learningmachine learning
Dataset associated with the 2021 AAAI Paper- Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery. The dataset contains 3 image sequences of aerial imagery from 386 farm parcels which have been annotated for nutrient deficiency stress.
Details →
Usage examples
See 1 usage example →
autonomous vehiclescomputer visiondeep learningGPSIMUlidarlogisticsmachine learningobject detectionobject trackingperceptionradarroboticstransportation
A large scale multimodal dataset for Autonomous Trucking. Sensor data was recorded with a heavy truck from MAN equipped with 6 lidars, 6 radars, 4 cameras and a high-precision GNSS. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time.
It comprises more than 740 scenes of 20s each within a multitude of different environmental conditions. Bounding boxes are available for 27 object classes, 15 attributes, and a range of more than 230m. The scenes are t...
Details →
Usage examples
-
MANTruckScenes: A multimodal dataset for autonomous trucking in diverse conditions by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, et al
-
TruckScenes devkit tutorial by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin
-
TruckScenes devkit by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin
-
PyPi package by Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin
See 4 usage examples →
agriculturedisaster responsegeospatialnatural resourcesatellite imagery
Data from the Moderate Resolution Imaging Spectroradiometer (MODIS), managed by
the U.S. Geological Survey and NASA. Five products are included:
MCD43A4 (MODIS/Terra and Aqua Nadir BRDF-Adjusted Reflectance Daily L3 Global 500 m SIN Grid),
MOD11A1 (MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid),
MYD11A1 (MODIS/Aqua Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid),
MOD13A1 (MODIS/Terra Vegetation Indices 16-Day L3 Global 500 m SIN Grid),
and MYD13A1 (MODIS/Aqua Vegetation Indices 16-Day L3 Global 500 m SIN Grid).
MCD43A4 has global coverage, all...
Details →
Usage examples
See 1 usage example →
analyticsarchivesdeep learningmachine learningNASA SMD AIplanetary
NASA missions like the Curiosity and Perseverance rovers carry a rich array of instruments suited to collect data and build evidence towards answering if Mars ever had livable environmental conditions. These rovers can collect rock and soil samples and can take measurements that can be used to determine their chemical makeup.
Because communication between rovers and Earth is severely constrained, with limited transfer rates and short daily communication windows, scientists have a limited time to analyze the data and make difficult inferences about the chemistry in order to prioritize the next operations and send those instructions back to the rover.
This project aimed at building a model to automatically analyze gas chromatography mass spectrometry (GCMS) data collected for Mars exploration in order to help the scientists in their analysis of understanding the past habitability of Mars.
More information are available at https://mars.nasa.gov/msl/spacecraft/instruments/sam/ and the data from Mars are available and described at https://pds-geosciences.wustl.edu/missions/msl/sam.htm.
We request that you cite th...
Details →
Usage examples
See 1 usage example →
analyticsarchivesdeep learningmachine learningNASA SMD AIplanetary
NASA missions like the Curiosity and Perseverance rovers carry a rich array of instruments suited to collect data and build evidence towards answering if Mars ever had livable environmental conditions. These rovers can collect rock and soil samples and can take measurements that can be used to determine their chemical makeup.
Because communication between rovers and Earth is severely constrained, with limited transfer rates and short daily communication windows, scientists have a limited time to analyze the data and make difficult inferences about the chemistry in order to prioritize the next operations and send those instructions back to the rover.
This project aimed at building a model to automatically analyze evolved gas analysis mass spectrometry (EGA-MS) data collected for Mars exploration in order to help the scientists in their analysis of understanding the past habitability of Mars.
More information are available at https://mars.nasa.gov/msl/spacecraft/instruments/sam/ and the data from Mars are available and described at https://pds-geosciences.wustl.edu/missions/msl/sam.htm.
We request that you ci...
Details →
Usage examples
See 1 usage example →
atmosphereclimateclimate modelCMIP6geospatialicelandmodeloceanssustainability
Data from the UK Earth System Model (UKESM1) ARISE-SAI experiment. The UKESM1 ARISE-SAI experiment explores the impacts of geoengineering via the injection of sulphur dioxide (SO2) into the stratosphere in order to keep global mean surface air temperature near 1.5 C above the pre-industrial climate. Data includes a five member ensemble of simulations with SO2 injection plus a five member ensemble of SSP2-4.5 simulations from CMIP6 to serve as a reference data set
Details →
Usage examples
See 1 usage example →
autonomous vehiclescomputer visiondeep learningevent cameraglobal shutter cameraGNSSGPSh5hdf5IMUlidarmachine learningperceptionroboticsRTK
M3ED is the first multi-sensor event camera (EC) dataset focused on high-speed dynamic motions in robotics applications. M3ED provides high-quality synchronized data from multiple platforms (car, legged robot, UAV), operating in challenging conditions such as off-road trails, dense forests, and performing aggressive flight maneuvers. M3ED also covers demanding operational scenarios for EC, such as high egomotion and multiple independently moving objects. M3ED includes high-resolution stereo EC (1280×720), grayscale and RGB cameras, a high-quality IMU, a 64-beam LiDAR, and RTK localization.
Details →
Usage examples
See 1 usage example →
natural language processing
MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in disti...
Details →
Usage examples
-
Dynamic Gazetteer Integration in Multilingual Models for Cross-Lingual and Cross-Domain Named Entity Recognition by Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi
-
MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition by Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg Rokhlenko
-
GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input by Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi
-
Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries by Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi
See 4 usage examples →
educationgeospatialinfrastructureschools
This database provides estimates of walking travel time of school-aged populations to schools recorded in OpenStreetMap. Population counts of male and female students are sorted into 3 groups of travel time - under 30 minutes, 30-60 minutes, and over 60 minutes. It covers the African continent and is aggregated by first-level administrative divisions.
Details →
Usage examples
See 1 usage example →
astronomyimagingsatellite imagerysurvey
The Wide-field Infrared Survey Explorer (WISE) was a NASA Medium Explorer satellite in low-Earth orbit that conducted an all-sky astronomical imaging survey over four infrared bands from 2010-2011. The NEOWISE Post-Cryo Data Release contains 3.4 and 4.6 micron (W1 and W2) imaging data that were acquired between 29 September 2010 and 1 February 2011 following the exhaustion of the inner and outer cryogen tanks.
Details →
Usage examples
See 1 usage example →
astronomyimagingobject detectionparquetsatellite imagerysurvey
The Near-Earth Object Wide-field Infrared Survey Explorer (NEOWISE) is a NASA Medium-class Explorer satellite in low-Earth orbit conducting an all-sky astronomical imaging survey over two infrared bands. The NEOWISE Reactivation mission began in 2013 when the original WISE satellite was brought out of hibernation to learn more about the population of near-Earth objects and comets that could pose an impact hazard to the Earth. The data is also used to study a wide range of astrophysical phenomena in the time domain including brown dwarfs, supernovae and active galactic nuclei.
Details →
Usage examples
See 1 usage example →
climatedisaster responseelevationgeospatiallidarstac
Lidar (light detection and ranging) is a technology that can measure the 3-dimentional location of objects, including the solid earth surface. The data consists of a point cloud of the positions of solid objects that reflected a laser pulse, typically from an airborne platform. In addition to the position, each point may also be attributed by the type of object it reflected from, the intensity of the reflection, and other system dependent metadata. The NOAA Coastal Lidar Data is a collection of lidar projects from many different sources and agencies, geographically focused on the coastal areas of the United States of America. The data is provided in Entwine Point Tiles (EPT; https://entwine.io) format, which is a lossless streamable octree of the point cloud, and in LAZ format. Datasets are maintained in their original projects and care should be taken when merging projects. The coordinate reference system for the data is The NAD83(2011) UTM zone appropriate for the center of each data set for EPT and geographic coordinates for LAZ. Vertically they are in the orthometric datum appropriate for that area (for example, NAVD88 in the mainland United States, PRVD02 in Puerto Rico, or GUVD03 in Guam). The geoid model used is reflected in the data set resource name.
The ...
Details →
Usage examples
See 1 usage example →
agricultureclimateenvironmentalnatural resourceregulatoryweather
Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929 and are at the time of this writing at the Version 8 software level. Over 9000 stations' data are typically available. The daily elements included in the dataset (as available from each station) are:
Mean temperature (.1 Fahrenheit)
Mean dew point (.1 Fahrenheit)
Mean sea level pressure (.1 mb)
Mean station pressure (.1 mb)
Mean visibility (.1 miles)
Mean wind speed (.1 knots)
Maximum sustained wind speed (.1 knots)
Maximum wind gust (.1 knots)
Maximum temperature (.1 Fahrenheit)
Minimum temperature (.1 Fahrenheit)
Precipitation amount (.01 inches)
Snow depth (.1 inches)
Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel Cloud.
G
...
Details →
Usage examples
See 1 usage example →
agricultureclimatemeteorologicalweather
The Integrated Surface Database (ISD) consists
of global hourly and synoptic observations
compiled from numerous sources into a gzipped
fixed width format. ISD was developed as a joint
activity within Asheville's Federal Climate
Complex. The database includes over 35,000 stations
worldwide, with some having data as far back
as 1901, though the data show a substantial
increase in volume in the 1940s and again in
the early 1970s. Currently, there are over
14,000 "active" stations updated daily in the
database. The total uncompressed data volume is
around 600 gigabytes; however, it ...
Details →
Usage examples
See 1 usage example →
agricultureclimatemeteorologicalweather
Please note NWS is Soliciting Comments until April 30, 2024 on Availability of Probabilistic Snow Grids for Select Weather Forecast Offices (WFOs) as an Experimental Element in the National Digital Forecast Database (NDFD) for the Contiguous United States (CONUS). A PDF version of the Public Notice can be found "HERE"
The National Digital Forecast Database (NDFD) is a suite of gridded forecasts of sensible weather elements (e.g., cloud cover, maximum temperature). Forecasts prepared by NWS field offices working in collaboration with the National Centers for Environmental Predictio...
Details →
Usage examples
See 1 usage example →
agricultureagricultureclimatedisaster responseenvironmentaltransportationweather
The National Water Model (NWM) is a water resources model that simulates and forecasts water
budget variables, including snowpack, evapotranspiration, soil moisture and streamflow, over
the entire continental United States (CONUS). The model, launched in August 2016, is designed
to improve the ability of NOAA to meet the needs of its stakeholders (forecasters, emergency
managers, reservoir operators, first responders, recreationists, farmers, barge operators, and
ecosystem and floodplain managers) by providing expanded accuracy, detail, and frequency of water
information. It is operated by NOA...
Details →
Usage examples
See 1 usage example →
agricultureclimatemeteorologicalsustainabilityweather
The U.S. Climate Normals are a large suite of data products that provide information about typical climate conditions for thousands of locations across the United States. Normals act both as a ruler to compare today’s weather and tomorrow’s forecast, and as a predictor of conditions in the near future. The official normals are calculated for a uniform 30 year period, and consist of annual/seasonal, monthly, daily, and hourly averages and statistics of temperature, precipitation, and other climatological variables from almost 15,000 U.S. weather stations.
NCEI generates the official U.S. norma
...
Details →
Usage examples
See 1 usage example →
agricultureagricultureclimatedisaster responseenvironmentaloceanstransportationweather
NOAA's Coastal Ocean Reanalysis (CORA) for the Gulf of Mexico and East Coast (GEC) is produced using verified hourly water levels from the Center of Operational Oceanographic Products & Services (CO-OPS), through hydrodynamic modeling from Advanced Circulation "ADCIRC" and Simulating WAves Nearshore "SWAN" models. Data are assimilated, processed, corrected, and processed again before quality assurance and skill assessment with additional verified tide station-based observations.
Details for CORA Dataset
Timeseries - 1979 to 2022
Size - Approx. 20.5TB
Domain - Lat 5.8 to 45.8 ; Long -98.0 to -53.8
Nodes - 1813443 centroids, 3564104 elements
Grid cells - Currently apporximately 505
Spatial Resolution ...
Details →
Usage examples
See 1 usage example →
climateenvironmentaloceansweather
The mission of the Ocean Climate Stations (OCS) Project is to make meteorological and
oceanic measurements from autonomous platforms. Calibrated, quality-controlled, and well-documented
climatological measurements are available on the OCS webpage and the OceanSITES Global Data
Assembly Centers (GDACs), with near-realtime data available prior to release of the complete,
downloaded datasets.
OCS measurements served through the Big Data Program come from OCS high-latitude moored buoys located in the Kuroshio
Extension (32°N 145°E) and the Gulf of Alaska (50°N 145°W). Initiated in 2004 and 20
...
Details →
Usage examples
See 1 usage example →
atmosphereclimatedata assimilationforecastgeosciencegeospatiallandmeteorologicalmodelnetcdfweather
NSF NCAR is providing a NetCDF-4 structured version of the 0.25 degree atmospheric ECMWF Reanalysis 5 (ERA5) to the AWS ODSP. ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in...
Details →
Usage examples
See 1 usage example →
biologycancercomputer visionhealthimage processingimaginglife sciencesmachine learningmagnetic resonance imagingmedical imagingmedicineneurobiologyneuroimagingsegmentation
This dataset contains 8,000+ brain MRIs of 2,000+ patients with brain metastases.
Details →
Usage examples
See 1 usage example →
earth observationgeospatialglobalmappingpopulationtiles
Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software.
Details →
Usage examples
See 1 usage example →
aerial imagerycogearth observationgeospatialimagingmapping
The New Jersey Office of GIS, NJ Office of Information Technology manages a series of 11 digital orthophotography and scanned aerial photo maps collected at various years ranging from 1930 to 2017. Each year’s worth of imagery are available as Cloud Optimized GeoTIFF (COG) files and some years are available as compressed MrSID and/or JP2 files. Additionally, each year of imagery is organized into a tile grid scheme covering the entire geography of New Jersey. Many years share the same tiling grid while others have unique grids as defined by the project at the time.
Details →
Usage examples
See 1 usage example →
elevationgeospatiallidarmapping
Elevation datasets in New Jersey have been collected over several years as several
discrete projects. Each project covers a geographic area, which is a subsection of
the entire state, and has differing specifications based on the available technology
at the time and project budget. The geographic extent of one project may overlap that
of a neighboring project. Each of the 18 projects contains deliverable products such
as LAS (Lidar point cloud) files, unclassified/classified, tiled to cover project area;
relevant metadata records or documents, most adhering to the Federal Geographic Data
Com...
Details →
Usage examples
See 1 usage example →
Homo sapiensimage processingimaginglife sciencesmagnetic resonance imagingsignal processing
OCMR is an open-access repository that provides multi-coil k-space data for cardiac cine. The fully sampled MRI datasets are intended for quantitative comparison and evaluation of image reconstruction methods. The free-breathing, prospectively undersampled datasets are intended to evaluate their performance and generalizability qualitatively.
Details →
Usage examples
See 1 usage example →
astronomyimagingobject detectionparquetsatellite imagerysimulationssurvey
This release consists of simulated data products designed to mimic observations of the same region of the sky as seen by two astronomical facilities: the Nancy Grace Roman Telescope and the Vera C. Rubin Observatory.
Details →
Usage examples
See 1 usage example →
biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledmachine learningmappingoceansopen source softwaresignal processing
Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.
Details →
Usage examples
See 1 usage example →
bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing
The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.
Details →
Usage examples
See 1 usage example →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
The 25 m PALSAR-2 ScanSAR is normalized backscatter data of PALSAR-2 broad area observation mode with observation width of 350 km.
The SAR imagery was ortho-rectificatied and slope corrected using the ALOS World 3D - 30 m (AW3D30) Digital Surface Model.
Polarization data are stored as 16-bit digital numbers (DN).
The DN values can be converted to gamma naught values in decibel unit (dB) using the following equation:
γ0 = 10*log10(DN2) - 83.0 dB
CARD4L stands for CEOS Analysis Ready Data for Land (Level 2.2) data are ortho-rectified and radiometrically terrain-corrected.
This datase...
Details →
Usage examples
See 1 usage example →
agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
Torrential rainfall triggered flooding and landslides in many parts of Rwanda. The hardest-hit districts were Ngororero, Rubavu, Nyabihu, Rutsiro and Karongi. According to reports, 14 people have died in Karongi, 26 in Rutsiro, 18 in Rubavu, 19 in Nyabihu and 18 in Ngororero.Rwanda National Police reported that the Mukamira-Ngororero and Rubavu-Rutsiro roads are impassable due to flooding and landslide debris. UNITAR on behalf of United Nations Office for the Coordination of Humanitarian Affairs (OCHA) / Regional Office for Southern & Eastern Africa in cooperation with Rwanda Space Agency ...
Details →
Usage examples
See 1 usage example →
agriculturecogdisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainabilitysynthetic aperture radar
Tropical Cyclone Mocha began to form in the Bay of Bengal on 11 May 2023 and continues to intensify as it moves towards Myanmar and Bangladesh.Cyclone Mocha is the first storm to form in the Bay of Bengal this year and is expected to hit several coastal areas in Bangladesh on 14 May with wind speeds of up to 175 km/h.After made its landfall in the coast between Cox’s Bazar (Bangladesh) and Kyaukphyu (Myanmar) near Sittwe (Myanmar). At most, Catastrophic Damage-causing winds was possible especially in the areas of Rakhine State and Chin State, and Severe Damage-causing winds is possible in the ...
Details →
Usage examples
See 1 usage example →
machine learningnatural language processing
24K Question/Answer (QA) pairs over 4.7K paragraphs, split between train (19K QAs), development (2.4K QAs) and a hidden test partition (2.5K QAs).
Details →
Usage examples
See 1 usage example →
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
Blunt force abdominal trauma is among the most common types of traumatic injury, with the most frequent cause being motor vehicle accidents. Abdominal trauma may result in damage and internal bleeding of the internal organs, including the liver, spleen, kidneys, and bowel. Detection and classification of injuries are key to effective treatment and favorable outcomes. A large proportion of patients with abdominal trauma require urgent surgery. Abdominal trauma often cannot be diagnosed clinically by physical exam, patient symptoms, or laboratory tests. Prompt diagnosis of abdominal trauma using...
Details →
Usage examples
See 1 usage example →
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
Over 1.5 million spine fractures occur annually in the United States alone resulting in over 17,730 spinal cord injuries annually. The most common site of spine fracture is the cervical spine. There has been a rise in the incidence of spinal fractures in the elderly and in this population, fractures can be more difficult to detect on imaging due to degenerative disease and osteoporosis. Imaging diagnosis of adult spine fractures is now almost exclusively performed with computed tomography (CT). Quickly detecting and determining the location of any vertebral fractures is essential to prevent ne...
Details →
Usage examples
See 1 usage example →
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
RSNA assembled this dataset in 2019 for the RSNA Intracranial Hemorrhage Detection AI Challenge (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/). De-identified head CT studies were provided by four research institutions. A group of over 60 volunteer expert radiologists recruited by RSNA and the American Society of Neuroradiology labeled over 25,000 exams for the presence and subtype classification of acute intracranial hemorrhage.
Details →
Usage examples
See 1 usage example →
computed tomographycomputer visioncsvlabeledlife sciencesmachine learningmedical image computingmedical imagingradiologyx-ray tomography
RSNA assembled this dataset in 2020 for the RSNA STR Pulmonary Embolism Detection AI Challenge (https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/). With more than 12,000 CT pulmonary angiography (CTPA) studies contributed by five international research centers, it is the largest publicly available annotated PE dataset. RSNA collaborated with the Society of Thoracic Radiology to recruit more than 80 expert thoracic radiologists who labeled the dataset with detailed clinical annotations.
Details →
Usage examples
See 1 usage example →
jsonmachine learningnatural language processing
14k QA pairs over 1.7K paragraphs, split between train (10k QAs), development (1.6k QAs) and a hidden test partition (1.7k QAs).
Details →
Usage examples
See 1 usage example →
air qualityclimateearth observationmeteorologicalweather
Air Quality is a global SILAM atmospheric composition and air quality forecast performed on a daily basis for > 100 species and covering the troposphere and the stratosphere. The output produces 3D concentration fields and aerosol optical thickness. The data are unique: 20km resolution for global AQ models is unseen worldwide.
Details →
Usage examples
See 1 usage example →
air qualityclimateenvironmentalgeospatialradiation
An ongoing collection of radiation and air quality measurements taken by devices involved in the Safecast project.
Details →
Usage examples
See 1 usage example →
biologycell biologycell imagingepigenomicsgene expressionhistopathologyHomo sapiensimaginglife sciencesmedicinemicroscopyneurobiologyneurosciencesingle-cell transcriptomicstranscriptomics
The Seattle Alzheimer's Disease Brain Cell Atlas (SEA-AD) consortium strives to gain a deep molecular and cellular understanding of the early pathogenesis of Alzheimer's disease and is funded by the National Institutes on Aging (NIA U19AG060909). The SEA-AD datasets available here comprise single cell profiling (transcriptomics and epigenomics) and quantitative neuropathology. To explore gene expression and chromatin accessibility information, the single-cell profiling data includes: snRNAseq and snATAC-seq data from the SEA-AD donor cohort (aged brains which span the spectrum of Alzhe...
Details →
Usage examples
See 1 usage example →
disaster responseearth observationenvironmentalgeospatialsatellite imagerysustainabilitysynthetic aperture radar
The Sentinel1 Single Look Complex (SLC) unzipped dataset contains Synthetic Aperture Radar (SAR) data from the European Space Agency’s Sentinel-1 mission. Different from the zipped data provided by ESA, this dataset allows direct access to individual swaths required for a given study area, thus drastically minimizing the storage and downloading time requirements of a project. Since the data is stored on S3, users can utilize the boto3 library and s3 get_object method to read the entire content of the object into the memory for processing, without actually having to download it. The Sentinel-1 ...
Details →
Usage examples
See 1 usage example →
proteinsingle-cell transcriptomics
Comprehensive, large-scale single-cell profiling of healthy human blood at different ages is one of the critical pending tasks required to establish a framework for systematic understanding of human aging. Here, using single-cell RNA/TCR/BCR-seq with protein feature barcoding (20 antibodies), we profiled 317 samples from 166 healthy individuals aged 25 to 85 years old drawn over 3-year period. Dataset spanning ~2 million cells describes 50 subpopulations of blood immune cells, with 14 subpopulations changing with age, including a novel NKG2C+ CD8 Tcm population that decreases with age. We desc...
Details →
Usage examples
See 1 usage example →
astronomyimagingsatellite imagerysurvey
Spitzer was an infrared astronomy space telescope with imaging from 3 to 160 microns and spectroscopy from 5 to 37 microns, launched into an Earth-trailing solar orbit as the last of NASA's Great Observatories. The SEIP Super Mosaics include data from the four channels of IRAC (3.6, 4.5, 5.8, 8 microns) and the 24 micron channel of MIPS. Data from multiple programs are combined where appropriate. Cryogenic Release v3.0 includes Spitzer data taken during commissioning and cryogenic operations, including calibration data.
Details →
Usage examples
See 1 usage example →
air temperatureclimate modelenergysolar
Released to the public as part of the Department of Energy's Open Energy Data Initiative,
these data represent a serially complete collection of hourly 4km wind, solar, temperature,
humidity, and pressure fields for the Continental United States under climate change scenarios.Sup3rCC is downscaled Global Climate Model (GCM) data. For example, the initial file set tagged
"sup3rcc_conus_mriesm20_ssp585_r1i1p1f1" is downscaled from MRI ESM 2.0 for climate change
scenario SSP5 8.5 and variant label r1i1p1f1. The downscaling process is performed using a
generative machine learning...
Details →
Usage examples
See 1 usage example →
citiesgeospatialinfrastructuremappingtraffictransportation
The basic geo-data set for public transport stops comprises public transport stops in Switzerland and additional selected geo-referenced public transport locations that are of operational or structural importance (operating points).
Details →
Usage examples
See 1 usage example →
bioinformaticscsvdicomgenomichealthimaginglife sciencesmedicine
This is a synthetic data set that includes FHIR resources, DICOM images, genomic data, physiological data (i.e., ECGs), and simple clinical notes. FHIR links all the data types together.
Details →
Usage examples
-
The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive Synthetic Health Record. by Walonoski J, Hall D, Bates KM, Farris MH, Dagher J, Downs ME, Sivek RT, Wellner B, Gregorowicz A, Hadley M, Campion FX, Levine L, Wacome K, Emmer G, Kemmer A, Malik M, Hughes J, Granger E, Russell S.
See 1 usage example →
biologyencyclopedicgenomichealthlife sciencesmedicine
Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s...
Details →
Usage examples
See 1 usage example →
biologyencyclopedicgenomichealthlife sciencesmedicinesingle-cell transcriptomics
Tabula Muris Senis is a comprehensive compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 500,000 cells from 18 organs and tissues across the mouse lifespan. We discovered cell-specific changes occurring across multiple cell types and organs, as well as age related changes in the cellular composition of different organs. Using single-cell transcriptomic data we were able to assess cell type specific manifestations of different hallmarks of aging, such as senescence, changes in the activity of metabolic pathways, depletion of stem-cell populat...
Details →
Usage examples
See 1 usage example →
biologyencyclopedicgeneticgenomichealthlife sciencesmedicinesingle-cell transcriptomics
Tabula Sapiens will be a benchmark, first-draft human cell atlas of two million cells from 25 organs of eight normal human subjects.
Taking the organs from the same individual controls for genetic background, age, environment, and epigenetic effects, and allows detailed analysis and comparison of cell types that are shared between tissues.
Our work creates a detailed portrait of cell types as well as their distribution and variation in gene expression across tissues and within the endothelial, epithelial, stromal and immune compartments.
A critical factor in the Tabula projects is our large collaborative network of PI’s with deep expertise at preparation of diverse organs, enabling all organs from a subject to be successfully processed within a single day.
Tabula Sapiens leverages our network of human tissue experts and a close collaboration with a Donor Network West, a not-for-profit organ procurement organization.
We use their experience to balance and assign cell types from each tissue compartment and optimally mix high-quality plate-seq data and high-volume droplet-based data to provide a broad and deep benchmark atlas.
Our goal is to make sequence data rapidly and broadly available to the scientific community as a community resource. Before you use our data, please take note of our Data Release Policy below.
Data Release Policy
Our goal is to make sequence data rapidly and broadly available to the scientific community as a community resource. It is our intention to publish the work of this project in a timely fashion, and we welcome collaborative interaction on the project and analyses.
However, considerable investment was made in generating these data and we ask that you respect rights of first publication and acknowledgment as outlined in the Toronto agreement.
By accessing these data, you agree not to publish any articles containing analyses of genes, cell types or transcriptomic data on a who...
Details →
Usage examples
See 1 usage example →
atmosphereearth observationenvironmentalgeophysicsgeoscienceglobalmeteorologicalmodelnetcdfprecipitationsatellite imageryweather
The Tropical Cyclone Precipitation, Infrared, Microwave and Environmental Dataset (TC PRIMED) is a dataset centered around passive microwave observations of global tropical cyclones from low-Earth-orbiting satellites. TC PRIMED is a compilation of tropical cyclone data from various sources, including 1) tropical cyclone information from the National Oceanic and Atmospheric Administration (NOAA) National Weather Service National Hurricane Center (NHC) and Central Pacific Hurricane Center (CPHC) and the U.S. Department of Defense Joint Typhoon Warning Center, 2) low-Earth-orbiting satellite obse...
Details →
Usage examples
See 1 usage example →
censusstatisticssurvey
U.S. Census Bureau American Community Survey (ACS) Public Use Microdata Sample (PUMS) available in a linked data format using the Resource Description Framework (RDF) data model.
Details →
Usage examples
See 1 usage example →
genome wide association studylife sciencespopulation genetics
The UKB-PPP is a collaboration between the UK Biobank (UKB) and thirteen biopharmaceutical companies characterising the plasma proteomic profiles of 54,219 UKB participants. As part of a collaborative analysis across the thirteen UKB-PPP partners, we conducted comprehensive protein quantitative trait loci (pQTL) mapping of 2,923 proteins that identifies 14,287 primary genetic associations, of which 85% are newly discovered, in addition to ancestry-specific pQTL mapping in non-Europeans. We identify independent secondary associations in 87% of cis and 30% of trans loci, expanding the catalogue ...
Details →
Usage examples
See 1 usage example →
astronomyobject detectionparquetsurvey
unWISE is a reprocessing of Wide-field Infrared Survey Explorer (WISE) data which preserves the native angular resolution and is optimized for forced photometry. WISE was a NASA satellite producing all-sky imaging in four infrared bands centered at 3.4, 4.6, 12 and 22 microns (W1, W2, W3, and W4) starting in 2010 until the coolant was exhausted in 2011. It was reactivated in 2013 as NEOWISE and continued imaging in W1 and W2 until 2024.
Details →
Usage examples
See 1 usage example →
bathymetrydisaster responseelevationgeospatialjapaneselandlidarmapping
This dataset comprises high-precision 3D point cloud data that encompasses the entire Shizuoka prefecture in Japan, covering 7,200 out of its 7,777 square kilometers. The data is produced through aerial laser survey, airborne laser bathymetry and mobile mapping systems, the culmination of many years of dedicated effort.This data will be visualized and analyzed for use in infrastructure maintenance, disaster prevention measures and autonomous vehicle driving.
Details →
Usage examples
See 1 usage example →
biologyhealthlife sciencesmedicinesignal processing
VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients.
Details →
Usage examples
See 1 usage example →
automatic speech recognitiondenoisingmachine learningspeaker identificationspeech processing
VOiCES is a speech corpus recorded in acoustically challenging settings,
using distant microphone recording. Speech was recorded in real rooms with various
acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise,
either television, music, or babble, was concurrently played with clean speech.
Data was recorded using multiple microphones strategically placed
throughout the room. The corpus includes audio recordings, orthographic transcriptions,
and speaker labels.
Details →
Usage examples
See 1 usage example →
climateclimate modelclimate projectionsCMIP6earth observationnetcdf
CCKP provides open access to a comprehensive suite of climate and climate change resources derived from the latest generation of climate data archives. Products are based on a consistent and transparent approach with a systematic way of pre-processing the raw observed and model-based projection data to enable inter-comparable use across a broad range of applications. Climate products consist of basic climate variables as well as a large collection (70+) of more specialized, application-orientated variables and indices across different scenarios. Precomputed data can be extracted per specified ...
Details →
Usage examples
See 1 usage example →
biologychemical biologylife sciencesmolecular dockingpharmaceuticalprotein
3D models for molecular docking screens.
Details →
Usage examples
See 1 usage example →
autism spectrum disorderbamgeneticgenomiclife sciencesvcfwhole genome sequencing
iHART is the Hartwell Foundation’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange (AGRE).
Details →
Usage examples
See 1 usage example →
bioinformaticsbiologycancercsvgene expressiongeneticgenomicHomo sapienslife sciencesMus musculusneurosciencetranscriptomics
recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio. recount2 is also included for historical purposes. The pipeline used to generate the data in recount3 (but not recount2) is available here.
Details →
Usage examples
See 1 usage example →
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
The FourCastNet Global Forecast System (FourCastNetGFS) is an experimental system set up by the National Centers for Environmental Prediction (NCEP) to produce medium range global forecasts. The model runs on a 0.25 degree latitude-longitude grid (about 28 km) and 13 pressure levels. The model produces forecasts 4 times a day at 00Z, 06Z, 12Z and 18Z cycles. Major atmospheric and surface fields including temperature, wind components, geopotential height, relative humidity and 2 meter temperature and 10 meter winds are available. The products are 6 hourly forecasts up to 10 days. The data format is GRIB2.
The FourCastNetGFS system is an experimental weather forecast model built upon the pre-trained Nvidia’s FourCastNet Machine Learning Weather Prediction (MLWP) model version 2. The FourCastNet (Bonev et al, 2023) was developed by Nvidia using Adaptive Fourier Neural Operators. It uses a Fourier transform-based token-mixing scheme with the vision transformer architecture. This model is pre-trained with ECMWF’s ERA5 reanalysis data. The FourCastNetGFS takes one model state as initial condition from NCEP 0.25 degree GDAS analysis data and runs FourCastNet with weights from the pretrained FourCas
...
Details →
agricultureclimatedisaster responseenvironmentalmeteorologicalweather
The GraphCast Global Forecast System (GraphCastGFS) is an experimental system set up by the National Centers for Environmental Prediction (NCEP) to produce medium range global forecasts. The horizontal resolution is a 0.25 degree latitude-longitude grid (about 28 km). The model runs 4 times a day at 00Z, 06Z, 12Z and 18Z cycles. Major atmospheric and surface fields including temperature, wind components, geopotential height, specific humidity, and vertical velocity, are available. The products are 6 hourly forecasts up to 10 days. The data format is GRIB2.
The GraphCastGFS system is an experimental weather forecast model built upon the pre-trained Google DeepMind’s GraphCast Machine Learning Weather Prediction (MLWP) model. The GraphCast model is implemented as a message-passing graph neural network (GNN) architecture with “encoder-processor-decoder” configuration. It uses an icosahedron grid with multiscale edges and has around 37 million parameters. This model is pre-trained with ECMWF’s ERA5 reanalysis data. The GraphCastGFSl takes two model states as initial conditions (current and 6-hr previous states) from NCEP 0.25 degree GDAS analysis data and runs GraphCast (37 levels) and GraphCast_operational (13 levels) with a pre-trained model provided by GraphCast. Unit conversion to the GDAS data is conducted to match the input data required by GraphCast and to generate forecast products consistent with GFS from GraphCastGFS’ native forecast data.
The GraphCastGFS version 2 made the following changes from the GraphcastCastGFS version 1.
- The 37 vertical levels model is removed due to the storage restriction and limited accuracy.
- The 13 levels graphcast ML model was fine-tuned with NCEP’s GDAS data as inputs and ECMWF ERA5 data as ground truth from 20210323 to 20220901, validated from 20220901 to 20230101. Evaluation is done with forecasts from 20230101-20240101. The new weights created from the training are used to create global forecasts. It is important to note that the GraphCastGFS v1 model weights obtained from Google’s DeepMInd were provided based on 12 timesteps training with ERA5 data, while the GraphCastGFS v2 model weights resulted from training with 14 timesteps with GDAS and ERA5 data that significantly increased the accuracy of the forecasts compared with GraphCastGFS V1.
The input data generated from the GDAS data as GraphCast input is provided under input/ directory. An example of file names is shown below
source-gdas_date-2024022000_res-0.25_levels-13_steps-2.nc
The files are under forecasts_13_levels/. There are 40 files under each directory covering a 10 day forecast. An example of file name is listed below
graphcastgfs.t00z.pgrb2.0p25.f006
The GraphCastGFS version 2.1 change log:
- Starting from 06 cycle on 20240710, the forecast length is increased from 10 days to 16 days.
Please note that th...
Details →
cyber securityinternetintrusion detectionnetwork traffic
This dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC) that use the notion of profiles to generate cybersecurity dataset in a systematic manner. It incluides a detailed description of intrusions along with abstract distribution models for applications, protocols, or lower level network entities. The dataset includes seven different attack scenarios, namely Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure incl...
Details →
machine learningnatural language processing
9092 crowd-sourced science questions and 68 tables of curated facts
Details →
machine learningnatural language processing
68 tables of curated facts
Details →
csvjsonmachine learning
1,197,377 science-relevant sentences
Details →
machine learningnatural language processing
294,000 science-relevant tuples
Details →
biodiversitybiologyconservationgeneticgenomiclife sciencestranscriptomicswildlife
Australasian Genomes is the genomic data repository for the Threatened Species Initiative (TSI) and the ARC Centre for Innovations in Peptide and Protein Science (CIPPS). This repository contains reference genomes, transcriptomes, resequenced genomes and reduced representation sequencing data from Australasian species. Australasian Genomes is managed by the Australasian Wildlife Genomics Group (AWGG) at the University of Sydney on behalf of our collaborators within TSI and CIPPS.
Details →
life sciencesmagnetic resonance imagingneuroimagingneuroscienceniftipediatricsegmentation
Manually curated and reviewed infant brain segmentations and accompanying T1w and T2w images for a range of 1-9 month old participants from the Baby Connectome Project (BCP)
Details →
Usage examples
See 3 usage examples →
climatesustainability
The CSIRO Climate retrospective Analysis and Forecast Ensemble system: version 1 (CAFE60v1) provides a large ensemble retrospective analysis of the global climate system from 1960 to present with sufficiently many realizations and at spatio-temporal resolutions suitable to enable probabilistic climate studies. Using a variant of the ensemble Kalman filter, 96 climate state estimates are generated over the most recent six decades. These state estimates are constrained by monthly mean ocean, atmosphere and sea ice observations such that their trajectories track the observed state while enabling ...
Details →
agricultureclimatefood securitysustainability
High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments.
Details →
computer visiondeep learningmachine learning
COCO is a large-scale object detection, segmentation, and captioning dataset.
This is part of the fast.ai datasets collection hosted by AWS for convenience
of fast.ai students. If you use this dataset in your research please cite
arXiv:1405.0312 [cs.CV].
Details →
bioinformaticsbiologycoronavirusCOVID-19life sciencesmolecular dockingpharmaceutical
Aggregating critical information to accelerate drug discovery for the molecular modeling and simulation community.
A community-driven data repository and curation service for molecular structures, models, therapeutics, and
simulations related to computational research related to therapeutic opportunities for COVID-19
(caused by the SARS-CoV-2 coronavirus).
Details →
agricultureearth observationforecasthydrologymeteorologicalnatural resourceweather
En el marco del Sistema de Información de Sequías del Sur de Sudamérica (SISSA) se ha desarrollado una base de predicciones en escala subestacional y estacional con datos corregidos y sin corregir, con el propósito que permita estudiar predictibilidad en distintas escalas y también que sirva para alimentar modelos de sectores como agricultura e hidrología.
La base contiene datos en escala diaria entre 2000-2019 (sin corregir) y 2010-2019 (corregidos) para diversas variables incluyendo: temperatura media, máxima y mínima, así como también lluvia, viento medio y otras variables pensadas para alimentar modelos hidrológicos y de cultivo.
La base de datos abarca toda el área del Centro Regional del Clima para el sur de sudamérica (CRC-SAS), abarcando desde Bolivia y centro-sur de Brasil hasta la Patagonia incluyendo los países miembros como Chile, Argentina, Brasil, Paraguay, Uruguay y Bolivia.
La base fue generada a partir de datos de GEFSv12 para escala subestacional (GEFS) y CFS2 para escala estacional (CFS2). Para la generación de los datos corregidos se utilizaron los datos del reanálisis de ERA5 (ERA5).
Within the framework of the Southern South American Drought Information System (SISSA), a base of sub-seasonal and seasonal scale predictions has been developed with corrected and uncorrected data, with the purpose of studying predictability at different scales and also to be used to feed models for sectors such as agriculture and hydrology.
The database contains daily scale data between 2000-2019 (uncorrected) and 2010-2019 (corrected) for several variables including: mean, maximum and minimum temperature, as well as rainfall, mean wind and other variables intended to feed hydrological and crop models.
The database covers the entire area of the Regional Climate Center for Southern South America (CRC-SAS), from Bolivia and south-central Brazil to Patagonia, including member countries such as Chile, Argentina, Brazil, Paraguay, Uruguay and Bolivia.
The base was generated from GEFSv12 data for subseasonal scale (GEFS) and CFS2 for seasonal scale (CFS2). Data from the ERA5 reanalysis (ERA5) we...
Details →
climateearth observationearthquakessatellite imageryweather
Various kinds of weather raw data and charts from Central Weather Administration.
Details →
climateearth observationearthquakessatellite imageryweather
Various kinds of weather raw data and charts from Central Weather Bureau.
Details →
cogcomputer visiondeep learningearth observationfloodsgeospatialmachine learningsatellite imagerysynthetic aperture radar
This dataset consists of chips of Sentinel-1 and Sentinel-2 satellite data. Each Sentinel-1 chip contains a corresponding label for water and each Sentinel-2 chip contains a corresponding label for water and clouds. Data is stored in folders by a unique event identifier as the folder name. Within each event folder there are subfolders for Sentinel-1 (s1) and Sentinel-2 (s2) data. Each chip is contained in its own sub-folder with the folder name being the source image id, followed by a unique chip identifier consisting of a hyphenated set of 5 numbers. All bands of the satellite data, as well a...
Details →
air qualityatmospheremodel
The data are part of EPA’s Air Quality Time Series (EQUATES) Project. The data consist of hourly gridded pollutant concentrations estimates by the Community Multiscale Air Quality (CMAQ) model version 5.3.2 (https://doi.org/10.15139/S3/F2KJSK) for January 1 – December 31, 2019. Model data is provided for two spatial domains : the Northern Hemisphere (108 km x 108km horizontal grid spacing) and the Contiguous United States including parts of Canada and Mexico (12km x 12km horizontal grid spacing). Two types of hourly data are provided: three-dimensional air pollutant concentrations and vert...
Details →
autonomous vehiclesbroadbandcomputer visionlidarmachine learningsegmentationus
"The DARPA Invisible Headlights Dataset is a large-scale multi-sensor dataset annotated for autonomous, off-road navigation in challenging off-road environments. It features simultaneously collected off-road imagery from multispectral, hyperspectral, polarimetric, and broadband sensors spanning wave-lengths from the visible spectrum to long-wave infrared and provides aligned LIDAR data for ground-truth shape. Camera calibrations, LiDAR registrations, and traversability annotations for a subset of the data are available."
Details →
energymarinewater
Data released from projects funded by the Department of Energy's Water Power Technologies Office (DOE WPTO)
that are too large or complex to be conveniently accessed by traditional means. The Marine Energy data lake
aims to improve and automate access of high-value MHK data sets, making data actionable and discoverable by
researchers and industry to accelerate analysis and advance innovation.
This data lake is a sister-data lake to the Department of Energy’s Open Energy Data Initiative (OEDI) data lake.
Details →
citiesdisaster responsegeospatialus-dc
LiDAR point cloud data for Washington, DC is available for anyone to use on Amazon S3.
This dataset, managed by the Office of the Chief Technology Officer (OCTO), through the
direction of the District of Columbia GIS program, contains tiled point cloud data for
the entire District along with associated metadata.
Details →
agricultureair qualityair temperatureatmosphereclimateclimate modelclimate projectionsCMIP5CMIP6ecosystemselevationenvironmentalEulerianeventsfloodsfluid dynamicsgeosciencegeospatialhdf5healthHPChydrologyinfrastructureland coverland usemeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynear-surface specific humiditynetcdfopen source softwarephysicspost-processingprecipitationradiationsimulationsuswaterweather
The data are a subset of the EPA Dynamically Downscaled Ensemble (EDDE), Version 1. EDDE is a collection of physics-based modeled data that represent 3D atmospheric conditions for historical and future periods under different scenarios. The EDDE Version 1 datasets cover the contiguous United States at a horizontal grid spacing of 36 kilometers at hourly increments. EDDE Version 1 includes simulations that have been dynamically downscaled from multiple global climate models (GCMs) under both mid- and high-emission scenarios from the Fifth Coupled Model Intercomparison Project (CMIP5) using the...
Details →
agricultureair qualityair temperatureatmosphereclimateclimate modelclimate projectionsCMIP5CMIP6ecosystemselevationenvironmentalEulerianeventsfloodsfluid dynamicsgeosciencegeospatialhdf5healthHPChydrologyinfrastructureland coverland usemeteorologicalmodelnear-surface air temperaturenear-surface relative humiditynear-surface specific humiditynetcdfopen source softwarephysicspost-processingprecipitationradiationsimulationsuswaterweather
The data are a subset of the EPA Dynamically Downscaled Ensemble (EDDE), Version 2. EDDE is a collection of physics-based modeled data that represent 3D atmospheric conditions for historical and future periods under different scenarios. The EDDE Version 2 datasets cover the contiguous United States at a horizontal grid spacing of 12 kilometers at hourly increments. EDDE Version 2 will include simulations that have been dynamically downscaled from multiple global climate models (GCMs) under multiple emission scenarios from the Sixth Coupled Model Intercomparison Project (CMIP6) using the Weath...
Details →
environmental
Detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model.
Details →
astronomy
The data are from observations with the Murchison Widefield Array (MWA) which is a
Square Kilometer Array (SKA) precursor in Western Australia. This particular
dataset is from the Epoch of Reionization project which is a key science driver
of the SKA. Nearly 2PB of such observations have been recorded to date, this is
a small subset of that which has been exported from the MWA data archive in
Perth and made available to the public on AWS. The data were taken to detect
signatures of the first stars and galaxies forming and the effect of these early
stars and galaxies on the evolution of the u...
Details →
assemblybioinformaticsbiologycontaminationfastageneticgenomehealthlife scienceswhole genome sequencing
Sequence database used by FCS-GX (Foreign Contamination Screen - Genome Cross-species aligner) to detect contamination from foreign organisms in genome sequences.
Details →
astronomy
The Galaxy Evolution Explorer Satellite (GALEX) was a NASA mission led by the California Institute of Technology, whose primary goal was to investigate how star formation in galaxies evolved from the early universe up to the present. GALEX used microchannel plate detectors to obtain direct images in the near-UV (NUV) and far-UV (FUV), and a grism to disperse light for low resolution spectroscopy.
Details →
biodiversitybioinformaticsbiologyconservationgeneticgenomiclife sciences
The Genome Ark hosts genomic information for the Vertebrate Genomes Project (VGP) and other related projects. The VGP is an international collaboration that aims to generate complete and near error-free reference genomes for all extant vertebrate species. These genomes will be used to address fundamental questions in biology and disease, to identify species most genetically at risk for extinction, and to preserve genetic information of life.
Details →
ai safetymachine learningnatural language processingsynthetic data
A comprehensive dataset designed for aligning language models with safety and ethical guidelines. Contains 8,361 curated triplets of prompts, responses, and safe responses across various risk categories. Each entry includes safety scores, judge reasoning, and harm probability assessments, making it valuable for model alignment, testing, and benchmarking.
Details →
Usage examples
See 3 usage examples →
energyenvironmentalmodelsustainability
The aim of this project is to create an easy-to-use platform where various types of analytics can be performed on a wide range of electrical grid datasets. The aim is to establish an open-source library of algorithms that universities, national labs and other developers can contribute to which can be used on both open-source and proprietary grid data to improve the analysis of electrical distribution systems for the grid modeling community. OEDI Systems Integration (SI) is a grid algorithms and data analytics API created to standardize how data is sent between different modules that are run as...
Details →
biologyconservationecosystemsenvironmentallabeledobject detection
For this project, The Water Institute (the Institute) and subcontractor Colibri Ecological Consulting, LLC (Colibri) utilized established methods and protocols capable of assessing changes of colonial waterbird populations and their important habitats within individual states and the broader northern Gulf of Mexico region. Data collection activities included:
Aerial Photographic Nest Surveys: Implementation of fixed-wing aircraft surveys intended to assess waterbird colonies and document associated nesting within select portions of the northern Gulf of Mexico. Additional detail is provide...
Details →
biologybreast cancercancercomputational pathologyhistopathologylife sciences
This is a retrospective dataset of 1523 H&E-stained whole slide images (WSI) of lymph nodes from breast cancer patients. The cohort consisted of 177 patients (122 LN-positive - metastasis was reported in at least 1 LN - and 55 LN-negative patients) with invasive breast carcinoma treated between 1984 and 2002 at Guy’s Hospital London, UK. Slides were scanned and digitised at 40x magnification (0.23 µm/pixel), NanoZoomer H.T2.0 2.0-HT (Hamamatsu Photonics UK, Ltd, Welwyn Garden City, UK). WSIs are in .ndpi format.
Details →
agricultureclimateearth observationmeteorologicalweather
HIRLAM (High Resolution Limited Area Model) is an operational synoptic and mesoscale weather prediction model managed by the Finnish Meteorological Institute.
Details →
agricultureclimatecoastalearth observationenvironmentalsustainabilityweather
This dataset contains historical and projected dynamically downscaled climate data for the Southeast region of the State of Alaska at 1 and 4km spatial resolution and hourly temporal resolution. Select variables are also summarized into daily resolutions. This data was produced using the Weather Research and Forecasting (WRF) model (Version 4.0). We downscaled both Climate Forecast System Reanalysis (CFSR) historical reanalysis data (1980-2019) and both historical and projected runs from two GCM’s from the Coupled Model Inter-comparison Project 5 (CMIP5): GFDL-CM3 and NCAR-CCSM4 (historical ru...
Details →
disaster responseelevationgeospatiallidar
The U.S. Cities elevation data collection program supported the US Department of Homeland Security Homeland Security and Infrastructure Program (HSIP). As part of the HSIP Program, there were 133+ U.S. cities that had imagery and LiDAR collected to provide the Homeland Security, Homeland Defense, and Emergency Preparedness, Response and Recovery (EPR&R) community with common operational, geospatially enabled baseline data needed to analyze threat, support critical infrastructure protection and expedite readiness, response and recovery in the event of a man-made or natural disaster. As a pa...
Details →
astronomy
The Hubble Space Telescope (HST) is one of the most productive scientific instruments ever created. This dataset contains calibrated and raw data for all currently active instruments on HST: ACS, COS, STIS, WFC3, and FGS.
Details →
climatenetcdfprecipitation
GCMs under CMIP6 have been widely used to investigate climate change impacts and put forward associated adaptation and mitigation strategies. However, the relatively coarse spatial resolutions (usually 100~300km) preclude their direct applications at regional scales, which are exactly where the analysis (e.g., hydrological model simulation) is performed. To bridge this gap, a typical approach is to ‘refine’ the information from GCMs through regional climate downscaling experiments, which can be conducted statistically, dynamically, or a combination thereof. Statistical downscaling establishes ...
Details →
earth observationenvironmentalgeospatialsatellite imagery
ISS SERVIR Environmental Research and Visualization System (ISERV) was a fully-automated prototype camera aboard the International Space Station that was tasked to capture high-resolution Earth imagery of specific locations at 3-7 frames per second. In the course of its regular operations during 2013 and 2014, ISERV's camera acquired images that can be used primaliry in use is environmental and disaster management.
Details →
computer visiondeep learningmachine learning
Some of the most important datasets for image localization research, including
Camvid and PASCAL VOC (2007 and 2012). This is part of the fast.ai datasets
collection hosted by AWS for convenience of fast.ai students. See
documentation link for citation and license details for each dataset.
Details →
bioinformaticscoronavirusCOVID-19healthlife sciencesmedicineSARS
This dataset is a collection of anonymized thoracic radiographs (X-Rays) and computed tomography (CT) scans of patients with suspected COVID-19. Images are acommpanied by a positive or negative diagnosis for SARS-CoV2 infection via RT-PCR. These images were provided by Hospital das Clínicas da Universidade de São Paulo, Hospital Sirio-Libanes, and by Laboratory Fleury.
Details →
astronomy
The K2 mission observed 100 square degrees for 80 days each across 20 different pointings along the ecliptic, collecting high-precision photometry for a selection of targets within each field. The mission began when the original Kepler mission ended due to loss of the second reaction wheel in 2013.
Details →
autonomous vehiclescomputer visiondeep learningmachine learningrobotics
Dataset and benchmarks for computer vision research in the context of autonomous driving. The dataset has been recorded in and around the city of Karlsruhe, Germany using the mobile platform AnnieWay (VW station wagon) which has been equipped with several RGB and monochrome cameras, a Velodyne HDL 64 laser scanner as well as an accurate RTK corrected GPS/IMU localization unit. The dataset has been created for computer vision and machine learning research on stereo, optical flow, visual odometry, semantic segmentation, semantic instance segmentation, road segmentation, single image depth predic...
Details →
astronomy
The Kepler mission observed the brightness of more than 180,000 stars near the Cygnus constellation at a 30 minute cadence for 4 years in order to find transiting exoplanets, study variable stars, and find eclipsing binaries.
Details →
The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access r...
Details →
The MIMIC-IV-ECG module contains approximately 800,000 diagnostic electrocardiograms across nearly 160,000 unique patients. These diagnostic ECGs use 12 leads and are 10 seconds in length. They are sampled at 500 Hz. This subset contains all of the ECGs for patients who appear in the MIMIC-IV Clinical Database. When a cardiologist report is available for a given ECG, we provide the needed information to link the waveform to the report. The patients in MIMIC-IV-ECG have been matched against the MIMIC-IV Clinical Database, making it possible to link to information across the MIMIC-IV modules.
Details →
benchmarkcomputer visiondeep learninginternet
The MegaScenes Dataset is an extensive collection of around 430k scenes, featuring over 100k structure-from-motion reconstructions and over 2 million registered images. MegaScenes includes a diverse array of scenes, such as minarets, building interiors, statues, bridges, towers, religious buildings, and natural landscapes. The images of these scenes are captured under varying conditions, including different times of day, various weather and illumination, and from different devices with distinct camera intrinsics.
Details →
Usage examples
-
MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.
-
MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.
-
MegaScenes: Scene-Level View Synthesis at Scale by Tung J., Chou G., Cai R., Yang, G., Zhang K., Wetzstein G., et al.
See 3 usage examples →
deep learningmachine learningnatural language processing
Some of the most important datasets for NLP, with a focus on classification, including
IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and
full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext
103, and ACL-2010 French-English 10^9 corpus. This is part of the
fast.ai datasets collection hosted by AWS for convenience of fast.ai
students. See documentation link for citation and license details for each
dataset.
Details →
agricultureagriculturebathymetryclimatedisaster responseenvironmentaloceanstransportationweather
One of the National Geospatial-Intelligence Agency’s (NGA) and the National Oceanic and Atmospheric Administration’s (NOAA) missions is to ensure the safety of navigation on the seas by
maintaining the most current information and the highest quality services for U.S. and global transport networks. To achieve this mission, we need accurate coastal bathymetry over diverse
environmental conditions. The SCuBA program focused on providing critical information to improve existing bathymetry resources and techniques with two specific objectives. The first objective
was to validate National Aeronautics and Space Administration’s (NASA) Ice, Cloud and land Elevation SATellite-2 (ICESat-2), an Earth observing, space-based light detection and ranging (LiDAR)
capability, as a useful bathymetry tool for nearshore bathymetry information in differing environmental conditions. Upon validating the ICESat-2 bathymetry retrievals relative to sea floor
type, water clarity, and water surface dynamics, the next objective is to use ICESat-2 as a calibration tool to improve existing Satellite Derived Bathymetry (SDB) coastal bathymetry products
with poor coastal depth information but superior spatial coverage. Current resources that monitor coastal bathymetry can have large vertical depth errors (up to 50 percent) in the nearshore
region; however, derived results from ICESat-2 shows promising results for improving the accuracy of the bathymetry information in the nearshore region.
Project Overview
One of NGA’s and NOAA’s primary missions is to provide safety of navigation information. However, coastal depth information is still lacking in some regions—specifically, remote regions. In fact, it has been reported that 80 percent of the entire seafloor has not been mapped. Traditionally, airborne LiDARs and survey boats are used to map the seafloor, but in remote areas, we have to rely on satellite capabilities, which currently lack the vertical accuracy desired to support safety of navigation in shallow water. In 2018, NASA launched a space-based LiDAR system called ICESat-2 that has global coverage and a polar orbit originally designed to monitor the ice elevation in polar regions. Remarkably, because it has a green laser beam, ICESat-2 also happens to collect bathymetry information ICESat-2. With algorithm development provided by University of Texas (UT) Austin, NGA Research and Development (R&D) leveraged the ICESat-2 platform to generate SCuBA, an automated depth retrieval algorithm for accurate, global, refraction-corrected underwater depths from 0 m to 30 m, detailed in Figure 1 of the documentation. The key benefit of this product is the vertical depth accuracy of depth retrievals, which is ideal for a calibration tool. NGA and NOAA National Geodetic Survey (NGS), partnered to make this product available to the public for all US territories.
...
Details →
climatecoastaldisaster responseenvironmentalglobalmarine navigationmeteorologicaloceanssustainabilitywaterweather
NOTICE - The Coast Survey Development Laboratory (CSDL) in NOAA/National Ocean Service (NOS)/Office of Coast Survey is upgrading the Surge and Tide Operational Forecast System (STOFS, formerly ESTOFS) to Version 2.1. A Service Change Notice (SCN) has been issued and can be found "HERE"
NOAA's Surge and Tide Operational Forecast System: Three-Dimensional Component for the Atlantic Basin (STOFS-3D-Atlantic). STOFS-3D-Atlantic runs daily (at 12 UTC) to provide users with 24-hour nowcasts (analyses of near present conditions) and up to 96-hour forecast guidance of water level conditions, and 2- and 3-dimensional fields of water temperature, salinity, and currents. The water level outputs represent the combined tidal and subtidal water surface elevations and are referenced to xGEOID20B
STOFS-3D-Atlantic has been developed to serve the marine navigation, weather forecasting, and disaster mitigation user communities. It is developed in a collaborative effort between the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO), and the Virginia Institute of Marine Science.
STOFS-3D-Atlantic employs the Semi-implicit Cross-scale Hydroscience Integrated System Model (SCHISM) as the hydrodynamic model core. Its unstructured grid consists of 2,926,236 nodes and 5,654,157 triangular or quadrilateral elements. Grid resolution is 1.5-2 km near the shoreline, ~600 m for the floodplain, down to 8 m for watershed rivers (at least 3 nodes across each river cross-section), and around 2-10 m for levees. Along the U.S. coastline, the land boundary of the domain aligns with the 10-m contour above xGEOID20B, encompassing the coastal transitional zone most vulnerable to coastal and inland flooding.
STOFS-3D-Atlantic makes uses of outputs from the National Water Model (NWM) to include inland hydrology and extreme precipitation effects on coastal flooding; forecast guidance from the NCEP Global Forecast System (GFS) and High-Resolution Rapid Refresh (HRRR) model as the surface meteorological forci...
Details →
agricultureclimatemeteorologicalsustainabilityweather
NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).
Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.
Atmospheric Climate Data Records are measurements of several global variables to help characterize the atmosphere
...
Details →
climatecoastaldisaster responseenvironmentalmeteorologicaloceanswaterweather
This repository contains references to datasets published to the NOAA Open Data Dissemination Program. These reference datasets serve as index files to the original data by mapping to the Zarr V2 specification. When multidimensional model output is read through zarr, data can be lazily loaded (i.e. retrieving only the data chunks needed for processing) and data reads can be scaled horizontally to optimize object storage read performance.
The process used to optimize the data is called kerchunk. RPS runs the workflow in their AWS cloud environment every time a new data notification is received from a relevant source data bucket.
These are the current datasets being cloud-optimized. Refer to those pages for file naming conventions and other information regarding the specific model implementations:
NOAA Operational Forecast System (OFS)
NOAA Global Real-Time Ocean Forecast System (Global RTOFS)
NOAA National Water Model Short-Range Forecast
Filenames follow the source dataset’s conventions. For example, if the source file is
nos.dbofs.fields.f024.20240527.t00z.nc
Then the cloud-optimized filename is the same, with “.zarr” appended
nos.dbofs.fields.f024.20240527.t00z.nc.zarr
Data Aggregations
We also produce virtual aggregations to group an entire forecast model run, and the “best” available forecast.
Best Forecast (continuously updated) - nos.dbofs.fields.best.nc.zarr
Full Model Run - nos.dbofs.fields.forecast.[YYYYMMDD].t[CC]z.nc.zarr
- CC is the model run cycles, 00, 06, 12, 18 , or 03, 09, 15, 21 for nowcast and forecast runs
- YYYY = year, MM = month, DD = day
Cloud o...
Details →
broadcast ephemerisContinuously Operating Reference Station (CORS)earth observationgeospatialGNSSGPSmappingNOAA CORS Network (NCN)post-processingRINEXsurvey
The NOAA Continuously Operating Reference Stations (CORS) Network (NCN), managed by NOAA/National Geodetic Survey (NGS), provide Global Navigation Satellite System (GNSS) data, supporting three dimensional positioning, meteorology, space weather, and geophysical applications throughout the United States. The NCN is a multi-purpose, multi-agency cooperative endeavor, combining the efforts of hundreds of government, academic, and private organizations. The stations are independently owned and operated. Each agency shares their GNSS/GPS carrier phase and code range measurements and station metadata with NGS, which are analyzed and distributed free of charge.
...
Details →
agricultureclimatemeteorologicalsustainabilityweather
NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).
Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.
Fundamental CDRs are composed of sensor data (e.g. calibrated radiances, brightness temperatures) that have been
...
Details →
agricultureclimatemeteorologicalweather
The Global Ensemble Forecast System (GEFS), previously known as the GFS Global ENSemble (GENS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. The National Centers for Environmental Prediction (NCEP) started the GEFS to address the nature of uncertainty in weather observations, which is used to initialize weather forecast models. The GEFS attempts to quantify the amount of uncertainty in a forecast by generating an ensemble of multiple forecasts, each minutely different, or perturbed, from the original observations. With global coverage, GEFS is produced fo...
Details →
agriculturemeteorologicalwaterweather
NOTE - The legacy on-premises version of the Global Hydroestimator (GHE) is being retired. It is being replaced by the global Enterprise Rain Rate algorithm. You can find Enterprise Rain Rate products in the new bucket listed under the Resources section.
Global Hydro-Estimator provides a global mosaic imagery of rainfall estimates from multi-geostationary satellites, which currently includes GOES-16, GOES-15, Meteosat-8, Meteosat-11 and Himawari-8. The GHE products include: Instantaneous rain rate, 1 hour, 3 hour, 6 hour, 24 hour and also multi-day rainfall accumulation.
Details →
agricultureclimatemeteorologicalweather
NOAA/NESDIS Global Mosaic of Geostationary Satellite Imagery (GMGSI) visible (VIS), shortwave infrared (SIR), longwave infrared (LIR) imagery, and water vapor imagery (WV) are composited from data from several geostationary satellites orbiting the globe, including the GOES-East and GOES-West Satellites operated by U.S. NOAA/NESDIS, the Meteosat-10 and Meteosat-9 satellites from theMeteosat Second Generation (MSG) series of satellites operated by European Organization for the Exploitation of Meteorological Satellites (EUMETSAT), and the Himawari-9 satellite operated by the Japan Meteorological ...
Details →
climatecoastaldisaster responseenvironmentalglobalmeteorologicaloceanswaterweather
NOAA is soliciting public comment on petential changes to the Real Time Ocean Forecast System (RTOFS) through March 27, 2024. Please see Public Notice at (https://www.weather.gov/media/notification/pdf_2023_24/pns24-12_rtofs_v2.4.0.pdf)
NOAA's Global Real-Time Ocean Forecast System (Global RTOFS) provides users with nowcasts (analyses of near present conditions) and forecast guidance up to eight days of ocean temperature and salinity, water velocity, sea surface elevation, sea ice coverage and sea ice thickness.
The Global Operational Real-Time Ocean Forecast System (Global RTOFS) is based on an eddy resolving 1/12° global HYCOM (HYbrid Coordinates Ocean Model) (https://www.hycom.org/), which is coupled to the Community Ice CodE (CICE) Version 4 (https://www.arcus.org/witness-the-arctic/2018/5/highlight/1). The RTOFS grid has a 1/12 degree horizontal resolution and 41 hybrid vertical levels on a global tripolar grid.
Since 2020, the RTOFS system implements a multivariate, multi-scale 3DVar data assimilation algorithm (Cummings and Smedstad, 2014) using a 24-hour update cycle. The data types presently assimilated include
(1) satellite Sea Surface Temperature (SST) from METOP-B, JPSS-VIIRS, and in-Situ SST, from ships, fixed and drifting buoys
(2) Sea Surface Salinity (SSS) from SMAP, SMOS, and buoys
(3) profiles of Temperature and Salinity from Animal-borne, Alamo floats, Argo floats, CTD, fixed buoys, gliders, TESAC, and XBT
(4) Absolute Dynamic Topography (ADT) from Altika, Cryosat, Jason-3, Sentinel 3a, 3b, 6a
(5) sea ice concentration from SSMI/S, AMSR2
The system is designed to incorporate new observing systems as the data becomes available.
Once the observations go through a fully automated quality control and thinning process, the increments, or corrections, are obtained by executing the 3D variational algorithm. The increments are then added to the 24-hours forecast fields using a 6-hourly incremental analysis update. An earlier version of the system is described in Garraffo et al (2020).
Garraffo, Z.D., J.A. Cummings, S. Paturi, Y. Hao, D. Iredell, T. Spindler, B. Balasubramanian, I. Rivin, H-C. Kim, A. Mehra, 2020. Real Time Ocean-Sea Ice Coupled Three Dimensional Variational Global Data Assimilative Ocean Forecast System. In Research Activities in Earth System Modeling, edited by E. Astakhova, WMO, World Climate Research Program Report No.6, July 2020.
Cummings, J. A. and O. M. Smedstad. 2013. Variational Data Assimilation for the Global Ocean.
Data Assimilation for Atmospheric, Oceanic and Hydrologic Applications (Vol II)
S. Park and L. Xu (eds), Springer, Chapter 13, 303-343.
Global...
Details →
climatecoastaldisaster responseenvironmentalglobalmeteorologicaloceanswaterweather
NOTICE - The Coast Survey Development Laboratory (CSDL) in NOAA/National Ocean Service (NOS)/Office of Coast Survey has upgraded the Surge and Tide Operational Forecast System (STOFS, formerly ESTOFS) to Version 2.1. A Service Change Notice (SCN) has been issued and can be found "HERE"
NOAA's Global Surge and Tide Operational Forecast System 2-D (STOFS-2D-Global) provides users with nowcasts (analyses of near present conditions) and forecast guidance of water level conditions for the entire globe. STOFS-2D-Global has been developed to serve the marine navigation, weather forecasting, and disaster mitigation user communities. STOFS-2D-Global was developed in a collaborative effort between the NOAA/National Ocean Service (NOS)/Office of Coast Survey, the NOAA/National Weather Service (NWS)/National Centers for Environmental Prediction (NCEP) Central Operations (NCO), the University of Notre Dame, the University of North Carolina, and The Water Institute of the Gulf. The model generates forecasts out to 180 hours four times per day; forecast output includes water levels caused by the combined effects of storm surge and tides, by astronomical tides alone, and by sub-tidal water levels (isolated storm surge).
The hydrodynamic model employed by STOFS-2D-Global is the ADvanced CIRCulation (ADCIRC) finite element model. The model is forced by GFS winds, mean sea level pressure, and sea ice. The unstructured grid used by STOFS-2D-Global consists of 12,785,004 nodes and 24,875,336 triangular elements. Coastal res...
Details →
coastalgeospatialhistorymappingsurvey
Historical Charts are not for Navigation. The collection primarily consists of historic charts and maps produced by NOAA's Coast Survey and its predecessors, especially the U.S. Coast and Geodetic Survey and the U.S. Lake Survey (previously under the Department of War). The collection also includes bathymetric maps, land sketches, Civil War battle maps, aeronautical charting from the 1930s to the 1950s, and other drawings and photographs.
Details →
agricultureclimatemeteorologicalweather
The last several hurricane seasons have been active with records being set for the number of tropical storms and hurricanes in the Atlantic basin. These record-breaking seasons underscore the importance of accurate hurricane forecasting. Imperative to increased forecasting skill for hurricanes is the development of the Hurricane Forecast Analysis System or HAFS. To accelerate improvements in hurricane forecasting, this project has the following goals:
- To improve the HAFS. The HAFS is NOAA’s next-generation multi-scale numerical model, with data assimilation package and ocean coupling, which will provide an operational analysis and forecast out to seven days, with reliable and skillful guidance on hurricane track and intensity (including rapid intensification), storm size, genesis, storm surge, rainfall and tornadoes associated with hurricanes.
- To integrate into the Unified Forecasting System(UFS). The UFS is a community-based, coupled comprehensive Earth system modeling system whose numerical applications span local to global domains and predictive time scales from sub-hourly analyses to seasonal predictions. It is designed to support the Weather Enterprise and to be the source system for NOAA’s operational numerical weather prediction applications. The HAFS will be a part of UFS geared for hurricane model applications. HAFS comprises five major components; (a) High-resolution moving nest (b) High-resolution physics (c) Multi-scale data assimilation (DA) (d) 3D ocean coupling, and (e) Observations to support the DA.
[Read about how the storm-following model improves intensity forecasts](https://www.aoml.noaa.gov/hurricane-model-that-follows-mult...
Details →
agricultureclimatemeteorologicalweather
The NOAA NASA Joint Archive (NNJA) of Observations for Earth System Reanalysis is a curated joint observation archive containing Earth system data from 1979 to present prepared by teams at NOAA's Physical Sciences Laboratory and NASA's Global Modeling and Assimilation Office. The goal is to foster collaboration across organizations and develop the ability for direct comparison of Earth System reanalysis results. Providing a singular dataset for observation input use will allow reanalyses to be compared on their unique development qualities by removing the variation from using different...
Details →
bathymetryearth observationmarine navigationmodeloceansoceans
The National Bathymetric Source (NBS) project creates and maintains
high-resolution bathymetry composed of the best available data.
This project enables the creation of next-generation nautical charts
while also providing support for modeling, industry, science,
regulation, and public curiosity. Primary sources of bathymetry include
NOAA and U.S. Army Corps of Engineers hydrographic surveys and
topographic bathymetric (topo-bathy) lidar (light detection and ranging)
data. Data submitted through the NOAA Office of Coast Survey’s external
source data process are also included, with gaps...
Details →
agricultureclimatecogmeteorologicalweather
The National Blend of Models (NBM) is a nationally consistent and skillful suite of calibrated forecast guidance based on a blend of both NWS and non-NWS numerical weather prediction model data and post-processed model guidance. The goal of the NBM is to create a highly accurate, skillful and consistent starting point for the gridded forecast.
Details →
agricultureclimatemeteorologicalweather
The North American Mesoscale Forecast System (NAM) is one of the National Centers For Environmental Prediction’s (NCEP) major models for producing weather forecasts. NAM generates multiple grids (or domains) of weather forecasts over the North American continent at various horizontal resolutions. Each grid contains data for dozens of weather parameters, including temperature, precipitation, lightning, and turbulent kinetic energy. NAM uses additional numerical weather models to generate high-resolution forecasts over fixed regions, and occasionally to follow significant weather events like hur...
Details →
agricultureclimatemeteorologicaloceanssustainabilityweather
NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).
Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.
Oceanic Climate Data Records are measurements of oceans and seas both surface and subsurface as well as frozen st
...
Details →
bathymetryearth observationmarine navigationmodeloceansoceans
Founded in 1807, NOAA’s Office of Coast Survey is the nation’s first scientific agency and today is responsible for supporting nearly $5.4 trillion in economic activity through providing advanced marine navigation services. The Office of Coast Survey collects and qualifies hydrographic, bathymetric, and topographic data, from NOAA platforms and many other data providers. These data and associated deliverables are posted here for various users to access, including but not limited to the "National Bathymetric Source Program" for incorporation into compilations of the best available bat...
Details →
agricultureclimatemeteorologicalweather
The Rapid Refresh (RAP) is a NOAA/NCEP operational weather prediction system comprised primarily of a numerical forecast model and analysis/assimilation system to initialize that model. It covers North America and is run with a horizontal resolution of 13 km and 50 vertical layers. The RAP was developed to serve users needing frequently updated short-range weather forecasts, including those in the US aviation community and US severe weather forecasting community. The model is run for every hour of the day; it is integrated to 51 hours for the 03/09/15/21 UTC cycles and to 21 hours for every ot...
Details →
agricultureclimatemeteorologicalweather
The Real-Time Mesoscale Analysis (RTMA) is a NOAA National Centers For Environmental Prediction (NCEP) high-spatial and temporal resolution analysis/assimilation system for near-surf ace weather conditions. Its main component is the NCEP/EMC Gridpoint Statistical Interpolation (GSI) system applied in two-dimensional variational mode to assimilate conventional and satellite-derived observations.
The RTMA was developed to support NDFD operations and provide field forecasters with high quality analyses for nowcasting, situational awareness, and forecast verification purposes. The system produces
...
Details →
bathymetryhydrographymarine navigationoceansseafloorwater
S-102 is a data and metadata encoding specification that is part of the S-100 Universal Hydrographic Data Model, an international standard for hydrographic data exchange. This collection of data contains bathymetric surfaces from NOAA/NOS/OCS National Bathymetric Source, for various U.S. coastal and offshore waters and the great lakes. These datasets are encoded as HDF5 files conforming to the S-102 specification.
Details →
agricultureclimatemeteorologicalweather
The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event's location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. It contains data documenting: The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the S...
Details →
satellite imageryspace weather
The National Oceanic and Atmospheric Administration (NOAA) Geostationary Operational Environmental Satellite 19 (GOES-19) is the fourth and final satellite in the
Geostationary Operational Environmental Satellites (GOES) – R Series, the Western Hemisphere’s most sophisticated weather-observing and environmental monitoring
system.
The GOES-R Series provides advanced imagery and atmospheric measurements, real-time mapping of lightning activity, and space weather observations. As a part of the
Space Weather Follow On (SWFO) Mission, the GOES-19 spacecraft contains a Compact Coronagraph-1 (
...
Details →
climatemeteorologicalsolarweather
Space weather forecast and observation data is collected and disseminated by NOAA’s Space Weather Prediction Center (SWPC) in Boulder, CO. SWPC produces forecasts for multiple space weather phenomenon types and the resulting impacts to Earth and human activities. A variety of products are available that provide these forecast expectations, and their respective measurements, in formats that range from detailed technical forecast discussions to NOAA Scale values to simple bulletins that give information in laymen's terms. Forecasting is the prediction of future events, based on analysis and...
Details →
agricultureclimatemeteorologicalsustainabilityweather
NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).
Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.
Terrestrial CDRs are composed of sensor data that have been improved and quality controlled over time, together w
...
Details →
agricultureclimatemeteorologicalweather
The NOAA Monthly U.S. Climate Gridded Dataset (NClimGrid) consists of four climate variables derived from the GHCN-D dataset: maximum temperature, minimum temperature, average temperature and precipitation. Each file provides monthly values in a 5x5 lat/lon grid for the Continental United States. Data is available from 1895 to the present. On an annual basis, approximately one year of "final" nClimGrid will be submitted to replace the initially supplied "preliminary" data for the same time period. Users should be sure to ascertain which level of data is required for their research.
EpiNOAA is an analysis ready dataset that consists of a daily time-series of nClimGrid measures (maximum temperature, minimum temperature, average temperature, and precipitation) at the county scale. Each file provides daily values for the Continental United States. Data are available from 1951 to the present. Daily data are updated every 3 days with a preliminary data file and replaced with the scaled (i.e., quality controlled) data file every three months. This derivative data product is an enhancement from the original daily nClimGrid dataset in that all four weather parameters are now p
...
Details →
climatedisaster responseelevationgeospatiallidarstac
The Unified Forecast System (UFS) is a community-based, coupled, comprehensive Earth Modeling System. The UFS Coastal application is a project under development by NOAA and NCAR, which supports coastal forecasting requirements based on UFS standards. The coupling infrastructure for UFS Coastal App is currently being developed based on a fork of the ufs-weather-model (UFS-WM), with additional coastal model-components including SCHISM, ADCIRC, ROMS, and FVCOM, as well as additional infrastructure to support coastal coupling of WW3 and CICE. The model-level repository contains the model code and external submodules needed to build the UFS coastal model executable and the associated model components.
The UFS Coastal Regression Test (RT) system is a type of testing built into the software development that ensures that changes to the model-level code and associated model-components do not break the existing functionality of the code. The number and type of tests currently in the RT system suite are evolving along with current dependencies such as UFS-WM and ESMF libraries. Currently, at least one RT case exists for each coastal model. The status and descriptions of the existing RT cases is available via the UFS Coastal Wiki page. These are currently regularly tested on NOAA/MSU Hercules platform, and to a lesser frequency on TACC/Frontera.
Each of the regression tests require a set of input data files and configuration files. The configuration files include namelist and model configuration files which can be found within the UFS-Coastal model code repository. The ...
Details →
agricultureclimatemeteorologicalweather
The NOAA Unified Forecast System (UFS) / Global Ensemble Forecast System version 13 (GEFSv13) Replay dataset supports the retrospective forecast archive in preparation for GEFSv13 / GFSv17. It includes a range of atmospheric and oceanic variables—such as temperature, humidity, winds, salinity, and currents—covering global conditions at a nominal horizontal resolution of ¼ degree, enabling detailed weather analysis.
The dataset was generated by replaying the coupled UFS model against pre-existing external reanalyses; ERA5 for atmospheric data and ORAS5 for ocean and ice dynamics. Each simulation stream was initialized from these reanalyses, which were pre-processed for the UFS model components, including the GFDL Finite-Volume Cubed-Sphere Dynamical Core (FV3; 25 km, 127 vertical levels) and the Modular Ocean Model (MOM6; ¼ degree tri-polar grid, 72 vertical levels). This replay methodology enforces a predetermined model state while allowing cross-component fluxes and unconstrained processes to be computed.
For the land surface, NOAA’s JEDI-based land data assimilation system incorporated snow depth observations from the NCEI Global Historical Climatology Network (GHCN) and satellite-derived snow cover from the U.S. National Ice Center. The JEDI Sea-ice Ocean and Coupled Analysis system (SOCA) adjusted sea-ice thickness and concentration for consistency with ORAS5.
The original dataset spanned January 1994 to October 2023, with plans for ongoing updates and a 1-degree version covering 1958 to 2023. The dataset is hosted on AWS and GCP clou
...
Details →
agricultureclimatedisaster responseenvironmentalmeteorologicaloceansweather
The "Unified Forecast System" (UFS) is a community-based, coupled, comprehensive Earth Modeling System. The Hierarchical Testing Framework (HTF) serves as a comprehensive toolkit designed to enhance the testing capabilities within UFS "repositories". It aims to standardize and simplify the testing process across various "UFS Weather Model" (WM) components and associated modules, aligning with the Hierarchical System Development (HSD) approach and NOAA baseline operational metrics.
The HTF provides a structured methodology for test case design and execution, which enhances code management practices, fosters user accessibility, and promotes adherence to established testing protocols. It enables developers to conduct testing efficiently and consistently, ensuring code integrity and reliability through the use of established technologies such as CMake and CTest. When integrated with containerization techniques, the HTF facilitates portability of test cases and promotes reproducibility across different computing environments. This approach reduces the computational overhead and enhances collaboration within the UFS community by providing a unified testing framework.
Acknowledgment - The Unified Forecast System (UFS) atmosphere-ocean coupled model...
Details →
agricultureclimatemeteorologicalweather
The Unified Forecast System (UFS) is a community-based, coupled, comprehensive Earth modeling system. It supports "multiple applications" covering different forecast durations and spatial domains. The Land Data Assimilation (DA) System is an offline version of the Noah Multi-Physics (Noah-MP) land surface model (LSM) used in the UFS Weather Model (WM). Its data assimilation framework uses "[Joint Effort for Data assimilation Integration - JEDI] (https://www.jcsda.org/jcsda-project-jedi)" software. The offline Noah-MP LSM is a stand-alone, uncoupled model used to execute land surface simulations. In this traditional uncoupled mode, near-surface atmospheric forcing data is required as input. Sample forcing and restart data are provided in this data bucket.
The Noah-MP LSM has evolved through community efforts to pursue and refine a modern-era LSM suitable for use in the National Centers for Environmental Prediction (NCEP) operational weather and climate prediction models. This collaborative effort continues with participation from entities such as NCAR, NCEP, NASA, and university groups.
For details regarding the physical parameterizations used in Noah-MP, see "[Niu, et al. (2011)] (https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2010JD015139)". The "[Land DA User’s Guide] (https://land-da.readthedocs.io/en/latest/)" provides information on building and running the Land DA System in offline mode. Users can access additional technical support via the "[UFS GitHub Discussions] (https://github.com/NOAA-EPIC/land-offline_workflow/discussions)" for the L...
Details →
agricultureclimatemeteorologicalweather
The NOAA UFS Marine Reanalysis is a global sea ice ocean coupled reanalysis product produced by the marine data assimilation team of the UFS Research-to-Operation (R2O) project. Underlying forecast and data assimilation systems are based on the UFS model prototype version-6 and the Next Generation Global Ocean Data Assimilation System (NG-GODAS) release of the Joint Effort for Data assimilation Integration (JEDI) Sea Ice Ocean Coupled Assimilation (SOCA). Covering the 40 year reanalysis time period from 1979 to 2019, the data atmosphere option of the UFS coupled global atmosphere ocean sea ice (DATM-MOM6-CICE6) model was applied with two atmospheric forcing data sets: CFSR from 1979 to 1999 and GEFS from 2000 to 2019. Assimilated observation data sets include extensive space-based marine observations and conventional direct measurements of in situ profile data sets.
This first UFS-marine interim reanalysis product is released to the broader weather and earth system modeling and analysis communities to obtain scientific feedback and applications for the development of the next generation operational numerical weather prediction system at the National Weather Service(NWS). The released file sets include two parts 1.) 1979 - 2019 UFS-DATM-MOM6-CICE6 model free runs and 2) 1979-2019 reanalysis cycle outputs (see descriptions embedded in each file set). Analyzed sea ice and ocean variables are ocean temperature, salinity, sea surface height, and sea ice conce
...
Details →
agricultureclimatemeteorologicalweather
The "Unified Forecast System (UFS)" is a community-based, coupled, comprehensive Earth Modeling System. It supports " multiple applications" with different forecast durations and spatial domains. The UFS Short-Range Weather (SRW) Application figures among these applications. It targets predictions of atmospheric behavior on a limited spatial domain and on time scales from minutes to several days. The SRW Application includes a prognostic atmospheric model, pre-processor, post-processor, and community workflow for running the system end-to-end. The "SRW Application Users's Guide" includes information on these components and provides detailed instructions on how to build and run the SRW Application. Users can access additional technical support via the "UFS GitHub Discussions"
This data registry contains the data required to run the “out-of-the-box” SRW Application case. The SRW App requires numerous input files to run, including static datasets (fix files containing climatological information, terrain and land use data), initial condition data files, lateral boundary condition data files, and model configuration files (such as namelists). The SRW App experiment generation system also contains a set of workflow end-to-end (WE2E) tests that exercise various configurations of the system (e.g., different grids, physics suites). Data for running a subset of these WE2E tests are also included within this registry.
Users can generate forecasts for dates not included in this data registry by downloading and manually adding raw model files for the desired dates. Many of these model files are publicly available and can be accessed via links on the "Developmental Testbed Center" webs...
Details →
agricultureclimatemeteorologicalweather
The Unified Forecast System (UFS) is a community-based, coupled, comprehensive Earth Modeling System. The ufs-weather-model (UFS-WM) is the model source of the UFS for NOAA’s operational numerical weather prediction applications. The UFS-WM Regression Test (RT) is the testing software to ensure that previously developed and tested capabilities in UFS-WM still work after code changes are integrated into the system. It is required that UFS-WM RTs are performed successfully on the required Tier-1 platforms whenever code changes are made to the UFS-WM. The results of the UFS-WM RTs are summarized in log files and these files will be committed to the UFS-WM repository along with the code changes. Currently, the UFS-WM RTs have been developed to support several applications targeted for operational implementations including the global weather forecast, subseasonal to seasonal forecasts, hurricane forecast, regional rapid refresh forecast, and ocean analysis.
At this time, there are 123 regression tests to support the UFS applications. The tests are evolving along with the development merged to the UFS-WM code repository. The regression test framework has been developed in the UFS-WM to run these tests on tier-1 supported systems. Each of the regression tests require a set of input data files and configuration files. The configuration files include namelist and model configuration files residing within the UFS-WM code repository. The input data includes initial conditions, climatology data, and fixed data sets such as orographic data and grid sp
...
Details →
climatemeteorologicalsolarweather
The WSA-Enlil heliospheric model provides critical information regarding the propagation of solar Coronal Mass Ejections (CMEs) and transient structures within the heliosphere. Two distinct models comprise the WSA-Enlil modeling system; 1) the Wang-Sheeley-Arge (WSA) semi-empirical solar coronal model, and 2) the Enlil magnetohydrodynamic (MHD) heliospheric model. MHD modeling of the full domain (solar photosphere to Earth) is extremely computationally demanding due to the large parameter space and resulting characteristic speeds within the system. To reduce the computational burden and improve the timeliness (and hence the utility in forecasting space weather disturbances) of model results, the domain of the MHD model (Enlil) is limited from 21.5 Solar Radii (R_s) to just beyond the orbit of Earth, while the inner portion, spanning from the solar photosphere to 21.5R_s, is characterized by the WSA model. This coupled modeling system is driven by solar synoptic maps composed of numerous magnetogram observations from the National Solar Observatory’s (NSO) Global Oscillation Network Group (GONG). Such maps provide a full surface description of solar photospheric magnetic flux density, while not accounting for the evolution of surface features for regions outside the view of the observatories.
In its current configuration (NOAA WSA-Enlil V3.0), the modeling system consists of WSA V5.4 and Enlil V2.9e. The system relies upon the zero point corrected GONG synoptic maps (mrzqs) to define the inner photospheric boundary.
The operational data files provided in this bucket include NetCDF files containing 3-dimensional gridded neutral density from 100 to 1000 km, Total Electron Content (TEC), and Maximum Usable Frequency (MUF).
The full 3D datasets from the operational model are provided here as compressed tar files with naming convention wsa_enlil.mrid########.full3d.tgz. These files consist of the full set of 3D datacubes (tim..nc), all time series results stored at predefined observation points (evo..nc), and supplemented by the operational CME fits (conefiles) and the operationally...
Details →
climatemeteorologicalsolarweather
The coupled Whole Atmosphere Model-Ionosphere Plasmasphere Electrodynamics (WAM-IPE) Forecast System (WFS) is developed and maintained by the NOAA Space Weather Prediction Center (SWPC). The WAM-IPE model provides a specification of ionosphere and thermosphere conditions with real-time nowcasts and forecasts up to two days in advance in response to solar, geomagnetic, and lower atmospheric forcing. The WAM is an extension of the Global Forecast System (GFS) with a spectral hydrostatic dynamical core utilizing an enthalpy thermodynamic variable to 150 vertical levels on a hybrid pressure-sigma grid, with a model top of approximately 3 x 10-7 Pa (typically 400-600km depending on levels of solar activity). Additional upper atmospheric physics and chemistry, including electrodynamics and plasma processes, are included. The IPE model provides the plasma component of the atmosphere. It is a time-dependent, global 3D model of the ionosphere and plasmasphere from 90 km to approximately 10,000 km. WAM fields of winds, temperature, and molecular and atomic atmospheric composition are coupled to IPE to enable the plasma to respond to changes driven by the neutral atmosphere.
The operational WAM-IPE is currently running in two different Concepts of Operation (CONOPS) to produce results of Nowcast and Forecast. The WAM-IPE real-time nowcast system (WRS) ingests real-time solar wind parameters every 5 minutes from NOAA’s spacecrafts located at Lagrange point 1 (L1) between the Sun and Earth in order to capture rapid changes in the ionosphere and thermosphere due to the sudden onset of geomagnetic storms. The nowcast is reinitialized every six hours to include the latest forcing from the lower atmosphere. The forecast system (WFS) runs four times daily (0, 6, 12, 18 UT), providing two-day forecasts. Observed solar wind parameters are used whenever observational values are available, for the forecast portion, the forecasted 3-hour Kp and daily F10.7 issued by SWFO are ingested into the model to estimate solar wind parameters. Lower atmospheric data assimilation only carries out twice daily at 0 and 12 UT cycles to maintain the stability of the coupled model. Model version v1.2 became available in July 2023, featuring the implementation of the WRS into operations, as well as improvements to the Kp-derived solar wind parameters utilized by WFS forecasts.
The data files within this bucket are provided strictly on a non-operational basis, with no guarantee of timely delivery or availability. There may exist temporal gaps in coverage.
The top-level versioned directories (v1.x) include NetCDF files from operational runs providing 3-dimensional gridded neutral density every 10 minutes with an altitude range from 100 to 1000 km. wfs.YYYYMMDD subdirectories contain two-day forecasts, updated four times daily (cycle initialization 00, 06, 12, 18 UT). wrs.YYYYMMDD subdirectories contain real-time nowcast neutral density outputs, reinitialized ever
...
Details →
geneticgenomiclife scienceswhole genome sequencing
This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.
Details →
computer visionimage processingimaginglife sciencesmachine learningmagnetic resonance imagingneuroimagingneurosciencenifti
Here, we collected and pre-processed a massive, high-quality 7T fMRI dataset that can be used to advance our understanding of how the brain works. A unique feature of this dataset is the massive amount of data available per individual subject. The data were acquired using ultra-high-field fMRI (7T, whole-brain, 1.8-mm resolution, 1.6-s TR). We measured fMRI responses while each of 8 participants viewed 9,000–10,000 distinct, color natural scenes (22,500–30,000 trials) in 30–40 weekly scan sessions over the course of a year. Additional measures were collected including resting-state data, retin...
Details →
electricityenergyhydrography
The ONS Open Data Portal, produced by the National Operator of the Electric System (ONS), gathers historical data from the Brazilian electricity sector in an easy and democratic way with the main objective to facilitate and improve the access and consumption of this type of content by all users and audiences. The Portal gathers consolidated data on energy generation (hydraulic, thermal, solar and wind), areas of national and international energy exchange and energy load, information about equipments such as transmission lines, generating units, converters and others; and hydrological data obta...
Details →
image processingmachine learning
A dataset of all images of Open Food Facts, the biggest open dataset of food products in the world.
Details →
internet
A free software, global observation network for detecting censorship, surveillance and traffic manipulation on the internet.
Details →
biologyimaginglife sciencesneurobiologyneuroimaging
OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the Center for Reproducible Neuroscience at Stanford University. Development of the OpenNeuro resource has been funded by th...
Details →
archiveslife sciencespharmaceuticaltext analysistxt
The OIDA Data on AWS contain the metadata, documents, and extracted text for all of the documents in the UCSF-JHU Opioid Industry Documents Archive, a growing corpus of internal corporate records and other documents arising from the opioid industry.
Details →
geospatialmapping
Horizontal and vertical adjustment datasets for coordinate transformation to be used by PROJ 7 or later. PROJ is a generic coordinate transformation software that transforms geospatial coordinates from one coordinate reference system (CRS) to another. This includes cartographic projections as well as geodetic transformations.
Details →
astronomy
The PS1 surveys used a 1.8 meter telescope and its 1.4 Gigapixel camera to image the sky in five broadband filters. The largest of these surveys provides coverage of the entire sky north of -30 degrees declination, with approximately 10 observation epochs across 3 years in each filter.
Details →
biologylife sciences
PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).
Details →
machine translationnatural language processing
ParaCrawl is a set of large parallel corpora to/from English for all official EU languages by a broad web crawling effort. State-of-the-art methods are applied for the entire processing chain from identifying web sites with translated text all the way to collecting, cleaning and delivering parallel corpora that are ready as training data for CEF.AT and translation memories for DG Translation.
Details →
breast cancercancercomputer visioncsvlabeledlife sciencesmachine learningmammographymedical image computingmedical imagingradiology
According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Yet breast cancer mortality in high-income countries has dropped by 40% since the 1980s when health authorities implemented regular mammography screening in age groups considered at risk. Early detection and treatment are critical to reducing cancer fatalities, and your machine learning skills could help streamline the process radiologists use to evaluate screening mammograms. Currently, early detection of breast cancer requi...
Details →
earth observationmeteorologicalnatural resourceweather
The Servicio Meteorológico Nacional de Argentina (SMN-Arg), the National Meteorological Service of Argentina, shares its deterministic forecasts generated with WRF 4.0 (Weather and Research Forecasting) initialized at 00 and 12 UTC every day.
This forecast includes some key hourly surface variables –2 m temperature, 2 m relative humidity, 10 m wind magnitude and direction, and precipitation–, along with other daily variables, minimum and maximum temperature.
The forecast covers Argentina, Chile, Uruguay, Paraguay and parts of Bolivia and Brazil in a Lambert conformal projection, with 4 km
...
Details →
cultural preservationinternetukraine
The dataset contains web archives of Open Access collections of digitised cultural heritage from more than 3,000+ websites of Ukrainian cultural institutions, such as museums, libraries or archives. The web archives have been produced by SUCHO, which is a volunteer group of more than 1,300 international cultural heritage professionals – librarians, archivists, researchers, programmers - who have joined forces to save as much digitised cultural heritage during the 2022 invasion of Ukraine before the servers hosting them get destroyed, damaged or go offline for any other reason. The web archives...
Details →
artcultureencyclopedichistorymuseum
The Smithsonian’s mission is the "increase and diffusion of knowledge" and has been collecting since 1846. The Smithsonian, through its efforts to digitize its multidisciplinary collections, has created millions of digital assets and related metadata describing the collection objects. On February 25th, 2020, the Smithsonian released over 2.8 million CC0 interdisciplinary 2-D and 3-D images, related metadata, and additionally, research data from researches across the Smithsonian. The 2.8 million "open access" collections are a subset of the Smithsonian’s 155 million objects,...
Details →
amino acidbioinformaticschemical biologygenomicgraphmetagenomicsmicrobiomepharmaceuticalprotein
Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.
Details →
Usage examples
See 3 usage examples →
digital preservationfree softwareopen source softwaresource code
Software Heritage is the largest
existing public archive of software source code and accompanying
development history. The Software Heritage Graph Dataset is a fully
deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code
directories, Version Control System (VCS) commits tracking evolution over
time, up to the full states of VCS repositories as observed by Software
Heritage during periodic crawls. The dataset’s contents come from major
development forges (including GitHub and GitLab), FOSS distributions (e.g.,
Deb...
Details →
machine learning
1,076 textbook lessons, 26,260 questions, 6229 images
Details →
geneticgenomiclife sciences
The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.
Details →
commercecomputer visiondeep learninggraphinformation retrievalinternetmachine learningnatural language processing
A collection of 51,701 product pages from 8175 e-commerce websites across 8 markets (US, GB, SE, NL, FI, NO, DE, AT) with 5 manually labelled elements, specifically, the product price, name and image, add-to-cart and go-to-cart buttons.
The dataset was collected between 2018 and 2019 and is made available has MHTML and as WebTraversalLibrary-format snapshots.
Details →
computer visionmachine learningmachine translationnatural language processing
MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania.
The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word's translation into English (and corresponding images.)
Details →
cancerlife sciencesmagnetic resonance imagingmedical imagingmedicineradiology
The University of California San Francisco Brain Metastases Stereotactic Radiosurgery (UCSF-BMSR) dataset is a public, clinical, multimodal brain MRI dataset consisting of 560 brain MRIs from 412 patients with expert annotations of 5136 brain metastases. Data consists of registered and skull stripped T1 post-contrast, T1 pre-contrast, FLAIR and subtraction (T1 pre-contrast - T1 post-contrast) images and voxelwise segmentations of enhancing brain metastases in NifTI format.
Details →
bioinformaticsbiologygeneticgenomiclife sciences
The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome...
Details →
biologychemical biologylife sciencespharmaceutical
Collection of 7 billion small molecules in SMILES notation with 28 billion fingerprints, including MACCS, ECFP4, FCFP4, and PubChem, with pre-constructed USearch indexes over them.
Details →
earth observationgeospatialimage processingsatellite imagerystacsynthetic aperture radar
Umbra satellites generate the highest resolution Synthetic Aperture Radar (SAR) imagery ever offered from space, up to 16-cm resolution. SAR can capture images at night, through cloud cover, smoke and rain. SAR is unique in its abilities to monitor changes. The Open Data Program (ODP) features over twenty diverse time-series locations that are updated frequently, allowing users to experiment with SAR's capabilities. We offer single-looked spotlight mode in either 16cm, 25cm, 35cm, 50cm, or 1m resolution, and multi-looked spotlight mode. The ODP also features an assorted collection of over ...
Details →
agriculturebiodiversitybioinformaticsbiologyfood securitygeneticgenomiclife scienceswhole genome sequencing
This dataset captures Sunflower's genetic diversity originating
from thousands of wild, cultivated, and landrace sunflower
individuals distributed across North America.The data consists of raw sequences and associated botanical metadata,
aligned sequences (to three different reference genomes), and sets of
SNPs computed across several cohorts.
Details →
activity detectionagriculturecogdisaster responseearth observationenvironmentalgeospatialimage processingland covernatural resourcesatellite imagerystac
The Venµs science mission is a joint research mission undertaken by CNES and ISA,
the Israel Space Agency. It aims to demonstrate the effectiveness of high-resolution multi-temporal observation optimised through
Copernicus, the global environmental and security monitoring programme. Venµs was launched from the Centre Spatial Guyanais by a
VEGA rocket, during the night from 2017, August 1st to 2nd. Thanks to its multispectral camera (12 spectral bands in the visible
and near-infrared ranges, with spectral characteristics provided here), it
acquires imagery every 1-2 days over 100+ areas at...
Details →
The electrocardiogram (ECG) is a non-invasive representation of the electrical activity of the heart. Although the twelve-lead ECG is the standard diagnostic screening system for many cardiological issues, the limited accessibility of twelve-lead ECG devices provides a rationale for smaller, lower-cost, and easier to use devices. While single-lead ECGs are limiting [1], reduced-lead ECG systems hold promise, with evidence that subsets of the standard twelve leads can capture useful information [2], [3], [4] and even be comparable to twelve-lead ECGs in some limited contexts. In 2017 we challen...
Details →
machine learningnatural language processing
ZEST is a benchmark for zero-shot generalization to unseen NLP tasks, with 25K labeled instances across 1,251 different tasks.
Details →
biodiversitybioinformaticsconservationearth observationlife sciences
iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.
Details →
genetic mapslife sciencespopulation geneticsrecombination mapssimulations
Contains all resources (genome specifications, recombination maps, etc.) required for species specific simulation with the stdpopsim package. These resources are originally from a variety of other consortium and published work but are consolidated here for ease of access and use. If you are interested in adding a new species to the stdpopsim resource please raise an issue on the stdpopsim GitHub page to have the necessary files added here.
Details →
internetjapanesenatural language processingweb archive
A large Japanese language corpus created through preprocessing Common Crawl data
Details →
Usage examples
See 2 usage examples →
blockchainweb3
The AWS Public Blockchain Data provides free access to blockchain datasets. Data is transformed into multiple tables as compressed Parquet files, partitioned by date, to allow efficient access for most common analytics queries.
Datasets
Blockchain dataset | Maintained by | Path |
Bitcoin | AWS | s3://aws-public-blockchain/v1.0/btc/ |
Ethereum | AWS | s3://aws-public-blockchain/v1.0/eth/ |
Arbitrum | SonarX | s3://aws-public-blockchain/v1.1/sonarx/arbitrum/ |
Aptos | SonarX | s3://aws-public-blockchain/v1.1/sonarx/aptos/ |
Base | SonarX | s3://aws-public-blockchain/v1.1/sonarx/base/ |
Provenance | SonarX | s3://aws-public-blockchain/v1.1/sonarx/provenance/ |
XRP Ledger | SonarX | s3://aws-public-blockchain/v1.1/sonarx/xrp/ |
For full datasets, with support and real-time updates, please visit
SonarX....
Details →
Usage examples
See 2 usage examples →
amazon.sciencecomputer visionmachine learning
The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.
Details →
Usage examples
See 2 usage examples →
biologymarine mammalsoceans
Since 2007, the Integrated Marine Observing System’s Animal Tracking Facility (formerly known as the Australian Animal Tracking And Monitoring System (AATAMS)) has established a permanent array of acoustic receivers around Australia to detect the movements of tagged marine animals in coastal waters. Simultaneously, the Animal Tracking Facility developed a centralised national database (https://animaltracking.aodn.org.au/) to encourage collaborative research across the Australian research community and provide unprecedented opportunities to monitor broad-scale animal movements. The resulting da...
Details →
Usage examples
See 2 usage examples →
climateearth observationenvironmentalgeosciencegeospatial
Pacific Mangroves beta version product is an extension of
the Global Mangrove Watch (GMV v3, 2020). which shows the
extent of mangrove ecosystems across Pacific Island Countries
and Territories (PICTs). The changes in mangroves extent was
further classified into three categories of closed (high-density),
open (lower density) and non-mangrove. This was used as the
baseline training layer where mangrove categories between
2016 and 2022 were analysed.
Details →
Usage examples
See 2 usage examples →
earth observationenvironmentalgeosciencegeospatialwater
Water Observations from Space (WOfS) beta version product for Water
Observations from Space (WOfS) is an annual summary of the temporal
and spatial extent of surface water over landscapes. In essence, this
highlights where water is usually or where it is rarely. The results
are visualised to compare points in time spanning over a year, a season
or multiple years. The dataset extends back historically to 2013.
Details →
Usage examples
See 2 usage examples →
ageapproximate monte carloapproximate monte carlo replicatescensusdemographic and housing characteristics filedhcdifferential privacydisclosure avoidanceethnicitygroup quartershispanichousehold typehousinghousing unitslatinomicrodatanoisy measurementspopulationraceredistrictingrelation-to-householdersingle year of agevoting age
The 2010 Census Production Settings Demographic and Housing Characteristics (DHC) Approximate Monte Carlo (AMC) method seed Privacy Protected Microdata File (PPMF0) and PPMF replicates (PPMF1, PPMF2, ..., PPMF25) are a set of microdata files intended for use in estimating the magnitude of error(s) introduced by the 2020 Decennial Census Disclosure Avoidance System (DAS) into the Redistricting and DHC products. The PPMF0 was created by executing the 2020 DAS TopDown Algorithm (TDA) using the confidential 2010 Census Edited File (CEF) as the initial input; the replicates were then created by executing the 2020 DAS TDA repeatedly with the PPMF0 as its initial input. Inspired by analogy to the use of bootstrap methods in non-private contexts, U.S. Census Bureau (USCB) researchers explored whether simple calculations based on comparing each PPMFi to the PPMF0 could be used to reliably estimate the scale of errors introduced by the 2020 DAS, and generally found this approach worked well.
The PPMF0 and PPMFi files contained here are provided so that external researchers can estimate properties of DAS-introduced error without privileged access to internal USCB-curated data sets; further information on the estimation methodology can be found in Ashmead et. al 2024.
The 2010 DHC AMC seed PPMF0 and PPMF replicates have been cleared for public dissemination by the USCB Disclosure Review Board (CBDRB-FY24-DSEP-0002). The 2010 PPMF0 included in these files was produced using the same parameters and settings as were used to produce the 2010 Demonstration Data Product Suite (202...
Details →
Usage examples
-
An Approximate Monte Carlo Simulation Method for Estimating Uncertainty and Constructing Confidence Intervals for 2020 Census Statistics by Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., Keller, S. A.
-
Estimating Confidence Intervals Using Approximate Monte Carlo Simulation Iterations (Jupyter Notebook) by Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., Keller, S. A.
See 2 usage examples →
2020 censusageapproximate monte carloapproximate monte carlo replicatescensusdecennial censusdemographic and housing characteristics filedhcdifferential privacydisclosure avoidanceethnicitygroup quartershispanichousehold typehousinghousing unitslatinomicrodatanoisy measurementspopulationraceredistrictingrelation-to-householdersingle year of agevoting age
The 2020 Census Production Settings Demographic and Housing Characteristics (DHC) Approximate Monte Carlo (AMC) method seed Privacy Protected Microdata File (PPMF0) and PPMF replicates (PPMF1, PPMF2, ..., PPMF50) are a set of microdata files intended for use in estimating the magnitude of error(s) introduced by the 2020 Census Disclosure Avoidance System (DAS) into the 2020 Census Redistricting Data Summary File (P.L. 94-171), the Demographic and Housing Characteristics File, and the Demographic Profile.
The PPMF0 was the source of the publicly released, official 2020 Census data products referenced above, and was created by executing the 2020 DAS TopDown Algorithm (TDA) using the confidential 2020 Census Edited File (CEF) as the initial input; the official location for the PPMF0 is on the United States Census Bureau FTP server, but we also include a copy of it here for convenience. The replicates were then created by executing the 2020 DAS TDA repeatedly with the PPMF0 as its initial input.
Inspired by analogy to the use of bootstrap methods in non-private contexts, U.S. Census Bureau (USCB) researchers explored whether simple calculations based on comparing each PPMFi to the PPMF0 could be used to reliably estimate the scale of errors introduced by the 2020 DAS, and generally found this approach worked well.
The PPMF0 and PPMFi files contained here are provided so that external researchers can estimate properties of DAS-introduced error without privileged access to internal USCB-curated data sets; further information on the estimation methodology can be found in Ashmead et. al 2024.
The 2020 DHC AMC...
Details →
Usage examples
-
An Approximate Monte Carlo Simulation Method for Estimating Uncertainty and Constructing Confidence Intervals for 2020 Census Statistics by Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., Keller, S. A.
-
Estimating Confidence Intervals Using Approximate Monte Carlo Simulation Iterations (Jupyter Notebook) by Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., Keller, S. A.
See 2 usage examples →
bioinformaticsgenomegenomicHomo sapienslife sciencesMus musculusnon-human primateopen source softwareRattus norvegicusvariant annotation
GenomeKit is Deep Genomics’ Python library for fast and easy access to genomic resources such as sequence, data tracks, and annotations. The goal is to let machine learning researchers build data sets easily, and to be creative about how those data sets are designed. Out of the box, GenomeKit provides access to pre-built optimized genomic data files that are required for its operation.
Details →
Usage examples
-
Quickstart by Deep Genomics
-
An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics by Albi Celaj, Alice Jiexin Gao, Tammy T.Y. Lau, Erle M. Holgersen, Alston Lo, Varun Lodaya, Christopher B. Cole, Robert E. Denroche, Carl Spickett, Omar Wagih, Pedro O. Pinheiro, Parth Vora, Pedrum Mohammadi-Shemirani, Steve Chan, Zach Nussbaum, Xi Zhang, Helen Zhu, Easwaran Ramamurthy, Bhargav Kanuparthi, Michael Iacocca, Diane Ly, Ken Kron, Marta Verby, Kahlin Cheung-Ong, Zvi Shalev, Brandon Vaz, Sakshi Bhargava, Farhan Yusuf, Sharon Samuel, Sabriyeh Alibai, Zahra Baghestani, Xinwen He, Kirsten Krastel, Oladipo Oladapo, Amrudha Mohan, Arathi Shanavas, Magdalena Bugno, Jovanka Bogojeski, Frank Schmitges, Carolyn Kim, Solomon Grant, Rachana Jayaraman, Tehmina Masud, Amit Deshwar, Shreshth Gandhi, Brendan J. Frey
See 2 usage examples →
earth observationgeosciencegeospatial
The GeoMAD is derived from Landsat surface reflectance data.
The data are masked for cloud, shadows and other image artefacts
using the associated pixel quality product to help provide as
clear a set of observations as possible from which to calculate the medians.
Details →
Usage examples
See 2 usage examples →
biologychemical biologychemistrymarine mammalsoceans
CTD (Conductivity-Temperature_Depth)-Satellite Relay Data Loggers (CTD-SRDLs) are used to explore how marine animal behaviour relates to their oceanic environment. Loggers developed at the University of St Andrews Sea Mammal Research Unit transmit data in near real-time via the Argo satellite system. Data represented here was collected in the Southern Ocean, from elephant, fur and Weddell Seals. In 2024 data was added from flatback and olive ridley turtles, from a pilot study co-funded by the Royal Australian Navy in collaboration with the Australian Institute of Marine Science and Indigenous ...
Details →
Usage examples
See 2 usage examples →
chemistryocean velocityoceans
Integrated Marine Observing System (IMOS) have moorings across both it's National Mooring Network and Deep Water Moorings facilities. The National Mooring Network facility comprises a series of national reference stations and regional moorings designed to monitor particular oceanographic phenomena in Australian coastal ocean waters. The Deep Water Moorings facility (formerly known as the Australian Bluewater Observing System) provides the coordination of national efforts in the sustained observation of open ocean properties with particular emphasis on observations important to climate and ...
Details →
Usage examples
See 2 usage examples →
life sciencesneuroimagingtransportationworkload analysis
Commercial pilot simulation data during safety-of-flight scenarios.
Details →
Usage examples
See 2 usage examples →
chemistryoceans
This collection includes conductivity-temperature-depth (CTD) profiles obtained at the National Reference Stations (NRS) as part of the water sampling program. The instruments used also measure dissolved oxygen, fluorescence, and turbidity. The collection also includes practical salinity, water density and artificial chlorophyll concentration, as computed from the measured properties. The data are processed in delayed mode, with automated quality control applied. The National Reference Station network is designed to provide baseline information, at timescales relevant to human response, that i...
Details →
Usage examples
See 2 usage examples →
chemistryocean currentsocean velocityoceans
The Australian National Facility for Ocean Gliders (ANFOG), with IMOS/NCRIS funding, deploys a fleet of eight gliders around Australia. The data represented by this record, are presented in delayed mode. The underwater ocean glider represents a technological revolution for oceanography. Autonomous ocean gliders can be built relatively cheaply, are controlled remotely and are reusable allowing them to make repeated subsurface ocean observations at a fraction of the cost of conventional methods. The data retrieved from the glider fleet will contribute to the study of the major boundary current s...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Bonney Coast (BONC) HF ocean radar system covers an area of the Bonney Coast, South Australia, which has a recurring annual upwelling feature near to the coast that significantly changes the ecosystem from one of warm water originating in Western Australia, to one dominated by cold upwelling water from off the continental shelf. The dynamics of this area and the relationship between ocean circulation, chemistry and sediments control the larval species and the higher marine species and ecosystems in which they forage. The data from this site provide linking observations between the Southe...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Capricorn Bunker Group site is in the southern region of the Great Barrier Reef Marine Park World Heritage Area (GBR). The HF ocean radar coverage is from the coast to beyond the edge of the continental shelf. This is an area where the East Australian Current (EAC) meanders as it moves south from the Swain Reefs and loses touch with the western land boundary. The area is dynamic with warm EAC water recirculating and being wind-driven northwards along the coast inside the GBR lagoon. The recirculating warm water contrasts with the upwelling tendency of the parts of the EAC which contin...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Coffs Harbour (COF) HF ocean radar site is located near the point at which the East Australian Current (EAC) begins to separate from the coast. Here the EAC is at its narrowest and swiftest: to the north it is forming from the westwards subtropical jet, and to the south it forms eddies and eventually the warm water moves eastwards across the Tasman Sea, forming a front with the cold water of the Southern Ocean. The connection between coastal and continental shelf waters is fundamental to the understanding of the anthropogenic impact on the coastal ocean and the role of the ocean in mitig...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Coral Coast (CORL) HF ocean radar system covers an area of the Western Australia Coast, Western Australia, an area subject to the variability of the Leeuwin Current (LC) and its coupling with coastal winds, tides, and waves. In this area the LC generates several eddies which control the larval species and the higher marine species and ecosystems in which they forage.The CORL HF ocean radar system consists of two SeaSonde crossed loop direction finding stations located at Dongara (29.283 S 114.920E) and Green Head (114.967 E 30.073 S). These radars operate at a frequency of 4.463 MHz, with...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Newcastle (NEWC) HF ocean radar system covers an area of the Central Coast, New South Wales, an area subject to the variability of the East Australian Current (EAC) and its coupling with coastal winds, tides, and waves. In this area the EAC separates from the coast and generates several eddies which control the larval species and the higher marine species and ecosystems in which they forage.The NEWC HF ocean radar system consists of two SeaSonde crossed loop direction finding stations located at Sea Rocks (32.441575 S 152.539022 E) and Red Head (33.010245 S 151.727059 E). These radars ope...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Northwest Shelf (NWA) HF ocean radar system covers an area which includes the Ningaloo Peninsula and the Ningaloo Reef to the west. The Ningaloo Reef is one of the longest and most pristine reefs in the world. The reef is rich in marine biodiversity, with shark whales, turtles and fish aggregations, and high primary and secondary productions which are controlled by the physical oceanographic processes. The NWA HF ocean radar is a WERA phased array system with 12-element receive arrays located at the Jurabi Turtle Centre (21.8068 S, 114.1015 E) and Point Billie (22.5432 S, 113.690 E). Th...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Rottnest Shelf (ROT) HF ocean radar system covers an area which includes Rottnest Island and the Perth Canyon to the north-west. The Perth Canyon has the highest marine biodiversity in the region with whale and fish aggregations, and high primary and secondary productions which are controlled by the physical oceanographic processes. Combined with the dynamics of the Perth Canyon is the dominant Leeuwin Current which produces a wake on the leeward side of Rottnest Island. This is a topographically induced up-welling and associated primary and secondary productivity. The region is influ...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The South Australia Gulfs (SAG) HF ocean radar system covers the area of about 40,000 square kilometres bounded by Kangaroo Island to the east and the Eyre Peninsula to the north. This is a dynamic region where warm water from the remnants of the Leeuwin current is moving from the west, and water with varying density is exchanging with Spencer Gulf and the Gulf of St Vincent. Upwelling events occur from the deep ocean on the south side of the observation area. This is a key ocean area for aquaculture and fishing, and is a major shipping thoroughfare. The data from this HF ocean radar syste...
Details →
Usage examples
See 2 usage examples →
ocean currentsocean velocityoceans
The Turquoise Coast (TURQ) HF ocean radar system covers the area of shelf between Seabird and Jurien Bay and is the logical continuation of major research efforts to understand the role of the Leeuwin Current System (Leeuwin Current, the Leeuwin Undercurrent and Capes Current) in controlling not only the physical system but also its links to both pelagic and benthic ecosystems. In contrast to eastern ocean basins, which are highly productive, Western Australia experiences an oligotrophic environment. The Leeuwin Current is a shallow (<300 m deep), narrow band (< 100 km wide) of warm, lo...
Details →
Usage examples
See 2 usage examples →
ocean sea surface heightocean velocityoceans
Gridded (adjusted) sea level anomaly (GSLA), gridded sea level (GSL) and surface geostrophic velocity (UCUR,VCUR) for the Australasian region. GSLA is mapped using optimal interpolation of detided, de-meaned, inverse-barometer-adjusted altimeter and tidegauge estimates of sea level. GSL is GSLA plus an estimate of the departure of mean sea level from the geoid – mean sea level (over 18 years of model time) of Ocean Forecasting Australia Model version 3 (OFAM3). The geostrophic velocities are derived from GSLA and the mean surface velocity from OFAM3. The altimeter data window for input to the ...
Details →
Usage examples
See 2 usage examples →
coastalearth observationenvironmentalgeosciencegeospatial
Pacific Coastlines beta version product includes coastline change detection
since the year 2000 for Pacific Island Country and Territories (PICTs).
This product will provide ongoing monitoring of coastline change detection.
This provides insights into processes including erosion (where landmass
area decreases) and accretion or deposition (where landmass area increases).
Details →
Usage examples
See 2 usage examples →
bioinformaticsgenomicgenotypingHomo sapienslife scienceslong read sequencingwhole genome sequencing
The Platinum Pedigree Consortium (PCC) is a collaborative project to create a comprehensive reference for human genetic variation using a four-generation, 28-member family (CEPH-1463). We employed five different short and long-read sequencing technologies to generate phased assemblies and characterize both inherited and de novo variation, including at some of the most difficult to genotype genomic regions such as tandem repeats, centromeres, and the Y chromosome. This extensive "truth set" is publicly available and can be used to test and benchmark new algorithms and technologies to ...
Details →
Usage examples
See 2 usage examples →
satellite imagery
This dataset combines SSL4EO-S12 and SSL4EO-L to create a multi-view dataset for multi-modal fusion using self-supervised learning for earth observation.
Details →
Usage examples
-
SSL4EO-S12 - A Large-Scale Multi-Modal, Multi-Temporal Dataset for Self-Supervised Learning in Earth Observation by Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Chenying Liu, Conrad M Albrecht, Xiao Xiang Zhu
-
SSL4EO-L - Datasets and Foundation Models for Landsat Imagery by Adam J. Stewart, Nils Lehmann, Isaac A. Corley, Yi Wang, Yi-Chia Chang, Nassim Ait Ali Braham, Shradha Sehgal, Caleb Robinson, Arindam Banerjee
See 2 usage examples →
air qualityatmosphereenvironmentalhealthnetcdf
Fine particulate matter (PM2.5) concentrations are estimated using information from satellite-, simulation- and monitor-based sources. Aerosol optical depth from multiple satellites (MODIS, VIIRS, MISR, SeaWiFS, and VIIRS) and their respective retrievals (Dark Target, Deep Blue, MAIAC) is combined with simulation (GEOS-Chem) based upon their relative uncertainties as determined using ground-based sun photometer (AERONET) observations to produce geophysical estimates that explain most of the variance in ground-based PM2.5 measurements. A subsequent statistical fusion incorporates additional inf...
Details →
Usage examples
See 2 usage examples →
chemistryocean currentsoceans
High precision satellite altimeter missions including TOPEX/Poseidon (T/P), Jason-1 and now OSTM/Jason-2, have contributed fundamental advances in our understanding of regional and global ocean circulation and its role in the Earth's climate and regional applications. These altimeter satellites essentially observe the height of the global oceans – as such, they have become the tool of choice for scientists to measure sea level rise – both at regional and global scales as well as giving information about ocean currents and large- and small-scale variability. The determination of changes in ...
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water.There are multiple retrieval algorithms for estimating Chl-a. These data use the Carder method implemented in the SeaDAS processing software l2gen and described in Carder K. L., Chen F. R., Lee Z. P., Hawes S. K. and Cannizzaro J. P. (2003), MODIS Ocean Science Team Algorithm Theoretical Basis Docume...
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. There are multiple retrieval algorithms for estimating Chl-a. These data use the Garver-Siegel-Maritorena (GSM) method implemented in the SeaDAS processing software l2gen and described in “Chapter 11, and references therein, of IOCCG Report 5, 2006, (Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. There are multiple retrieval algorithms for estimating Chl-a. These data use the OC3 method recommended by the NASA Ocean Biology Processing Group and implemented in the SeaDAS processing software l2gen. The OC3 algorithm is described at http://oceancolor.gsfc.nasa.gov/cms/atbd/chlor_a (and links th...
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. There are multiple retrieval algorithms for estimating Chl-a. These data use the OCI method (Hu et al 2012, doi: 10.1029/2011jc007395) recommended by the NASA Ocean Biology Processing Group and implemented in the SeaDAS processing software l2gen. The OCI algorithm is described at https://oceancolor.gsfc.nasa.gov/atbd/chlor_a/ (and links therei...
Details →
Usage examples
See 2 usage examples →
oceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the diffuse attenuation coefficient (Kd) at 490nm wavelength which provides information on how light is attenuated in the water column. It is defined as the scaling length of the exponential decrease of the downwelling irradiance and has units (m^-1). The MODIS K490 product estimates Kd at 490nm wavelength, using a semi-empirical model based on the ratio of water leaving radiances at 490nm and 555nm....
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. An empirical relationship is then used to compute an estimate of the relative abundance of three phytoplankton size classes (micro, nano and picoplankton). The methods used to decompose chl_oc3 are described by Brewin et al in two papers in 2010 and 2012. The two methods, denoted Brewin2010at and Br...
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. Modelling is then used to compute an estimate of the Net Primary Productivity (NPP).The model used is based on the standard vertically generalised production model (VGPM). The VGPM is a "chlorophyll-based" model that estimates net primary production from chlorophyll using a temperature-de...
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. Modelling is then used to compute an estimate of the Net Primary Productivity (NPP).The model used is based on the standard vertically generalised production model (VGPM). The VGPM is a "chlorophyll-based" model that estimates net primary production from chlorophyll using a temperature-de...
Details →
Usage examples
See 2 usage examples →
oceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These measurements at discrete wavelengths represent the spectrum of light leaving the water surface, and the shape of the spectrum is characteristic of the water optical properties.Moore et al. (2009) applied a clustering technique to spectra to identify 8 sets of discrete optical water types. This product "owt_csiro" is produced using a CSIRO implementation of the Moore et al. algorithm, and testing shows that it closely reproduces the res...
Details →
Usage examples
See 2 usage examples →
biologyoceanssatellite imagery
The Aqua satellite platform carries a MODIS sensor that observes sunlight reflected from within the ocean surface layer at multiple wavelengths. These multi-spectral measurements are used to infer the concentration of chlorophyll-a (Chl-a), most typically due to phytoplankton, present in the water. An empirical relationship is then used to compute an estimate of the relative abundance of three phytoplankton size classes (micro, nano and picoplankton). The methods used to decompose chl_oc3 are described by Brewin et al in two papers in 2010 and 2012. The two methods, denoted Brewin2010at and Br...
Details →
Usage examples
See 2 usage examples →
climateearth observationenvironmentalgeosciencegeospatial
Sentinel-1 carries a Synthetic Aperture RADAR (SAR) that
operates on the C-band. This platform offers SAR data
day and night and in all-weather conditions.
Details →
Usage examples
See 2 usage examples →
auxiliary datadisaster responseearth observationearthquakesfloodsgeophysicssentinel-1synthetic aperture radar
Sentinel-1 Precise Orbit Determination (POD) products contain auxiliary data on satellite position and velocity for the European Space Agency's (ESA) Sentinel-1 mission. Sentinel-1 is a C-band Synthetic Aperture Radar (SAR) satellite constellation first launched in 2014 as part of the European Union's Copernicus Earth Observation programme. POD products are a necessary auxiliary input for nearly all Sentinel-1 data processing workflows.
This dataset is a mirror of the Sentinel-1 Orbits dataset hosted in the Copernicus Data Space Ecosystem (CDSE). New files are added within 20 minutes of their publication to CDSE. This dataset includes two types of POD files: RESORB and POEORB.
- RESORB POD files are restituted orbit files generated within 180 minutes from sensing time and published multiple times per day, with accuracy requirement of 10 cm in 2D RMS, but typically below 5 cm. RESORB products from the last 90 days are included in this dataset.
- POEORB POD files are precise orbit files generated with a timeliness of 20 days from sensing time ...
Details →
Usage examples
See 2 usage examples →
earth observationgeosciencegeospatial
The Geometric Median and Absolute Deviations (GeoMAD) product is
a cloud-free annual mosaic that uses a more robust method of
determining the median observation than a simple median. Along
with the median observation, the GeoMAD produces three measures
of variance, or absolute deviations, which helps to understand
how the data over the time period changes. For example, some areas,
such as desert, will change very little. Whereas crop land will change
more. All ofthese values are useful in understand what is happening
in the area covered by the GeoMAD.
Details →
Usage examples
See 2 usage examples →
air temperatureatmospheremeteorologicaloceansradiation
Enhancement of Measurements on Ships of Opportunity (SOOP)-Air Sea Flux sub-facility collects underway meteorological and oceanographic observations during scientific and Antarctic resupply voyages in the oceans adjacent to Australia. Data product is quality controlled bulk air-sea fluxes and input observations. Research Vessel Real Time Air-Sea Fluxes, equips the Marine National Facility (MNF) (Research Vessels Southern Surveyor and Investigator), the Australian Antarctic Division (Research and Supply Vessels Aurora Australis and Nuyina), and Research Vessel Tangaroa with "climate qualit...
Details →
Usage examples
See 2 usage examples →
air temperatureatmospheremeteorologicaloceansprecipitationradiation
Enhancement of Measurements on Ships of Opportunity (SOOP)-Air Sea Flux sub-facility collects underway meteorological and oceanographic observations during scientific and Antarctic resupply voyages in the oceans adjacent to Australia. Data product is quality controlled observations. Research Vessel Real Time Air-Sea Fluxes, equips the Marine National Facility (MNF) (Research Vessels Southern Surveyor and Investigator), the Australian Antarctic Division (Research and Supply Vessels Aurora Australis and Nuyina), and Research Vessel Tangaroa with "climate quality" meteorological measure...
Details →
Usage examples
See 2 usage examples →
chemistryoceans
The IMOS Ship of Opportunity Underway CO2 Measurements group is a research and data collection project working within the IMOS Ship of Opportunity Multi-Disciplinary Underway Network sub-facility. The CO2 group sample critical regions of the Southern Ocean and the Australian shelf waters have a major impact on CO2 uptake by the ocean. These are regions where biogeochemical cycling is predicted to be particularly sensitive to a changing climate. The pCO2 Underway System measures the fugacity of carbon dioxide (fCO2) along with other variables such as sea surface salinity (SSS) and sea surface t...
Details →
Usage examples
See 2 usage examples →
oceans
XBT real-time data is available through the IMOS portal. Data is acquired by technicians who ride the ships of opportunity in order to perform high density sampling along well established transit lines. The data acquisition system used is the Quoll developed by Turo Technology. Data collected and is stored in netcdf files, with real-time data messages (JJVV bathy messages) created on the ship and sent to shore by iridium sbd. This is inserted onto the GTS by our colleagues at the Australian Bureau of Meteorology. The full resolution data is collected from the ship and returned for processing t...
Details →
Usage examples
See 2 usage examples →
oceans
Fisheries Vessels as Ships of Opportunities (FishSOOP) is an IMOS Sub-Facility working with fishers to collect real-time temperature and depth data by installing equipment on a network of commercial fishing vessels using a range of common fishing gear.Every day, fishing vessels operate broadly across the productive areas of Australia’s Exclusive Economic Zone where we have few subsurface ocean measurements. The FishSOOP Sub-Facility is utilising this observing opportunity to cost-effectively increase the spatial and temporal resolution of subsurface temperature data in Australia’s inshore, she...
Details →
Usage examples
See 2 usage examples →
air temperatureatmosphereoceans
The Sea Surface Temperature (SST) sub-facility produces 1-minute average data products. Observed data are 1-minute median SST values and are retrieved from the vessel once an hour. High-resolution 1-minute median data are available in the delayed mode approximately every 3 months, these were produced in delayed mode using visual (manual) inspection of the quality flags after the data had been run through the automated quality control software at the Bureau of Meteorology The data products are produced from data from 3 ships: L'Astrolabe, Xutra Buhm and Wana Buhm, within this sub-facility w...
Details →
Usage examples
See 2 usage examples →
chemistryoceans
The research vessels (RV Cape Ferguson and RV Solander) of the Australian Institute of Marine Science (AIMS) routinely record along-track (underway) measurements of near-surface water temperature, salinity, chlorophyll (fluorescence) and turbidity (NTU) during scientific operations in the tropical waters of northern Australia, particularly the Great Barrier Reef (GBR). All data records include sampling time (UTC), position (Latitude, Longitude) and water depth (under keel). Data are recorded at 10 second intervals. Data are measured with a Seabird SBE38 thermometer, Seabird SBE21 thermosalinog...
Details →
Usage examples
See 2 usage examples →
oceans
Buoys provide integral wave parameters. Buoy data from the following organisations contribute to the National Wave Archive: Manly Hydraulics Laboratory (part of the NSW Department of Planning and Environment (DPE), which has assumed function of the former NSW Office of Environment and Heritage (OEH)); Bureau of Meteorology; Western Australia Department of Transport (DOT); the Queensland Department of Environment and Science (DES); the Integrated Marine Observing System (IMOS); Gippsland Ports; the NSW Nearshore Wave Data Program from the NSW Department of Planning and Environment (DPE); the Un...
Details →
Usage examples
See 2 usage examples →
amazon.sciencecomputer visionlabeledmachine learningparquetvideo
This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019.
This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of...
Details →
Usage examples
See 2 usage examples →
agricultureamazon.sciencebiologyCaenorhabditis elegansDanio reriogeneticgenomicHomo sapienslife sciencesMus musculusRattus norvegicusreference index
Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.
Details →
Usage examples
See 1 usage example →
amazon.sciencemachine learningnatural language processing
Amazon product questions and their answers, along with the public product information.
Details →
Usage examples
See 1 usage example →
amazon.sciencedeep learningmachine learningnatural language processingspeech recognition
Sentence classification datatasets with ASR Errors.
Details →
Usage examples
See 1 usage example →
amazon.sciencecomputer visiondeep learning
The first large public body measurement dataset including 8978 frontal and lateral
silhouettes for 2505 real subjects, paired with height, weight and 14 body
measurements. The following artifacts are made available for each subject.
- Subject Height
- Subject Weight
- Subject Gender
- Two black-and-white silhouette images of subject standing in frontal and side pose
respectively with full body in view.
- 14 body measurements in cm - {ankle girth, arm-length, bicep girth, calf girth,
chest girth, forearm girth, height, hip girth, leg-length,
shoulder-breadth, shoulder-to-crotch length, thigh girth,
waist girth, wrist girth}
The data is split into 3 sets - Training, Test Set A, Test Set B. For the training and
Test-A sets, subjects are photographed and 3D-scanned by in a lab by technicians. For
the Test-B set, subjects ...
Details →
Usage examples
See 1 usage example →
deep learninglife sciencesmolecular dockingopen source softwareprotein folding
This is the data used to train the Boltz-1 model. It contains the following datasets:
- Our pre-processed version of the Protein Data Bank
- Our pre-processed version of the multiple sequence alignment data for each protein chain
- The raw multiple sequence alginment data.
- A pre-computed symmetry file for symmetry correction during training
Details →
Usage examples
-
Boltz-1: Democratizing Biomolecular Interaction Modeling by J Wohlwend, G Corso, S Passaro, M Reveiz, K Leidal, W Swiderski, T Portnoi, I Chinn, J Silterra, T Jaakkola, R Barzilay
See 1 usage example →
amazon.scienceconversation datamachine learningnatural language processing
This dataset provides extra annotations on top of the publicly released
Topical-Chat dataset(https://github.com/alexa/Topical-Chat) which will help in reproducing the results in our paper
"Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems" (https://arxiv.org/abs/2005.12529?context=cs.CL).
The dataset contains 5 files: train.json, valid_freq.json, valid_rare.json, test_freq.json and test_rare.json.
Each of these files will have additional annotations on top of the original Topical-Chat dataset.
These specific annotations are: dialogue act annotations a...
Details →
Usage examples
See 1 usage example →
climatelandoceans
The Global Carbon Budget (GCB) is recognised globally as the most comprehensive report on global carbon emissions and sinks. This dataset, updated every year, includes estimates of land and ocean carbon fluxes from the suite of models used in the report.
Details →
Usage examples
-
Global Carbon Budget 2023 by Pierre Friedlingstein, Michael O’Sullivan, Matthew W. Jones, Robbie M. Andrew, Luke Gregor, Judith Hauck, Corinne Le Quéré, Ingrid T. Luijkx, Are Olsen, Glen P. Peters, Wouter Peters, Julia Pongratz, Clemens Schwingshackl, Stephen Sitch, Josep G. Canadell, Philippe Ciais, Rob B. Jackson,Simone Alin, Ramdane Alkama, Almut Arneth, Vivek K. Arora, Nicholas R. Bates, Meike Becker, Nicolas Bellouin, Henry C. Bittig, Laurent Bopp, Frédéric Chevallier, Louise P. Chini, Margot Cronin, Wiley Evans, Stefanie Falk, Richard A. Feely, Thomas Gasser, Marion Gehlen, Thanos Gkritzalis, Lucas Gloege, Giacomo Grassi, Nicolas Gruber, Özgür Gürses, Ian Harris, Matthew Hefner, Richard A. Houghton, George C. Hurtt, Yosuke Iida, Tatiana Ilyina, Atul K. Jain, Annika Jersild, Koji Kadono, Etsushi Kato, Daniel Kennedy, Kees Klein Goldewijk, Jürgen Knauer, Jan Ivar Korsbakken, Peter Landschützer, Nathalie Lefèvre, Keith Lindsay, Junjie Liu, Zhu Liu, Gregg Marland, Nicolas Mayot, Matthew J. McGrath, Nicolas Metzl, Natalie M. Monacci, David R. Munro, Shin-Ichiro Nakaoka, Yosuke Niwa, Kevin O´Brien, Tsuneo Ono, Paul I. Palmer, Naiqing Pan, Denis Pierrot, Katie Pocock, Benjamin Poulter, Laure Resplandy, Eddy Robertson, Christian Rödenbeck, Carmen Rodriguez, Thais M. Rosan, Jörg Schwinger, Roland Séférian, Jamie D. Shutler, Ingunn Skjelvan, Tobias Steinhoff, Qing Sun, Adrienne J. Sutton, Colm Sweeney, Shintaro Takao, Toste Tanhua, Pieter P. Tans, Xiangjun Tian, Hanqin Tian, Bronte Tilbrook, Hiroyuki Tsujino, Francesco Tubiello, Guido R. van der Werf, Anthony P. Walker, Rik Wanninkhof, Chris Whitehead, Anna Wranne, Rebecca Wright, Wenping Yuan, Chao Yue, Xu Yue, Sönke Zaehle, Jiye Zeng, Bo Zheng
See 1 usage example →
amazon.sciencebioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing
To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.
Details →
Usage examples
See 1 usage example →
amazon.scienceinformation retrievaljsonnatural language processingtext analysis
A collection of sentences extracted from customer reviews labeled with their helpfulness score.
Details →
Usage examples
See 1 usage example →
amazon.sciencemachine learningnatural language processing
This dataset provides labeled humor detection from product question answering systems.
The dataset contains 3 csv files: Humorous.csv
containing the humorous product questions,
Non-humorous-unbiased.csv containing
the non-humorous prodcut questions from the same products as the humorous one, and,
Details →
Usage examples
See 1 usage example →
amazon.sciencedialogmachine learningnatural language processing
Humor patterns used for querying Alexa traffic when creating the taxonomy described in the paper "“Alexa, Do You Want to Build a Snowman?” Characterizing Playful Requests to Conversational Agents" by Shani C., Libov A., Tolmach S., Lewin-Eytan L., Maarek Y., and Shahaf D. (CHI LBW 2022). These patterns corrospond to the researchers' hypotheses regarding what humor types are likely to appear in Alexa traffic. These patterns were used for querying Alexa traffic to evaluate these hypotheses.
Details →
Usage examples
See 1 usage example →
air qualityenergyenvironmentalmeteorological
This dataset includes detailed information about coal power plants, their locations, capacities, emissions, and other relevant attributes around the Indian Gangetic Plain.
Details →
Usage examples
See 1 usage example →
amazon.sciencemachine learningnatural language processing
This dataset provides product related questions and answers, including answers' quality labels, as as part of the paper 'IR Evaluation and Learning in the Presence of Forbidden Documents'.
Details →
Usage examples
See 1 usage example →
amazon.sciencemachine learningnatural language processing
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences.
We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts.
As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated).
For these datasets, the columns provided for each datapoint are as follows:
text- the original sentence
span- the span (phrase) which is masked out
span_lower- the lowercase version of span
r...
Details →
Usage examples
See 1 usage example →
amazon.sciencenatural language processingtext analysis
A collection of product reviews summaries automatically generated by PASS for 32 Amazon products from the FewSum dataset
Details →
Usage examples
See 1 usage example →
amazon.sciencecomputer vision
PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.
Details →
Usage examples
See 1 usage example →
amazon.sciencejsonnatural language processing
This dataset is part of the paper "McPhraSy: Multi-Context Phrase Similarity and Clustering" by DN Cohen et al (2022). The purpose of PCD is to evaluate the quality of semantic-based clustering of noun phrases. The phrases were collected from the [Amazon Review Dataset] (https://nijianmo.github.io/amazon/).
Details →
Usage examples
See 1 usage example →
explorationgeophysicsseismology
Near, mid, far, full stack (with AGC) imaged 3D seismic data. We also include the decimated
stacking velocity field. The dataset is used in oil and gas exploration. Survey size is
approximately 2,900 km2.Coordinate system used is: GDA94 / MGA Zone 51 Petrel, 700004Datasets are converted to open-source MDIO format (v1 specification).Original SEG-Y files are licensed as CC BY 3.0 AU and are downloaded from
Google Drive
accessed via SEG Open Data Wiki. The datasets are available
courtesy of ConocoPhillips and Geoscience Australia. Raw data can be requested from Geoscience
Australia's NOPIMS s...
Details →
Usage examples
See 1 usage example →
amazon.sciencemachine learningnatural language processing
This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time.
Each question is also associated with related product details, including its id and title.
Details →
Usage examples
See 1 usage example →
code completionmachine learning
PyEnvs is a collection of 2814 permissively licensed Python packages along with their isolated development environments. Paired with a program analyzer (e.g. Jedi Language Server), it supports querying for project-related information. CallArgs is a dataset built on top of PyEnvs for function call argument completion. It provides function definition, implementation, and usage information for each function call instance.
Details →
Usage examples
See 1 usage example →
amazon.scienceanomaly detectionclassificationfewshotindustrialsegmentation
Largest Visual Anomaly detection dataset containing objects from 12 classes in 3 domains across 10,821(9,621 normal and 1,200 anomaly) images. Both image and pixel level annotations are provided.
Details →
Usage examples
See 1 usage example →
amazon.scienceinformation retrievalmachine learningnatural language processing
Voice-based refinements of product search
Details →
Usage examples
See 1 usage example →
amazon.sciencemachine learningnatural language processing
This dataset provides how-to articles from wikihow.com and their summaries,
written as a coherent paragraph.
The dataset itself is available at wikisum.zip,
and contains the article, the summary, the wikihow url, and an official fold (train, val, or test).
In addition, human evaluation results are available at
wikisum-human-eval...
Details →
Usage examples
See 1 usage example →
amazon.scienceconversation datadialogmachine learningnatural language processing
Wizard of Tasks (WoT) is a dataset containing conversations for Conversational Task Assistants (CTAs). A CTA is a conversational agent whose goal is to help humans to perform real-world tasks. A CTA can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. WoT contains about 550 conversations with ~18,000 utterances in two domains, i.e., Cooking and Home Improvement.
Details →
Usage examples
See 1 usage example →
government records
The regulations.gov website allows users to view proposed rules and supporting documents for the federal rule-making process. In addition, users can post and view comments about those proposed rules. The site contains about 27 million pieces of text and binary data, but the API that provides access only allows a user to obtain one thousand items per hour. As a result, it would take approximately 3 years to download all the data.
Mirrulations (MIRRor of regULATIONS.gov) is a system that uses a collection of donated API keys to create a mirror of the data. In addition, for each pdf in the da...
Details →
Usage examples
See 1 usage example →
environmentalmeteorologicalweather
This is an archive of pure AI-based weather prediction reforecasts produced collaboratively between the Cooperative Institute for Research in the Atmosphere (CIRA) and the NOAA Global Systems Laboratory (NOAA-GSL).
Currently, FourCastNetv2-small, Pangu-Weather, and GraphCast are included, with more models to come. Each of these models has been initialized with both NOAA GFS (directories with no extension) and ECMWF IFS initial conditions (directories ending in "_IFS"). The datasets are updated with near-real-time data twice per day (00Z and 12Z initializations).
...
Details →
amazon.sciencecomputer visiondeep learningmachine learning
Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.
Details →
amazon.sciencecomputer visiondeep learninginformation retrievalmachine learningmachine translation
Amazon Berkeley Objects (ABO) is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as "spin" or "360º-View" images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.
Details →
brain imagesbrain modelselectrophysiologyion channelslife sciencesmicrocircuit modeling and simulationmorphological reconstructionsMus musculusneurosciencesimulation neurosciencesingle neuron models
The Blue Brain Open Data represents an extensive neuroscience dataset encompassing a diverse range of data types, including experimental, model, and simulation data, along with images and videos depicting reconstructed neurons and brain regions.
Details →
amazon.sciencecomputer visionmachine learning
Fine-grained localized visual similarity and search for fashion.
Details →
amazon.sciencenatural language processing
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
Details →
amazon.sciencegraphtraffictransportation
Large-scale node-weighted conflict graphs for maximum weight independent set solvers
Details →
amazon.sciencejsonmetadata
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Curren...
Details →
network traffictelecommunications
Experiment data for the Spatiam DTN Network Platform Technology demonstration carried to test a Delay and Disruption Tolerant Network between the ISS and Earth.
Details →
benchmarkdeep learningmachine learningmeta learningtime series forecasting
TSBench comprises thousands of benchmark evaluations for time series forecasting methods. It provides various metrics (i.e. measures of accuracy, latency, number of model parameters, ...) of 13 time series forecasting methods across 44 heterogeneous datasets. Time series forecasting methods include both classical and deep learning methods while several hyperparameters settings are evaluated for the deep learning methods.In addition to the tabular data providing the metrics, TSBench includes the probabilistic forecasts of all evaluated methods for all 44 datasets. While the tabular data is smal...
Details →
SocialGene RefSeq Databases
amino acidbioinformaticschemical biologygenomicgraphmetagenomicsmicrobiomepharmaceuticalprotein
Precomputed SocialGene Neo4j graph databases of various sizes built from RefSeq genomes and MIBiG BGCs.
Details →
Usage examples
See 3 usage examples →