Usage examples for all datasets listed in the Registry of Open Data on AWS tagged with natural language processing.
Tutorials
Tools & Applications
Publications
-
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures by Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary
-
Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
-
C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
-
CC-News-En: A large English news corpus by Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
-
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
-
Coyo-700m: Image-text pair dataset by Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim
-
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
-
Index fun by Philippe Suter
-
LAION-5B: An open large-scale dataset for training next generation image-text models by Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, et al
-
Language is not all you need: aligning perception with language models by Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, et al
-
Language models are few-shot learners by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al
-
Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
-
LLaMA: open and efficient foundation language models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, et al
-
Mapping languages: The Corpus of Global Language Use by Jonathan Dunn
-
mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, et al
-
Multimodal C4: an open, billion-scale corpus of images interleaved with text by Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, et al
-
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
-
No Language Left Behind: scaling human-centered machine translation by Costa-jussà, Marta R., James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, et al
-
Of using Common Crawl to play Family Feud by Paul Masurel
-
On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
-
Using open data to predict market movements by DELL EMC
-
Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli
Tutorials
Tools & Applications
Publications
-
chiVe 2.0: SudachiとNWJCを用いた実用的な日本語単語ベクトルの実現に向けて by 河村宗一郎, 久本空海, 真鍋陽俊, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸
-
chiVe: 製品利用可能な日本語単語ベクトル資源の実現へ向けて ~形態素解析器Sudachiと超大規模ウェブコーパスNWJCによる分散表現の獲得と改良~ by 久本空海, 山村崇, 勝田哲弘, 竹林佑斗, 髙岡一馬, 内田佳孝, 岡照晃, 浅原正幸
-
Sudachi: a Japanese Tokenizer for Business by Kazuma Takaoka, Sorami Hisamoto, Noriko Kawahara, Miho Sakamoto, Yoshitaka Uchida, Yuji Matsumoto
-
形態素解析器『Sudachi』のための大規模辞書開発 by 坂本美保, 川原典子, 久本空海, 髙岡一馬, 内田佳孝
-
複数粒度の分割結果に基づく日本語単語分散表現 by 真鍋陽俊, 岡照晃, 海川祥毅, 髙岡一馬, 内田佳孝, 浅原正幸
-
詳細化した同義関係をもつ同義語辞書の作成 by 高岡一馬, 岡部裕子, 川原典子, 坂本美保, 内田佳孝
Tutorials
Tools & Applications
Tutorials
Tools & Applications
Publications
Tutorials
Tools & Applications
Tools & Applications
Publications
Tutorials
Tools & Applications
Publications
-
Dynamic Gazetteer Integration in Multilingual Models for Cross-Lingual and Cross-Domain Named Entity Recognition by Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi
-
Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries by Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi
-
GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input by Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi
-
MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition by Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg Rokhlenko
Tutorials
Tools & Applications
If you want to add a dataset or usage example to this registry, please follow the instructions on the Registry of Open Data on AWS GitHub repository or tell us about your project.
Home