Smithsonian Research Data Lake

The data lake is a service run by the Office of Research Computing that helps Smithsonian researchers process and analyze their data at scale. Using Hadoop and Spark technologies, the data lake can enable computation across multiple heterogenous datasets, autogenerate metadata for description and discovery, and make connections between datasets in different projects.
The data lake can accept data from many sources - from environmental monitoring stations to legacy relational databases - and then supply those analyzed data for publication in repositories, websites, or dedicated visualization platforms.

If you have used the Smithsonian Research Data Lake to further your research, please consider acknowledging it in your publications using the following DOI:

https:/doi.org/10.25572/9kqe-jy70

More information, including technical specifications and how to use the data lake for an SI research project, can be found on Confluence Smithsonian Data Lake Home - Smithsonian Research Data Lake - SI Collaboration WIKI (note: must be logged in with Smithsonian credentials to view).