{ "cells": [ { "cell_type": "markdown", "id": "5986c32c-58ce-4e31-9a2d-e0ca018b07bf", "metadata": {}, "source": [ "# Accessing Data from NASA's CMR in R\n", "\n", "Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Alex Mandel (Development Seed), Henry Rodman (Development Seed), Zac Deziel (Development Seed)\n", "\n", "Date: March 26, 2025\n", "\n", "Description: This notebook serves as a follow-up to [\"Searching for Data in NASA's CMR in R\"](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/cmr_search_in_r.html). In this guide, users will learn how to:\n", "- Access data from a NASA Distributed Active Archive Center (DAAC) directly.\n", "- Use `paws` to download data from a NASA DAAC locally." ] }, { "cell_type": "markdown", "id": "df352544-7428-421d-827a-510141080010", "metadata": {}, "source": [ "## Additional Resources\n", "- [Working with R in MAAP](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r.html) \n", " - Current R Documentation within the MAAP Docs.\n", "- [NASA's Operational CMR (MAAP Docs)](https://docs.maap-project.org/en/latest/technical_tutorials/search/catalog.html#nasa-s-operational-cmr) \n", " - A section in the MAAP Docs offering an overview of resources to search and access NASA's CMR.\n", "- [`ncdf4` Reference Manual](https://cran.r-project.org/web/packages/ncdf4/ncdf4.pdf)\n", " - Documentation for reading and writing netCDF files using the `ncdf4` package.\n", "- [GDAL Raster Drivers](https://gdal.org/en/latest/drivers/raster/index.html)\n", " - A list of drivers for raster data.\n", "- [`paws` Reference Manual](https://cran.r-project.org/web/packages/paws/paws.pdf)\n", " - Documentation for using the `paws` package." ] }, { "cell_type": "markdown", "id": "9f1d15ad-170f-4286-84ec-55a1b45b3d2e", "metadata": {}, "source": [ "## Run This Notebook\n", "To access and run this tutorial within MAAP’s Algorithm Development Environment (ADE), please refer to the [“Getting started with the MAAP”](https://docs.maap-project.org/en/latest/getting_started/getting_started.html) section of our documentation.\n", "\n", "Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP. Running the tutorial outside of the MAAP ADE may lead to errors. Users should work within an \"R/Python\" workspace." ] }, { "cell_type": "markdown", "id": "201805c7-bd3b-42ce-a299-95acac3c7638", "metadata": {}, "source": [ "## Install and Load Required Libraries\n", "Let's load the packages needed for this notebook." ] }, { "cell_type": "code", "execution_count": 75, "id": "dc1b93e1-fb92-45e5-ade0-7b4c19a9c867", "metadata": {}, "outputs": [], "source": [ "library(\"reticulate\") # to use maap-py python \n", "library(\"paws\") # to access S3 files\n", "library(\"ncdf4\") # to read HDF4/netcdf files locally\n", "library(\"terra\") # to open raster files" ] }, { "cell_type": "markdown", "id": "64131b26-1a47-4bb3-b105-d2d269afc06d", "metadata": {}, "source": [ "Additionally, we'll invoke the `MAAP` constructor. This will allow us to use the python-based `maapy-py` library from R." ] }, { "cell_type": "code", "execution_count": 76, "id": "02382469-7ded-4f13-a826-7a9728b0a86f", "metadata": {}, "outputs": [], "source": [ "maap_py <- import(\"maap.maap\")\n", "maap <- maap_py$MAAP()" ] }, { "cell_type": "markdown", "id": "7ea49dea-e3f6-4919-bb6e-e39815426cc5", "metadata": {}, "source": [ " ## Searching for Data\n", "\n", "In the example below, we'll demonstrate searching and accessing data from ORNL DAAC. We'll search for a GEDI L4B dataset, extract the associated links to access the data, and then open a file.\n", "\n", "For more information on searching for NASA CMR collections and granules in R, see [\"Searching for Data in NASA's CMR in R\"](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/cmr_search_in_r.html). " ] }, { "cell_type": "code", "execution_count": 77, "id": "c7a649a4-62e0-43a7-9090-c9556cd09d63", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Collection ID: C2792577683-ORNL_CLOUD\"\n", "Granules:\n", "[1] \"GEDI_L4B_Gridded_Biomass_V2_1.GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif\"\n", "[2] \"GEDI_L4B_Gridded_Biomass_V2_1.GEDI04_B_MW019MW223_02_002_02_R01000M_V2.tif\"\n", "[3] \"GEDI_L4B_Gridded_Biomass_V2_1.GEDI04_B_MW019MW223_02_002_02_R01000M_MU.tif\"\n", "[4] \"GEDI_L4B_Gridded_Biomass_V2_1.GEDI04_B_MW019MW223_02_002_02_R01000M_QF.tif\"\n", "[5] \"GEDI_L4B_Gridded_Biomass_V2_1.GEDI04_B_MW019MW223_02_002_02_R01000M_NS.tif\"\n" ] } ], "source": [ "# Search for a dataset in NASA's CMR\n", "gedi_collection <- maap$searchCollection(\n", " short_name = \"GEDI_L4B_Gridded_Biomass_V2_1_2299\", \n", " cmr_host = \"cmr.earthdata.nasa.gov\",\n", " cloud_hosted = \"true\"\n", ")\n", "\n", "# Extract the collection’s concept ID\n", "collection_id <- gedi_collection[[1]][\"concept-id\"]\n", "print(paste(\"Collection ID:\", collection_id))\n", "\n", "# Retrieve granules (up to 5 granules)\n", "gedi_granules <- maap$searchGranule(\n", " concept_id = collection_id,\n", " limit = as.integer(5),\n", " cmr_host = \"cmr.earthdata.nasa.gov\"\n", ")\n", "\n", "granule_names <- sapply(gedi_granules, function(names) names[\"Granule\"][\"GranuleUR\"])\n", "cat(\"Granules:\\n\")\n", "print(granule_names)" ] }, { "cell_type": "markdown", "id": "b9746a7a-37d8-46be-8559-c0875dc4d99f", "metadata": {}, "source": [ "Now that we have our granules, let's extract the URLs associated with the first granule. There are two links: an S3 link, and an https link." ] }, { "cell_type": "code", "execution_count": 78, "id": "cf10ad4a-2aa9-4126-bc85-6d74bdeb51ed", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"https Link: https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif\"\n", "[1] \"S3 Link: s3://ornl-cumulus-prod-protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif\"\n" ] } ], "source": [ "https_link <- gedi_granules[[1]][\"Granule\"][\"OnlineAccessURLs\"][[1]][0][\"URL\"]\n", "print(paste(\"https Link:\", https_link))\n", "s3_link <- gedi_granules[[1]][\"Granule\"][\"OnlineAccessURLs\"][[1]][2][\"URL\"]\n", "print(paste(\"S3 Link:\", s3_link))" ] }, { "cell_type": "markdown", "id": "f5d7be93-dbdf-4e24-8c79-60cbe625cea1", "metadata": {}, "source": [ "## Data Access" ] }, { "cell_type": "markdown", "id": "ccab6a1d-3404-45c9-b3e3-a02274d37ea9", "metadata": {}, "source": [ "### Direct Access\n", "\n", "Let's use the `sf` package to read the metadata associated with the TIFF file above. To read an item from S3 directly, `/vsis3/` needs to precede the S3 path. To do this, we'll use the `sub` function to replace `s3://` with `/vsis3/`." ] }, { "cell_type": "code", "execution_count": 79, "id": "d084dd3b-bcb0-4cbc-9982-257270d1156e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Driver: GTiff/GeoTIFF\n", "Files: /vsis3/ornl-cumulus-prod-protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif\n", "Size is 34704, 14616\n", "Coordinate System is:\n", "PROJCRS[\"WGS 84 / NSIDC EASE-Grid 2.0 Global\",\n", " BASEGEOGCRS[\"WGS 84\",\n", " ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n", " MEMBER[\"World Geodetic System 1984 (Transit)\"],\n", " MEMBER[\"World Geodetic System 1984 (G730)\"],\n", " MEMBER[\"World Geodetic System 1984 (G873)\"],\n", " MEMBER[\"World Geodetic System 1984 (G1150)\"],\n", " MEMBER[\"World Geodetic System 1984 (G1674)\"],\n", " MEMBER[\"World Geodetic System 1984 (G1762)\"],\n", " MEMBER[\"World Geodetic System 1984 (G2139)\"],\n", " ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n", " LENGTHUNIT[\"metre\",1]],\n", " ENSEMBLEACCURACY[2.0]],\n", " PRIMEM[\"Greenwich\",0,\n", " ANGLEUNIT[\"degree\",0.0174532925199433]],\n", " ID[\"EPSG\",4326]],\n", " CONVERSION[\"US NSIDC EASE-Grid 2.0 Global\",\n", " METHOD[\"Lambert Cylindrical Equal Area\",\n", " ID[\"EPSG\",9835]],\n", " PARAMETER[\"Latitude of 1st standard parallel\",30,\n", " ANGLEUNIT[\"degree\",0.0174532925199433],\n", " ID[\"EPSG\",8823]],\n", " PARAMETER[\"Longitude of natural origin\",0,\n", " ANGLEUNIT[\"degree\",0.0174532925199433],\n", " ID[\"EPSG\",8802]],\n", " PARAMETER[\"False easting\",0,\n", " LENGTHUNIT[\"metre\",1],\n", " ID[\"EPSG\",8806]],\n", " PARAMETER[\"False northing\",0,\n", " LENGTHUNIT[\"metre\",1],\n", " ID[\"EPSG\",8807]]],\n", " CS[Cartesian,2],\n", " AXIS[\"easting (X)\",east,\n", " ORDER[1],\n", " LENGTHUNIT[\"metre\",1]],\n", " AXIS[\"northing (Y)\",north,\n", " ORDER[2],\n", " LENGTHUNIT[\"metre\",1]],\n", " USAGE[\n", " SCOPE[\"Environmental science - used as basis for EASE grid.\"],\n", " AREA[\"World between 86°S and 86°N.\"],\n", " BBOX[-86,-180,86,180]],\n", " ID[\"EPSG\",6933]]\n", "Data axis to CRS axis mapping: 1,2\n", "Origin = (-17367530.445161499083042,7314540.830638599582016)\n", "Pixel Size = (1000.895023349667440,-1000.895023349667440)\n", "Metadata:\n", " AREA_OR_POINT=Area\n", "Image Structure Metadata:\n", " COMPRESSION=LZW\n", " INTERLEAVE=BAND\n", "Corner Coordinates:\n", "Upper Left (-17367530.445, 7314540.831) (180d 0' 0.00\"W, 85d 2'40.44\"N)\n", "Lower Left (-17367530.445,-7314540.831) (180d 0' 0.00\"W, 85d 2'40.44\"S)\n", "Upper Right (17367530.445, 7314540.831) (180d 0' 0.00\"E, 85d 2'40.44\"N)\n", "Lower Right (17367530.445,-7314540.831) (180d 0' 0.00\"E, 85d 2'40.44\"S)\n", "Center ( 0.0000019, -0.0000008) ( 0d 0' 0.00\"E, 0d 0' 0.00\"S)\n", "Band 1 Block=256x256 Type=Float32, ColorInterp=Gray\n", " NoData Value=-9999\n", " Overviews: 17352x7308, 8676x3654, 4338x1827, 2169x914, 1085x457, 543x229\n" ] } ], "source": [ "tiff_path <- sub(\"s3://\", \"/vsis3/\", s3_link)\n", "\n", "tiff_read <- sf::gdal_utils(\"info\", tiff_path)" ] }, { "cell_type": "markdown", "id": "20f506ac-ab39-4502-bde2-3204f4c768bb", "metadata": {}, "source": [ "Since this is a TIFF file, we can use the `Terra` package to access the data." ] }, { "cell_type": "code", "execution_count": 80, "id": "429e2868-a0fe-4ffa-b42e-bab80e6a5178", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "class : SpatRaster \n", "dimensions : 14616, 34704, 1 (nrow, ncol, nlyr)\n", "resolution : 1000.895, 1000.895 (x, y)\n", "extent : -17367530, 17367530, -7314541, 7314541 (xmin, xmax, ymin, ymax)\n", "coord. ref. : WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933) \n", "source : GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif \n", "name : GEDI04_B_MW019MW223_02_002_02_R01000M_SE " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "gedi_data <- terra::rast(tiff_path)\n", "gedi_data" ] }, { "cell_type": "markdown", "id": "e6af389b-5b6b-446b-b2a8-589e9d860c6a", "metadata": {}, "source": [ "### Download a File Locally" ] }, { "cell_type": "markdown", "id": "cc21b20c-5832-41a4-a344-b056687c84b5", "metadata": {}, "source": [ "When data cannot or should not be directly accessed, the file can also be downloaded locally. For more examples on when (or when not) to directly access the data, see [\"MAAP AWS Access in R\"](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/access_aws_maap.html). [\"When to 'Cloud'\"](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/when-to-cloud.html) is a more general resource, but also provides good questions to ask yourself when using cloud access.\n", "\n", "For this example, let's search for a MODIS dataset provided by LP DAAC. Similar to above, we'll search for the collection and retrieve the associated granules, then extract the S3 link from the first granule." ] }, { "cell_type": "code", "execution_count": 81, "id": "9b664986-b1ac-4fad-aef3-c4524ed20a12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"S3 Link: s3://lp-prod-protected/MOD13A1.061/MOD13A1.A2000049.h02v06.061.2020041151125/MOD13A1.A2000049.h02v06.061.2020041151125.hdf\"\n" ] } ], "source": [ "# Search for a dataset in NASA's CMR\n", "modis_collection <- maap$searchCollection(\n", " short_name = \"MOD13A1\", \n", " cmr_host = \"cmr.earthdata.nasa.gov\",\n", " cloud_hosted = \"true\"\n", ")\n", "\n", "# Extract the collection’s concept ID\n", "collection_id <- modis_collection[[1]][\"concept-id\"]\n", "\n", "# Retrieve granules (up to 5 granules)\n", "modis_granules <- maap$searchGranule(\n", " concept_id = collection_id,\n", " limit = as.integer(5),\n", " cmr_host = \"cmr.earthdata.nasa.gov\"\n", ")\n", "\n", "# Retrieve S3 link\n", "s3_link <- modis_granules[[1]][\"Granule\"][\"OnlineAccessURLs\"][[1]][1][\"URL\"]\n", "print(paste(\"S3 Link:\", s3_link))" ] }, { "cell_type": "markdown", "id": "3b9dc9bb-4eee-4659-9e85-0e562eb377c6", "metadata": {}, "source": [ "To download the data locally, temporary credentials for LP DAAC are needed." ] }, { "cell_type": "code", "execution_count": 82, "id": "87aa49fd-c3a0-47dc-b433-904a4283a9ae", "metadata": {}, "outputs": [], "source": [ "# Get AWS S3 credentials for LP DAAC\n", "credentials <- maap$aws$earthdata_s3_credentials(\n", " \"https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials\"\n", ")\n", "\n", "# Configure AWS S3 client using paws\n", "s3 <- paws::s3(\n", " credentials = list(\n", " creds = list(\n", " access_key_id = credentials$accessKeyId,\n", " secret_access_key = credentials$secretAccessKey,\n", " session_token = credentials$sessionToken\n", " )),\n", " region = \"us-west-2\")" ] }, { "cell_type": "markdown", "id": "ef2830d7-31b6-4dfc-83dc-f9eddd5cc5cc", "metadata": {}, "source": [ "Before downloading, let's do some final prepping. First, we'll create a directory to download our file to. Then, from our S3 link, we can get the bucket, key, and a filename." ] }, { "cell_type": "code", "execution_count": 83, "id": "f1920448-d661-4c55-99d3-d4f669b924d6", "metadata": {}, "outputs": [], "source": [ "# Create new directory\n", "dir_name = \"./data\"\n", "if(!dir.exists(dir_name)){dir.create(dir_name)}" ] }, { "cell_type": "code", "execution_count": 84, "id": "ee30ae3c-c1a6-43e1-bcec-8dca6e648307", "metadata": {}, "outputs": [ { "data": { "text/html": [ "'lp-prod-protected'" ], "text/latex": [ "'lp-prod-protected'" ], "text/markdown": [ "'lp-prod-protected'" ], "text/plain": [ "[1] \"lp-prod-protected\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'MOD13A1.A2000049.h02v06.061.2020041151125.hdf'" ], "text/latex": [ "'MOD13A1.A2000049.h02v06.061.2020041151125.hdf'" ], "text/markdown": [ "'MOD13A1.A2000049.h02v06.061.2020041151125.hdf'" ], "text/plain": [ "[1] \"MOD13A1.A2000049.h02v06.061.2020041151125.hdf\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'MOD13A1.061/MOD13A1.A2000049.h02v06.061.2020041151125/MOD13A1.A2000049.h02v06.061.2020041151125.hdf'" ], "text/latex": [ "'MOD13A1.061/MOD13A1.A2000049.h02v06.061.2020041151125/MOD13A1.A2000049.h02v06.061.2020041151125.hdf'" ], "text/markdown": [ "'MOD13A1.061/MOD13A1.A2000049.h02v06.061.2020041151125/MOD13A1.A2000049.h02v06.061.2020041151125.hdf'" ], "text/plain": [ "[1] \"MOD13A1.061/MOD13A1.A2000049.h02v06.061.2020041151125/MOD13A1.A2000049.h02v06.061.2020041151125.hdf\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get bucket from file path\n", "s3_parts <- strsplit(sub(\"s3://\",\"\", s3_link), \"/\", fixed = TRUE)[[1]] # drop the s3 prefix\n", "bucket <- s3_parts[1] # grab the 1st item which is the bucket name\n", "bucket\n", "\n", "# Create file name for download\n", "filename <- tail(s3_parts, n=1) # grab the last part of the path\n", "filename\n", "\n", "# Get key from file path\n", "key <- paste(tail(s3_parts, n=-1), collapse='/') # grab everything in the path, except the 1st item\n", "key" ] }, { "cell_type": "markdown", "id": "5570ef5d-69fd-4c81-9795-63fdba015532", "metadata": {}, "source": [ "Now we can download our file." ] }, { "cell_type": "code", "execution_count": 85, "id": "7c410f1f-3359-4f27-aef5-b3519cbdd6da", "metadata": {}, "outputs": [], "source": [ "modis_file <- s3$download_file(Bucket = bucket, Key = key, Filename = paste(\"./data/\", filename))" ] }, { "cell_type": "markdown", "id": "8e9e076c-0cd0-489d-92ee-6553987ca154", "metadata": {}, "source": [ "### Access the Downloaded File" ] }, { "cell_type": "markdown", "id": "7f0fd644-dd63-4216-917c-da68e9930c87", "metadata": {}, "source": [ "The data has been downloaded and we can open the file. Since this is an HDF4 file, we can use the `ncdf4` package to open and work with it." ] }, { "cell_type": "code", "execution_count": 86, "id": "c781de3c-e2ab-4e03-b2bb-ba6f200f4417", "metadata": {}, "outputs": [], "source": [ "modis_file <- nc_open(paste(\"./data/\", filename))" ] }, { "cell_type": "markdown", "id": "0a657df0-4520-49b9-b607-45bc13721ace", "metadata": {}, "source": [ "The desired information can now be obtained from the opened file. For example, let's print the variable names." ] }, { "cell_type": "code", "execution_count": 87, "id": "b7cc0e93-81e3-47fe-a15d-d0bd82d5fa6d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "