Direct DAAC S3 Bucket Access (BETA)

Authors: Alex Mandel (Development Seed), Brian Freitag (NASA MSFC), Jamison French (Development Seed)

Date: September 27, 2023

Description: In this tutorial, we demonstrate how to assume the MAAP data reader role to access specific DAAC buckets.

This tutorial demonstrates an experimental feature to allow access to DAACs without using EarthDataLogin.

This method currently works for a select number of DAACs and their EarthDataCloud datasets which are stored in AWS S3: - NSIDC - ORNL - GES DISC - LPDAAC - PODAAC

Run This Notebook

To access and run this tutorial within MAAP’s Algorithm Development Environment (ADE), please refer to the “Getting started with the MAAP” section of our documentation.

Disclaimer: this tutorial must be run within MAAP’s ADE to assume the necessary permissions. This tutorial was tested using the vanilla workspace image. If you encounter issues with the installs, ensure you have the latest version of pip installed.

Additional Resources

Importing Packages

If the packages below are not installed already, uncomment the following cell.

[ ]:
%pip install h5netcdf fsspec s3fs xarray rioxarray --quiet
[1]:
import boto3
import fsspec
import xarray
import rioxarray
import matplotlib.pyplot as plt
import rasterio
from rasterio.session import AWSSession

Access The Data

We’ll create a couple helper functions to setup the assumed role session and view the data.

[2]:
def assume_role_credentials(ssm_parameter_name):
    # Create a session using your current credentials
    session = boto3.Session()

    # Retrieve the SSM parameter
    ssm = session.client('ssm', "us-west-2")
    parameter = ssm.get_parameter(
        Name=ssm_parameter_name,
        WithDecryption=True
    )
    parameter_value = parameter['Parameter']['Value']

    # Assume the DAAC access role
    sts = session.client('sts')
    assumed_role_object = sts.assume_role(
        RoleArn=parameter_value,
        RoleSessionName='TutorialSession'
    )

    # From the response that contains the assumed role, get the temporary
    # credentials that can be used to make subsequent API calls
    credentials = assumed_role_object['Credentials']

    return credentials

# We can pass assumed role credentials into fsspec
def fsspec_access(credentials):
    return fsspec.filesystem(
        "s3",
        key=credentials['AccessKeyId'],
        secret=credentials['SecretAccessKey'],
        token=credentials['SessionToken']
    )

# We can also pass assumed role credentials into rasterio AWSSession
def rasterio_access(credentials):
    aws_session = AWSSession(
        aws_access_key_id=credentials['AccessKeyId'],
        aws_secret_access_key=credentials['SecretAccessKey'],
        aws_session_token=credentials['SessionToken']
    )

    return rasterio.Env(aws_session)

Initialize the assumed role sessions

[3]:
s3_fsspec = fsspec_access(assume_role_credentials("/iam/maap-data-reader"))
s3_rasterio = rasterio_access(assume_role_credentials("/iam/maap-data-reader"))

NSIDC DAAC Access

We can use xarray to open a specific GROUP within the HDF5 such as the “gt1l” track.

[4]:
nsidc_object = "s3://nsidc-cumulus-prod-protected/ATLAS/ATL08/006/2023/06/21/ATL08_20230621235543_00272011_006_01.h5"
with s3_fsspec.open(nsidc_object) as f:
    ds = xarray.open_dataset(f, group='gt1l/land_segments', engine="h5netcdf", phony_dims='sort')
ds
[4]:
<xarray.Dataset>
Dimensions:            (delta_time: 21470, ds_geosegments: 5, ds_surf_type: 5)
Coordinates:
  * delta_time         (delta_time) datetime64[ns] 2023-06-21T23:55:51.147657...
    latitude           (delta_time) float32 ...
    longitude          (delta_time) float32 ...
Dimensions without coordinates: ds_geosegments, ds_surf_type
Data variables: (12/41)
    asr                (delta_time) float32 ...
    atlas_pa           (delta_time) float32 ...
    beam_azimuth       (delta_time) float32 ...
    beam_coelev        (delta_time) float32 ...
    brightness_flag    (delta_time) float32 ...
    cloud_flag_atm     (delta_time) float32 ...
    ...                 ...
    snr                (delta_time) float32 ...
    solar_azimuth      (delta_time) float32 ...
    solar_elevation    (delta_time) float32 ...
    surf_type          (delta_time, ds_surf_type) int8 ...
    terrain_flg        (delta_time) float64 ...
    urban_flag         (delta_time) float64 ...
Attributes:
    Description:  Contains data categorized as land at 100 meter intervals.
    data_rate:    Data are stored as aggregates of 100 meters.

ORNL DAAC Access

We can also use rioxarray to inspect our TIF objects.

[5]:
ornl_object = "s3://ornl-cumulus-prod-protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif"

with s3_fsspec.open(ornl_object) as obj:
    data_array = rioxarray.open_rasterio(obj)
data_array
[5]:
<xarray.DataArray (band: 1, y: 14616, x: 34704)>
[507233664 values with dtype=float32]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 -1.737e+07 -1.737e+07 ... 1.737e+07 1.737e+07
  * y            (y) float64 7.314e+06 7.313e+06 ... -7.313e+06 -7.314e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:  Area
    _FillValue:     -9999.0
    scale_factor:   1.0
    add_offset:     0.0

PO DAAC Access

[7]:
po_object = "s3://podaac-ops-cumulus-protected/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T55GEM_20230813T235239Z_20230815T154108Z_S2B_30_v1.0_B01_WTR.tif"
with s3_fsspec.open(po_object) as obj:
    data_array = rioxarray.open_rasterio(obj)
data_array
[7]:
<xarray.DataArray (band: 1, y: 3660, x: 3660)>
[13395600 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 5e+05 5e+05 5.001e+05 ... 6.097e+05 6.098e+05
  * y            (y) float64 -4.8e+06 -4.8e+06 -4.8e+06 ... -4.91e+06 -4.91e+06
    spatial_ref  int64 0
Attributes: (12/48)
    ACCODE:                                                                  ...
    AEROSOL_CLASS_REMAPPING_ENABLED:                                         ...
    AEROSOL_NOT_WATER_TO_HIGH_CONF_WATER_FMASK_VALUES:                       ...
    AEROSOL_PARTIAL_SURFACE_AGGRESSIVE_TO_HIGH_CONF_WATER_FMASK_VALUES:      ...
    AEROSOL_PARTIAL_SURFACE_WATER_CONSERVATIVE_TO_HIGH_CONF_WATER_FMASK_VALUE...
    AEROSOL_WATER_MODERATE_CONF_TO_HIGH_CONF_WATER_FMASK_VALUES:             ...
    ...                                                                          ...
    WORLDCOVER_COVERAGE:                                                     ...
    WORLDCOVER_SOURCE:                                                       ...
    _FillValue:                                                              ...
    scale_factor:                                                            ...
    add_offset:                                                              ...
    long_name:                                                               ...

GES DISC Access

[8]:
ges_disc_object = "s3://gesdisc-cumulus-prod-protected/Landslide/Global_Landslide_Nowcast.1.1/2020/Global_Landslide_Nowcast_v1.1_20201231.tif"

with s3_fsspec.open(ges_disc_object) as obj:
    data_array = rioxarray.open_rasterio(obj)
data_array
[8]:
<xarray.DataArray (band: 1, y: 14400, x: 43200)>
[622080000 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
  * y            (y) float64 60.0 59.99 59.98 59.97 ... -59.98 -59.99 -60.0
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    STATISTICS_MAXIMUM:  2
    STATISTICS_MEAN:     nan
    STATISTICS_MINIMUM:  0
    STATISTICS_STDDEV:   nan
    _FillValue:          255
    scale_factor:        1.0
    add_offset:          0.0

LP DAAC Access

We can also use rasterio to directly inspect our TIF objects.

[9]:
lp_object = "s3://lp-prod-protected/HLSL30.020/HLS.L30.T56JMN.2023225T234225.v2.0/HLS.L30.T56JMN.2023225T234225.v2.0.B11.tif"

with s3_rasterio:
    with rasterio.open(lp_object) as src:
        print(f'Width: {src.width}')
        print(f'Height: {src.height}')
        print(f'Bounds: {src.bounds}')
        print(f'CRS: {src.crs}')
        print(f'Count: {src.count}')
        print(f'Data type: {src.dtypes}')

Width: 3660
Height: 3660
Bounds: BoundingBox(left=399960.0, bottom=-3309780.0, right=509760.0, top=-3199980.0)
CRS: EPSG:32656
Count: 1
Data type: ('int16',)
[ ]: