MAAP AWS Access With Python

Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Chuck Daniels (Development Seed), Alex Mandel (Development Seed)

Date: March 26, 2025

Description: In this tutorial, we walk through accessing MAAP data in S3 buckets (maap-ops-workspace and nasa-maap-data-store) in python. We’ll also demonstrate opening a raster, vector, and text file.

Run This Notebook

To access and run this tutorial within MAAP’s Algorithm Development Environment (ADE), please refer to the “Getting started with the MAAP” section of our documentation.

Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP, such as maap-py. Running the tutorial outside of the MAAP ADE may lead to errors.

Additional Resources

Install/Import Packages

Let’s install and load the packages necessary for this tutorial.

[5]:
from maap.maap import MAAP
from pystac_client import Client
import geopandas as gpd
from osgeo import gdal
import pandas as pd
import boto3
import rasterio
import os
import re
import subprocess
from rasterio.session import AWSSession
from rasterio.env import Env

Set up Access

We don’t need to manually handle temporary credentials, but we do need to set the default AWS region to us-west-2.

[6]:
# Connect to MAAP API and S3
maap = MAAP()
s3 = boto3.client("s3", region_name="us-west-2")

Explore Buckets

Mounted paths (like /projects/ or /shared/) are convenient for interactive browsing in the ADE, but they can be slower and are not portable to other environments like the DPS.

For reproducible and scalable workflows — especially those intended to run in the cloud or on DPS — it’s recommended to use direct S3 paths or GDAL-style virtual file paths. Now that we have access to MAAP buckets, we can retrieve data stored in AWS. Typically, users will interact with two main buckets:

  1. maap-ops-workspace – Contains both user-private and user-shared data.

    • Private files are found under s3://maap-ops-workspace/private/<username>/...

    • Shared files are available under s3://maap-ops-workspace/shared/<username>/...

  2. nasa-maap-data-store – Hosts curated datasets that have been ingested into the MAAP STAC catalog.

    • This is the primary location for analysis-ready data used in DPS jobs, and shared workflows.

User Shared Buckets

To list objects from a shared bucket, run the code below. Be sure to update the prefix path after “shared/” to match your desired directory.

[21]:
s3_response = s3.list_objects_v2(
    Bucket="maap-ops-workspace",
    Prefix="shared/alexdevseed/cog-tests/"
)

To grab the identifier for each object within your bucket, run the following cell.

[8]:
all_objects = [obj["Key"] for obj in s3_response.get("Contents", [])]
tif_objects = [key for key in all_objects if key.endswith(".tif")]
for tif in tif_objects:
    print(tif)

shared/alexdevseed/cog-tests/Landsat8_275_comp_cog_2015-2020_dps.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr3.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr4.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr6.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr8.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-s3o8.tif
shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif

User Private Buckets

To access data in your private bucket, you’ll follow a similar approach as before, but with an updated prefix. First, we’ll retrieve your username to correctly construct the path.

[12]:
username = maap.profile.account_info()['username']
print("Username:", username)
Username: harshinigirish
[18]:
prefix = f"shared/{username}/"
s3_response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
s3_object_keys = [obj["Key"] for obj in s3_response.get("Contents", [])]
print("S3 Objects:")
for key in s3_object_keys:
    print(key)
S3 Objects:
shared/harshinigirish/
shared/harshinigirish/GLLIDARPC_FL_20200311_FIA8_l0s12.las

Note: While the following examples don’t explicitly access private buckets, the process is exactly the same as for shared buckets. The only difference is the prefix path—use your username directly instead of shared/username.

nasa-maap-data-store Buckets

To access data from the nasa-maap-data-store bucket, we’ll use a STAC query via the pystac-client library to retrieve item metadata, including file paths. These paths can then be used with tools that support STAC or direct S3 access.

[14]:
stac_url = "https://stac.maap-project.org/"
client = Client.open(stac_url)

In this example, we’ll query the icesat2-boreal collection to explore its available data items.

[15]:
collection_id = "icesat2-boreal"
search = client.search(collections=[collection_id], max_items=10)
items = list(search.get_items())
print("First 10 STAC Items:")
for item in items:
    print(item.id)

First 10 STAC Items:
boreal_agb_202302151676439579_1326
boreal_agb_202302151676435792_3402
boreal_agb_202302151676435665_3417
boreal_agb_202302151676434536_3215
boreal_agb_202302151676434460_3035
boreal_agb_202302151676432986_2782
boreal_agb_202302151676430990_1278
boreal_agb_202302151676430794_26340
boreal_agb_202302151676430633_40664
boreal_agb_202302151676430594_0611

Now that we’ve specified our collection and retrieved a list of items, we can extract the S3 URL linked to the first item in the collection.

[16]:
first_item = items[0]
asset_href = list(first_item.assets.values())[0].href
print("S3 URL:", asset_href)

S3 URL: s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv

Accessing an Item

TIFF

In this example, we’ll access a TIFF file from a shared S3 bucket. To read the file directly from S3, the path must begin with /vsis3/. We’ll construct the full path by combining /vsis3/with the bucket name.

[22]:
key = "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif"
tiff_path = f"/vsis3/{bucket}/{key}"
print("TIFF path:", tiff_path)
TIFF path: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif
This code block uses the rio cogeo info command-line tool to inspect a Cloud Optimized GeoTIFF (COG) directly from S3.
It prints detailed metadata specific to the COG structure—such as tile layout, overviews, and internal organization. This information is useful for evaluating whether the file is optimized for cloud-based access and helps inform decisions before processing or visualization.
[26]:
cmd = ["rio", "cogeo", "info", tiff_path]
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
Driver: GTiff
File: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif
COG: True
Compression: None
ColorSpace: None

Profile
    Width:            3000
    Height:           3000
    Bands:            4
    Tiled:            True
    Dtype:            float32
    NoData:           -3.3999999521443642e+38
    Alpha Band:       False
    Internal Mask:    False
    Interleave:       PIXEL
    ColorMap:         False
    ColorInterp:      ('gray', 'undefined', 'undefined', 'undefined')
    Scales:           (1.0, 1.0, 1.0, 1.0)
    Offsets:          (0.0, 0.0, 0.0, 0.0)

Geo
    Crs:              PROJCS["unknown",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101004,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["latitude_of_center",40],PARAMETER["longitude_of_center",180],PARAMETER["standard_parallel_1",50],PARAMETER["standard_parallel_2",70],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
    Origin:           (-1791478.0, 7983304.0)
    Resolution:       (30.0, -30.0)
    BoundingBox:      (-1791478.0, 7893304.0, -1701478.0, 7983304.0)
    MinZoom:          6
    MaxZoom:          11

Image Metadata
    AREA_OR_POINT: Area

Image Structure
    LAYOUT: COG
    INTERLEAVE: PIXEL

Band 1
    ColorInterp: gray
    Metadata:
        STATISTICS_MAXIMUM: 642.62117058029
        STATISTICS_MEAN: 54.162682191363
        STATISTICS_MINIMUM: 4.258820251973
        STATISTICS_STDDEV: 41.175124882695

Band 2
    ColorInterp: undefined
    Metadata:
        STATISTICS_MAXIMUM: 334.88861093713
        STATISTICS_MEAN: 10.524834878905
        STATISTICS_MINIMUM: 0.62954892025643
        STATISTICS_STDDEV: 10.751815070336

Band 3
    ColorInterp: undefined
    Metadata:
        STATISTICS_MAXIMUM: 178.32760325572
        STATISTICS_MEAN: 41.353653830476
        STATISTICS_MINIMUM: 3.408856940596
        STATISTICS_STDDEV: 31.934002621158

Band 4
    ColorInterp: undefined
    Metadata:
        STATISTICS_MAXIMUM: 850.18554567087
        STATISTICS_MEAN: 72.227219607842
        STATISTICS_MINIMUM: 6.5033239743444
        STATISTICS_STDDEV: 52.055061765112

IFD
    Id      Size           BlockSize     Decimation
    0       3000x3000      512x512       0
    1       1500x1500      512x512       2
    2       750x750        512x512       4
    3       375x375        512x512       8

As a best practice, it’s important to know which GDAL drivers are available, as using the appropriate driver ensures efficient and reliable access to geospatial data. Different drivers support different formats (e.g., GeoTIFF, NetCDF, Shapefile), and selecting the right one can significantly impact performance and compatibility.

Please refer to the “GDAL OGR driver list” for more details.

[24]:
with rasterio.Env() as env:
    drivers = list(env.drivers().items())
    for short_name, can_create in drivers[:5]:
        print(f"{short_name:<10} | Can Create: {can_create}")

VRT        | Can Create: Virtual Raster
GTI        | Can Create: GDAL Raster Tile Index
DERIVED    | Can Create: Derived datasets using VRT pixel functions
GTiff      | Can Create: GeoTIFF
COG        | Can Create: Cloud optimized GeoTIFF generator

This code snippet runs the gdalinfo command-line utility from within Python to read metadata from a TIFF file stored in an AWS S3 bucket. The file path is formatted with /vsis3/, which allows GDAL to access cloud-hosted data directly. The command is executed using Python’s subprocess module, and the output—containing detailed metadata about the raster file (such as size, projection, and geotransform)—is captured and printed.

Vector

In this example, we access a GeoPackage file stored in a shared S3 bucket using the geopandas package. As with raster data, we prepend /vsis3/ to the file path so that GDAL can stream the data directly from S3 without downloading it locally.

[27]:
prefix = "shared/smk0033/CONUSbiohex2020/biohex.gpkg"
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
gpkg_keys = [obj["Key"] for obj in response.get("Contents", []) if obj["Key"].endswith(".gpkg")]

key = gpkg_keys[0]
vector_path = f"/vsis3/{bucket}/{key}"
print("GeoPackage path:", vector_path)

gdf = gpd.read_file(vector_path)
print(vector_path)

GeoPackage path: /vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg
/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg
[28]:
print(gdf.head())
   USHEXES_ID  EMAP_HEX  PROP_FORES  SE_PROP_FO   CRM_LIVE  SE_CRM_LIV  \
0        1680    1680.0    0.966835    3.247659  76.729213   14.810822
1        1681    1681.0    0.983914    1.123591  72.751194   10.498955
2        1568    1568.0    0.854100   12.539034  88.527037   20.416719
3        1456    1456.0    0.543536   22.598699  52.052440   40.713392
4        1345    1345.0    0.520229   23.210199  42.179547   29.260777

   CRM_STND_D  SE_CRM_STN  CRM_LIVE_D  SE_CRM_L_1  ...  SE_JENK_LI  \
0    2.091053   68.108338   78.820266   15.381299  ...   17.287907
1    1.870613   25.186416   74.621807   10.496552  ...    9.390085
2    0.703147   58.649462   89.230184   20.333036  ...   20.126408
3    3.783766   37.665236   55.836206   39.080061  ...   37.532659
4    0.340501   50.498881   42.520048   29.336943  ...   27.366023

   JENK_STND_  SE_JENK_ST  JENK_LIVE_  SE_JENK__1    EST_SAMPLE  SAMPLED_PL  \
0   23.530244   59.050753  127.717229   10.583475  14242.786806         6.0
1    9.422362   18.234947  117.190761    9.232750  47158.889642        19.0
2    2.643056   45.610955  109.839281   19.858437  21226.969702         9.0
3   13.858363   29.379562   81.440374   35.288359  23836.849808        10.0
4    2.744681   41.864178   59.346039   27.787539  37744.619593        16.0

   NON_SAMPLE  AVG_INVYR                                           geometry
0         0.0     2017.3  MULTIPOLYGON (((-69.33725 47.57343, -69.3385 4...
1         0.0     2017.1  MULTIPOLYGON (((-69.1548 47.36372, -69.3385 47...
2         0.0     2016.6  MULTIPOLYGON (((-69.1548 47.36372, -69.15631 4...
3         0.0     2017.4  MULTIPOLYGON (((-68.78615 47.36264, -68.78862 ...
4         0.0     2016.9  MULTIPOLYGON (((-68.41737 47.3604, -68.4208 47...

[5 rows x 27 columns]

CSV (Spatial)

In this example, we’ll load a CSV file containing spatial data directly from the MAAP STAC results. We use the item variable, and then modify it to stream data using the /vsis3/ prefix.

[29]:
asset_href = "s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv"

Since we already have a complete S3 path, we convert the "s3://" prefix to "/vsis3/". Additionally, we define the appropriate field names for longitude and latitude so that the file is interpreted as spatial.

To learn more, refer to the GDAL Comma Separated Value (.csv) driver documentation.

[30]:
csv_path = asset_href.replace("s3://", "/vsis3/")
gdf = gpd.read_file(
    f"CSV:{csv_path}",
    engine="fiona",
    X_POSSIBLE_NAMES="lon",
    Y_POSSIBLE_NAMES="lat"
)
print(gdf.head())

         lon        lat               AGB                   SE  \
0 -76.301546  51.089067  13.2031877105918  0.00120325130936702
1 -79.011834  50.972447  3.88344532354623  0.00107527195707417
2 -76.397307  50.458315   4.3007091919769  0.00107527195707417
3 -76.308436  50.442678  43.3027732332638  0.00120325130936702
4 -77.456452  52.031459  2.34135031326733  0.00107527195707417

                     geometry
0  POINT (-76.30155 51.08907)
1  POINT (-79.01183 50.97245)
2  POINT (-76.39731 50.45832)
3  POINT (-76.30844 50.44268)
4  POINT (-77.45645 52.03146)

CSV (non-spatial)

For this example, we’ll access a CSV file from our shared bucket.

[31]:
csv_listing = s3.list_objects_v2(Bucket=bucket, Prefix="shared/smk0033/csv_ex/")
csv_keys = [obj["Key"] for obj in csv_listing.get("Contents", [])]
csv_key = csv_keys[3]
print(csv_key)

shared/smk0033/csv_ex/country_estimates_gedi_l4b_v002.csv

Although this CSV file can be accessed directly from shared storage or S3, we’re downloading it locally before reading. This approach helps avoid potential memory issues or latency that can arise when reading files over a network connection—especially for formats like CSV that aren’t inherently cloud-optimized.

Downloading also ensures better compatibility with processing tools like pandas, which expect local file handles for some operations. While cloud-native streaming is preferred for large geospatial formats (e.g., COGs), working with local copies of non-spatial files can improve stability and simplicity in many cases.

Before downloading, let’s create a new directory to put our file.

[32]:
os.makedirs("./data", exist_ok=True)
[33]:
#create file name for download
filename = os.path.basename(csv_key)
print("Filename:", filename)
Filename: country_estimates_gedi_l4b_v002.csv
[34]:
download_path = os.path.join("./data", filename)
s3.download_file(Bucket=bucket, Key=csv_key, Filename=download_path)
[35]:
# Read CSV into DataFrame
data = pd.read_csv(download_path)
print(data.head())

       Country ISO3  Percent_Forest  FAO_Forested_AGBD  FAO_Forested_AGBD.1  \
0        Aruba  ABW             2.3            -9999.0              -9999.0
1  Afghanistan  AFG             1.9            -9999.0              -9999.0
2       Angola  AGO            53.4               30.3                 16.2
3     Anguilla  AIA            61.1              210.0                128.3
4      Albania  ALB            28.8            -9999.0              -9999.0

   GEDI_L4B_Total_AGBD  GEDI_L4B_Total_AGBD.1  GEDI_L4B_AGBD_SE_Percent  \
0                  2.1                    0.5                      23.6
1                 24.7                    1.3                       5.4
2                 34.6                    0.6                       1.9
3                  4.4                    1.0                      22.5
4                 56.9                    1.4                       2.5

      FAO_AGB  GEDI_L4B_AGB  GEDI_L4B_AGB_SE
0 -9999.00000      0.000036         0.000008
1 -9999.00000      1.583907         0.084862
2     2.02000      4.312326         0.080284
3     0.00116      0.000035         0.000008
4 -9999.00000      0.161158         0.004026