MAAP AWS Access With Python
Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Chuck Daniels (Development Seed), Alex Mandel (Development Seed)
Date: March 26, 2025
Description: In this tutorial, we walk through accessing MAAP data in S3 buckets (maap-ops-workspace and nasa-maap-data-store) in python. We’ll also demonstrate opening a raster, vector, and text file.
Run This Notebook
To access and run this tutorial within MAAP’s Algorithm Development Environment (ADE), please refer to the “Getting started with the MAAP” section of our documentation.
Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP, such as maap-py. Running the tutorial outside of the MAAP ADE may lead to errors.
Additional Resources
- A notebook in NASA Openscapes that shows users how to access data from S3 links.
- Official MAAP documentation showing how to work with AWS-hosted datasets in R.
Install/Import Packages
Let’s install and load the packages necessary for this tutorial.
[5]:
from maap.maap import MAAP
from pystac_client import Client
import geopandas as gpd
from osgeo import gdal
import pandas as pd
import boto3
import rasterio
import os
import re
import subprocess
from rasterio.session import AWSSession
from rasterio.env import Env
Set up Access
We don’t need to manually handle temporary credentials, but we do need to set the default AWS region to us-west-2.
[6]:
# Connect to MAAP API and S3
maap = MAAP()
s3 = boto3.client("s3", region_name="us-west-2")
Explore Buckets
Mounted paths (like /projects/ or /shared/) are convenient for interactive browsing in the ADE, but they can be slower and are not portable to other environments like the DPS.
For reproducible and scalable workflows — especially those intended to run in the cloud or on DPS — it’s recommended to use direct S3 paths or GDAL-style virtual file paths. Now that we have access to MAAP buckets, we can retrieve data stored in AWS. Typically, users will interact with two main buckets:
maap-ops-workspace – Contains both user-private and user-shared data.
Private files are found under
s3://maap-ops-workspace/private/<username>/...Shared files are available under
s3://maap-ops-workspace/shared/<username>/...
nasa-maap-data-store – Hosts curated datasets that have been ingested into the MAAP STAC catalog.
This is the primary location for analysis-ready data used in DPS jobs, and shared workflows.
User Private Buckets
To access data in your private bucket, you’ll follow a similar approach as before, but with an updated prefix. First, we’ll retrieve your username to correctly construct the path.
[12]:
username = maap.profile.account_info()['username']
print("Username:", username)
Username: harshinigirish
[18]:
prefix = f"shared/{username}/"
s3_response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
s3_object_keys = [obj["Key"] for obj in s3_response.get("Contents", [])]
print("S3 Objects:")
for key in s3_object_keys:
print(key)
S3 Objects:
shared/harshinigirish/
shared/harshinigirish/GLLIDARPC_FL_20200311_FIA8_l0s12.las
Note: While the following examples don’t explicitly access private buckets, the process is exactly the same as for shared buckets. The only difference is the prefix path—use your username directly instead of shared/username.
nasa-maap-data-store Buckets
To access data from the nasa-maap-data-store bucket, we’ll use a STAC query via the pystac-client library to retrieve item metadata, including file paths. These paths can then be used with tools that support STAC or direct S3 access.
[14]:
stac_url = "https://stac.maap-project.org/"
client = Client.open(stac_url)
In this example, we’ll query the icesat2-boreal collection to explore its available data items.
[15]:
collection_id = "icesat2-boreal"
search = client.search(collections=[collection_id], max_items=10)
items = list(search.get_items())
print("First 10 STAC Items:")
for item in items:
print(item.id)
First 10 STAC Items:
boreal_agb_202302151676439579_1326
boreal_agb_202302151676435792_3402
boreal_agb_202302151676435665_3417
boreal_agb_202302151676434536_3215
boreal_agb_202302151676434460_3035
boreal_agb_202302151676432986_2782
boreal_agb_202302151676430990_1278
boreal_agb_202302151676430794_26340
boreal_agb_202302151676430633_40664
boreal_agb_202302151676430594_0611
Now that we’ve specified our collection and retrieved a list of items, we can extract the S3 URL linked to the first item in the collection.
[16]:
first_item = items[0]
asset_href = list(first_item.assets.values())[0].href
print("S3 URL:", asset_href)
S3 URL: s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv
Accessing an Item
TIFF
In this example, we’ll access a TIFF file from a shared S3 bucket. To read the file directly from S3, the path must begin with /vsis3/. We’ll construct the full path by combining /vsis3/with the bucket name.
[22]:
key = "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif"
tiff_path = f"/vsis3/{bucket}/{key}"
print("TIFF path:", tiff_path)
TIFF path: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif
rio cogeo info command-line tool to inspect a Cloud Optimized GeoTIFF (COG) directly from S3.[26]:
cmd = ["rio", "cogeo", "info", tiff_path]
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
Driver: GTiff
File: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif
COG: True
Compression: None
ColorSpace: None
Profile
Width: 3000
Height: 3000
Bands: 4
Tiled: True
Dtype: float32
NoData: -3.3999999521443642e+38
Alpha Band: False
Internal Mask: False
Interleave: PIXEL
ColorMap: False
ColorInterp: ('gray', 'undefined', 'undefined', 'undefined')
Scales: (1.0, 1.0, 1.0, 1.0)
Offsets: (0.0, 0.0, 0.0, 0.0)
Geo
Crs: PROJCS["unknown",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101004,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["latitude_of_center",40],PARAMETER["longitude_of_center",180],PARAMETER["standard_parallel_1",50],PARAMETER["standard_parallel_2",70],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
Origin: (-1791478.0, 7983304.0)
Resolution: (30.0, -30.0)
BoundingBox: (-1791478.0, 7893304.0, -1701478.0, 7983304.0)
MinZoom: 6
MaxZoom: 11
Image Metadata
AREA_OR_POINT: Area
Image Structure
LAYOUT: COG
INTERLEAVE: PIXEL
Band 1
ColorInterp: gray
Metadata:
STATISTICS_MAXIMUM: 642.62117058029
STATISTICS_MEAN: 54.162682191363
STATISTICS_MINIMUM: 4.258820251973
STATISTICS_STDDEV: 41.175124882695
Band 2
ColorInterp: undefined
Metadata:
STATISTICS_MAXIMUM: 334.88861093713
STATISTICS_MEAN: 10.524834878905
STATISTICS_MINIMUM: 0.62954892025643
STATISTICS_STDDEV: 10.751815070336
Band 3
ColorInterp: undefined
Metadata:
STATISTICS_MAXIMUM: 178.32760325572
STATISTICS_MEAN: 41.353653830476
STATISTICS_MINIMUM: 3.408856940596
STATISTICS_STDDEV: 31.934002621158
Band 4
ColorInterp: undefined
Metadata:
STATISTICS_MAXIMUM: 850.18554567087
STATISTICS_MEAN: 72.227219607842
STATISTICS_MINIMUM: 6.5033239743444
STATISTICS_STDDEV: 52.055061765112
IFD
Id Size BlockSize Decimation
0 3000x3000 512x512 0
1 1500x1500 512x512 2
2 750x750 512x512 4
3 375x375 512x512 8
As a best practice, it’s important to know which GDAL drivers are available, as using the appropriate driver ensures efficient and reliable access to geospatial data. Different drivers support different formats (e.g., GeoTIFF, NetCDF, Shapefile), and selecting the right one can significantly impact performance and compatibility.
Please refer to the “GDAL OGR driver list” for more details.
[24]:
with rasterio.Env() as env:
drivers = list(env.drivers().items())
for short_name, can_create in drivers[:5]:
print(f"{short_name:<10} | Can Create: {can_create}")
VRT | Can Create: Virtual Raster
GTI | Can Create: GDAL Raster Tile Index
DERIVED | Can Create: Derived datasets using VRT pixel functions
GTiff | Can Create: GeoTIFF
COG | Can Create: Cloud optimized GeoTIFF generator
This code snippet runs the gdalinfo command-line utility from within Python to read metadata from a TIFF file stored in an AWS S3 bucket. The file path is formatted with /vsis3/, which allows GDAL to access cloud-hosted data directly. The command is executed using Python’s subprocess module, and the output—containing detailed metadata about the raster file (such as size, projection, and geotransform)—is captured and printed.
Vector
In this example, we access a GeoPackage file stored in a shared S3 bucket using the geopandas package. As with raster data, we prepend /vsis3/ to the file path so that GDAL can stream the data directly from S3 without downloading it locally.
[27]:
prefix = "shared/smk0033/CONUSbiohex2020/biohex.gpkg"
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
gpkg_keys = [obj["Key"] for obj in response.get("Contents", []) if obj["Key"].endswith(".gpkg")]
key = gpkg_keys[0]
vector_path = f"/vsis3/{bucket}/{key}"
print("GeoPackage path:", vector_path)
gdf = gpd.read_file(vector_path)
print(vector_path)
GeoPackage path: /vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg
/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg
[28]:
print(gdf.head())
USHEXES_ID EMAP_HEX PROP_FORES SE_PROP_FO CRM_LIVE SE_CRM_LIV \
0 1680 1680.0 0.966835 3.247659 76.729213 14.810822
1 1681 1681.0 0.983914 1.123591 72.751194 10.498955
2 1568 1568.0 0.854100 12.539034 88.527037 20.416719
3 1456 1456.0 0.543536 22.598699 52.052440 40.713392
4 1345 1345.0 0.520229 23.210199 42.179547 29.260777
CRM_STND_D SE_CRM_STN CRM_LIVE_D SE_CRM_L_1 ... SE_JENK_LI \
0 2.091053 68.108338 78.820266 15.381299 ... 17.287907
1 1.870613 25.186416 74.621807 10.496552 ... 9.390085
2 0.703147 58.649462 89.230184 20.333036 ... 20.126408
3 3.783766 37.665236 55.836206 39.080061 ... 37.532659
4 0.340501 50.498881 42.520048 29.336943 ... 27.366023
JENK_STND_ SE_JENK_ST JENK_LIVE_ SE_JENK__1 EST_SAMPLE SAMPLED_PL \
0 23.530244 59.050753 127.717229 10.583475 14242.786806 6.0
1 9.422362 18.234947 117.190761 9.232750 47158.889642 19.0
2 2.643056 45.610955 109.839281 19.858437 21226.969702 9.0
3 13.858363 29.379562 81.440374 35.288359 23836.849808 10.0
4 2.744681 41.864178 59.346039 27.787539 37744.619593 16.0
NON_SAMPLE AVG_INVYR geometry
0 0.0 2017.3 MULTIPOLYGON (((-69.33725 47.57343, -69.3385 4...
1 0.0 2017.1 MULTIPOLYGON (((-69.1548 47.36372, -69.3385 47...
2 0.0 2016.6 MULTIPOLYGON (((-69.1548 47.36372, -69.15631 4...
3 0.0 2017.4 MULTIPOLYGON (((-68.78615 47.36264, -68.78862 ...
4 0.0 2016.9 MULTIPOLYGON (((-68.41737 47.3604, -68.4208 47...
[5 rows x 27 columns]
CSV (Spatial)
In this example, we’ll load a CSV file containing spatial data directly from the MAAP STAC results. We use the item variable, and then modify it to stream data using the /vsis3/ prefix.
[29]:
asset_href = "s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv"
Since we already have a complete S3 path, we convert the "s3://" prefix to "/vsis3/". Additionally, we define the appropriate field names for longitude and latitude so that the file is interpreted as spatial.
To learn more, refer to the GDAL Comma Separated Value (.csv) driver documentation.
[30]:
csv_path = asset_href.replace("s3://", "/vsis3/")
gdf = gpd.read_file(
f"CSV:{csv_path}",
engine="fiona",
X_POSSIBLE_NAMES="lon",
Y_POSSIBLE_NAMES="lat"
)
print(gdf.head())
lon lat AGB SE \
0 -76.301546 51.089067 13.2031877105918 0.00120325130936702
1 -79.011834 50.972447 3.88344532354623 0.00107527195707417
2 -76.397307 50.458315 4.3007091919769 0.00107527195707417
3 -76.308436 50.442678 43.3027732332638 0.00120325130936702
4 -77.456452 52.031459 2.34135031326733 0.00107527195707417
geometry
0 POINT (-76.30155 51.08907)
1 POINT (-79.01183 50.97245)
2 POINT (-76.39731 50.45832)
3 POINT (-76.30844 50.44268)
4 POINT (-77.45645 52.03146)
CSV (non-spatial)
For this example, we’ll access a CSV file from our shared bucket.
[31]:
csv_listing = s3.list_objects_v2(Bucket=bucket, Prefix="shared/smk0033/csv_ex/")
csv_keys = [obj["Key"] for obj in csv_listing.get("Contents", [])]
csv_key = csv_keys[3]
print(csv_key)
shared/smk0033/csv_ex/country_estimates_gedi_l4b_v002.csv
Although this CSV file can be accessed directly from shared storage or S3, we’re downloading it locally before reading. This approach helps avoid potential memory issues or latency that can arise when reading files over a network connection—especially for formats like CSV that aren’t inherently cloud-optimized.
Downloading also ensures better compatibility with processing tools like pandas, which expect local file handles for some operations. While cloud-native streaming is preferred for large geospatial formats (e.g., COGs), working with local copies of non-spatial files can improve stability and simplicity in many cases.
Before downloading, let’s create a new directory to put our file.
[32]:
os.makedirs("./data", exist_ok=True)
[33]:
#create file name for download
filename = os.path.basename(csv_key)
print("Filename:", filename)
Filename: country_estimates_gedi_l4b_v002.csv
[34]:
download_path = os.path.join("./data", filename)
s3.download_file(Bucket=bucket, Key=csv_key, Filename=download_path)
[35]:
# Read CSV into DataFrame
data = pd.read_csv(download_path)
print(data.head())
Country ISO3 Percent_Forest FAO_Forested_AGBD FAO_Forested_AGBD.1 \
0 Aruba ABW 2.3 -9999.0 -9999.0
1 Afghanistan AFG 1.9 -9999.0 -9999.0
2 Angola AGO 53.4 30.3 16.2
3 Anguilla AIA 61.1 210.0 128.3
4 Albania ALB 28.8 -9999.0 -9999.0
GEDI_L4B_Total_AGBD GEDI_L4B_Total_AGBD.1 GEDI_L4B_AGBD_SE_Percent \
0 2.1 0.5 23.6
1 24.7 1.3 5.4
2 34.6 0.6 1.9
3 4.4 1.0 22.5
4 56.9 1.4 2.5
FAO_AGB GEDI_L4B_AGB GEDI_L4B_AGB_SE
0 -9999.00000 0.000036 0.000008
1 -9999.00000 1.583907 0.084862
2 2.02000 4.312326 0.080284
3 0.00116 0.000035 0.000008
4 -9999.00000 0.161158 0.004026