MAAP AWS Access in R
Authors: Sheyenne Kirkland (UAH), Harshini Girish (UAH), Alex Mandel (DevSeed), Chuck Daniels (DevSeed), Henry Rodman (DevSeed), Zac Deziel (DevSeed)
Date: February 24, 2025
Description: In this tutorial, we walk through accessing MAAP data in S3 buckets (maap-ops-workspace and nasa-maap-data-store) in R. We’ll also demonstrate opening a raster, vector, and text file.
Run This Notebook
To access and run this tutorial within MAAP’s Algorithm Development Environment (ADE), please refer to the “Getting started with the MAAP” section of our documentation.
Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP, such as maap-py. Running the tutorial outside of the MAAP ADE may lead to errors. Users should work within an “R/Python” workspace.
Additional Resources
-
A file in the
paws
Github with examples on how to usepaws
in R.
-
A notebook in NASA Openscapes that also shows users how to access data from S3 links.
Install/Load Packages
Let’s install and load packages needed for this tutorial.
[ ]:
install.packages("rstac")
library("rstac")
library("sf")
library("reticulate")
library("paws")
Set up Access
While we don’t need the code to get temporary credentials (the paws
package handles this for us), the default region needs to be set to “us-west-2”.
[2]:
s3 <- paws::s3(region = "us-west-2")
Let’s also set up maap-py
since this will be used to get our username.
[3]:
maap_py <- import("maap.maap")
maap <- maap_py$MAAP()
Explore Buckets
Now that we have access to MAAP buckets, we can get data available in AWS. Users will potentially use two buckets:
maap-ops-workspace
nasa-maap-data-store
“maap-ops-workspace” holds user shared and private buckets, while “nasa-maap-data-store” holds our datasets that are ingested into the MAAP STAC.
User Private Buckets
To access data within your private bucket, we’ll use code similar to above but update the prefix. Before updating our prefix, we’ll get our username.
[7]:
username <- maap$profile$account_info()$username
username
Now we can use the username variable in our prefix. Be sure to update the path as necessary to access the files within your desired bucket.
[8]:
s3_response <- s3$list_objects_v2(Bucket = bucket, Prefix = paste(username, "CONUSbiohex2020", sep = "/"))
[9]:
s3_object_keys <- sapply(s3_response$Contents, function(s3_object) s3_object$Key)
s3_object_keys
- 'smk0033/CONUSbiohex2020/'
- 'smk0033/CONUSbiohex2020/.ipynb_checkpoints/'
- 'smk0033/CONUSbiohex2020/CONUSbiohex2020.dbf'
- 'smk0033/CONUSbiohex2020/CONUSbiohex2020.prj'
- 'smk0033/CONUSbiohex2020/CONUSbiohex2020.sbn'
- 'smk0033/CONUSbiohex2020/CONUSbiohex2020.sbx'
- 'smk0033/CONUSbiohex2020/CONUSbiohex2020.shp'
- 'smk0033/CONUSbiohex2020/CONUSbiohex2020.shx'
- 'smk0033/CONUSbiohex2020/biohex.gpkg'
Note: while we do not demonstrate accessing data from a private bucket in the following examples, accessing data in a private bucket works the same as accessing data in a shared bucket - the only difference will be the paths.
nasa-maap-data-store Buckets
To get access to the nasa-maap-data-store bucket, we’ll do an rstac
query to get a path to items we need. Users can then pass that path into the tool needed to open it.
For this example, we’ll use the “icesat2-boreal” dataset.
[10]:
# Define the MAAP STAC endpoint
stac_endpoint <- stac("https://stac.maap-project.org/")
[11]:
# Define the collection
collection <- "icesat2-boreal"
# Fetch items
stac_url <- stac_endpoint[[2]]
stac_items <- stac(stac_url) |>
stac_search(collections = collection) |>
get_request()
print(stac_items)
###Items
- features (10 item(s)):
- boreal_agb_202302151676439579_1326
- boreal_agb_202302151676435792_3402
- boreal_agb_202302151676435665_3417
- boreal_agb_202302151676434536_3215
- boreal_agb_202302151676434460_3035
- boreal_agb_202302151676432986_2782
- boreal_agb_202302151676430990_1278
- boreal_agb_202302151676430794_26340
- boreal_agb_202302151676430633_40664
- boreal_agb_202302151676430594_0611
- assets: csv, tif
- item's fields:
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type
Now that we have defined our collection and retrieved some items, let’s get the S3 URL associated with the first item.
[12]:
item = stac_items$features[[1]]$assets[[1]]$href
item
Accessing an Item
TIFF
For this example, we’ll access a TIFF file from the shared bucket query using the sf
package. To read an item from S3 directly, /vsis3/
needs to precede the path. To do this with an object from our shared bucket query, we’ll use paste
to combine vsis3
with our bucket and our key.
[13]:
tiff_path <- paste("/vsis3", bucket, shared_objects[7], sep = "/")
print(tiff_path)
[1] "/vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif"
As a best practice, drivers should be used for speed. There are specific drivers for different data formats. To list drivers for raster data, run the following cell. For vector data, update “raster” to “vector”. For a full list, remove “head” or see the GDAL Documentation site.
[14]:
head(st_drivers(what = "raster"))
name | long_name | write | copy | is_raster | is_vector | vsi | |
---|---|---|---|---|---|---|---|
<chr> | <chr> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | |
VRT | VRT | Virtual Raster | TRUE | TRUE | TRUE | FALSE | TRUE |
DERIVED | DERIVED | Derived datasets using VRT pixel functions | FALSE | FALSE | TRUE | FALSE | FALSE |
GTiff | GTiff | GeoTIFF | TRUE | TRUE | TRUE | FALSE | TRUE |
COG | COG | Cloud optimized GeoTIFF generator | FALSE | TRUE | TRUE | FALSE | TRUE |
NITF | NITF | National Imagery Transmission Format | TRUE | TRUE | TRUE | FALSE | TRUE |
RPFTOC | RPFTOC | Raster Product Format TOC format | FALSE | FALSE | TRUE | FALSE | TRUE |
Now lets read our data.
[15]:
tiff_read <- sf::gdal_utils("info", tiff_path)
tiff_read
Driver: GTiff/GeoTIFF
Files: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif
Size is 3000, 3000
Coordinate System is:
PROJCRS["unknown",
BASEGEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101004,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]],
CONVERSION["Albers Equal Area",
METHOD["Albers Equal Area",
ID["EPSG",9822]],
PARAMETER["Latitude of false origin",40,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8821]],
PARAMETER["Longitude of false origin",180,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8822]],
PARAMETER["Latitude of 1st standard parallel",50,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8823]],
PARAMETER["Latitude of 2nd standard parallel",70,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8824]],
PARAMETER["Easting at false origin",0,
LENGTHUNIT["metre",1],
ID["EPSG",8826]],
PARAMETER["Northing at false origin",0,
LENGTHUNIT["metre",1],
ID["EPSG",8827]]],
CS[Cartesian,2],
AXIS["easting",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["northing",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
Data axis to CRS axis mapping: 1,2
Origin = (-1791478.000000000000000,7983304.000000000000000)
Pixel Size = (30.000000000000000,-30.000000000000000)
Metadata:
AREA_OR_POINT=Area
Image Structure Metadata:
INTERLEAVE=PIXEL
LAYOUT=COG
Corner Coordinates:
Upper Left (-1791478.000, 7983304.000) ( 16d51'54.04"E, 68d27' 4.96"N)
Lower Left (-1791478.000, 7893304.000) ( 18d20'45.02"E, 69d 3'10.36"N)
Upper Right (-1701478.000, 7983304.000) ( 15d 9'32.11"E, 68d58' 7.91"N)
Lower Right (-1701478.000, 7893304.000) ( 16d37'42.72"E, 69d35' 5.45"N)
Center (-1746478.000, 7938304.000) ( 16d44'58.48"E, 69d 1' 3.32"N)
Band 1 Block=512x512 Type=Float32, ColorInterp=Gray
Min=4.259 Max=642.621
Minimum=4.259, Maximum=642.621, Mean=54.163, StdDev=41.175
NoData Value=-3.4e+38
Overviews: 1500x1500, 750x750, 375x375
Metadata:
STATISTICS_MAXIMUM=642.62117058029
STATISTICS_MEAN=54.162682191363
STATISTICS_MINIMUM=4.258820251973
STATISTICS_STDDEV=41.175124882695
Band 2 Block=512x512 Type=Float32, ColorInterp=Undefined
Min=0.630 Max=334.889
Minimum=0.630, Maximum=334.889, Mean=10.525, StdDev=10.752
NoData Value=-3.4e+38
Overviews: 1500x1500, 750x750, 375x375
Metadata:
STATISTICS_MAXIMUM=334.88861093713
STATISTICS_MEAN=10.524834878905
STATISTICS_MINIMUM=0.62954892025643
STATISTICS_STDDEV=10.751815070336
Band 3 Block=512x512 Type=Float32, ColorInterp=Undefined
Min=3.409 Max=178.328
Minimum=3.409, Maximum=178.328, Mean=41.354, StdDev=31.934
NoData Value=-3.4e+38
Overviews: 1500x1500, 750x750, 375x375
Metadata:
STATISTICS_MAXIMUM=178.32760325572
STATISTICS_MEAN=41.353653830476
STATISTICS_MINIMUM=3.408856940596
STATISTICS_STDDEV=31.934002621158
Band 4 Block=512x512 Type=Float32, ColorInterp=Undefined
Min=6.503 Max=850.186
Minimum=6.503, Maximum=850.186, Mean=72.227, StdDev=52.055
NoData Value=-3.4e+38
Overviews: 1500x1500, 750x750, 375x375
Metadata:
STATISTICS_MAXIMUM=850.18554567087
STATISTICS_MEAN=72.227219607842
STATISTICS_MINIMUM=6.5033239743444
STATISTICS_STDDEV=52.055061765112
Vector
For this example, we’ll access a geopackage from a shared bucket query using the sf
package. Similar to above, we’ll attach /vsis3/
to our path.
[16]:
vector_listing <- s3$list_objects_v2(Bucket = bucket, Prefix = "shared/smk0033/CONUSbiohex2020/")
vector_key <- sapply(vector_listing$Contents, function(vector_object) vector_object$Key)
[17]:
vector_path <- paste("/vsis3", bucket, vector_key[9], sep = "/")
print(vector_path)
[1] "/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg"
[18]:
vector <- st_read(vector_path)
# limit the printout to the first few rows
head(vector)
Reading layer `CONUSbiohex2020' from data source
`/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg'
using driver `GPKG'
Simple feature collection with 12591 features and 26 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -125.0093 ymin: 24.3193 xmax: -66.6917 ymax: 49.50757
Geodetic CRS: Unknown datum based upon the Clarke 1866 ellipsoid
Registered S3 method overwritten by 'geojsonsf':
method from
print.geojson geojson
USHEXES_ID | EMAP_HEX | PROP_FORES | SE_PROP_FO | CRM_LIVE | SE_CRM_LIV | CRM_STND_D | SE_CRM_STN | CRM_LIVE_D | SE_CRM_L_1 | Shape | ⋯ | SE_JENK_LI | JENK_STND_ | SE_JENK_ST | JENK_LIVE_ | SE_JENK__1 | EST_SAMPLE | SAMPLED_PL | NON_SAMPLE | AVG_INVYR | Shape | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <MULTIPOLYGON [°]> | ||
1 | 1680 | 1680 | 0.9668349 | 3.247659 | 76.72921 | 14.81082 | 2.0910534 | 68.10834 | 78.82027 | 15.38130 | MULTIPOLYGON (((-69.33725 4... | ⋯ | 17.287907 | 23.530244 | 59.05075 | 127.71723 | 10.58347 | 14242.787 | 6 | 0 | 2017.3 | MULTIPOLYGON (((-69.33725 4... |
2 | 1681 | 1681 | 0.9839139 | 1.123591 | 72.75119 | 10.49896 | 1.8706130 | 25.18642 | 74.62181 | 10.49655 | MULTIPOLYGON (((-69.1548 47... | ⋯ | 9.390085 | 9.422362 | 18.23495 | 117.19076 | 9.23275 | 47158.890 | 19 | 0 | 2017.1 | MULTIPOLYGON (((-69.1548 47... |
3 | 1568 | 1568 | 0.8541005 | 12.539034 | 88.52704 | 20.41672 | 0.7031466 | 58.64946 | 89.23018 | 20.33304 | MULTIPOLYGON (((-69.1548 47... | ⋯ | 20.126408 | 2.643056 | 45.61095 | 109.83928 | 19.85844 | 21226.970 | 9 | 0 | 2016.6 | MULTIPOLYGON (((-69.1548 47... |
4 | 1456 | 1456 | 0.5435363 | 22.598699 | 52.05244 | 40.71339 | 3.7837661 | 37.66524 | 55.83621 | 39.08006 | MULTIPOLYGON (((-68.78615 4... | ⋯ | 37.532659 | 13.858363 | 29.37956 | 81.44037 | 35.28836 | 23836.850 | 10 | 0 | 2017.4 | MULTIPOLYGON (((-68.78615 4... |
5 | 1345 | 1345 | 0.5202292 | 23.210199 | 42.17955 | 29.26078 | 0.3405009 | 50.49888 | 42.52005 | 29.33694 | MULTIPOLYGON (((-68.41737 4... | ⋯ | 27.366023 | 2.744681 | 41.86418 | 59.34604 | 27.78754 | 37744.620 | 16 | 0 | 2016.9 | MULTIPOLYGON (((-68.41737 4... |
6 | 1235 | 1235 | 0.2953065 | 73.404294 | 15.99120 | 73.40429 | 0.0000000 | 0.00000 | 15.99120 | 73.40429 | MULTIPOLYGON (((-68.04847 4... | ⋯ | 73.404294 | 1.098071 | 73.40429 | 20.08525 | 73.40429 | 4576.253 | 2 | 0 | 2017.0 | MULTIPOLYGON (((-68.04847 4... |
CSV (Spatial)
For this example, we’ll access a CSV file with spatial data from our MAAP STAC query. We’ll use the “item” variable from the STAC query, and then use sf
to open the file.
[19]:
item
Since we have a full S3 URL, let’s replace s3://
with /vsis/
using the sub
function. We’ll also set names for the coordinate fields - see the GDAL Comma Separated Value (.csv) driver page for more information.
[20]:
head(st_read(sub("s3://", "/vsis3/", item), options = c("X_POSSIBLE_NAMES=lon", "Y_POSSIBLE_NAMES=lat")))
options: X_POSSIBLE_NAMES=lon Y_POSSIBLE_NAMES=lat
Reading layer `boreal_agb_202302151676439579_1326_train_data' from data source
`/vsis3/nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv'
using driver `CSV'
Simple feature collection with 8773 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -178.5851 ymin: 36.25292 xmax: 178.2856 ymax: 67.56312
CRS: NA
lon | lat | AGB | SE | geometry | |
---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <chr> | <POINT> | |
1 | -76.30155 | 51.08907 | 13.2031877105918 | 0.00120325130936702 | POINT (-76.30155 51.08907) |
2 | -79.01183 | 50.97245 | 3.88344532354623 | 0.00107527195707417 | POINT (-79.01183 50.97245) |
3 | -76.39731 | 50.45832 | 4.3007091919769 | 0.00107527195707417 | POINT (-76.39731 50.45832) |
4 | -76.30844 | 50.44268 | 43.3027732332638 | 0.00120325130936702 | POINT (-76.30844 50.44268) |
5 | -77.45645 | 52.03146 | 2.34135031326733 | 0.00107527195707417 | POINT (-77.45645 52.03146) |
6 | -77.68919 | 50.53604 | 39.7893307310248 | 0.00120325130936702 | POINT (-77.68919 50.53604) |
CSV (non-spatial)
For this example, we’ll access a CSV file from our shared bucket.
[21]:
csv_listing <- s3$list_objects_v2(Bucket = bucket, Prefix = "shared/smk0033/csv_ex/")
csv_keys <- sapply(csv_listing$Contents, function(csv_object) csv_object$Key)
# choose an arbitrary key
csv_key <- csv_keys[4]
csv_key
Since this CSV does not have any spatial data, we’ll download the file locally and then read it. While we are able to directly access a non-spatial CSV file, it is being downloaded and read for simplicity.
Before downloading, let’s create a new directory to put our file.
[ ]:
dir.create("./data")
Now we can download our data and open it.
[23]:
# create file name for download
filename <- sub(".*/", "", csv_key)
filename
[24]:
s3$download_file(Bucket = bucket, Key = csv_key, Filename = paste("./data", filename, sep = "/"))
[25]:
# open data
data <- read.csv(paste("./data", filename, sep = "/"))
head(data)
Country | ISO3 | Percent_Forest | FAO_Forested_AGBD | FAO_Forested_AGBD.1 | GEDI_L4B_Total_AGBD | GEDI_L4B_Total_AGBD.1 | GEDI_L4B_AGBD_SE_Percent | FAO_AGB | GEDI_L4B_AGB | GEDI_L4B_AGB_SE | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | Aruba | ABW | 2.3 | -9999.0 | -9999.0 | 2.1 | 0.5 | 23.6 | -9.999e+03 | 3.554221e-05 | 8.388068e-06 |
2 | Afghanistan | AFG | 1.9 | -9999.0 | -9999.0 | 24.7 | 1.3 | 5.4 | -9.999e+03 | 1.583907e+00 | 8.486180e-02 |
3 | Angola | AGO | 53.4 | 30.3 | 16.2 | 34.6 | 0.6 | 1.9 | 2.020e+00 | 4.312326e+00 | 8.028431e-02 |
4 | Anguilla | AIA | 61.1 | 210.0 | 128.3 | 4.4 | 1.0 | 22.5 | 1.160e-03 | 3.543690e-05 | 7.973480e-06 |
5 | Albania | ALB | 28.8 | -9999.0 | -9999.0 | 56.9 | 1.4 | 2.5 | -9.999e+03 | 1.611579e-01 | 4.025675e-03 |
6 | Andorra | AND | 34.0 | 154.0 | 52.4 | 74.3 | 4.7 | 6.3 | 2.460e-03 | 3.360226e-03 | 2.118110e-04 |