{ "cells": [ { "cell_type": "markdown", "id": "4795f2ad-0dee-4a73-9470-abd674b427a9", "metadata": {}, "source": [ "# MAAP AWS Access With Python\n", "\n", "Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Chuck Daniels (Development Seed), Alex Mandel (Development Seed)\n", "\n", "Date: March 26, 2025\n", "\n", "Description: In this tutorial, we walk through accessing MAAP data in S3 buckets (maap-ops-workspace and nasa-maap-data-store) in python. We’ll also demonstrate opening a raster, vector, and text file.\n", "\n" ] }, { "cell_type": "markdown", "id": "b50fa104-6598-4e9d-a700-81e9c58795c5", "metadata": {}, "source": [ "## Run This Notebook" ] }, { "cell_type": "markdown", "id": "f66b18ad-5109-444d-95aa-60b609a8bc36", "metadata": {}, "source": [ "To access and run this tutorial within MAAP's Algorithm Development Environment (ADE), please refer to the [\"Getting started with the MAAP\"](https://docs.maap-project.org/en/latest/getting_started/getting_started.html) section of our documentation.\n", "\n", "Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP, such as maap-py. Running the tutorial outside of the MAAP ADE may lead to errors." ] }, { "cell_type": "markdown", "id": "c55b4ef3-0945-469e-93fc-0c17f51c6ba1", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "id": "0ccea117-eabe-4c93-9d8b-47d36269f628", "metadata": {}, "source": [ "- [earthdata: Python–R Handoff](https://github.com/NASA-Openscapes/earthdata-cloud-cookbook/blob/main/earthdata-cloud-r/python-r-handoff.Rmd) \n", "A notebook in NASA Openscapes that shows users how to access data from S3 links.\n", "\n", "- [MAAP AWS Access Tutorial (R)](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/access_aws_maap.html) \n", "Official MAAP documentation showing how to work with AWS-hosted datasets in R.\n" ] }, { "cell_type": "markdown", "id": "c080f33f-85c2-482e-9202-73f2ff5cc5c9", "metadata": {}, "source": [ "## Install/Import Packages\n", " \n", "Let's install and load the packages necessary for this tutorial." ] }, { "cell_type": "code", "execution_count": 5, "id": "ca21520d-dd36-4772-97d1-88f731f7d79e", "metadata": {}, "outputs": [], "source": [ "from maap.maap import MAAP\n", "from pystac_client import Client\n", "import geopandas as gpd\n", "from osgeo import gdal\n", "import pandas as pd\n", "import boto3\n", "import rasterio\n", "import os\n", "import re\n", "import subprocess\n", "from rasterio.session import AWSSession\n", "from rasterio.env import Env\n" ] }, { "cell_type": "markdown", "id": "6512e30f-47e1-4086-acd8-c8fdeb327dab", "metadata": {}, "source": [ "## Set up Access" ] }, { "cell_type": "markdown", "id": "66b2747b-531c-4ac9-bf8b-283c97187f6d", "metadata": {}, "source": [ "We don’t need to manually handle temporary credentials, but we do need to set the default AWS region to `us-west-2`." ] }, { "cell_type": "code", "execution_count": 6, "id": "171699ce-4e3c-4433-aadb-22d34f48626d", "metadata": {}, "outputs": [], "source": [ "# Connect to MAAP API and S3\n", "maap = MAAP()\n", "s3 = boto3.client(\"s3\", region_name=\"us-west-2\")\n" ] }, { "cell_type": "markdown", "id": "8dfe813f-47db-4eb0-9e4e-160cbeb9faea", "metadata": {}, "source": [ "## Explore Buckets" ] }, { "cell_type": "markdown", "id": "c9128aed-e736-4efa-8e20-526463f41994", "metadata": {}, "source": [ "Mounted paths (like `/projects/` or `/shared/`) are convenient for interactive browsing in the ADE, but they can be slower and are not portable to other environments like the DPS. \n", "\n", " For reproducible and scalable workflows — especially those intended to run in the cloud or on DPS — it's recommended to use direct S3 paths or GDAL-style virtual file paths.\n", "Now that we have access to MAAP buckets, we can retrieve data stored in AWS. Typically, users will interact with two main buckets:\n", "\n", "1. **maap-ops-workspace** – Contains both user-private and user-shared data. \n", " - Private files are found under `s3://maap-ops-workspace/private//...` \n", " - Shared files are available under `s3://maap-ops-workspace/shared//...`\n", "\n", "2. **nasa-maap-data-store** – Hosts curated datasets that have been ingested into the MAAP STAC catalog. \n", " - This is the primary location for analysis-ready data used in DPS jobs, and shared workflows.\n" ] }, { "cell_type": "markdown", "id": "1ce3ad99-e9c7-4ea7-9c46-67c3da8bb764", "metadata": {}, "source": [ "## User Shared Buckets" ] }, { "cell_type": "markdown", "id": "3a1aa39c-4a58-4623-b1f7-66425407e96f", "metadata": {}, "source": [ "To list objects from a shared bucket, run the code below. Be sure to update the prefix path after \"shared/\" to match your desired directory." ] }, { "cell_type": "code", "execution_count": 21, "id": "9d227f6d-e596-4b3c-be36-a4bc23ccbc5c", "metadata": {}, "outputs": [], "source": [ "s3_response = s3.list_objects_v2(\n", " Bucket=\"maap-ops-workspace\",\n", " Prefix=\"shared/alexdevseed/cog-tests/\"\n", ")\n" ] }, { "cell_type": "markdown", "id": "56a5bc77-b22f-49fe-a8b6-6948f41fc4c0", "metadata": {}, "source": [ "To grab the identifier for each object within your bucket, run the following cell." ] }, { "cell_type": "code", "execution_count": 8, "id": "bffec6dd-d1e7-46f7-850f-1dda26a1280f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shared/alexdevseed/cog-tests/Landsat8_275_comp_cog_2015-2020_dps.tif\n", "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr3.tif\n", "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr4.tif\n", "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr6.tif\n", "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr8.tif\n", "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-s3o8.tif\n", "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\n" ] } ], "source": [ "all_objects = [obj[\"Key\"] for obj in s3_response.get(\"Contents\", [])]\n", "tif_objects = [key for key in all_objects if key.endswith(\".tif\")]\n", "for tif in tif_objects:\n", " print(tif)\n" ] }, { "cell_type": "markdown", "id": "44ecd74b-dae5-4e70-bc16-cad0a25bdcbd", "metadata": {}, "source": [ "## User Private Buckets" ] }, { "cell_type": "markdown", "id": "133c4861-b7fd-4618-8da0-f3687eb8493c", "metadata": {}, "source": [ "To access data in your private bucket, you'll follow a similar approach as before, but with an updated prefix. First, we’ll retrieve your username to correctly construct the path." ] }, { "cell_type": "code", "execution_count": 12, "id": "b271484a-ca5f-4c59-beed-b0ae1f7d4b68", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Username: harshinigirish\n" ] } ], "source": [ "username = maap.profile.account_info()['username']\n", "print(\"Username:\", username)" ] }, { "cell_type": "code", "execution_count": 18, "id": "cde43c59-aa24-4d2c-84a0-4f92c71757a2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "S3 Objects:\n", "shared/harshinigirish/\n", "shared/harshinigirish/GLLIDARPC_FL_20200311_FIA8_l0s12.las\n" ] } ], "source": [ "prefix = f\"shared/{username}/\" \n", "s3_response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)\n", "s3_object_keys = [obj[\"Key\"] for obj in s3_response.get(\"Contents\", [])]\n", "print(\"S3 Objects:\")\n", "for key in s3_object_keys:\n", " print(key)" ] }, { "cell_type": "markdown", "id": "c30e94ba-d8c4-4bd2-9441-b5133c5d5188", "metadata": {}, "source": [ "**Note**:\n", "While the following examples don't explicitly access private buckets, the process is exactly the same as for shared buckets. The only difference is the prefix path—use your username directly instead of shared/username." ] }, { "cell_type": "markdown", "id": "76b779e3-5e85-4812-9272-dacff868a918", "metadata": {}, "source": [ "## nasa-maap-data-store Buckets" ] }, { "cell_type": "markdown", "id": "ec348699-8016-4ede-82ef-4898bc2fec66", "metadata": {}, "source": [ "To access data from the `nasa-maap-data-store` bucket, we’ll use a STAC query via the `pystac-client` library to retrieve item metadata, including file paths. These paths can then be used with tools that support STAC or direct S3 access." ] }, { "cell_type": "code", "execution_count": 14, "id": "cac5ac17-db0f-4003-877b-d8cfe15e1c3a", "metadata": {}, "outputs": [], "source": [ "stac_url = \"https://stac.maap-project.org/\"\n", "client = Client.open(stac_url)" ] }, { "cell_type": "markdown", "id": "55e554cb-8020-448a-a8d7-164a523f43a2", "metadata": {}, "source": [ "In this example, we'll query the `icesat2-boreal` collection to explore its available data items." ] }, { "cell_type": "code", "execution_count": 15, "id": "0c412a95-f83b-4aab-9d53-ff6422381257", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First 10 STAC Items:\n", "boreal_agb_202302151676439579_1326\n", "boreal_agb_202302151676435792_3402\n", "boreal_agb_202302151676435665_3417\n", "boreal_agb_202302151676434536_3215\n", "boreal_agb_202302151676434460_3035\n", "boreal_agb_202302151676432986_2782\n", "boreal_agb_202302151676430990_1278\n", "boreal_agb_202302151676430794_26340\n", "boreal_agb_202302151676430633_40664\n", "boreal_agb_202302151676430594_0611\n" ] } ], "source": [ "collection_id = \"icesat2-boreal\"\n", "search = client.search(collections=[collection_id], max_items=10)\n", "items = list(search.get_items())\n", "print(\"First 10 STAC Items:\")\n", "for item in items:\n", " print(item.id)\n" ] }, { "cell_type": "markdown", "id": "39695ae7-faab-4f42-abdd-5c2bb0f631a7", "metadata": {}, "source": [ "Now that we've specified our collection and retrieved a list of items, we can extract the S3 URL linked to the first item in the collection." ] }, { "cell_type": "code", "execution_count": 16, "id": "4f853e97-1c01-4cc3-aa4c-a0fde771b364", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "S3 URL: s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv\n" ] } ], "source": [ "first_item = items[0]\n", "asset_href = list(first_item.assets.values())[0].href\n", "print(\"S3 URL:\", asset_href)\n" ] }, { "cell_type": "markdown", "id": "f9a22b3a-182e-4fed-94c5-4da041d45e33", "metadata": {}, "source": [ "## Accessing an Item" ] }, { "cell_type": "markdown", "id": "fd914b5b-c017-4d75-9bcc-d5dcec6521af", "metadata": {}, "source": [ "## TIFF" ] }, { "cell_type": "markdown", "id": "199152ca-1b08-4362-8b9a-32d0fbb8ec28", "metadata": {}, "source": [ "In this example, we’ll access a TIFF file from a shared S3 bucket. To read the file directly from S3, the path must begin with `/vsis3/`. We'll construct the full path by combining `/vsis3/ `with the bucket name." ] }, { "cell_type": "code", "execution_count": 22, "id": "25110ca5-13ce-465c-8292-5e07258d5f03", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TIFF path: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\n" ] } ], "source": [ "key = \"shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\"\n", "tiff_path = f\"/vsis3/{bucket}/{key}\"\n", "print(\"TIFF path:\", tiff_path)" ] }, { "cell_type": "markdown", "id": "7d362bf3-18fe-400c-9718-f52082994296", "metadata": {}, "source": [ "This code block uses the `rio cogeo info` command-line tool to inspect a Cloud Optimized GeoTIFF (COG) directly from S3. \n", "It prints detailed metadata specific to the COG structure—such as tile layout, overviews, and internal organization. This information is useful for evaluating whether the file is optimized for cloud-based access and helps inform decisions before processing or visualization.\n" ] }, { "cell_type": "code", "execution_count": 26, "id": "89c5f895-5bad-47ac-b9a9-95c6043bd6ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Driver: GTiff\n", "File: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\n", "COG: True\n", "Compression: None\n", "ColorSpace: None\n", "\n", "Profile\n", " Width: 3000\n", " Height: 3000\n", " Bands: 4\n", " Tiled: True\n", " Dtype: float32\n", " NoData: -3.3999999521443642e+38\n", " Alpha Band: False\n", " Internal Mask: False\n", " Interleave: PIXEL\n", " ColorMap: False\n", " ColorInterp: ('gray', 'undefined', 'undefined', 'undefined')\n", " Scales: (1.0, 1.0, 1.0, 1.0)\n", " Offsets: (0.0, 0.0, 0.0, 0.0)\n", "\n", "Geo\n", " Crs: PROJCS[\"unknown\",GEOGCS[\"NAD83\",DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,298.257222101004,AUTHORITY[\"EPSG\",\"7019\"]],AUTHORITY[\"EPSG\",\"6269\"]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4269\"]],PROJECTION[\"Albers_Conic_Equal_Area\"],PARAMETER[\"latitude_of_center\",40],PARAMETER[\"longitude_of_center\",180],PARAMETER[\"standard_parallel_1\",50],PARAMETER[\"standard_parallel_2\",70],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]\n", " Origin: (-1791478.0, 7983304.0)\n", " Resolution: (30.0, -30.0)\n", " BoundingBox: (-1791478.0, 7893304.0, -1701478.0, 7983304.0)\n", " MinZoom: 6\n", " MaxZoom: 11\n", "\n", "Image Metadata\n", " AREA_OR_POINT: Area\n", "\n", "Image Structure\n", " LAYOUT: COG\n", " INTERLEAVE: PIXEL\n", "\n", "Band 1\n", " ColorInterp: gray\n", " Metadata:\n", " STATISTICS_MAXIMUM: 642.62117058029\n", " STATISTICS_MEAN: 54.162682191363\n", " STATISTICS_MINIMUM: 4.258820251973\n", " STATISTICS_STDDEV: 41.175124882695\n", "\n", "Band 2\n", " ColorInterp: undefined\n", " Metadata:\n", " STATISTICS_MAXIMUM: 334.88861093713\n", " STATISTICS_MEAN: 10.524834878905\n", " STATISTICS_MINIMUM: 0.62954892025643\n", " STATISTICS_STDDEV: 10.751815070336\n", "\n", "Band 3\n", " ColorInterp: undefined\n", " Metadata:\n", " STATISTICS_MAXIMUM: 178.32760325572\n", " STATISTICS_MEAN: 41.353653830476\n", " STATISTICS_MINIMUM: 3.408856940596\n", " STATISTICS_STDDEV: 31.934002621158\n", "\n", "Band 4\n", " ColorInterp: undefined\n", " Metadata:\n", " STATISTICS_MAXIMUM: 850.18554567087\n", " STATISTICS_MEAN: 72.227219607842\n", " STATISTICS_MINIMUM: 6.5033239743444\n", " STATISTICS_STDDEV: 52.055061765112\n", "\n", "IFD\n", " Id Size BlockSize Decimation \n", " 0 3000x3000 512x512 0\n", " 1 1500x1500 512x512 2\n", " 2 750x750 512x512 4\n", " 3 375x375 512x512 8\n", "\n" ] } ], "source": [ "cmd = [\"rio\", \"cogeo\", \"info\", tiff_path]\n", "result = subprocess.run(cmd, capture_output=True, text=True)\n", "print(result.stdout)" ] }, { "cell_type": "markdown", "id": "9df6f20d-ee34-46f9-a406-cd4b4eddf764", "metadata": {}, "source": [ "As a best practice, it's important to know which GDAL drivers are available, as using the appropriate driver ensures efficient and reliable access to geospatial data. Different drivers support different formats (e.g., GeoTIFF, NetCDF, Shapefile), and selecting the right one can significantly impact performance and compatibility." ] }, { "cell_type": "markdown", "id": "021f251f-e641-4131-8a23-c0c0455d8aba", "metadata": {}, "source": [ "Please refer to the [\"GDAL OGR driver list\"](https://gdal.org/en/stable/drivers/vector/index.html) for more details." ] }, { "cell_type": "code", "execution_count": 24, "id": "02560b0f-cb95-4981-9d74-3549431c21c9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VRT | Can Create: Virtual Raster\n", "GTI | Can Create: GDAL Raster Tile Index\n", "DERIVED | Can Create: Derived datasets using VRT pixel functions\n", "GTiff | Can Create: GeoTIFF\n", "COG | Can Create: Cloud optimized GeoTIFF generator\n" ] } ], "source": [ "with rasterio.Env() as env:\n", " drivers = list(env.drivers().items())\n", " for short_name, can_create in drivers[:5]: \n", " print(f\"{short_name:<10} | Can Create: {can_create}\")\n" ] }, { "cell_type": "markdown", "id": "13a87322-b070-4430-bdc0-2f1804560d2f", "metadata": {}, "source": [ "This code snippet runs the gdalinfo command-line utility from within Python to read metadata from a TIFF file stored in an AWS S3 bucket. The file path is formatted with `/vsis3/`, which allows GDAL to access cloud-hosted data directly. The command is executed using Python’s subprocess module, and the output—containing detailed metadata about the raster file (such as size, projection, and geotransform)—is captured and printed." ] }, { "cell_type": "markdown", "id": "27f764b7-afa5-44cb-a5f3-ef6978518517", "metadata": {}, "source": [ "## Vector" ] }, { "cell_type": "markdown", "id": "db045186-82ca-4844-9255-046ed498c6a6", "metadata": {}, "source": [ "In this example, we access a `GeoPackage` file stored in a shared S3 bucket using the geopandas package. As with raster data, we prepend ` /vsis3/` to the file path so that GDAL can stream the data directly from S3 without downloading it locally." ] }, { "cell_type": "code", "execution_count": 27, "id": "5cfbd239-0fb6-4e55-9634-2781423aef5d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GeoPackage path: /vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg\n", "/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg\n" ] } ], "source": [ "prefix = \"shared/smk0033/CONUSbiohex2020/biohex.gpkg\"\n", "response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)\n", "gpkg_keys = [obj[\"Key\"] for obj in response.get(\"Contents\", []) if obj[\"Key\"].endswith(\".gpkg\")]\n", "\n", "key = gpkg_keys[0]\n", "vector_path = f\"/vsis3/{bucket}/{key}\"\n", "print(\"GeoPackage path:\", vector_path)\n", "\n", "gdf = gpd.read_file(vector_path)\n", "print(vector_path)\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "360211b5-7bf5-4171-83da-9b3a3140438c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " USHEXES_ID EMAP_HEX PROP_FORES SE_PROP_FO CRM_LIVE SE_CRM_LIV \\\n", "0 1680 1680.0 0.966835 3.247659 76.729213 14.810822 \n", "1 1681 1681.0 0.983914 1.123591 72.751194 10.498955 \n", "2 1568 1568.0 0.854100 12.539034 88.527037 20.416719 \n", "3 1456 1456.0 0.543536 22.598699 52.052440 40.713392 \n", "4 1345 1345.0 0.520229 23.210199 42.179547 29.260777 \n", "\n", " CRM_STND_D SE_CRM_STN CRM_LIVE_D SE_CRM_L_1 ... SE_JENK_LI \\\n", "0 2.091053 68.108338 78.820266 15.381299 ... 17.287907 \n", "1 1.870613 25.186416 74.621807 10.496552 ... 9.390085 \n", "2 0.703147 58.649462 89.230184 20.333036 ... 20.126408 \n", "3 3.783766 37.665236 55.836206 39.080061 ... 37.532659 \n", "4 0.340501 50.498881 42.520048 29.336943 ... 27.366023 \n", "\n", " JENK_STND_ SE_JENK_ST JENK_LIVE_ SE_JENK__1 EST_SAMPLE SAMPLED_PL \\\n", "0 23.530244 59.050753 127.717229 10.583475 14242.786806 6.0 \n", "1 9.422362 18.234947 117.190761 9.232750 47158.889642 19.0 \n", "2 2.643056 45.610955 109.839281 19.858437 21226.969702 9.0 \n", "3 13.858363 29.379562 81.440374 35.288359 23836.849808 10.0 \n", "4 2.744681 41.864178 59.346039 27.787539 37744.619593 16.0 \n", "\n", " NON_SAMPLE AVG_INVYR geometry \n", "0 0.0 2017.3 MULTIPOLYGON (((-69.33725 47.57343, -69.3385 4... \n", "1 0.0 2017.1 MULTIPOLYGON (((-69.1548 47.36372, -69.3385 47... \n", "2 0.0 2016.6 MULTIPOLYGON (((-69.1548 47.36372, -69.15631 4... \n", "3 0.0 2017.4 MULTIPOLYGON (((-68.78615 47.36264, -68.78862 ... \n", "4 0.0 2016.9 MULTIPOLYGON (((-68.41737 47.3604, -68.4208 47... \n", "\n", "[5 rows x 27 columns]\n" ] } ], "source": [ "print(gdf.head())" ] }, { "cell_type": "markdown", "id": "e6467faa-e215-4695-8757-5bc7a2639069", "metadata": {}, "source": [ "## CSV (Spatial)" ] }, { "cell_type": "markdown", "id": "93d82664-0191-40ce-9c86-c19080d95f64", "metadata": {}, "source": [ "In this example, we’ll load a CSV file containing spatial data directly from the MAAP STAC results. We use the item variable, and then modify it to stream data using the /vsis3/ prefix." ] }, { "cell_type": "code", "execution_count": 29, "id": "5c10524a-5549-47a8-ab41-f72cf7a133a2", "metadata": {}, "outputs": [], "source": [ "asset_href = \"s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv\"\n" ] }, { "cell_type": "markdown", "id": "92ae5cf6-53e8-4f36-a10f-6e2b7e568d63", "metadata": {}, "source": [ "Since we already have a complete S3 path, we convert the `\"s3://\"` prefix to `\"/vsis3/\"`. Additionally, we define the appropriate field names for longitude and latitude so that the file is interpreted as spatial.\n", "\n", "To learn more, refer to the [GDAL Comma Separated Value (.csv) driver documentation](https://gdal.org/drivers/vector/csv.html).\n" ] }, { "cell_type": "code", "execution_count": 30, "id": "28a84a1b-f653-4603-8327-6a9dfc2e2211", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " lon lat AGB SE \\\n", "0 -76.301546 51.089067 13.2031877105918 0.00120325130936702 \n", "1 -79.011834 50.972447 3.88344532354623 0.00107527195707417 \n", "2 -76.397307 50.458315 4.3007091919769 0.00107527195707417 \n", "3 -76.308436 50.442678 43.3027732332638 0.00120325130936702 \n", "4 -77.456452 52.031459 2.34135031326733 0.00107527195707417 \n", "\n", " geometry \n", "0 POINT (-76.30155 51.08907) \n", "1 POINT (-79.01183 50.97245) \n", "2 POINT (-76.39731 50.45832) \n", "3 POINT (-76.30844 50.44268) \n", "4 POINT (-77.45645 52.03146) \n" ] } ], "source": [ "csv_path = asset_href.replace(\"s3://\", \"/vsis3/\")\n", "gdf = gpd.read_file(\n", " f\"CSV:{csv_path}\",\n", " engine=\"fiona\",\n", " X_POSSIBLE_NAMES=\"lon\",\n", " Y_POSSIBLE_NAMES=\"lat\"\n", ")\n", "print(gdf.head())\n" ] }, { "cell_type": "markdown", "id": "6791af01-bc7f-47fa-8ac6-73ead1bfd46c", "metadata": {}, "source": [ "## CSV (non-spatial)" ] }, { "cell_type": "markdown", "id": "8d47196a-4449-4edf-b9b2-68441682eee3", "metadata": {}, "source": [ "For this example, we’ll access a CSV file from our shared bucket." ] }, { "cell_type": "code", "execution_count": 31, "id": "45a3521a-111d-4a45-82b9-03943ef9c6b4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shared/smk0033/csv_ex/country_estimates_gedi_l4b_v002.csv\n" ] } ], "source": [ "csv_listing = s3.list_objects_v2(Bucket=bucket, Prefix=\"shared/smk0033/csv_ex/\")\n", "csv_keys = [obj[\"Key\"] for obj in csv_listing.get(\"Contents\", [])]\n", "csv_key = csv_keys[3]\n", "print(csv_key)\n" ] }, { "cell_type": "markdown", "id": "3a0fb963-6632-42de-a403-03216d41e1ba", "metadata": {}, "source": [ "Although this CSV file can be accessed directly from shared storage or S3, we’re downloading it locally before reading. This approach helps avoid potential memory issues or latency that can arise when reading files over a network connection—especially for formats like CSV that aren’t inherently cloud-optimized. \n", "\n", "Downloading also ensures better compatibility with processing tools like `pandas`, which expect local file handles for some operations. While cloud-native streaming is preferred for large geospatial formats (e.g., COGs), working with local copies of non-spatial files can improve stability and simplicity in many cases.\n" ] }, { "cell_type": "markdown", "id": "77c27d83-9fc2-44e4-8658-6b21678c55ae", "metadata": {}, "source": [ "Before downloading, let’s create a new directory to put our file." ] }, { "cell_type": "code", "execution_count": 32, "id": "66f78224-6191-420c-8310-5e2dd351a7ad", "metadata": {}, "outputs": [], "source": [ "os.makedirs(\"./data\", exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 33, "id": "b96cd970-d9fc-417f-abfb-ea1c3ff9ab4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filename: country_estimates_gedi_l4b_v002.csv\n" ] } ], "source": [ "#create file name for download\n", "filename = os.path.basename(csv_key)\n", "print(\"Filename:\", filename)" ] }, { "cell_type": "code", "execution_count": 34, "id": "ebe8459b-86c1-4d4c-9061-1613f5e39e38", "metadata": {}, "outputs": [], "source": [ "download_path = os.path.join(\"./data\", filename)\n", "s3.download_file(Bucket=bucket, Key=csv_key, Filename=download_path)" ] }, { "cell_type": "code", "execution_count": 35, "id": "467e8b07-7be8-4fce-941d-d4317701e7e8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country ISO3 Percent_Forest FAO_Forested_AGBD FAO_Forested_AGBD.1 \\\n", "0 Aruba ABW 2.3 -9999.0 -9999.0 \n", "1 Afghanistan AFG 1.9 -9999.0 -9999.0 \n", "2 Angola AGO 53.4 30.3 16.2 \n", "3 Anguilla AIA 61.1 210.0 128.3 \n", "4 Albania ALB 28.8 -9999.0 -9999.0 \n", "\n", " GEDI_L4B_Total_AGBD GEDI_L4B_Total_AGBD.1 GEDI_L4B_AGBD_SE_Percent \\\n", "0 2.1 0.5 23.6 \n", "1 24.7 1.3 5.4 \n", "2 34.6 0.6 1.9 \n", "3 4.4 1.0 22.5 \n", "4 56.9 1.4 2.5 \n", "\n", " FAO_AGB GEDI_L4B_AGB GEDI_L4B_AGB_SE \n", "0 -9999.00000 0.000036 0.000008 \n", "1 -9999.00000 1.583907 0.084862 \n", "2 2.02000 4.312326 0.080284 \n", "3 0.00116 0.000035 0.000008 \n", "4 -9999.00000 0.161158 0.004026 \n" ] } ], "source": [ "# Read CSV into DataFrame\n", "data = pd.read_csv(download_path)\n", "print(data.head())\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }