{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4795f2ad-0dee-4a73-9470-abd674b427a9",
   "metadata": {},
   "source": [
    "# MAAP AWS Access With Python\n",
    "\n",
    "Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Chuck Daniels (Development Seed), Alex Mandel (Development Seed)\n",
    "\n",
    "Date: March 26, 2025\n",
    "\n",
    "Description: In this tutorial, we walk through accessing MAAP data in S3 buckets (maap-ops-workspace and nasa-maap-data-store) in python. We’ll also demonstrate opening a raster, vector, and text file.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b50fa104-6598-4e9d-a700-81e9c58795c5",
   "metadata": {},
   "source": [
    "## Run This Notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f66b18ad-5109-444d-95aa-60b609a8bc36",
   "metadata": {},
   "source": [
    "To access and run this tutorial within MAAP's Algorithm Development Environment (ADE), please refer to the [\"Getting started with the MAAP\"](https://docs.maap-project.org/en/latest/getting_started/getting_started.html) section of our documentation.\n",
    "\n",
    "Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP, such as maap-py. Running the tutorial outside of the MAAP ADE may lead to errors."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c55b4ef3-0945-469e-93fc-0c17f51c6ba1",
   "metadata": {},
   "source": [
    "## Additional Resources"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ccea117-eabe-4c93-9d8b-47d36269f628",
   "metadata": {},
   "source": [
    "- [earthdata: Python–R Handoff](https://github.com/NASA-Openscapes/earthdata-cloud-cookbook/blob/main/earthdata-cloud-r/python-r-handoff.Rmd)  \n",
    "A notebook in NASA Openscapes that shows users how to access data from S3 links.\n",
    "\n",
    "- [MAAP AWS Access Tutorial (R)](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/access_aws_maap.html)  \n",
    "Official MAAP documentation showing how to work with AWS-hosted datasets in R.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c080f33f-85c2-482e-9202-73f2ff5cc5c9",
   "metadata": {},
   "source": [
    "## Install/Import Packages\n",
    " \n",
    "Let's install and load the packages necessary for this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ca21520d-dd36-4772-97d1-88f731f7d79e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from maap.maap import MAAP\n",
    "from pystac_client import Client\n",
    "import geopandas as gpd\n",
    "from osgeo import gdal\n",
    "import pandas as pd\n",
    "import boto3\n",
    "import rasterio\n",
    "import os\n",
    "import re\n",
    "import subprocess\n",
    "from rasterio.session import AWSSession\n",
    "from rasterio.env import Env\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6512e30f-47e1-4086-acd8-c8fdeb327dab",
   "metadata": {},
   "source": [
    "## Set up Access"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66b2747b-531c-4ac9-bf8b-283c97187f6d",
   "metadata": {},
   "source": [
    "We don’t need to manually handle temporary credentials, but we do need to set the default AWS region to     `us-west-2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "171699ce-4e3c-4433-aadb-22d34f48626d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to MAAP API and S3\n",
    "maap = MAAP()\n",
    "s3 = boto3.client(\"s3\", region_name=\"us-west-2\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dfe813f-47db-4eb0-9e4e-160cbeb9faea",
   "metadata": {},
   "source": [
    "## Explore Buckets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9128aed-e736-4efa-8e20-526463f41994",
   "metadata": {},
   "source": [
    "Mounted paths (like `/projects/` or `/shared/`) are convenient for interactive browsing in the ADE, but they can be slower and are not portable to other environments like the DPS.  \n",
    "\n",
    " For reproducible and scalable workflows — especially those intended to run in the cloud or on DPS — it's recommended to use direct S3 paths or GDAL-style virtual file paths.\n",
    "Now that we have access to MAAP buckets, we can retrieve data stored in AWS. Typically, users will interact with two main buckets:\n",
    "\n",
    "1. **maap-ops-workspace** – Contains both user-private and user-shared data.  \n",
    "   - Private files are found under `s3://maap-ops-workspace/private/<username>/...`  \n",
    "   - Shared files are available under `s3://maap-ops-workspace/shared/<username>/...`\n",
    "\n",
    "2. **nasa-maap-data-store** – Hosts curated datasets that have been ingested into the MAAP STAC catalog.  \n",
    "   - This is the primary location for analysis-ready data used in DPS jobs, and shared workflows.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ce3ad99-e9c7-4ea7-9c46-67c3da8bb764",
   "metadata": {},
   "source": [
    "## User Shared Buckets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a1aa39c-4a58-4623-b1f7-66425407e96f",
   "metadata": {},
   "source": [
    "To list objects from a shared bucket, run the code below. Be sure to update the prefix path after \"shared/\" to match your desired directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "9d227f6d-e596-4b3c-be36-a4bc23ccbc5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_response = s3.list_objects_v2(\n",
    "    Bucket=\"maap-ops-workspace\",\n",
    "    Prefix=\"shared/alexdevseed/cog-tests/\"\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56a5bc77-b22f-49fe-a8b6-6948f41fc4c0",
   "metadata": {},
   "source": [
    "To grab the identifier for each object within your bucket, run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bffec6dd-d1e7-46f7-850f-1dda26a1280f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shared/alexdevseed/cog-tests/Landsat8_275_comp_cog_2015-2020_dps.tif\n",
      "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr3.tif\n",
      "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr4.tif\n",
      "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr6.tif\n",
      "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-ovr8.tif\n",
      "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog-s3o8.tif\n",
      "shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\n"
     ]
    }
   ],
   "source": [
    "all_objects = [obj[\"Key\"] for obj in s3_response.get(\"Contents\", [])]\n",
    "tif_objects = [key for key in all_objects if key.endswith(\".tif\")]\n",
    "for tif in tif_objects:\n",
    "    print(tif)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44ecd74b-dae5-4e70-bc16-cad0a25bdcbd",
   "metadata": {},
   "source": [
    "## User Private Buckets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "133c4861-b7fd-4618-8da0-f3687eb8493c",
   "metadata": {},
   "source": [
    "To access data in your private bucket, you'll follow a similar approach as before, but with an updated prefix. First, we’ll retrieve your username to correctly construct the path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "b271484a-ca5f-4c59-beed-b0ae1f7d4b68",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Username: harshinigirish\n"
     ]
    }
   ],
   "source": [
    "username = maap.profile.account_info()['username']\n",
    "print(\"Username:\", username)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "cde43c59-aa24-4d2c-84a0-4f92c71757a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S3 Objects:\n",
      "shared/harshinigirish/\n",
      "shared/harshinigirish/GLLIDARPC_FL_20200311_FIA8_l0s12.las\n"
     ]
    }
   ],
   "source": [
    "prefix = f\"shared/{username}/\" \n",
    "s3_response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)\n",
    "s3_object_keys = [obj[\"Key\"] for obj in s3_response.get(\"Contents\", [])]\n",
    "print(\"S3 Objects:\")\n",
    "for key in s3_object_keys:\n",
    "    print(key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c30e94ba-d8c4-4bd2-9441-b5133c5d5188",
   "metadata": {},
   "source": [
    "**Note**:\n",
    "While the following examples don't explicitly access private buckets, the process is exactly the same as for shared buckets. The only difference is the prefix path—use your username directly instead of shared/username."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76b779e3-5e85-4812-9272-dacff868a918",
   "metadata": {},
   "source": [
    "## nasa-maap-data-store Buckets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec348699-8016-4ede-82ef-4898bc2fec66",
   "metadata": {},
   "source": [
    "To access data from the `nasa-maap-data-store` bucket, we’ll use a STAC query via the `pystac-client` library to retrieve item metadata, including file paths. These paths can then be used with tools that support STAC or direct S3 access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "cac5ac17-db0f-4003-877b-d8cfe15e1c3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "stac_url = \"https://stac.maap-project.org/\"\n",
    "client = Client.open(stac_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55e554cb-8020-448a-a8d7-164a523f43a2",
   "metadata": {},
   "source": [
    "In this example, we'll query the `icesat2-boreal` collection to explore its available data items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0c412a95-f83b-4aab-9d53-ff6422381257",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First 10 STAC Items:\n",
      "boreal_agb_202302151676439579_1326\n",
      "boreal_agb_202302151676435792_3402\n",
      "boreal_agb_202302151676435665_3417\n",
      "boreal_agb_202302151676434536_3215\n",
      "boreal_agb_202302151676434460_3035\n",
      "boreal_agb_202302151676432986_2782\n",
      "boreal_agb_202302151676430990_1278\n",
      "boreal_agb_202302151676430794_26340\n",
      "boreal_agb_202302151676430633_40664\n",
      "boreal_agb_202302151676430594_0611\n"
     ]
    }
   ],
   "source": [
    "collection_id = \"icesat2-boreal\"\n",
    "search = client.search(collections=[collection_id], max_items=10)\n",
    "items = list(search.get_items())\n",
    "print(\"First 10 STAC Items:\")\n",
    "for item in items:\n",
    "    print(item.id)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39695ae7-faab-4f42-abdd-5c2bb0f631a7",
   "metadata": {},
   "source": [
    "Now that we've specified our collection and retrieved a list of items, we can extract the S3 URL linked to the first item in the collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "4f853e97-1c01-4cc3-aa4c-a0fde771b364",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S3 URL: s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv\n"
     ]
    }
   ],
   "source": [
    "first_item = items[0]\n",
    "asset_href = list(first_item.assets.values())[0].href\n",
    "print(\"S3 URL:\", asset_href)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9a22b3a-182e-4fed-94c5-4da041d45e33",
   "metadata": {},
   "source": [
    "## Accessing an Item"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd914b5b-c017-4d75-9bcc-d5dcec6521af",
   "metadata": {},
   "source": [
    "## TIFF"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "199152ca-1b08-4362-8b9a-32d0fbb8ec28",
   "metadata": {},
   "source": [
    "In this example, we’ll access a TIFF file from a shared S3 bucket. To read the file directly from S3, the path must begin with `/vsis3/`. We'll construct the full path by combining `/vsis3/ `with the bucket name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "25110ca5-13ce-465c-8292-5e07258d5f03",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TIFF path: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\n"
     ]
    }
   ],
   "source": [
    "key = \"shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\"\n",
    "tiff_path = f\"/vsis3/{bucket}/{key}\"\n",
    "print(\"TIFF path:\", tiff_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d362bf3-18fe-400c-9718-f52082994296",
   "metadata": {},
   "source": [
    "This code block uses the `rio cogeo info` command-line tool to inspect a Cloud Optimized GeoTIFF (COG) directly from S3.  \n",
    "It prints detailed metadata specific to the COG structure—such as tile layout, overviews, and internal organization.  This information is useful for evaluating whether the file is optimized for cloud-based access and helps inform decisions before processing or visualization.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "89c5f895-5bad-47ac-b9a9-95c6043bd6ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Driver: GTiff\n",
      "File: /vsis3/maap-ops-workspace/shared/alexdevseed/cog-tests/boreal_agb_20211015_0249_cog.tif\n",
      "COG: True\n",
      "Compression: None\n",
      "ColorSpace: None\n",
      "\n",
      "Profile\n",
      "    Width:            3000\n",
      "    Height:           3000\n",
      "    Bands:            4\n",
      "    Tiled:            True\n",
      "    Dtype:            float32\n",
      "    NoData:           -3.3999999521443642e+38\n",
      "    Alpha Band:       False\n",
      "    Internal Mask:    False\n",
      "    Interleave:       PIXEL\n",
      "    ColorMap:         False\n",
      "    ColorInterp:      ('gray', 'undefined', 'undefined', 'undefined')\n",
      "    Scales:           (1.0, 1.0, 1.0, 1.0)\n",
      "    Offsets:          (0.0, 0.0, 0.0, 0.0)\n",
      "\n",
      "Geo\n",
      "    Crs:              PROJCS[\"unknown\",GEOGCS[\"NAD83\",DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,298.257222101004,AUTHORITY[\"EPSG\",\"7019\"]],AUTHORITY[\"EPSG\",\"6269\"]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4269\"]],PROJECTION[\"Albers_Conic_Equal_Area\"],PARAMETER[\"latitude_of_center\",40],PARAMETER[\"longitude_of_center\",180],PARAMETER[\"standard_parallel_1\",50],PARAMETER[\"standard_parallel_2\",70],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]\n",
      "    Origin:           (-1791478.0, 7983304.0)\n",
      "    Resolution:       (30.0, -30.0)\n",
      "    BoundingBox:      (-1791478.0, 7893304.0, -1701478.0, 7983304.0)\n",
      "    MinZoom:          6\n",
      "    MaxZoom:          11\n",
      "\n",
      "Image Metadata\n",
      "    AREA_OR_POINT: Area\n",
      "\n",
      "Image Structure\n",
      "    LAYOUT: COG\n",
      "    INTERLEAVE: PIXEL\n",
      "\n",
      "Band 1\n",
      "    ColorInterp: gray\n",
      "    Metadata:\n",
      "        STATISTICS_MAXIMUM: 642.62117058029\n",
      "        STATISTICS_MEAN: 54.162682191363\n",
      "        STATISTICS_MINIMUM: 4.258820251973\n",
      "        STATISTICS_STDDEV: 41.175124882695\n",
      "\n",
      "Band 2\n",
      "    ColorInterp: undefined\n",
      "    Metadata:\n",
      "        STATISTICS_MAXIMUM: 334.88861093713\n",
      "        STATISTICS_MEAN: 10.524834878905\n",
      "        STATISTICS_MINIMUM: 0.62954892025643\n",
      "        STATISTICS_STDDEV: 10.751815070336\n",
      "\n",
      "Band 3\n",
      "    ColorInterp: undefined\n",
      "    Metadata:\n",
      "        STATISTICS_MAXIMUM: 178.32760325572\n",
      "        STATISTICS_MEAN: 41.353653830476\n",
      "        STATISTICS_MINIMUM: 3.408856940596\n",
      "        STATISTICS_STDDEV: 31.934002621158\n",
      "\n",
      "Band 4\n",
      "    ColorInterp: undefined\n",
      "    Metadata:\n",
      "        STATISTICS_MAXIMUM: 850.18554567087\n",
      "        STATISTICS_MEAN: 72.227219607842\n",
      "        STATISTICS_MINIMUM: 6.5033239743444\n",
      "        STATISTICS_STDDEV: 52.055061765112\n",
      "\n",
      "IFD\n",
      "    Id      Size           BlockSize     Decimation           \n",
      "    0       3000x3000      512x512       0\n",
      "    1       1500x1500      512x512       2\n",
      "    2       750x750        512x512       4\n",
      "    3       375x375        512x512       8\n",
      "\n"
     ]
    }
   ],
   "source": [
    "cmd = [\"rio\", \"cogeo\", \"info\", tiff_path]\n",
    "result = subprocess.run(cmd, capture_output=True, text=True)\n",
    "print(result.stdout)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9df6f20d-ee34-46f9-a406-cd4b4eddf764",
   "metadata": {},
   "source": [
    "As a best practice, it's important to know which GDAL drivers are available, as using the appropriate driver ensures efficient and reliable access to geospatial data. Different drivers support different formats (e.g., GeoTIFF, NetCDF, Shapefile), and selecting the right one can significantly impact performance and compatibility."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "021f251f-e641-4131-8a23-c0c0455d8aba",
   "metadata": {},
   "source": [
    "Please refer to the [\"GDAL OGR driver list\"](https://gdal.org/en/stable/drivers/vector/index.html) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "02560b0f-cb95-4981-9d74-3549431c21c9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "VRT        | Can Create: Virtual Raster\n",
      "GTI        | Can Create: GDAL Raster Tile Index\n",
      "DERIVED    | Can Create: Derived datasets using VRT pixel functions\n",
      "GTiff      | Can Create: GeoTIFF\n",
      "COG        | Can Create: Cloud optimized GeoTIFF generator\n"
     ]
    }
   ],
   "source": [
    "with rasterio.Env() as env:\n",
    "    drivers = list(env.drivers().items())\n",
    "    for short_name, can_create in drivers[:5]: \n",
    "        print(f\"{short_name:<10} | Can Create: {can_create}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13a87322-b070-4430-bdc0-2f1804560d2f",
   "metadata": {},
   "source": [
    "This code snippet runs the gdalinfo command-line utility from within Python to read metadata from a TIFF file stored in an AWS S3 bucket. The file path is formatted with `/vsis3/`, which allows GDAL to access cloud-hosted data directly. The command is executed using Python’s subprocess module, and the output—containing detailed metadata about the raster file (such as size, projection, and geotransform)—is captured and printed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27f764b7-afa5-44cb-a5f3-ef6978518517",
   "metadata": {},
   "source": [
    "## Vector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db045186-82ca-4844-9255-046ed498c6a6",
   "metadata": {},
   "source": [
    "In this example, we access a `GeoPackage` file stored in a shared S3 bucket using the geopandas package. As with raster data, we prepend ` /vsis3/` to the file path so that GDAL can stream the data directly from S3 without downloading it locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "5cfbd239-0fb6-4e55-9634-2781423aef5d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GeoPackage path: /vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg\n",
      "/vsis3/maap-ops-workspace/shared/smk0033/CONUSbiohex2020/biohex.gpkg\n"
     ]
    }
   ],
   "source": [
    "prefix = \"shared/smk0033/CONUSbiohex2020/biohex.gpkg\"\n",
    "response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)\n",
    "gpkg_keys = [obj[\"Key\"] for obj in response.get(\"Contents\", []) if obj[\"Key\"].endswith(\".gpkg\")]\n",
    "\n",
    "key = gpkg_keys[0]\n",
    "vector_path = f\"/vsis3/{bucket}/{key}\"\n",
    "print(\"GeoPackage path:\", vector_path)\n",
    "\n",
    "gdf = gpd.read_file(vector_path)\n",
    "print(vector_path)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "360211b5-7bf5-4171-83da-9b3a3140438c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   USHEXES_ID  EMAP_HEX  PROP_FORES  SE_PROP_FO   CRM_LIVE  SE_CRM_LIV  \\\n",
      "0        1680    1680.0    0.966835    3.247659  76.729213   14.810822   \n",
      "1        1681    1681.0    0.983914    1.123591  72.751194   10.498955   \n",
      "2        1568    1568.0    0.854100   12.539034  88.527037   20.416719   \n",
      "3        1456    1456.0    0.543536   22.598699  52.052440   40.713392   \n",
      "4        1345    1345.0    0.520229   23.210199  42.179547   29.260777   \n",
      "\n",
      "   CRM_STND_D  SE_CRM_STN  CRM_LIVE_D  SE_CRM_L_1  ...  SE_JENK_LI  \\\n",
      "0    2.091053   68.108338   78.820266   15.381299  ...   17.287907   \n",
      "1    1.870613   25.186416   74.621807   10.496552  ...    9.390085   \n",
      "2    0.703147   58.649462   89.230184   20.333036  ...   20.126408   \n",
      "3    3.783766   37.665236   55.836206   39.080061  ...   37.532659   \n",
      "4    0.340501   50.498881   42.520048   29.336943  ...   27.366023   \n",
      "\n",
      "   JENK_STND_  SE_JENK_ST  JENK_LIVE_  SE_JENK__1    EST_SAMPLE  SAMPLED_PL  \\\n",
      "0   23.530244   59.050753  127.717229   10.583475  14242.786806         6.0   \n",
      "1    9.422362   18.234947  117.190761    9.232750  47158.889642        19.0   \n",
      "2    2.643056   45.610955  109.839281   19.858437  21226.969702         9.0   \n",
      "3   13.858363   29.379562   81.440374   35.288359  23836.849808        10.0   \n",
      "4    2.744681   41.864178   59.346039   27.787539  37744.619593        16.0   \n",
      "\n",
      "   NON_SAMPLE  AVG_INVYR                                           geometry  \n",
      "0         0.0     2017.3  MULTIPOLYGON (((-69.33725 47.57343, -69.3385 4...  \n",
      "1         0.0     2017.1  MULTIPOLYGON (((-69.1548 47.36372, -69.3385 47...  \n",
      "2         0.0     2016.6  MULTIPOLYGON (((-69.1548 47.36372, -69.15631 4...  \n",
      "3         0.0     2017.4  MULTIPOLYGON (((-68.78615 47.36264, -68.78862 ...  \n",
      "4         0.0     2016.9  MULTIPOLYGON (((-68.41737 47.3604, -68.4208 47...  \n",
      "\n",
      "[5 rows x 27 columns]\n"
     ]
    }
   ],
   "source": [
    "print(gdf.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6467faa-e215-4695-8757-5bc7a2639069",
   "metadata": {},
   "source": [
    "## CSV (Spatial)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93d82664-0191-40ce-9c86-c19080d95f64",
   "metadata": {},
   "source": [
    "In this example, we’ll load a CSV file containing spatial data directly from the MAAP STAC results. We use the item variable, and then modify it to stream data using the /vsis3/ prefix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "5c10524a-5549-47a8-ab41-f72cf7a133a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "asset_href = \"s3://nasa-maap-data-store/file-staging/nasa-map/icesat2-boreal/boreal_agb_202302151676439579_1326_train_data.csv\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92ae5cf6-53e8-4f36-a10f-6e2b7e568d63",
   "metadata": {},
   "source": [
    "Since we already have a complete S3 path, we convert the `\"s3://\"` prefix to `\"/vsis3/\"`. Additionally, we define the appropriate field names for longitude and latitude so that the file is interpreted as spatial.\n",
    "\n",
    "To learn more, refer to the [GDAL Comma Separated Value (.csv) driver documentation](https://gdal.org/drivers/vector/csv.html).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "28a84a1b-f653-4603-8327-6a9dfc2e2211",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         lon        lat               AGB                   SE  \\\n",
      "0 -76.301546  51.089067  13.2031877105918  0.00120325130936702   \n",
      "1 -79.011834  50.972447  3.88344532354623  0.00107527195707417   \n",
      "2 -76.397307  50.458315   4.3007091919769  0.00107527195707417   \n",
      "3 -76.308436  50.442678  43.3027732332638  0.00120325130936702   \n",
      "4 -77.456452  52.031459  2.34135031326733  0.00107527195707417   \n",
      "\n",
      "                     geometry  \n",
      "0  POINT (-76.30155 51.08907)  \n",
      "1  POINT (-79.01183 50.97245)  \n",
      "2  POINT (-76.39731 50.45832)  \n",
      "3  POINT (-76.30844 50.44268)  \n",
      "4  POINT (-77.45645 52.03146)  \n"
     ]
    }
   ],
   "source": [
    "csv_path = asset_href.replace(\"s3://\", \"/vsis3/\")\n",
    "gdf = gpd.read_file(\n",
    "    f\"CSV:{csv_path}\",\n",
    "    engine=\"fiona\",\n",
    "    X_POSSIBLE_NAMES=\"lon\",\n",
    "    Y_POSSIBLE_NAMES=\"lat\"\n",
    ")\n",
    "print(gdf.head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6791af01-bc7f-47fa-8ac6-73ead1bfd46c",
   "metadata": {},
   "source": [
    "## CSV (non-spatial)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d47196a-4449-4edf-b9b2-68441682eee3",
   "metadata": {},
   "source": [
    "For this example, we’ll access a CSV file from our shared bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "45a3521a-111d-4a45-82b9-03943ef9c6b4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shared/smk0033/csv_ex/country_estimates_gedi_l4b_v002.csv\n"
     ]
    }
   ],
   "source": [
    "csv_listing = s3.list_objects_v2(Bucket=bucket, Prefix=\"shared/smk0033/csv_ex/\")\n",
    "csv_keys = [obj[\"Key\"] for obj in csv_listing.get(\"Contents\", [])]\n",
    "csv_key = csv_keys[3]\n",
    "print(csv_key)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a0fb963-6632-42de-a403-03216d41e1ba",
   "metadata": {},
   "source": [
    "Although this CSV file can be accessed directly from shared storage or S3, we’re downloading it locally before reading. This approach helps avoid potential memory issues or latency that can arise when reading files over a network connection—especially for formats like CSV that aren’t inherently cloud-optimized.  \n",
    "\n",
    "Downloading also ensures better compatibility with processing tools like `pandas`, which expect local file handles for some operations. While cloud-native streaming is preferred for large geospatial formats (e.g., COGs), working with local copies of non-spatial files can improve stability and simplicity in many cases.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77c27d83-9fc2-44e4-8658-6b21678c55ae",
   "metadata": {},
   "source": [
    "Before downloading, let’s create a new directory to put our file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "66f78224-6191-420c-8310-5e2dd351a7ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(\"./data\", exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "b96cd970-d9fc-417f-abfb-ea1c3ff9ab4b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filename: country_estimates_gedi_l4b_v002.csv\n"
     ]
    }
   ],
   "source": [
    "#create file name for download\n",
    "filename = os.path.basename(csv_key)\n",
    "print(\"Filename:\", filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "ebe8459b-86c1-4d4c-9061-1613f5e39e38",
   "metadata": {},
   "outputs": [],
   "source": [
    "download_path = os.path.join(\"./data\", filename)\n",
    "s3.download_file(Bucket=bucket, Key=csv_key, Filename=download_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "467e8b07-7be8-4fce-941d-d4317701e7e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       Country ISO3  Percent_Forest  FAO_Forested_AGBD  FAO_Forested_AGBD.1  \\\n",
      "0        Aruba  ABW             2.3            -9999.0              -9999.0   \n",
      "1  Afghanistan  AFG             1.9            -9999.0              -9999.0   \n",
      "2       Angola  AGO            53.4               30.3                 16.2   \n",
      "3     Anguilla  AIA            61.1              210.0                128.3   \n",
      "4      Albania  ALB            28.8            -9999.0              -9999.0   \n",
      "\n",
      "   GEDI_L4B_Total_AGBD  GEDI_L4B_Total_AGBD.1  GEDI_L4B_AGBD_SE_Percent  \\\n",
      "0                  2.1                    0.5                      23.6   \n",
      "1                 24.7                    1.3                       5.4   \n",
      "2                 34.6                    0.6                       1.9   \n",
      "3                  4.4                    1.0                      22.5   \n",
      "4                 56.9                    1.4                       2.5   \n",
      "\n",
      "      FAO_AGB  GEDI_L4B_AGB  GEDI_L4B_AGB_SE  \n",
      "0 -9999.00000      0.000036         0.000008  \n",
      "1 -9999.00000      1.583907         0.084862  \n",
      "2     2.02000      4.312326         0.080284  \n",
      "3     0.00116      0.000035         0.000008  \n",
      "4 -9999.00000      0.161158         0.004026  \n"
     ]
    }
   ],
   "source": [
    "# Read CSV into DataFrame\n",
    "data = pd.read_csv(download_path)\n",
    "print(data.head())\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}