{ "cells": [ { "cell_type": "markdown", "id": "cd4ecddb-68da-4f55-af40-114124093bad", "metadata": {}, "source": [ "# Disk Storage on MAAP\n", "\n", "There are 3 types of storage available with different performance and access characteristics:" ] }, { "cell_type": "markdown", "id": "c736d86c-885c-4d47-8461-d623d2dbca33", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "## Home Directory: \n", "\n", "### Properties\n", "- **Moderate performance** \n", "- Mounted to workspaces for you \n", "- Is not available inside DPS jobs.\n", "- MAAP ADE this is `/projects/` , and is an EFS drive \n", "- MAAP Hub this will be `/home/jovyan/` , and is an EBS volume on an NFS share. jovyan is a default name for jupyterhubs you can ignore this, use path expansion of `~` in your code to avoid issue moving across platforms with different usernames, see related noted below.\n", "\n", "\n", "### Usage Recommendations\n", "\n", "- This should be used for code, small sample files and other documents.\n", "- `~/` is an alias for your home directory, but be careful not every programming language automatically expands this to the correct path. python `>>> os.path.expanduser (\"~/test.log\")`, R `path.expand(\"~/\")`\n", "- Backed up Daily, backups for 30 days - however we recommend all code be pushed to a git repository (like github) so you have a web accessible backup and sharing.\n", "\n", "### Management\n", "\n", "You should keep tabs on how much space your using with `du` and occasionally empty your trash with `rm` or `find`.\n", "\n", "#### Space Usage\n", "How to see how much space your files are using run this command in a terminal \n", "```\n", "du -ch --max-depth=1 --exclude='*/triaged-jobs*' --exclude='*/shared-buckets*' --exclude='*/my-private-bucket*' --exclude='*/my-public-bucket*' . | sort -h \n", "```\n", "You can modify this command to further explore specific sub directories.\n", "\n", "#### Removing Files\n", "There are a couple of common places that can easily build up lots of old files. Your trash folder, where files go if you use the File Explorer to delete, and Core dump files, e.g. code.###, when a kernel crashes.\n", "\n", "**Find** all files in your trash older than 30 days\n", "```\n", "find .local/share/Trash/files/* -type f -mtime +30\n", "```\n", "Permanently delete files with the **find** command\n", "```\n", "find .local/share/Trash/files/* -type f -mtime +30 -delete\n", "```\n" ] }, { "cell_type": "markdown", "id": "25d7f8ff-9638-454f-9b09-ffbae879cf90", "metadata": {}, "source": [ "## Local Disk: \n", "\n", "### Properties\n", "- **Fastest performance**\n", "- Use `/tmp` , this will be cleared on every reboot of your workspace\n", "- There is no backup\n", "\n", "### Usage Recommendations\n", "- Put intermediate products here, things you make, or things you copy from elsewhere (Internet, Buckets) as an intermediate hop from one step to the next\n", "- Make sure to copy anything you need to keep to your Buckets before closing or leaving for more than a few hours \n", "- Also exists on DPS workers so you can make it part of your scripts reliably\n", "\n", "### Management\n", "The disk space for `/tmp` is not infinite, if doing a large number of operations you may need to manually manage cleanup to ensure enough free space. If you are constantly short on space please contact support to increase the size of the `/tmp` disk." ] }, { "cell_type": "markdown", "id": "47298905-b946-435b-8602-703683dd3070", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "## Buckets: \n", "\n", "### Properties\n", "- **Slower initial read (higher latency)**\n", "- High throughput (many parallel reads)\n", "- Available in both Workspaces and DPS\n", "- Lots of space\n", "- Cheap for large amounts of files\n", "- Allows sharing data across users\n", "- Mounted automatically in Workspaces for convenience\n", "- Optimizing performance depends on libaries used, and file formats. Cloud Native formats perform best.\n", "\n", "### Usage Recommendations\n", "- Keep your Data here \n", " - For various reasons code (notebooks), or git repos stored in Buckets can behave weird, please use your home directory.\n", "- Bucket mounts (local paths) are not available in DPS jobs, but S3 paths remain the same. Use of S3 paths, direct reads are encouraged for best performance. For some formats, like netcdf3 you might need to copy to `/tmp` before opening for best performance.\n", "- Private: for your user, your DPS output automatically is here\n", "- Public/Share: for sharing files with other users (public is read/write for your user, shared is read-only for everyone) - to be clear \"Public\" means other users on the platform, not the whole internet.\n", "- Triaged: for debugging failed DPS job\n", "\n", "### Management\n", "- Just because it's easy to store large amounts of files and the cost is cheaper does not mean you shouldn't ever clean up. TODO: Tips on how to find things worth deleting.\n", "- Versioned: we keep old versions for 30 days (this includes deletions), so it's safe to replace files.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:pangeo] *", "language": "python", "name": "conda-env-pangeo-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }