Disk Storage on MAAP
There are 3 types of storage available with different performance and access characteristics:
Home Directory:
Properties
Moderate performance
Mounted to workspaces for you
Is not available inside DPS jobs.
MAAP ADE this is
/projects/, and is an EFS driveMAAP Hub this will be
/home/jovyan/, and is an EBS volume on an NFS share. jovyan is a default name for jupyterhubs you can ignore this, use path expansion of~in your code to avoid issue moving across platforms with different usernames, see related noted below.
Usage Recommendations
This should be used for code, small sample files and other documents.
~/is an alias for your home directory, but be careful not every programming language automatically expands this to the correct path. python>>> os.path.expanduser ("~/test.log"), Rpath.expand("~/")Backed up Daily, backups for 30 days - however we recommend all code be pushed to a git repository (like github) so you have a web accessible backup and sharing.
Management
You should keep tabs on how much space your using with du and occasionally empty your trash with rm or find.
Space Usage
How to see how much space your files are using run this command in a terminal
du -ch --max-depth=1 --exclude='*/triaged-jobs*' --exclude='*/shared-buckets*' --exclude='*/my-private-bucket*' --exclude='*/my-public-bucket*' . | sort -h
You can modify this command to further explore specific sub directories.
Removing Files
There are a couple of common places that can easily build up lots of old files. Your trash folder, where files go if you use the File Explorer to delete, and Core dump files, e.g. code.###, when a kernel crashes.
Find all files in your trash older than 30 days
find .local/share/Trash/files/* -type f -mtime +30
Permanently delete files with the find command
find .local/share/Trash/files/* -type f -mtime +30 -delete
Local Disk:
Properties
Fastest performance
Use
/tmp, this will be cleared on every reboot of your workspaceThere is no backup
Usage Recommendations
Put intermediate products here, things you make, or things you copy from elsewhere (Internet, Buckets) as an intermediate hop from one step to the next
Make sure to copy anything you need to keep to your Buckets before closing or leaving for more than a few hours
Also exists on DPS workers so you can make it part of your scripts reliably
Management
The disk space for /tmp is not infinite, if doing a large number of operations you may need to manually manage cleanup to ensure enough free space. If you are constantly short on space please contact support to increase the size of the /tmp disk.
Buckets:
Properties
Slower initial read (higher latency)
High throughput (many parallel reads)
Available in both Workspaces and DPS
Lots of space
Cheap for large amounts of files
Allows sharing data across users
Mounted automatically in Workspaces for convenience
Optimizing performance depends on libaries used, and file formats. Cloud Native formats perform best.
Usage Recommendations
Keep your Data here
For various reasons code (notebooks), or git repos stored in Buckets can behave weird, please use your home directory.
Bucket mounts (local paths) are not available in DPS jobs, but S3 paths remain the same. Use of S3 paths, direct reads are encouraged for best performance. For some formats, like netcdf3 you might need to copy to
/tmpbefore opening for best performance.Private: for your user, your DPS output automatically is here
Public/Share: for sharing files with other users (public is read/write for your user, shared is read-only for everyone) - to be clear “Public” means other users on the platform, not the whole internet.
Triaged: for debugging failed DPS job
Management
Just because it’s easy to store large amounts of files and the cost is cheaper does not mean you shouldn’t ever clean up. TODO: Tips on how to find things worth deleting.
Versioned: we keep old versions for 30 days (this includes deletions), so it’s safe to replace files.