Let’s assume:
- You are using Linux
- You have sucessfully gotten access to an HPC cluster, but it is not connected to internet. You do however have ssh access to it (that’s a bare requirement)
- You are using Python for doing the number crunching, and needs to install a lot of dependencies packages
- You have your own python packages to install, as well as your research experiments and config files etc…
So now you have (like me a few month ago) two problems:
- You can’t run your usual pip install numpy torch (and all other package you may need) on the cluster
- You have to copy a lot of code , and this needs to be updated everytime you
addremove a new bug
Let’s tackle this
Create a local pip cache#
The best way to know what list of package you need is to reproduced all the installation steps locally (if you don’t have already a similar environment). Cluster I have access to are using a module load approach for building an environment, and you typically can use a specific version of python. To get the same python version on you local machine, the easiest way is probably to use https://github.com/pyenv/pyenv and its plugin for virtual environment https://github.com/pyenv/pyenv-virtualenv
On your local Machine#
# On your local machine
# You may wish to change the $PYENV_ROOT env variable as well.
# 3.x.y is the python version that you loaded on the cluster.
pyenv virtualenv <3.x.y> workvenv
pyenv local workvenv # to activate the venv
Then on your local machine proceed to install everything you would need on the cluster.
pip install numpy torch your_local_packages
# you probably want to install classical build dependencies as well
pip install setuptools wheel
Once your are done, lets gather all the installed packages
pip freeze| grep -Ev '^(#|-e)' > requirements.txt
This commands retrieved all the packages that are not editable (local installation) then, we download them
mkdir pip-cache
pip download -r requirements.txt -d pip-cache
# copy the pip-cache
rsync -av pip-cache your_cluster:/path/to/destination/pip-cache
On the cluster#
Put this in your .config/pip/pip.conf
[global]
no-index=yes
find-links=<path/to/destination/pip-cache> # edit this
# On the cluster
module load python/3.x.y
# Create a virtualenv on the cluster.
python -m venv mywork-env
Then you copy your code to the cluster (or use the handy function in the next section to keep it in sync 😎)
And install all your packages (editable or not) using your usual pip install
it will use the cache by default.
Code synchronization#
To keep the code up to date on the cluster, and only the code (you don’t want to clutter )
#!/bin/bash
# ~/.local/bin/code_sync
# Synchronization of code files from one place to another.
# Usage:
# code_sync ~/codes my_cluster:work/code
function all_gittracked (){
# Get all files tracked by gits in a directory, recursively.
local start_dir
local git_dirs
# Set the starting directory for the search
start_dir=$1
# Find all subdirectories that contain a .git folder
git_dirs=$(find `realpath "$start_dir"` -type d -name ".git" -exec dirname {} \;)
# Loop through the Git directories and list tracked files
for git_dir in $git_dirs; do
# Change to the Git repository directory
cd "$git_dir" || exit # If we can't enter the directory fail
# Get the repository name
#repo_name=$(basename $(git rev-parse --show-toplevel) 2>/dev/null)
if ! git ls-tree --full-tree -r --full-name --name-only HEAD 2>/dev/null; then
continue
fi
tracked_files="${tracked_files} $(git ls-tree --full-tree -r --full-name --name-only HEAD 2>/dev/null | xargs -I{} readlink -f {})"
# Find all files tracked by Git in the repository
done
echo "$tracked_files"
return 0
}
source_path="$1"
target_path="$2"
rm /tmp/gittracked
abs_source_path=$(realpath $source_path)
all_gittracked $abs_source_path | sed "s|$abs_source_path/||g" | tr " " "\n" > /tmp/gittracked
rsync -rlptDhR $abs_source_path $target_path \
--progress \
--delete --force \
--files-from=/tmp/gittracked \
--exclude=".git" \
--verbose
while inotifywait -r -e modify,create,delete $abs_source_path
do
rsync -rlptDh $abs_source_path $target_path \
--progress \
--delete --force \
--files-from=/tmp/gittracked \
--verbose
done
Conclusion#
Here you go ! I know some other coworkers are using sshfs but I am not an fan of it, it induces a lot of communication back and forth between the cluster and your local machine, whereas the data transfered is essentially read only on the cluster.