Running Python code on a HPC Cluster with no internet access

Table of Contents

Let’s assume:

You are using Linux
You have sucessfully gotten access to an HPC cluster, but it is not connected to internet. You do however have ssh access to it (that’s a bare requirement)
You are using Python for doing the number crunching, and needs to install a lot of dependencies packages
You have your own python packages to install, as well as your research experiments and config files etc…

So now you have (like me a few month ago) two problems:

You can’t run your usual pip install numpy torch (and all other package you may need) on the cluster
You have to copy a lot of code , and this needs to be updated everytime you ~~add~~ remove a new bug

Let’s tackle this

Create a local pip cache
#

The best way to know what list of package you need is to reproduced all the installation steps locally (if you don’t have already a similar environment). Cluster I have access to are using a module load approach for building an environment, and you typically can use a specific version of python. To get the same python version on you local machine, the easiest way is probably to use https://github.com/pyenv/pyenv and its plugin for virtual environment https://github.com/pyenv/pyenv-virtualenv

On your local Machine
#

# On your local machine
# You may wish to change the $PYENV_ROOT env variable as well.
# 3.x.y is the python version that you loaded on the cluster.
pyenv virtualenv <3.x.y> workvenv
pyenv local workvenv # to activate the venv

Then on your local machine proceed to install everything you would need on the cluster.

pip install numpy torch your_local_packages
# you probably want to install classical build dependencies as well
pip install setuptools wheel

Once your are done, lets gather all the installed packages

pip freeze| grep -Ev '^(#|-e)' > requirements.txt

This commands retrieved all the packages that are not editable (local installation) then, we download them

mkdir pip-cache
pip download -r requirements.txt -d pip-cache
# copy the pip-cache
rsync -av pip-cache your_cluster:/path/to/destination/pip-cache

The path to destination should most likely be in some sort of scratch disk-space or a temporary folder

On the cluster
#

Put this in your .config/pip/pip.conf

[global]
no-index=yes
find-links=<path/to/destination/pip-cache> # edit this

# On the cluster
module load python/3.x.y
# Create a virtualenv on the cluster.
python -m venv mywork-env

Then you copy your code to the cluster (or use the handy function in the next section to keep it in sync 😎)

And install all your packages (editable or not) using your usual pip install it will use the cache by default.

Code synchronization
#

To keep the code up to date on the cluster, and only the code (you don’t want to clutter )

#!/bin/bash
# ~/.local/bin/code_sync

# Synchronization of code files from one place to another.
# Usage:
# code_sync ~/codes my_cluster:work/code


function all_gittracked (){
    # Get all files tracked by gits in a directory, recursively.

    local start_dir
    local git_dirs

    # Set the starting directory for the search
    start_dir=$1
    # Find all subdirectories that contain a .git folder
    git_dirs=$(find `realpath "$start_dir"` -type d -name ".git" -exec dirname {} \;)
    # Loop through the Git directories and list tracked files
    for git_dir in $git_dirs; do
        # Change to the Git repository directory
        cd "$git_dir" || exit # If we can't enter the directory fail
        # Get the repository name
        #repo_name=$(basename $(git rev-parse --show-toplevel) 2>/dev/null)
        if ! git ls-tree --full-tree -r --full-name --name-only HEAD 2>/dev/null; then
            continue
        fi
        tracked_files="${tracked_files} $(git ls-tree --full-tree -r --full-name --name-only HEAD 2>/dev/null | xargs -I{} readlink -f {})"
        # Find all files tracked by Git in the repository
    done
    echo "$tracked_files"
    return 0
}

source_path="$1"
target_path="$2"
rm /tmp/gittracked
abs_source_path=$(realpath $source_path)
all_gittracked $abs_source_path | sed "s|$abs_source_path/||g" | tr " " "\n" > /tmp/gittracked

rsync -rlptDhR $abs_source_path $target_path \
    --progress \
    --delete --force \
    --files-from=/tmp/gittracked \
    --exclude=".git" \
    --verbose

while inotifywait -r -e modify,create,delete $abs_source_path
do
    rsync -rlptDh $abs_source_path $target_path \
        --progress \
        --delete --force \
        --files-from=/tmp/gittracked \
        --verbose
done

Conclusion
#

Here you go ! I know some other coworkers are using sshfs but I am not an fan of it, it induces a lot of communication back and forth between the cluster and your local machine, whereas the data transfered is essentially read only on the cluster.

Create a local pip cache#

On your local Machine#

On the cluster#

Code synchronization#

Conclusion#

Create a local pip cache
#

On your local Machine
#

On the cluster
#

Code synchronization
#

Conclusion
#