Running Workflow (LIGKA) batch jobs on the Gateway

(Written by Jose, modified by Thomas)

Problem description

AFS can not be accessed from the cluster where things are executed in the queue, so all data/code/etc you want to run have to be storaged in the GPFS storage system, which can be read by the cluster.

Workaround

First, load the modules pointing directly to Thomas folder (which now point to GPFS, so are available at runtime):

module use ~g2thayw/public/easybuild/modules/all
module load EP-Stability-WF/1.0.4-DD-3.35.0

Then, move your imas db to your folder in GPFS and make a link back to the original location

mkdir -p /pfs/work/$USER/public && cd ~/public && mv -i imasdb /pfs/work/$USER/public && ln -s /pfs/work/$USER/public/imasdb .

Now if you do ls -g you should see something like:

<g2jrueda@s51 ~/public>ls -g
total 3
drwxr-xr-x 4 g2itmuse 2048 Jul 18  2022 imas_actors
lrwxr-xr-x 1 g2itmuse   32 Jul 21 11:52 imasdb -> /pfs/work/g2jrueda/public/imasdb

as you see we have imasdb → /pfs/… meaning that the link is well done

<aside> 💡 Nota bene: Your imas database is now stored in your GPFS folder, which, although we copied into a folder called public, is not public at all. So after this, you will need to grant access permissions to your folder `/pfs/work/$USER/public`` to each colleague

This can be done with chmod -R o+rx /pfs/work/$USER/imasdb </aside>

At this point, all data and code are in GPFS so it could be that things should work. But IMAS is trying to save something in your home, which point to AFS, so the code will enter in an non-convergence infinite loop with HELENA execution. I found a trick, bypass your home env variable (!!!).

export HOME=/pfs/work/$USER

<aside> 💡 Many things and programs depends on hidden configuration files or relatives routes from your home dir. Changing this can have multiple unwanted side effects, so just change it before submitted the batch work and reset it after the work is done. </aside>

Also, before running, increase a bit the memory limit, for the run as the default can be a bit too small (open the batch file and add):

#SBATCH  --mem 50GB

This will be added to a future version of the EP_batch command. As a workaround, this can done by setting the environment variable SBATCH_MEMORY=50GB before running the EP_batch script (export SBATCH_MEMORY=50GB). Please note that 50GB, might not be sufficient for larger runs using LIGKA models 1/2 -- please adjust it according to your needs.

After running the workflow the files before saved at your home, like resp_n* are now saved in your fake home.

Caveat: any original data that you try to read must also not be stored in AFS. If you are reading from someone else's data, please run the HELENA step of the workflow interactively, and submit only the remaining parts.