Submitting parallel compiled code to SLURM

14 views (last 30 days)
Leos Pohl
Leos Pohl on 22 Jul 2021
Answered: Raymond Norris on 22 Jul 2021
I have a compiled application. I am trying to run it with SLURM on several nodes. The application itself is parallel with local parpool. When I use sbatch with slurm script, the parallel pool does not get created and i get errors and output (see the attached files). The slurm script is below. When I do interactive shell, and execute the commands on each node, all works as expected. Interestingly, even if I use only a single node and signle srun command in the slurm script, i still get an error, so i am not sure which processes are racing to write to those files, but i guess, the error is of different nature.
When i remove the lines that set MCR_CACHE_ROOT, i get a little different error:
Error using parpool (line 113)
Invalid default value for property 'ParallelNode' in class 'parallel.internal.settings.ParallelSettingsTree':
No value is set for setting 'PCTVersionNumber' at the any level.
Error in run_getIllumination (line 28)
MATLAB:settings:config:UndefinedSettingValueForLevel
srun: error: ec64: task 2: Exited with exit code 255
Error using parpool (line 113)
Parallel pool failed to start with the following error.
Error in run_getIllumination (line 28)
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 676)
Failed to locate and destroy old interactive jobs.
Error using parallel.Cluster/findJob (line 74)
The job storage metadata file '/lustre/fs0/home/lpohl/.mcrCache9.5/run_ge0/local_cluster_jobs/R2018b/matlab_metadata.mat' does not exist or is corrupt. For assistance recovering job
data, contact MathWorks Support Team. Otherwise, delete all files in the JobStorageLocation and try again.
parallel:cluster:PoolCreateFailed
Error using parpool (line 113)
Parallel pool failed to start with the following error.
Can someone help out?
#!/bin/bash
#SBATCH -A user
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=30
#SBATCH --time=00:20:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#SBATCH --nodelist=ec64
#Select File to run
export file="run_getIllumination"
export args="~/IlluminationModel/matlab_code/constants.txt"
#Select how logs get stored
mkdir $SLURM_JOB_ID
export debug_logs="$SLURM_JOB_ID/job_$SLURM_JOB_ID.log"
export benchmark_logs="$SLURM_JOB_ID/job_$SLURM_JOB_ID.log"
#Load Modules
module load matlab/matlab-R2018b
# Enter Working Directory
cd $SLURM_SUBMIT_DIR
# Create Log File
echo $SLURM_SUBMIT_DIR
echo "JobID: $SLURM_JOB_ID" >> $debug_logs
echo "Running on $SLURM_NODELIST" >> $debug_logs
echo "Running on $SLURM_NNODES nodes." >> $debug_logs
echo "Running on $SLURM_NPROCS processors." >> $debug_logs
echo "Current working directory is `pwd`" >> $debug_logs
# Module debugging
module list >> $debug_logs
date >> $benchmark_logs
echo "ulimit -l: " >> $benchmark_logs
ulimit -l >> $benchmark_logs
export MCR_CACHE_ROOT="/tmp/mcr_cache_root_$USER"
mkdir -p $MCR_CACHE_ROOT
# Run job
#ls -d ~/IlluminationModel/matlab_code/*.txt | egrep 'tp_grid[0-9]+?\.txt$' | xargs -I {} srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon {}
srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid1.txt
#srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid2.txt
#srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid3.txt
echo "Program is finished with exit code $? at: `date`"
sleep 3
date >> $benchmark_logs
echo "ulimit -l" >> $benchmark_logs
ulimit -l >> $benchmark_logs
mv job.$SLURM_JOB_ID.err $SLURM_JOB_ID/
mv job.$SLURM_JOB_ID.out $SLURM_JOB_ID/

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!