Submitting parallel compiled code to SLURM

14 views (last 30 days)
Leos Pohl
Leos Pohl on 22 Jul 2021
Answered: Raymond Norris on 22 Jul 2021
I have a compiled application. I am trying to run it with SLURM on several nodes. The application itself is parallel with local parpool. When I use sbatch with slurm script, the parallel pool does not get created and i get errors and output (see the attached files). The slurm script is below. When I do interactive shell, and execute the commands on each node, all works as expected. Interestingly, even if I use only a single node and signle srun command in the slurm script, i still get an error, so i am not sure which processes are racing to write to those files, but i guess, the error is of different nature.
When i remove the lines that set MCR_CACHE_ROOT, i get a little different error:
Error using parpool (line 113)
Invalid default value for property 'ParallelNode' in class 'parallel.internal.settings.ParallelSettingsTree':
No value is set for setting 'PCTVersionNumber' at the any level.
Error in run_getIllumination (line 28)
srun: error: ec64: task 2: Exited with exit code 255
Error using parpool (line 113)
Parallel pool failed to start with the following error.
Error in run_getIllumination (line 28)
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 676)
Failed to locate and destroy old interactive jobs.
Error using parallel.Cluster/findJob (line 74)
The job storage metadata file '/lustre/fs0/home/lpohl/.mcrCache9.5/run_ge0/local_cluster_jobs/R2018b/matlab_metadata.mat' does not exist or is corrupt. For assistance recovering job
data, contact MathWorks Support Team. Otherwise, delete all files in the JobStorageLocation and try again.
Error using parpool (line 113)
Parallel pool failed to start with the following error.
Can someone help out?
#SBATCH -A user
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=30
#SBATCH --time=00:20:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#SBATCH --nodelist=ec64
#Select File to run
export file="run_getIllumination"
export args="~/IlluminationModel/matlab_code/constants.txt"
#Select how logs get stored
export debug_logs="$SLURM_JOB_ID/job_$SLURM_JOB_ID.log"
export benchmark_logs="$SLURM_JOB_ID/job_$SLURM_JOB_ID.log"
#Load Modules
module load matlab/matlab-R2018b
# Enter Working Directory
# Create Log File
echo "JobID: $SLURM_JOB_ID" >> $debug_logs
echo "Running on $SLURM_NODELIST" >> $debug_logs
echo "Running on $SLURM_NNODES nodes." >> $debug_logs
echo "Running on $SLURM_NPROCS processors." >> $debug_logs
echo "Current working directory is `pwd`" >> $debug_logs
# Module debugging
module list >> $debug_logs
date >> $benchmark_logs
echo "ulimit -l: " >> $benchmark_logs
ulimit -l >> $benchmark_logs
export MCR_CACHE_ROOT="/tmp/mcr_cache_root_$USER"
mkdir -p $MCR_CACHE_ROOT
# Run job
#ls -d ~/IlluminationModel/matlab_code/*.txt | egrep 'tp_grid[0-9]+?\.txt$' | xargs -I {} srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon {}
srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid1.txt
#srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid2.txt
#srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid3.txt
echo "Program is finished with exit code $? at: `date`"
sleep 3
date >> $benchmark_logs
echo "ulimit -l" >> $benchmark_logs
ulimit -l >> $benchmark_logs

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!