MATLAB Answers

Nick
0

r2011b remote file mirroring issues (non-shared file system with generic scheduler interface)

Asked by Nick
on 7 Mar 2012
Hello,
I am using a local Matlab client (r2011b) connected to our remote cluster with Torque/Moabas the resource manager/scheduler. I can successfully submit and run jobs using the generic scheduler interface, which is great, especially after reading the helpful example files for pbs in a nonshared file system included in the toolbox.
However, when I'm submitting alot of jobs [hundreds] all with very small data sets, say, 10k each, the file mirroring seems to bog my Matlab client down and keep it in a near perpetual 'busy' state. It is also very, very slow, which I can't understand why given the small amount of data and results. Maybe it has trouble mirroring at a high level with lots of files?
Essentially, I have a script that does some preprocessing and then goes through and submits all my jobs. When that's done, I close the Matlab client (since they will be running for a while!) When I reopen it, I set up the scheduler like usual:
jm = torque_scheduler;
(where "torque_scheduler" is a script I modified from the Matlab example to return the interface) and then it immediately goes "busy" and just sits there. I can see it updating files in my working folder by refreshing the explorer window and checking timestamps, but it is INCREDIBLY slow-- it will be stuck doing this for hours.
My question is-- is there some way that I can manually shut off file mirroring so that it doesn't get so slow and then copy the data myself when the jobs or done, or wait until I do a 'findJob' command? When it hits the crazy busy state, it becomes unresponsive for hours. And also of note: when I submit these jobs, my Matlab client begins usurping VERY large amounts of memory-- it was up to 8 GB at one point towards the end of my submission routine... what's up with that? Is that all the zipping/sending it's doing?
One thing that makes me wary of the file mirroring is that I keep getting errors like this every so often:
_Error using distcomp.genericscheduler/pSubmitJobCommon (line 64)
Job submission did not occur because the user supplied SubmitFcn (parallelSubmitFcn) errored.
Error using parallel.cluster.RemoteClusterAccess/startMirrorForJob (line 364)
Failed to start mirror for job with ID 20.
Error in parallelSubmitFcn (line 132)
remoteConnection.startMirrorForJob(job);
Error in distcomp.genericscheduler/pSubmitJobCommon (line 48)
feval(submitFcn, scheduler, job, setprop, args{:});
Error in distcomp.genericscheduler/pSubmitParallelJob (line 24)
scheduler.pSubmitJobCommon( job, scheduler.ParallelSubmitFcn );
Error in distcomp.simpleparalleljob/submit (line 47)
scheduler.pSubmitParallelJob(job);
Error in getsubnetworks (line 44)
submit(job);
Caused by:
Error using parallel.cluster.RemoteClusterAccess/waitForChoreToFinishOrError (line 919)_
which makes me think I should just turn it off. Seems great for small workloads or small amounts of data, but it seems to get very overwhelmed when you throw alot of jobs at it. Any suggestions? Is there an easy way to turn this off and do it manually that anybody has had success with?
Thanks-- Nick
PS: Is there any way to add some kind of taskbar or progress bar so that I can see what Matlab is doing in the background during it's massive 'busy' binge?

  0 Comments

Sign in to comment.

7 Answers

Answer by Konrad Malkowski on 15 Mar 2012

Hi Nick,
Take a look at the getJobStateFcn.m for your scheduler.
Try commenting the code inside of the if ... else block. This should disable mirroring for jobs that are currently running but not finished.
If that doesn't improve the performance, try commenting the whole if ... else ... end block, and then execute:
remoteConnection.doLastMirrorForJob(job);
% Store the fact that we have done the last mirror so we can shortcut in the future
data.HasDoneLastMirror = true;
scheduler.setJobSchedulerData(job, data);
on a job whose status you are interested in, once it completes.

  0 Comments

Sign in to comment.


Answer by Thomas
on 7 Mar 2012

You might be finding residual effects of the upgrade to r2011b..
If the command window throws out a number of old jobs here, that means there was some issue when you moved up and it is trying to find metadata files of old jobs on the cluster and in your working directory and hence cannot complete the validation (this causes it to not time out either since it is still working). IF so do the following procedure
1. clear the local_scheduler _data (you can also rename the folder) /Users/USERNAME/.matlab/local_scheduler_data/R2011a
2. Empty all the metadata files and job directories in the DataLocation (on your desktop) - Parallel>Manage Configurations>Slect your Configuration and find the Data Location > Folder where Job directory is stored.
3. Remove the files that Matlab writes on the cluster for each job i.e Job#.lockstate, Job#.in.mat, Job#.out.mat, Job#.common.mat, Job#.jobout.mat, and Job#.state.mat
hope this helps..

  1 Comment

Thomas,
Thanks for the reply. However, I don't think this is the issue. There was no upgrade involved. These are for all new jobs, and the data folder had been started new.
I think the job mirroring is just very inefficient for large numbers of jobs and it would probably be easier if I turned it off, if there is an easy way to do this.
I will try this fix-- hopefully it helps some. Though I don't think I want to delete the local_scheduler_data because I still have jobs running.
Thanks--
Nick

Sign in to comment.


Answer by Konrad Malkowski on 12 Mar 2012

Hi Nick,
Does the issue occur when you leave your MATLAB on, while the jobs are running on the cluster?
How many jobs do you have running at a time, and how many jobs are on your scheduler in finished state?
What do you mean by 10k data size? Could you provide a bit more detail?
Have you tried running a single job with multiple tasks, instead of running multiple single task jobs? This should reduce the file system load by at least a factor of 2.

  1 Comment

Konrad,
Thanks for your reply. I am submitting a ton of jobs-- between 500-1000-- at a time, as to modularize the huge data set into little chunks so that each data file is roughly 10k as opposed to a few larger 250MB+ data files.
Is it really that ill-advised to disable the mirroring? I can see how it would be nice when running a few jobs here and there, but for massive batch-job environments where we are running simulations on large datasets, I can't seem to get it to work efficiently and it's really putting a dent into my work.
I will respond with more concrete numbers shortly.
Thanks--
Nick

Sign in to comment.


Answer by Nick
on 12 Mar 2012

Konrad:
Does the issue occur when you leave your MATLAB on, while the jobs are running on the cluster?
both
How many jobs do you have running at a time, and how many jobs are on your scheduler in finished state?
Many/many (100s/100s)
What do you mean by 10k data size? Could you provide a bit more detail?
Zip files are about 1 MB.
Have you tried running a single job with multiple tasks, instead of running multiple single task jobs? This should reduce the file system load by at least a factor of 2.
Cannot do it this way without significant changes to my algorithm code.

  0 Comments

Sign in to comment.


Answer by Nick
on 16 Mar 2012

Konrad--
You're awesome. Thanks for helping with this. I am trying this and I think it will work. I was able to find all the spots where mirror status was affected but I wasn't quite sure exactly where to comment things out, but this makes perfect sense, so thank you.
Also, I see all these useful
dctSchedulerMessage()
messages throughout the code that would be really nice to see either in the main window or in another message window. Is there a way to set my logging or to open an interactive logging window somewhere so that I see these as they happen and have a better idea of exactly what is going on? Or are those meant for something else.
My apologies at the lack of Matlab enabled knowledge I have. I'm a C/C++ guy. ;-)
Thanks again Konrad. You're a lifesaver.
Nick

  0 Comments

Sign in to comment.


Answer by Konrad Malkowski on 19 Mar 2012

No problem :-) That is my background as well :-)
Regarding your question. To force these diagnostic messages to print in your MATLAB command window use:
setSchedulerMessageHandler(@disp)

  0 Comments

Sign in to comment.


Answer by Nick
on 19 Mar 2012

Konrad,
So, I still consistently get this error after submitting a number of jobs:
Error using distcomp.genericscheduler/pSubmitJobCommon (line 64)
Job submission did not occur because the user supplied SubmitFcn (parallelSubmitFcn) errored.
Error using parallel.cluster.RemoteClusterAccess/startMirrorForJob (line 364)
Failed to start mirror for job with ID 93.
Error in parallelSubmitFcn (line 132)
remoteConnection.startMirrorForJob(job);
Error in distcomp.genericscheduler/pSubmitJobCommon (line 48)
feval(submitFcn, scheduler, job, setprop, args{:});
Error in distcomp.genericscheduler/pSubmitParallelJob (line 24)
scheduler.pSubmitJobCommon( job, scheduler.ParallelSubmitFcn );
Error in distcomp.simpleparalleljob/submit (line 47)
scheduler.pSubmitParallelJob(job);
Error in getsubnetworks (line 45)
submit(job);
Caused by:
Error using parallel.cluster.RemoteClusterAccess/waitForChoreToFinishOrError (line 919)
The following errors occurred in the com.mathworks.toolbox.distcomp.clusteraccess.UploadFilesChore:
Could not send Job93 for job 93: Copy C:\Users\nlindberg\Documents\MATLAB\Jobs\Job93 to
hpc01.mkei.org:/home/nlindberg/matlab_data//Job93 : while copying
C:\Users\nlindberg\Documents\MATLAB\Jobs\Job93\Task25.state.mat to /home/nlindberg/matlab_data//Job93/Task25.state.mat: Failure
com.mathworks.toolbox.distcomp.remote.ProtocolFulfillmentException: Copy C:\Users\nlindberg\Documents\MATLAB\Jobs\Job93 to
hpc01.mkei.org:/home/jbazil/matlab_data//Job93 : while copying C:\Users\nlindberg\Documents\MATLAB\Jobs\Job93\Task25.state.mat to
/home/nlindberg/matlab_data//Job93/Task25.state.mat: Failure
Error in distcomp.genericscheduler/pSubmitParallelJob (line 24)
scheduler.pSubmitJobCommon( job, scheduler.ParallelSubmitFcn );
Error in distcomp.simpleparalleljob/submit (line 47)
scheduler.pSubmitParallelJob(job);
Error in getsubnetworks (line 45)
submit(job);
I can't find any information at all on "waitForChoreToFinishOrError" and I've looked at the code and I can't seem to figure out what the heck it does-- and it stops me from submitting more jobs. Anybody have any experience with this? Or is it time to lob a call in to Mathworks support?
I'm wondering if maybe it's running into a network file handle limit of some kind while trying to scp over so many files at once...
Thanks-- Nick

  1 Comment

I would recommend contacting support at this point. When you create the ticket, please include your current submission scripts and point the TS to this thread.

Sign in to comment.