MATLAB Answers

Daniel
0

Starting multiple workers from the command prompt

Asked by Daniel
on 18 Apr 2013
I'm working on a computer that has ~30 cores, and I'm trying to use them all as part of a parallel job. I'm running Matlab R2012b on SUSE Linux Enterprise Server 11. The Parallel Computing Toolbox has a limit of twelve local workers, so I'm using the Distributed Computing Toolbox. I don't have any experience with this toolbox, though, so I may be approaching this the wrong way.
My default approach of "open matlabpool 30" results in an error which states "You requested a minimum of 30 workers, but only 12 workers are allowed with the Local cluster."
I found this link that describes how to manually start workers from the command line. Using that page as a guide, I'm able to start the MDCE process and a local job manager. I can also start one worker without any trouble.
Whenever I try to start a second worker, though, I receive the error:
"The mdce service on the host [myhostname] returned the following error: The MATLAB worker exited unexpectedly while starting. The cause of this problem is: ============================================================================ unable to create new native This is causing: Error in server thread; nested exception is: java.lang.OutOfMemoryError: unable to create new native thread ============================================================================ "
The commands that I'm using are:
mdce start -clean -mdcedef mdce_alt.sh
startjobmanager -v
startworker -name worker1 -jobmanager default_jobmanager -jobmanagerhost [myhostname] -v
startworker -name worker2 -jobmanager default_jobmanager -jobmanagerhost [myhostname] -v
(I'm using an alternative MDCE configuration file because I don't have root access to the folders that are set as the default values for PIDBASE, LOCKBASE, LOGBASE and CHECKPOINTBASE in mdce_def.sh. In the alternative config file, I changed the entries for those variables to a directory for which I have write-access.)
Can someone tell me what I'm doing wrong?

  0 Comments

Sign in to comment.

1 Answer

Answer by Jason Ross
on 19 Apr 2013
Edited by Jason Ross
on 19 Apr 2013

  1. "open matlabpool 30" will attempt to open 30 workers using the cluster profile you have set as default -- most likely "local". You need to either switch the default cluster profile to use MDCS or specify the profile name specifically, e.g. "matlabpool open mymdcsprofile 30". If you don't have a profile set up yet, you'll need to get that done first. Depending on your release, this is either a fill in the blanks operation in the Parallel menu, or you can use the Discover Clusters tool in the Parallel menu to look for a running cluster.
  2. For opening several workers, your approach is generally OK -- I've used this exact same procedure many times. I'm assuming you are on a *NIX machine since you altered the .sh file. I'd look at what are defined as "limits" for the shell you are working with. I'd suspect that "memoryuse" might be limited and that's cutting you off. Just type "limit" or "limits" in your shell. As an example, here's my output:
% limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 8192 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc unlimited
maxlocks unlimited
maxsignal 16382
maxmessage 819200
maxnice 0
maxrtprio 0
maxrttime unlimited
You might be able to up limits in the shell you are in, or you might need to look up how to do it for whatever distro you are using.
It might also be possible to open 30 shells and start one worker process in each ...

  3 Comments

My "memoryuse" limit is unlimited. The other fields seem to match your example output fairly closely, except for my "maxproc" (300) and "vmemoryuse" (~8GB).
The Discover Clusters tool doesn't find any clusters, so I'm looking into starting one of my own. I'm having some trouble setting it up, though. The Matlab Job Scheduler profile type references the Distributed Computing Server, but requires that I specify the hostname of the machine where the job manager is running.
Right now I won't know that in advance. I'm running on a system that assigns computing nodes via PBS queue; jobs are assigned to a computing node, each of which has the ~30 cores I mentioned. The job I'm running is a serial loop, where the calculations at iteration i depend on the results from iteration i-1, but the most time-consuming parts of each loop can be run in parallel.
Because cores are assigned in blocks of thirty, when my job is running on a node I'll have thirty cores available on the "local machine". I won't know which node the job will run on until it starts running. I'll have exclusive access to all the cores on that node, though, which is why I'm trying to put them all to use. Is it possible to define a generic MJS profile that does not have an explicit hostname specified in advance?
If you are running PBS and have a MDCS license, it would be easier to set up the PBS integration and run it via the built-in PBS profile. If people don't want it set up on all the machines in the cluster, they could (for example) set it up on one and make a queue that routes jobs there.
To answer your question directly, the MDCS cluster profile needs the machine name in it to work.
Hi Daniel,
'maxproc' is the important limit in this case. MJS potentially uses a large number of threads, and a limit of 300 is too small. I would recommend increasing this limit as far as you can, unlimited would be ideal, but if that's not possible then try at least 1000.
Tom

Sign in to comment.