Starting multiple workers from the command prompt

10 views (last 30 days)
I'm working on a computer that has ~30 cores, and I'm trying to use them all as part of a parallel job. I'm running Matlab R2012b on SUSE Linux Enterprise Server 11. The Parallel Computing Toolbox has a limit of twelve local workers, so I'm using the Distributed Computing Toolbox. I don't have any experience with this toolbox, though, so I may be approaching this the wrong way.
My default approach of "open matlabpool 30" results in an error which states "You requested a minimum of 30 workers, but only 12 workers are allowed with the Local cluster."
I found this link that describes how to manually start workers from the command line. Using that page as a guide, I'm able to start the MDCE process and a local job manager. I can also start one worker without any trouble.
Whenever I try to start a second worker, though, I receive the error:
"The mdce service on the host [myhostname] returned the following error: The MATLAB worker exited unexpectedly while starting. The cause of this problem is: ============================================================================ unable to create new native This is causing: Error in server thread; nested exception is: java.lang.OutOfMemoryError: unable to create new native thread ============================================================================ "
The commands that I'm using are:
mdce start -clean -mdcedef mdce_alt.sh
startjobmanager -v
startworker -name worker1 -jobmanager default_jobmanager -jobmanagerhost [myhostname] -v
startworker -name worker2 -jobmanager default_jobmanager -jobmanagerhost [myhostname] -v
(I'm using an alternative MDCE configuration file because I don't have root access to the folders that are set as the default values for PIDBASE, LOCKBASE, LOGBASE and CHECKPOINTBASE in mdce_def.sh. In the alternative config file, I changed the entries for those variables to a directory for which I have write-access.)
Can someone tell me what I'm doing wrong?

Answers (1)

Jason Ross
Jason Ross on 19 Apr 2013
Edited: Jason Ross on 19 Apr 2013
  1. "open matlabpool 30" will attempt to open 30 workers using the cluster profile you have set as default -- most likely "local". You need to either switch the default cluster profile to use MDCS or specify the profile name specifically, e.g. "matlabpool open mymdcsprofile 30". If you don't have a profile set up yet, you'll need to get that done first. Depending on your release, this is either a fill in the blanks operation in the Parallel menu, or you can use the Discover Clusters tool in the Parallel menu to look for a running cluster.
  2. For opening several workers, your approach is generally OK -- I've used this exact same procedure many times. I'm assuming you are on a *NIX machine since you altered the .sh file. I'd look at what are defined as "limits" for the shell you are working with. I'd suspect that "memoryuse" might be limited and that's cutting you off. Just type "limit" or "limits" in your shell. As an example, here's my output:
% limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 8192 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc unlimited
maxlocks unlimited
maxsignal 16382
maxmessage 819200
maxnice 0
maxrtprio 0
maxrttime unlimited
You might be able to up limits in the shell you are in, or you might need to look up how to do it for whatever distro you are using.
It might also be possible to open 30 shells and start one worker process in each ...
  3 Comments
Jason Ross
Jason Ross on 19 Apr 2013
If you are running PBS and have a MDCS license, it would be easier to set up the PBS integration and run it via the built-in PBS profile. If people don't want it set up on all the machines in the cluster, they could (for example) set it up on one and make a queue that routes jobs there.
To answer your question directly, the MDCS cluster profile needs the machine name in it to work.
Thomas Ibbotson
Thomas Ibbotson on 22 Apr 2013
Hi Daniel,
'maxproc' is the important limit in this case. MJS potentially uses a large number of threads, and a limit of 300 is too small. I would recommend increasing this limit as far as you can, unlimited would be ideal, but if that's not possible then try at least 1000.
Tom

Sign in to comment.

Categories

Find more on Startup and Shutdown in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!