MATLAB Answers

0

unable to pass "Parallel pool test" on remote Parallel server

Asked by Mike VanHorn on 30 Jul 2019
Latest activity Answered by Mike VanHorn on 31 Jul 2019
I have set up MATLAB Parallel Server on our cluster. The MATLAB Job Scheduler is running on the headnode, and is able to talk to all of the workers on the compute nodes.
If I run MATLAB as a client on the headnode, I can pass all of the cluster profile validation tests. However, if I run the same tests on a different client machine (outside of the cluster), all of the tests pass except for the "Parallel pool test (parpool)". It fails after about 6 minutes with the following error:
-clip-
Error Report: Failed to initialize the interactive session.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 789)
The interactive communicating job errored with the following message: Client unable to connect to worker. Check whether a firewall is blocking communication between the worker machine and the MATLAB client machine.
-clip-
I have the headnode set up so that it is nat-ing the cluster node traffic out of the cluster, so I am not sure why this isn't working. What is different between this test and the others, that this one would be failing when the others pass? It seems to me that in the previous tests, the client is talking to the MJS, and that is all, but in this case the workers need to talk directly to the client (according to the error message), which should be working (I can ssh from the worker machine to the client without issue). If the converse is true, and the client has to talk directly to the worker, I don't see how this would ever work in a cluster situation.
On another track, it may be that some ports are being blocked by filtering on our network switches. What ports do the workers need to be able to talk to the client?
Thank you for any help!

  0 Comments

Sign in to comment.

Products


Release

R2019a

2 Answers

Answer by Jason Ross
on 30 Jul 2019

The required ports are documented here. Note that they are configurable in the mdce_def or mjs_def (.bat or .sh, dependingon platform) files in <matlabroot>/toolbox/distcomp/. There is some more detail in that file, as well.
It may be useful to set the hostname, IP, or ports explictly on the client host. To do that, use the pctconfig command in a fresh session of MATLAB before you attempt to run any other parallel commands. The client tries to "get this right" but in some cases you need to be explicit about the exact IP of the host and/or hostname to use.

  0 Comments

Sign in to comment.


Answer by Mike VanHorn on 31 Jul 2019

I have seen the "Troubleshoot Common Problems" page you referenced, but I had used the formula on this page
to open very much fewer ports on the server. However, based on your suggestion, I have opened from BASEPORT to BASEPORT+2000, inclusive. I'm hoping that helps with another problem, which is that nodes are crashing with errors about not being able to "communicate with the client", even when nothing is running.
Unfortunately, opening all of these extra ports did not fix the problem I posted about; the "Parallel pool test (parpool)" still fails from a client machine outside of the cluster.
I was poking around using lsof while running the validation tests on the headnode (where everything passes), and it appears to me that during the "Parallel pool test (parpool)" test, the client is making direct connections to the workers, and not going through the server. As this seems to be the case, there is no way this is ever going to work in a cluster situation, because the compute nodes have private IPs (192.168.*.*), and there's no way for the client to be able to originate a connection with them.

  0 Comments

Sign in to comment.