I am trying to accelerate my code involving lots of FFT-s by putting it inside a parfor loop. All calculations in the loop are completely independent, and I don't see any error messages about non-classifyable variables. When I use 12 workers on my 12-core machine to run this parfor loop, I see that first 15-20 runs go very fast: it takes about 0.8 s to run a parfor loop. But then, calculation time begins to vary from a run to a run, from 0.8 s to 19 s, although these runs are completely equivalent in terms of computation load.
I am aware that FFT is multi-threaded and runs on many cores, so the communication overhead might interfere with parallelization in parfor. Then it is unclear to me why first 20 runs are so fast. I am using sliced arrays, for both input and output, and output arrays are only changed inside the parfor loop and not run-to-run, so there is no accumulating of the data to be communicated between workers.
If I use a standard for-loop instead of parfor, then calculation time stays very stable, around 4.5 s, which is much longer that initial 0.8s for parfor. Task manager shows that in this case all cores are pretty busy, with about 97% of the total CPU load. When I use parfor, the load is only about 50-60%, and it is still faster.
Any hint is really appreciated! -Thanks