Optimize/speed up a big and slow matrix operation with addition and bsxfun?

19 views (last 30 days)
Hi. I have a fairly long line of code in my script that takes about 7 seconds to run, it was originally three separate calculations but are now done in one line that looks like this:
X = bsxfun(@times,reshape(bsxfun(@times,A,B),[1440,1,2251]),(1-C)./2)...
+bsxfun(@times,reshape(E,1440,1,2251),D)...
+bsxfun(@times,reshape(F,1440,1,[]),((1+C)/2));
Since I need to run it several tens of thousands times it is causing a fair amount of anxiety in my life at the moment. I don’t know how I would go about speeding it up further or if it is even possible. So I would really appreciate it if any of you optimization gurus here could give me some advice on how and if I can go about making this faster.
The input variables are of size:
A = 2251x1
B = 1440x2251
C = 1x181
D = 1440x181
E = 1440x2251
F = 1440x2251
Thanks!

Accepted Answer

James Tursa
James Tursa on 20 Apr 2015
Edited: James Tursa on 20 Apr 2015
Assuming you have a C compiler available, here is a simple C mex routine to do the calculation. Could possibly be sped up even more with multi-threading (e.g., OpenMP) as well but I did not include that code below. Let me know if you have an OpenMP compilant compiler since that addition would be fairly simple.
As is (with no C-mex multi-threading) here is an example run on my machine:
Elapsed time is 6.655025 seconds. % M-code with bsxfun
Elapsed time is 1.700077 seconds. % C-mex code (not multi-threaded)
CAUTION: Code below is bare bones with no argument checking.
// X = bsxfun(@times,reshape(bsxfun(@times,A,B),[1440,1,2251]),(1-C)./2)...
// + bsxfun(@times,reshape(E,1440,1,2251),D)...
// + bsxfun(@times,reshape(F,1440,1,[]),((1+C)/2));
//
// The input variables are of size:
//
// A = 1 x 2251
// B = 1440 x 2251
// C = 1 x 181
// D = 1440 x 181
// E = 1440 x 2251
// F = 1440 x 2251
//
// Output:
//
// X = 1440 x 181 x 2251
//
// Calling sequence:
//
// X = this_file_name(A,B,C,D,E,F);
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize i, j, k, m, n, p, km, ikm, jm;
mwSize dims[3];
double *A, *B, *C, *D, *E, *F, *X, *Cm, *Cp;
double a, cp, cm;
m = mxGetM(prhs[3]);
n = mxGetN(prhs[3]);
p = mxGetN(prhs[4]);
A = mxGetPr(prhs[0]);
B = mxGetPr(prhs[1]);
C = mxGetPr(prhs[2]);
D = mxGetPr(prhs[3]);
E = mxGetPr(prhs[4]);
F = mxGetPr(prhs[5]);
dims[0] = m;
dims[1] = n;
dims[2] = p;
plhs[0] = mxCreateNumericArray(3, dims, mxDOUBLE_CLASS, mxREAL);
X = mxGetPr(plhs[0]);
Cm = (double *) mxMalloc(n*sizeof(*Cm));
Cp = (double *) mxMalloc(n*sizeof(*Cp));
for( j=0; j<n; j++ ) {
Cm[j] = (1.0 - C[j]) / 2.0;
Cp[j] = (1.0 + C[j]) / 2.0;
}
for( k=0; k<p; k++ ) {
a = A[k];
km = k * m;
for( j=0; j<n; j++ ) {
jm = j * m;
cm = Cm[j] * a;
cp = Cp[j];
for( i=0; i<m; i++ ) {
ikm = i+km;
*X++ = B[ikm] * cm + E[ikm] * D[i+jm] + F[ikm] * cp;
}
}
}
mxFree(Cp);
mxFree(Cm);
}
  6 Comments
PetterS
PetterS on 21 Apr 2015
Switching to singles brought it down to 1.3 seconds, so not really half but still, since I started with 7 seconds I’m really pleased with 1.3. I will run it continuously for around two weeks now and after that I don’t know if or when I will be using it again so you don’t have to put any effort in argument checking it.
Thank you so much for your help on this, it really saved me a lot of time!
James Tursa
James Tursa on 23 Apr 2015
Another potential speed-up I should mention is to replace mxCreateNumericArray with mxCreateUninitNumericArray.

Sign in to comment.

More Answers (1)

John D'Errico
John D'Errico on 19 Apr 2015
You are trying to do billions of multiplies here, resulting in an array that is of size
1440*181*2251
ans =
586700640
So roughly 600 million elements. since they are double precision numbers, the array will require 8 bytes per element. So the final result will require nearly 5 gigabytes of RAM to store.
Oh, by the way, there are three parts to that computation, so your code must generate three such temporary arrays in that computation, that are then summed together. So the internal storage requirements might be as large as 15 gigabytes of RAM, depending on how that code is processed internally.
That it ran in 7 seconds is a testament to the speed of your system, not the slowness of the code.
If you want it to run faster, get more memory. A good idea would be to use a flash based hard drive to allow your CPU to access virtual memory much faster.
  5 Comments
John D'Errico
John D'Errico on 20 Apr 2015
The funny thing is, much computation in the old days was done using single precision. Double precision is more robust though. You can get away with things with doubles that might hurt you with single.
When you start swapping to disk, things suddenly get much slower. So my argument is that IF the time drops by 50%, then you were not doing any serious amount of swapping with the doubles.
Eps tells you about the potential error in those numbers.
eps('double')
ans =
2.22044604925031e-16
eps('single')
ans =
1.192093e-07
Eps is the smallest number you can add to 1, and get a different result. So think of eps as the granularity for that precision, the least significant bit.
Much will depend on what you are doing with those numbers. Is there noise in them anyway, so that trash down around 1 part in 1e7 won't bother you? What are you doing with this result afterwards? On some computations that noise won't matter a bit.
schrodinbug
schrodinbug on 10 Jul 2019
if you're creating a mex function would using a library like Eigen provide value?

Sign in to comment.

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!