How to speed up MEX function?

Question

Yifan Lin on 1 Nov 2022

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/1841218-how-to-speed-up-mex-function

Commented: James Tursa on 7 Nov 2022

following mex code is running too slow, but I don't know why it is and how to make it faster. Any help is greatly appreciated!

calculate_my_way.cpp

#include "mex.hpp"
#include "mexAdapter.hpp"
#include <cmath>
class MexFunction : public matlab::mex::Function {
public:
    void operator()(matlab::mex::ArgumentList outputs, matlab::mex::ArgumentList inputs) {
		
		matlab::data::TypedArray<double> var0 = inputs[0];
		matlab::data::TypedArray<double> var1 = inputs[1];
		matlab::data::TypedArray<double> var2 = inputs[2];
		matlab::data::TypedArray<double> var3 = inputs[3];
		
		auto var0Iter = var0.begin();
		auto var1Iter = var1.begin();
		auto var2Iter = var2.begin();
		auto var3Iter = var3.begin();
		
		const int numOfElements = var0.getNumberOfElements();
		double buffer = 0;
		
		for (int x = 0; x<numOfElements; x++)
		{
			buffer = std::sin(*var0Iter) + std::sin(*var1Iter) + std::sin(*var2Iter) + std::cos(*var3Iter);
			*var0Iter = buffer;
			buffer = std::sin(*var1Iter + *var2Iter) + std::cos(*var3Iter);
			*var1Iter = buffer;
			var0Iter++;
			var1Iter++;
			var2Iter++;
			var3Iter++;
			
		}
		outputs[0] = std::move(var0);
		outputs[1] = std::move(var1);
    }
};

It's just simple calculation, but this code runs even slower than native distance function which performs a lot more complicated calculation than just a few sin+cos.

I'm using compiler that came with Visual Studio 2017. below is how I run mex and the compiler setup info.

mex -v calculate_my_way.cpp
...
   Compiler location: C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\
...
   OPTIMFLAGS : /O2 /Oy- /DNDEBUG

and this is how I am seeing performance issues.

clear
size_test = 1e7;
var1 = zeros(size_test, 1);
var2 = zeros(size_test, 1);
var3 = zeros(size_test, 1);
var4 = zeros(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
elapsed_time = timeit(cant_beat_me);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
elapsed_time = timeit(mex_slow);

15 Comments
Show 13 older commentsHide 13 older comments

Bruno Luong on 2 Nov 2022

Edited: Bruno Luong on 2 Nov 2022

Open in MATLAB Online

Obviously evalutae cos/sin depends run time on data

Compare between MATLAB and cpp with zero data

clear
size_test = 1e7;
var1 = zeros(size_test, 1);
var2 = zeros(size_test, 1);
var3 = zeros(size_test, 1);
var4 = zeros(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
MATLAB_elapsed_time = timeit(cant_beat_me) %  0.0274
Intel_elapsed_time = timeit(mex_slow) % 0.1803
function [out0,out1] = distance(var0, var1, var2, var3)
out0 = sin(var0) + sin(var1) + sin(var2) + cos(var3);
out1 = sin(var1 + var2) + cos(var3);
end

with random data

clear
size_test = 1e7;
var1 = 2*pi*rand(size_test, 1);
var2 = 2*pi*rand(size_test, 1);
var3 = 2*pi*rand(size_test, 1);
var4 = 2*pi*rand(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
MATLAB_elapsed_time = timeit(cant_beat_me) % 0.1560
Intel_elapsed_time = timeit(mex_slow) % 0.5101

The factor of

>> 0.5101/0.156
ans =
3.2699

could be well explained by multi-thread.

Bruno Luong on 3 Nov 2022

Open in MATLAB Online

By curiosity I code the same calculation in C. Time is 0.24 sec; twice faster than C++ (0.5 sec) but 60% slower than MATLAB (0.147 sec).

/* mex -g -R2018a calculate_C_way.c */
#include "mex.h"
#include <math.h>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    int i, n;
    double *var0Iter, *var1Iter, *var2Iter, *var3Iter, *out0Iter, *out1Iter;
    n    = mxGetNumberOfElements(prhs[0]);
    plhs[0] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
    plhs[1] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
    var0Iter = mxGetDoubles(prhs[0]);
    var1Iter = mxGetDoubles(prhs[1]);
    var2Iter = mxGetDoubles(prhs[2]);
    var3Iter = mxGetDoubles(prhs[3]);
    
    out0Iter = mxGetDoubles(plhs[0]);
    out1Iter = mxGetDoubles(plhs[1]);
    for (i = 0; i < n; i++) {
        *out0Iter = sin(*var0Iter) + sin(*var1Iter) + sin(*var2Iter) + cos(*var3Iter);
        *out1Iter = sin(*var1Iter + *var2Iter) + cos(*var3Iter);
        out0Iter++;
        out1Iter++;
        var0Iter++;
        var1Iter++;
        var2Iter++;
        var3Iter++;
    }
}

Yifan Lin on 3 Nov 2022

@Bruno Luong, Thanks! I was also curious and wanted to give this a try, but you beat me to it! Yes, apparently C++ API is slower than C API for MATLAB. Ref: this post - Is C++ MEX API significantly slower than the C MEX API? - MATLAB Answers - MATLAB Central (mathworks.com). I've also tried openmp like you suggested, but the problem was, I was using VS2017, so I couldn't do #pragma omp simd. I'll wait for my VS2019 install to finish and try again there with the C API.

Sign in to comment.

Sign in to answer this question.

Answer 1

Bruno Luong on 3 Nov 2022

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/1841218-how-to-speed-up-mex-function#answer_1090703

Edited: Bruno Luong on 3 Nov 2022

Open in MATLAB Online

Last experience, Time with C OpenMP, Intel Parallel Studio XE 2022

CIntel_elapsed_time = 0.0574 [sec]

2.5 faster than MATLAB (finally I beat MATLAB).

To have fast mex: Use C-API (not Cpp), Make it multi-thread, Select a decent compiler.

/* Compile with intel compiler
mex -O COMPFLAGS="$COMPFLAGS /MD /Qopenmp" -R2018a calculate_C_way.c */
#include "mex.h"
#include <math.h>
 /* Set to 1 to Enable OPENMP
	to 0 to disable it */
#define OPENMP_FLAG		1
#if OPENMP_FLAG == 1
#include <omp.h>
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    int i, n;
    double *var0Iter, *var1Iter, *var2Iter, *var3Iter, *out0Iter, *out1Iter;
    n    = mxGetNumberOfElements(prhs[0]);
    plhs[0] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
    plhs[1] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
    var0Iter = mxGetDoubles(prhs[0]);
    var1Iter = mxGetDoubles(prhs[1]);
    var2Iter = mxGetDoubles(prhs[2]);
    var3Iter = mxGetDoubles(prhs[3]);
    
    out0Iter = mxGetDoubles(plhs[0]);
    out1Iter = mxGetDoubles(plhs[1]);
#if OPENMP_FLAG==1
#pragma omp parallel for default(none) private(i) \
        schedule(static) \
		shared(n, out0Iter, out1Iter, var0Iter, var1Iter, var2Iter, var3Iter)
#endif
    for (i = 0; i < n; i++) {
        out0Iter[i] = sin(var0Iter[i]) + sin(var1Iter[i]) + sin(var2Iter[i]) + cos(var3Iter[i]);
        out1Iter[i] = sin(var1Iter[i] + var2Iter[i]) + cos(var3Iter[i]);
    }
}

2 Comments
Show NoneHide None

Yifan Lin on 3 Nov 2022

@Bruno Luong Thank you very much!!!! This is exactly what I was looking for!

James Tursa on 7 Nov 2022

Open in MATLAB Online

Typically, instead of this

#define OPENMP_FLAG		1
#if OPENMP_FLAG == 1
#include <omp.h>
#endif

you can use this:

#ifdef _OPENMP
#include <omp.h>
#endif

The _OPENMP macro is defined by the compiling environment when OpenMP is available.

Sign in to comment.

Answer 2

Bruno Luong on 2 Nov 2022

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/1841218-how-to-speed-up-mex-function#answer_1089328

Edited: Bruno Luong on 2 Nov 2022

Open in MATLAB Online

I don't know well C++, but I have practiced quite a lot mex C.

It looks like this statement just move a bunch of data

outputs[0] = std::move(var0);
outputs[1] = std::move(var1);

ALso I wonder if your input "0, and 1 would change

*var0Iter = buffer;
...
*var1Iter = buffer;

after calling the mex, which is NOT allowed.

2 Comments
Show NoneHide None

Yifan Lin on 2 Nov 2022

@Bruno Luong! Another one of your answer here helped me tremendously a few years back! thank you!

I've tested the var0 and var1 value, they did change. And they get moved to the output.

So, [a,b] = calculate_my_way(0,0,0,0); [a,b] will be both 1.

I have a suspicion that this slowness may be either

1. MSVC is not as good as the one Mathworks uses (probably Intel Parallel Studio)

2. the C++ Mex function calling may be problematic with some massive overhead that I don't know.

3. I am just not doing something right in my c++ code?

Bruno Luong on 2 Nov 2022

" Another one of your answer here helped me tremendously a few years back! thank you! "

Oh... realy glad to read that...

Sign in to comment.

How to speed up MEX function?

15 Comments
Show 13 older commentsHide 13 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How to speed up MEX function?

15 Comments Show 13 older commentsHide 13 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

15 Comments
Show 13 older commentsHide 13 older comments

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None