Speed of the abs function

Question

1 vote

Hello everybody!

I had a question about how matlab computed the absolute value function.

Indeed, i was working on a school project and had to optimize a code that returned the smallest euclidian distance from the origin of a 3-D cloud of points. Simple enough, my first thought was to use the basic :

d = x.^2 + y.^2 + z.^2;
dmin = sqrt(min(d));

But then, thinking that squaring was a expensive operation, especially on a data set that can be bilions of points big, i tried to implement a "pre-check". My idea was the following: the 1-norm will be way cheaper to compute and so i will be able to compute a first "estimate" of the smallest distance very quickly. I will then use this to only compute the 2-norm of the points that are "close enough" to the origin on a way smaller data set.

so i did this:

dEstimate = abs(x) + abs(y) + abs(z);
dMinEstimate = min(dEstimate);
threshold = dMinEstimate*sqrt(3); %the closest point in the 2-norm can be at most sqrt(3) times further than the closest in the 1-norm.
potentialMin = find(dEstimate < threshold);
d = x(potentialMin).^2 + y(potentialMin).^2 + y(potentialMin).^2
dmin = sqrt(min(d));

But to my surprise, when using the profile viewer to compare the codes, this is what i got:

As you can see, the results are averaged over 100 calls in order to reduce processing noise. Keep in mind that the data is one milion points here.

How can the square operation time be close or even faster than the absolute value?

I would assume that the absolute value is computed by setting to 0 the sign bit in the IEEE 754 double convention, which is a very simple operation while the squaring is a much more complex process.

Is matlab this optimised? How can this be?

Thanks in advance for your answer !

Arthur

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

James Tursa on 5 Dec 2023

Edited: James Tursa on 6 Dec 2023

Open in MATLAB Online

2 votes

Although I applaud your thinking about this problem, this level of optimization is hard to achieve with a language such as MATLAB because there is a lot going on in the background that you are not considering beyond just the abs( ) vs squaring comparison. E.g.,

d = x.^2 + y.^2 + z.^2; % passes the x, y, z arrays through the CPU. Allocates d same size as x, y, and z
dmin = sqrt(min(d)); % passes d through the CPU. Then sqrt( ) of a scalar

So, four potentially large arrays get passed through the CPU along with some ops and a potentially large new allocation.

Compared to this:

dEstimate = abs(x) + abs(y) + abs(z); % passes the x, y, z arrays through the CPU. Passes abs( ) of them through CPU also. Allocates dEstimate same size as x, y, and z
dMinEstimate = min(dEstimate); % min( ) passes dEstimate through the CPU
threshold = dMinEstimate*sqrt(3); % Scalar ops
potentialMin = find(dEstimate < threshold); % passes dEstimate through the CPU again, allocates potentialMin
d = x(potentialMin).^2 + y(potentialMin).^2 + y(potentialMin).^2; % New allocations for the subscripting, passing through CPU, etc.
dmin = sqrt(min(d)); % passes d through the CPU, then scalar op

So, lots more going on in the background here with how much data is being fed through the CPU, new memory allocations, etc. I don't know how one would necessarily expect this to be either slower or faster than the original given all that is going on. And depending on how large the arrays are I'm not sure what cache hit rate you might expect either. So results may or may not scale well.

If you were coding in C/C++ then maybe you could look at something like this and try to optimize cache hits and minimize large temporary array allocations etc. For this particular problem, in C/C++ all you need is a simple loop with no large temporary array allocations needed at all. But for a language like MATLAB this type of optimization is not worth it IMO.

1 Comment
Show -1 older comments Hide -1 older comments

Walter Roberson on 5 Dec 2023

If you manage to find tables of the throughput and latency of different instructions for various generations of Intel architectures, then you will see that at various times:

FABS has been the same speed as FMUL
FABS has been the same speed as FMUL but had less latency
FABS has been the same speed and latency as FMUL but can only be started every 2 cycles where FMUL could be started every cycle
FABS has been the same speed and latency as FMUL but can only be started every 4 cycles where FMUL could be started every 2 cycles
FABS has been 1 cycle, latency 1, startable every cycle, but FMUL was 1 cycle, latency 5, 2 startable every cycle
FABS has been 1 cycle, latency 1, but FMUL was 2 cycles

... and other combinations.

So, FABS is anywhere from "always faster than FMUL" to "same speed as FMUL but you can't start as many FABS in a row" to "faster than FMUL but you can do more FMUL at the same time"... depending on exact architecture.

https://www.agner.org/optimize/instruction_tables.pdf

Sign in to comment.

Answer 2

Chunru on 5 Dec 2023

2 votes

" squaring was a expensive operation"

This is ture for VERY OLD cpu or some special purpose CPUs. In modern general purpose CPU, such as Intel one, the difference among square/multiplication/abs/addition is no longer significant (subjective here). For example, a CPU can perform 4 ADD/cycle or 4 MUL/cycle (floating point). The abs operation may not be much simpler than square.

If you consider the FP addition, abs, multiplier (of similar speed), with the overhead such memory access, loop control, the two approaches you used above may have ver similar performance. (It is also noted additional overhead for testing and comparison).

So in modern general purpose CPU, it may not worth to do the "optimization" mannually. (Some compilers may do optimisation better).

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Answer 3

Walter Roberson on 5 Dec 2023

Open in MATLAB Online

0 votes

We have to watch out for measurement error due to Execution Engine processing

format long g

N = 10000;

x = randn(N, 1); y = randn(N,1); z = randn(N,1);

M = 10;

t1a = zeros(1,M);

t1b = zeros(1,M);

t2a = zeros(1,M);

t2b = zeros(1,M);

for K = 1 : M; start = tic; d = x.^2 + y.^2 + z.^2; t1a(K) = toc(start); end

for K = 1 : M; start = tic; d = abs(x) + abs(y) + abs(z); t2a(K) = toc(start); end

for K = 1 : M; start = tic; d = x.^2 + y.^2 + z.^2; t1b(K) = toc(start); end

for K = 1 : M; start = tic; d = abs(x) + abs(y) + abs(z); t2b(K) = toc(start); end

t1a

t1a = 1×10

1.0e+00 * 0.000268 0.000168 1.8e-05 1.5e-05 1.8e-05 1.5e-05 6e-06 5e-06 5e-06 5e-06

t1b

t1b = 1×10

1.0e+00 * 0.00019 1.7e-05 1.3e-05 1.3e-05 1.8e-05 1.4e-05 5e-06 5e-06 5e-06 5e-06

t2a

t2a = 1×10

1.0e+00 * 0.000346 4.2e-05 0.000113 2.6e-05 9e-06 7e-06 7e-06 6e-06 6e-06 6e-06

t2b

t2b = 1×10

1.0e+00 * 0.000192 3.1e-05 2.5e-05 4.3e-05 8e-06 7e-06 6e-06 6e-06 6e-06 6e-06

plot([t1a; t1b; t2a; t2b].')

legend({'squares#1', 'squares#2', 'abs#1', 'abs#2'})

plot([t1b; t2b].')

legend({'squares#2', 'abs#2'})

For reasons currently unknown, the abs() version consistently is slower at execution 3 or 4 compared to 2, but by the end of the run the timings for the two versions are virtually identical.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Speed of the abs function

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment
Show -1 older comments Hide -1 older comments

More Answers (2)

0 Comments
Show -2 older comments Hide -2 older comments

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Speed of the abs function

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment Show -1 older comments Hide -1 older comments

More Answers (2)

0 Comments Show -2 older comments Hide -2 older comments

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments

0 Comments
Show -2 older comments Hide -2 older comments