Why does backslash behave differently when transposition is used in-line?

Hello, in the process of creating a problem for my students in a numerical methods class I found out that the backslash operation produces different results if you solve A^{T}x=b as "x=(A')\b" when compared to first defining "AT=A'" and then solving as "x=AT\b". An example is given below. Order of operations suggests that in the first case, A should be transposed and then the backslash operation performed, which is forced in the second case. One would expect the results to then be the same. Why is there a difference at all?
n = 15;
m = 2*n+1;
t = fliplr(cos(pi/(n)*(0:n))).';
A = [t.^(0:m);
t.^([0,0:m-1])*diag(max(0,0:m))];
b = (1./((0:m)'+1).*((1).^((0:m)'+1)-(-1).^(((0:m)'+1))));
x = (A')\b;
AT = A';
x2 = AT\b;
difference = norm(x-x2)
difference = 1.3350e-05

2 Comments

This is indepedent of the question about the difference between those two solutions, but you might find the cospi function useful. [Or perhaps not; it can give a different answer than cos(pi*...). Would you expect those two calls to return exactly down-to-the-last-bit (DTTLB) identical answers?]
n = 15;
V = (1/n)*(0:n);
t1 = cos(pi*V);
t2 = cospi(V);
norm(t1-t2)
ans = 3.2870e-16
n = 15;
m = 2*n+1;
t = fliplr(cos(pi/(n)*(0:n))).';
A = [t.^(0:m);
t.^([0,0:m-1])*diag(max(0,0:m))];
cond(A)
ans = 1.0901e+12
b = (1./((0:m)'+1).*((1).^((0:m)'+1)-(-1).^(((0:m)'+1))));
norm(b)
ans = 2.2073
Note that the error seen is exactly as would be expected, when a subtly different order of operations is performed. That is, we would expect to see floating point trash on the order of:
cond(A)*eps(norm(b))
ans = 4.8410e-04
which is quite roughly what you got.
Never presume that two such computations are performed in exactly the same sequence, and if they are not, then you can and often will see floating point trash creep in like this:
0.3 - 0.1 - 0.2
ans = -2.7756e-17
-0.1 - 0.2 + 0.3
ans = -5.5511e-17

Sign in to comment.

 Accepted Answer

This is an optimization that's present in both matrix multiplication and backslash: When transposition is applied as part of these operations (that is, on the same line), they are fused into just one operation for performance.
Here's an example for matrix multiplication:
% Make a random matrix and vector of size 10000 x 10000
n = 2e4;
A = randn(n);
x = randn(n, 1);
% Cost of matrix-vector multiplication:
tic;
y1 = A * x;
toc
Elapsed time is 0.084646 seconds.
% Cost of matrix-vector multiplication with transposed A:
tic;
y2 = A' * x;
toc
Elapsed time is 0.062393 seconds.
% Cost of transposing A
tic;
AT = A';
toc
Elapsed time is 0.372679 seconds.
The reason that transposing is so slow is that it requires a new n-by-n matrix to be allocated, and quite some memory movement. On the other hand, transposed matrix-vector multiplication can be achieved by simply changing the order of indexing:
% Compute y1 = A*x
y1 = zeros(n, 1);
for i=1:n
for j=1:n
y1(i) = y1(i) + A(i, j) * x(j);
end
end
% Compute y2 = A'*x
y2 = zeros(n, 1);
for i=1:n
for j=1:n
y2(i) = y2(i) + A(j, i) * x(j);
end
end
The same principle is also applied for backslash. In this case, usually the cost of factorizing A is much larger than the cost of transposing it, but there's still a noticeable advantage for the case of a triangular matrix:
% Make a random triangular matrix and vector of size 10000 x 10000
n = 5e3;
A = triu(randn(n));
x = randn(n, 1);
% Cost of backslash
tic;
y1 = A \ x;
toc
Elapsed time is 0.019095 seconds.
% Cost of backslash with transposed A:
tic;
y2 = A' \ x;
toc
Elapsed time is 0.019226 seconds.
% Cost of transposing A
tic;
AT = A';
toc
Elapsed time is 0.196998 seconds.

19 Comments

Thank you for the response. I understand the point that you are making about performance improvement (relative to computation time) of combining the operations; however, the particular case I am considering suffers in terms of accuracy of the solution. That is, the example I provided comes from a naive implementation of an approach to computing Gaussian quadrature weights by integrating a Hermite interpolating polynomial. We use roots of Legendre polynomials to construct t instead of the values in t in my example code. When using the roots of the Legendre polynomials, the computed quadrature weights (the first n+1 entries of either x or x2 in the code) lead to orders of magnitude more accurate approximations when transposition is done in a separate line (i.e., when the quadrature weights come from x2).
All of this said, I would expect the language to parse my expression with order of operations in mind. In particular, the parentheses around (A') are being ignored in your description. Further, if transposition is to be avoided for the sake of computational efficiency, then the combined operation should not produce a result that is orders of magnitude different (even with poor conditioning).
then the combined operation should not produce a result that is orders of magnitude different (even with poor conditioning).
Why not? The conditioning of your A matrix is pretty bad,
n = 15;
m = 2*n+1;
t = fliplr(cos(pi/(n)*(0:n))).';
A = [t.^(0:m);
t.^([0,0:m-1])*diag(max(0,0:m))];
cond(A)
ans = 1.0901e+12
Regardless, why do you consider the results to be "orders of magnitude different"? The percent difference is still pretty small,
b = (1./((0:m)'+1).*((1).^((0:m)'+1)-(-1).^(((0:m)'+1))));
x = (A')\b;
AT = A';
x2 = AT\b;
percentDifference = norm(x-x2)/norm(x)*100
percentDifference = 0.0022
I am not suggesting that mathematically the results should not be different when using different algorithms two solve a system of linear equations given poor conditioning. Instead, I am saying that when I request the solution of , the result should be more predictably similar whether or not I define AT directly first. That is, MATLAB's parser is detecting the transposition and then doing something here that I, and every one of my colleagues, would have never expected to solve this system of equations. This is not discussed anywhere in the documentation of "mldivide, \" (mldivide - Solve systems of linear equations Ax = B for x - MATLAB) as far as I am aware, which is incredibly unfortunate. Perhaps the flowcharts on the documentation page should be updated to reflect that this is done.
Further, in my response I say
"When using the roots of the Legendre polynomials, the computed quadrature weights (the first n+1 entries of either x or x2 in the code) lead to orders of magnitude more accurate approximations when transposition is done in a separate line (i.e., when the quadrature weights come from x2)."
This is entirely different than considering the absolute or relative difference in the entries of x or x2. I am considering the error after applying these weights to approximate the action of the linear operation they are designed for. Either way, a 0.0022 percent difference is too large for my purposes.
I am not suggesting that mathematically the results should not be different when using different algorithms two solve a system of linear equations given poor conditioning. Instead, I am saying that when I request the solution of , the result should be more predictably similar whether or not I define AT directly first.
It is not clear to me what the distinction here is between "not different" and "predictably similar". An ill-conditioned operation will amplify numerical noise by a factor of the condition number, which in this case is ~1e12. There is no predictable pattern that the amplified noise can be expected to have.
Do you mean the magnitude of the difference you expect is smaller? The precision limit of double floats is ~1e-16. With a condition number of 1e12, those deviations are predicted to get amplified to as much as 1e-4 in x, which is even greater than what is seen for your specific A,b..
Either way, a 0.0022 percent difference is too large for my purposes.
The only principled way to solve that is to improve the conditioning of your problem. Even if A.'\b was implemented in separate steps the way you expect, there are numerous other differences in the way the same operation in Matlab may be executed in different environments that can trigger similarly different results. When the dimensions of A and b get larger, the way the operations are multithreaded across your CPUs will be different, for example.
So far, for any one CPU configuration, I have not observed any cases where different results were produced depending on chance order of returning values from parallel workers.
I have observed different results being returned for the same code between different machines (with different numbers of CPUs with the same instruction set.)
"Perhaps the flowcharts on the documentation page should be updated to reflect that this is done."
You can make suggestions to TMW for documentation improvements here:
And/or by providing a rating at the bottom of the MLDIVIDE page, and writing a detailed comment.
Matt J's final comment is exactly right: the only principled fix is to improve the conditioning of the problem. If your application requires accuracy beyond what your condition number permits, no syntactic trick will reliably save you. The solution is reformulation: for Gaussian quadrature weights specifically, there are well-conditioned algorithms (e.g., the Golub-Welsch algorithm) that bypass this kind of linear system entirely.
A serious issue (IMO) in this thread that was raised by @Jonah A Reeger in this comment is that "the parentheses around (A') are being ignored." According to Operator Precedence - MATLAB & Simulink, parentheses have higher precedence than complex conjugate transpose. If the user explicitly uses parentheses isn't it reasonable for the user to expect that the term inside the parentheses be evaluated first? The linked doc page doesn't say anything about when parentheses may be ignored.
rng(100);
n = 2e4;
A = randn(n);
x = randn(n, 1);
y2 = (A') * x;
AT = A';
y3 = AT * x;
isequal(y2,y3)
ans = logical
0
norm(y2-y3)
ans = 9.5844e-11
If the user explicitly uses parentheses isn't it reasonable for the user to expect that the term inside the parentheses be evaluated first?
I would say so, yes. But there have always been a variety of weird behind the scenes stuff in Matlab. Take the computations below. Is it clear why zeros(1e4) should take essentially zero time? Or why the time to increment A in place should be more than twice the time to compute B?
testfun()
Elapsed time is 0.000074 seconds. Elapsed time is 0.074352 seconds. Elapsed time is 0.228188 seconds.
function testfun
N=1e4;
tic
A=zeros(N);
toc
tic;
B=A;
B=B+1;
toc
tic;
A=A+1;
toc
end
"But there have always been a variety of weird behind the scenes stuff in Matlab."
Be that as it may for the examples you've shown, the user can't honestly object to the end results.
There shouldn't be anything weird when it comes to operator precedence, which is at the core of any scientific, programming language.
the user can't honestly object to the end results.
If only the end results matter, then your point about operator precedence likewise doesn't matter. The fused implementation of (A')\y and AT=A'; AT\y carry out mathematically equivalent operation sequences and give the same end result up to numerical precision (though in the OP's case, the precision is lousy, because of ill-conditioning).
There shouldn't be anything weird when it comes to operator precedence, which is at the core of any scientific, programming language.
Some would say that the written sequence of commands is as much at the core as operator precedence. In my example, the commands aren't being executed in the chronology that my code requests. The execution of zeros(N) has been delayed and combined with later steps, against my wishes.
@Matt J: (A')\y and AT=A'; AT\y carry out mathematically equivalent operation sequences and give the same end result up to numerical precision
This is the point. Both results are numerically valid. If the difference matters, the question is flawed.
The prescribed (and documented) order of operations indicated by (A')*y and AT = A'; AT*y should yield the same results (not just the same results up to numerical precision). That's what order of operations (or operator precedence) means, at least to my understanding.
Not sure where you going with "written sequence of commands." I've never seen that concept discussed in the context of the few programming languages with which I'm familiar. Any links that describe that in the context of a languate implementation (as opposed to just best or recommended practice) would be appreciated.
Not sure where you going with "written sequence of commands."
I'm going the same place you are going with order of operations. The point of order of operations is that the programmer can be assured that certain intermediate computations will complete at prescribed points in the code and in a prescribed sequence. In (A')*y, you think that the result of A' should exist and be held in RAM in full before the matrix-vector multiplication is started. But that same logic applies to,
A=zeros(N); %(1)
B=A+1; %(2)
Surely, and for the same reasons, I should be able to expect that after (1) and before (2), the variable A exists and actually does contain an NxN matrix of zeros. It doesn't though.
The prescribed (and documented) order of operations indicated by (A')*y and AT = A'; AT*y should yield the same results
MathWorks does not document the JIT acceleration. In older releases you find the hint, that the order of calculations can be changed. Therefore the acceleration was disabled, when the profiler was called. This limitation has been removed, but it is reasonable, that a re-ordering is still applied, if it saves processing time.
Obviously, Matlab does not perform the transposition in (A')*y explicitely:
n = 4e3;
A = randn(n);
x = randn(n, 1);
format longe
timeit(@() F1(A, x), 1)
ans =
1.877561900000000e-02
timeit(@() F2(A, x), 1)
ans =
2.865461900000000e-02
function y = F1(A, x)
B = A + 1;
y = (B') * x;
end
function y = F2(A, x)
B = A' + 1;
y = B * x;
end
I understand the point "should yield the same results", but Matlab is a high-level language, which tries to accelerate the code in a smart way. Omitting the transpose operation is smart and if the programmer is aware of the underlying BLAS and LAPACK functions, this is even expected.
If a computation suffers from such optimizations due to numerical instabilities, it is flawed. The results cannot be reproduced on a different CPU or with a different number of available threads.
This is a standard rule in numerical maths and not a problem of Matlab.
A documentation of the JIT acceleration would be useful in my opinion. The MathWorks team explained repeatedly, that they will not publish this to avoid, that users optimize their code to the JIT. It is the goal of MathWorks, to do it the other way aroud: optimize the JIT to the code written by users.
"Omitting the transpose operation is smart and if the programmer is aware of the underlying BLAS and LAPACK functions, this is even expected."
The original question, and this sub-thread of comments (at least mine), has nothing to do with omitting the transpose operation. The issue is that Matlab is ignoring parentheses, which are a directive by the user as to what should be evaluated.
Consider the following code, where the parentheses in the expression for z2 are respected (assuming mtimes in z1 works left to right as stated in the documentation):
rng(100);
a = rand;
b = rand;
c = rand;
z1 = a*b*c; % (1)
z2 = a*(b*c); % (2)
isequal(z1,z2)
ans = logical
0
Based on the documentation, would you be o.k. with Matlab disregarding the parentheses (for whatever reason it deems justified) when evaluating the RHS of (2) so as to return the exact same result as in z1?
@Paul "Based on the documentation" - This is an important point. There is no documentation about the JIT acceleration for calling BLAS functions.
While a*b*c and a*(b*c) has no effect on the speed, the tranpose operation has. Some programming languages do not perform the transposition explicitly, but set a flag for the variable only. It seems like Matlab's JIT performs something similar.
The code optimization of C compilers shows equivalent effects. Example:
t = sum + y; c = (t - sum) - y; % Compensated summation by Kahan
can be optimized to:
c = ((sum + y) - sum) - y;
and in a next step to:
c = 0;
To control such problems, an exhaustive testing is required, which considers the optimization level.
The problem "(A')\b differes from AT=A'; AT\b" is equivalent to: "It is unclear, how the JIT optimization works". But we do see, how it works, even it is not documented. And if we see, what a tool does, is it worth to discuss, that we expect something else?
Paul
Paul on 5 May 2026 at 4:17
Edited: Paul on 5 May 2026 at 11:42
The primary issue that I'm focusing on, again, has nothing to do with how Matlab calls the underlying BLAS routine.
The issue is that Matlab is ignoring the parentheses, which I think is worthy of dicussion insofar as there is documentation that states that parentheses have the highest precedence and there is no documentation to indicate that there is ever an exception to that rule.
More generally, it's always better to have documentation that explains what a software tool does, rather than having to figure it out by experimenation.
In this regard, secondary to my primary concern, the documentation should explain that operations like '* will not explicitly form the transpose and then multiply. I agree that not forming the explicit transpose is the right thing to do, but it should be documented. What would be the harm for the documentation to explain what the software actually does?
If a user really wants the transpose formed expliclity before the multiplication, then the user should be able to use parentheses as in the Question and count on those parentheses to be respected, which is exactly why parentheses should be used, also as documented.
In the example with c, sum, t, and y (assuming those are floating point types) it's my understanding that a standard-compliant C compiler could implement that as
c = ((sum + y) - sum) - y;
but can't reduce that expression to c = 0, absent specific direction from the user via compiler flag (because floating point addition/subtraction is not associative). Is that not true?
Because Matlab (incorrectly, IMO) ignores the parentheses, one way to force the transpose without explicitly creating a new variable is to use the transpose function literally. Some may find it a bit disconcerting that y1 and y4 are not equivalent.
rng(100);
n = 2e4;
A = randn(n);
x = randn(n, 1);
y1 = A'*x;
y2 = (A') * x;
isequal(y1,y2)
ans = logical
1
AT = A';
y3 = AT * x;
isequal(y2,y3)
ans = logical
0
y4 = transpose(A)*x;
isequal(y4,y3)
ans = logical
1
isequal(y1,y4)
ans = logical
0

Sign in to comment.

More Answers (1)

The already given answers explain the effect alreayd. A summary:
Why does backslash behave differently when transposition is used in-line?
To improve performance.
Matlab calls BLAS and LAPACK functions, which consider the state of transposition of the input also.
Instable systems are highly sensitive to tiny variations of the start values and applied methods. This concerns physical systems like a double pendulum and ill-conditioned linear systems.
Either way, a 0.0022 percent difference is too large for my purposes.
The results produced by instable systems are random. If different computers or algorithms produce different results, there is no "better" or "worse" solution. The exact reproduction of a result of an ill-conditioned linear equation, you have to call the same BLAS/LAPACK method on the same CPU. Of course, you can do this with Matlab e.g. by using: https://www.mathworks.com/matlabcentral/fileexchange/16777-lapack . But the general problem remains, that the results will differ on other CPUs, with a different number of threads, with another BLAS/MKL version or used compiler.
You cannot draw meaningful conclusions from results of instable systems. This concerns rolling a dice or computing the sum with limited precision:
[1e16 + 1 - 1e16, 1e16 - 1e16 + 1]
ans = 1×2
0 1
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
The order of arguments matters. If the difference between the results of an instable system does not match the purpose, this is a problem of the purpose.

5 Comments

I will chime in one last time on this question, since I posed it. My main frustration is with the prescribed order of operations being ignored. My points about being "predictably similar" and that this difference is too large are exactly this. I would expect that when I communicate an order of operations, the code would execute the same way if these operations are performed in separate lines or in the same line. I am certainly aware of the magnification of errors due to the conditioning of this problem and, of course, expect the differences in results produced by different algorithms to depend on the precision of the floating point arithmetic as well. I just did not expect different algorithms to be applied to solve the problem in this case. And, since transposition as an operation is well conditioned, a 0.0022 percent difference when I expected the same operations to be performed in either case is too large.
@Jonah A Reeger: we are just volunteers on this forum, we cannot change how MATLAB works. If you want to give feedback or make an improvement request to TMW then you can do so here:
You can include a link to this thread in your improvement request.
Jan
Jan on 5 May 2026 at 0:32
Edited: Jan on 5 May 2026 at 0:34
And, since transposition as an operation is well conditioned, a 0.0022 percent difference when I expected the same operations to be performed in either case is too large.
While the transposition is stable, solving A\b for an ill-conditioned system is not.
I just did not expect different algorithms to be applied to solve the problem in this case.
This means, that your expectations do not match how Matlab computes the solution. With the knowledge, that Matlab applies a JIT-optimization inside code lines and that the computations are forwared to a BLAS function, the behavior is expected.
It is important to know, how a tool works. You do see how it works now. Then it is not useful to insist on expecting a different behavior.
dpb
dpb on 5 May 2026 at 14:13
Edited: dpb on 6 May 2026 at 20:32
Only being able to discover how it works by reverse engineering is in direct contradiction to the documentation is the issue/problem irrespective of the actual numerics. One certainly has justifcation from the documentation to expect that the parentheses will control order of execution.
It's not likely TMW will choose to change the behavior, but certainly the documentation should somehow make note that optimizations may cause intermediate results to not be explicitly computed despite expressed preference for such with parentheses although as noted, they have adamantly refused to reveal anything about JIT optimizations other than they are implemented. As noted by others MATLAB has always been quirky and only continues to become more so with time. I've commented often in the past that since there is no published formal language standard as with C or Fortran, one has no way to find out what is or is not actually a compliant behavior -- "it does what it does".
Matt J
Matt J on 6 May 2026 at 21:08
Edited: Matt J on 6 May 2026 at 21:29
I just did not expect different algorithms to be applied to solve the problem in this case.
They are not different algorithms. The solution algorithm as applied with AT\b or with (A')\b are mathematically equivalent. The operational differences that are giving you 0.0022 percent errors are not from the transposition or from differences in the mathematical recipe being used. They are from low level differences in the way the CPU is multithreading things. If you were to run your code on different CPUs, you would see differences of 0.0022 percent, even if you used a pre-transposed matrix AT\b in both cases.
I agree that disregarding parentheses is controversial, but it is not at the heart of the numerical discrepancies you are seeing.

Sign in to comment.

Asked:

on 17 Apr 2026

Commented:

on 19 May 2026 at 14:57

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!