You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
Vec-trick implementation (multiple times)
9 views (last 30 days)
Show older comments
Dear all,
the question is related to Tensorproduct. Since the question was not answered as intended, i want to revisit the question.
Introduction:
Suppose you have a matrix vector multiplication, where a matrix C with size (np x mq) is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). The vector is denoted v with size (mp x 1) or its vectorized version X with size (m x p).
In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations, see Wikipedia.
Expensive variant (in case of flops):
Cheap variant (in case of flops):
Main question:
I want to perform many of these operations at ones, e.g. 2500000. Example: n=m=p=q=7 with A=size(7x7), B=size(7x7), v=size(49x2500000).
In Tensorproduct i have implemented a MeX-C version of the cheap variant which is quite slower than a Matlab version of the expensive variant provided by Bruno Luong.
Is it possible to implement the cheap version in Matlab without looping?
5 Comments
Bruno Luong
on 23 Aug 2021
"Is it possible to implement the cheap version in Matlab without looping"
The method
B = matrix_xi.';
A = matrix_eta;
X = reshape(vector, size(A,2), size(B,1), []);
CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
given in this thread is cheap version in MATLAB without looping! Granted it is no faster than the expensive method, but it's what you ask for, no ?
ConvexHull
on 23 Aug 2021
Edited: ConvexHull
on 23 Aug 2021
Your completely right. I over looked the "pagetimes" version.
I am still suprised about the fact that there is no cheap version which is able to outperform the expensive vectorized Matlab version (actually BLAS version).
I mean, the cheap variant is of order O(2*7*7*7)=O(686), whereas the expensive variant is of order O(7*7*7*7)=O(2401), in case of flops.
Thank you!
ConvexHull
on 23 Aug 2021
By the way i tried to optimize the C-code in Tensorproduct, which gave me 20% speed-up, however still not usefull.
Bruno Luong
on 23 Aug 2021
Because smaller flops doesn't mean necessary faster. Memory access, cache, thread management are as well important, and which is fatest method probably depends on n=m=p=q.
ConvexHull
on 23 Aug 2021
Edited: ConvexHull
on 23 Aug 2021
Yeah that's definitly the case here.
The main problem is that, if you want to perform the Vec-trick multiple times in a vectorized fashion you have to reorder the datastructure. After applying AX you cannot perform a Matrix-Matrix multiplication directly with B.
Stupid Memory access O.o!
Accepted Answer
ConvexHull
on 24 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
Here is a pure intrinsic Matlab version without loops, however with two transpose operations and quite slow.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
C = kron(B,A);
tic
for i=1:n
v1 = reshape(C*reshape(v,49,[]),size(v));
end
toc % Elapsed time is 0.456353 seconds
tic
for i=1:n
v2 = reshape(reshape(B*reshape((A*reshape(v,7,[])).',7,[]),7*2500000,[]).',7,[]);
end
toc % Elapsed time is 3.879752 seconds
max(abs(v1(:)-v2(:)))
% 1.4211e-14
22 Comments
Bruno Luong
on 24 Aug 2021
According to my test, your code is slightly slower than pagemtimes method that is provided in other thread.
for i=1:n
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
end
ConvexHull
on 24 Aug 2021
That makes sense. I think pagetime will not do much different. However, only recently (R2020b) introduced.
ConvexHull
on 24 Aug 2021
Edited: ConvexHull
on 24 Aug 2021
@Bruno: Do you know a faster way tranposing a matrix in Matlab? :)
Bruno Luong
on 24 Aug 2021
Actually MATLAB is smarther than you though, when you do
C = AA * BB.';
Matlab does not call transpose, it call Blas to do matrix multiplication without explicit transposition (no memory copying or moving).
So if you wonder if your code is not efficient because of .', the answer is NO.
ConvexHull
on 24 Aug 2021
Edited: ConvexHull
on 24 Aug 2021
I don't know what you mean.
- The ().' is far the most expensive operation no matter what is being done in the background.
- Reshape is for free.
- The small 7er matrix-matrix multiplication is cheaper than the 49er big one.
- By the way ()' or ().' are nearly same expensive.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
tic
for i=1:n
vv = reshape(v,7,[]); %#ok<*NASGU>
end
toc % Elapsed time is 0.000186 seconds
tic
for i=1:n
vvv = A*vv;
end
toc % Elapsed time is 0.350487 seconds
tic
for i=1:n
vvvv = (vvv).';
end
toc % Elapsed time is 1.682334 seconds
tic
for i=1:n
vvvvv = reshape(vvvv,7,[]);
end
toc % Elapsed time is 0.000181 seconds
tic
for i=1:n
vvvvvvv = B*vvvvv;
end
toc % Elapsed time is 0.347840 seconds
tic
for i=1:n
vvvvvvvv = reshape(vvvvvvv,7*2500000,[]);
end
toc % Elapsed time is 0.000174 seconds
tic
for i=1:n
vvvvvvvvv = (vvvvvvvv).';
end
toc % Elapsed time is 1.470868 seconds
tic
for i=1:n
vvvvvvvvvv = reshape(vvvvvvvvv,7,[]);
end
toc % Elapsed time is 0.000148 seconds
Bruno Luong
on 24 Aug 2021
Edited: Bruno Luong
on 24 Aug 2021
No
CC = BB.'
alone take time. However with
CC = AA*BB.';
there is no transposition happens in the background.
As showed here (timings obtained by running the code on TMW server, R2021a)
AA = rand(49);
BB = rand(49,500000*5);
BBt = BB.';
tic
CC = AA*BB;
toc
Elapsed time is 0.631366 seconds.
tic
CC = AA*BBt.'; % MATLAB does not perform transposition then multiplication
% it does both by a Blas implementation
toc
Elapsed time is 0.631350 seconds.
tic
BBtt = BBt.';
CC = AA*BBtt;
toc
Elapsed time is 1.105448 seconds.
ConvexHull
on 24 Aug 2021
Edited: ConvexHull
on 24 Aug 2021
I understand your remarks. However, my operations are in following order:
reshape(B*reshape(A.')).'
I think the reshape between them is the problem. This makes it expensive.
Bruno Luong
on 24 Aug 2021
Oh I see, I missread your code and the transposition is before the reshape.
In the case, yes the tranposition must be carried out explicitly by Matlab. Sorry but I don't know how one can accelerate the transposition.
ConvexHull
on 25 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
Suprisingly, even two small 7er matrix-matrix multiplications are slower than one 49er matrix-matrix multiplication.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
C = kron(B,A);
tic
for i=1:n
v1 = reshape(C*reshape(v,49,[]),size(v));
end
toc % Elapsed time is 0.456353 seconds
tic
for i=1:n
v2 = reshape(A*reshape(v,7,[]),size(v));
v3 = reshape(B*reshape(v,7,[]),size(v));
end
toc % Elapsed time is 0.683820 seconds
ConvexHull
on 25 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
According to https://github.com/MichielStock/Kronecker.jl the vec-trick becomes useful for larger sizes than n=m=p=q=100.
Bruno Luong
on 25 Aug 2021
According to my test, it is "cheap" method is faster from size 27.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
end
close all
semilogy(stab, [t1; t3]');
legend('Expensive method', 'Cheap method using pagemtimes');
xlabel('s');
ylabel('time [sec]');
grid on;
ConvexHull
on 25 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
Unfortunately, my problem is restricted to sizes of max n=m=p=q=16. Rather smaller.
ConvexHull
on 25 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
With the intrinsic method it is nearly the same.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,1000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),s*1000,[]).',s,[]);
t3(i) = toc;
end
close all
semilogy(stab, [t1; t3]');
legend('Expensive method', 'Cheap method using intrinsic operations');
xlabel('s');
ylabel('time [sec]');
grid on;
Bruno Luong
on 25 Aug 2021
Edited: Bruno Luong
on 25 Aug 2021
Your curves do not clearly cross and not coherent with your finding for size = 7.
ConvexHull
on 25 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
Could you make a test between pagetimes and the intrinsic one? Currently, I have no pagetimes available.
Would be great!
ConvexHull
on 25 Aug 2021
Edited: ConvexHull
on 25 Aug 2021
Yes perhaps the vector size (=1000) was too small. Note that today i use different computer.
Bruno Luong
on 25 Aug 2021
Edited: Bruno Luong
on 25 Aug 2021
There is an mtimesx in File exchange that you can use for older matlab that does similar task as pagemtimes.
AFAIK pagemtimes is slighly faster.
Bruno Luong
on 25 Aug 2021
Results with 3 methods
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),s*1000,[]).',s,[]);
t2(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
end
close all
semilogy(stab, [t1; t2; t3]');
legend('Expensive method', 'Cheap method using transposition', 'Cheap method using pagemtimes');
xlabel('s');
ylabel('time [sec]');
grid on;
ConvexHull
on 25 Aug 2021
Thanks for the effort. There is a small typo in line "v2 = ... s*1000...". But i don't think it would influence the results much.
Bruno Luong
on 26 Aug 2021
Add benchmark with mtimesx
Conclusion
- For version before R2020b, use expensive method for s < 44, use mtimesx otherwise;
- For version R2020b or later, use expensive method for s < 27, use pagemtimes otherwise.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
t4 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),[],s).',s,[]);
t2(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v4 = mtimesx(mtimesx(A, X), 'N', B, 'T');
t4(i) = toc;
end
close all
semilogy(stab, [t1; t2; t3; t4]');
legend('Expensive method', ...
'Cheap method using transposition', ...
'Cheap method using pagemtimes', ...
'Cheap method using mtimesx');
xlabel('s');
ylabel('time [sec]');
grid on;
Stefano Cipolla
on 14 Sep 2023
Edited: Stefano Cipolla
on 14 Sep 2023
Hi there! May I ask if you are aware of implementation of functions similar to "pagemtimes" but able to work with at least one sparse input? Alternatively do you see any easy workaround? More precisely I need someting like
pagemtimes(A, V)
where A is a nxnxn sparse real tensor and V is a real dense nxn matrix...
Bruno Luong
on 14 Sep 2023
Edited: Bruno Luong
on 14 Sep 2023
@Stefano Cipolla "sparse real tensor"
I'm not aware this native MATLAB class.
But you can put the A as diagonal block of n^2 x n^2 sparse matrix
SA = [A(:,:,1) 0 0 ... 0
0 A(:,:,2) 0 ... 0
...
9=0 0 ... A(:,:,n)]
Do the same expansion for V (with the same block) then solve it
More Answers (0)
See Also
Categories
Find more on Data Type Identification in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)