Changing the type of the built-in function output variable

Is it possible to force Matlab to change the type of output variable returned from the built-in function (which I cant control) ?
For example if I have x=randi(255,1e7,1,'uint8'); and want to do [ux,~,id]=unique(x); the variable id will be of double type and occupy unnecessarily huge space before I trim it down with id=uint8(id); since I know max(id)<256. I would like to force id to uint8 type such that it does not create a spike in memory requirement during calling Matlab's unique, which might go over RAM and stall execution.
Any ideas if this is possible?

9 Comments

That looks like a request to Mathworks. However, since uints span 256 unique values, id will have to span 1-256, while uint8 doesn't reach 256 - this seems like problem due to 1-indexing arrays I guess...
id will equal x in that code, with probability very very high
Well I found in the meantime this: id=zeros(size(x),'uint8'); [ux,~,id(:)]=unique(x); which seems to work, another option is id=uint8([]); [ux,~,id(1:numel(x))]=unique(x);
Not sure if these are optimal ways, I was trying to minimize the memory requirement when ordinally encoding x. So ideally it would be [ux,~,x]=unique(x); but does not seem possible without auxilary variable.
id(:) as output calculates the entire output in double and then assigns it to the variable. No memory is saved.
Seriously, for a value not to appear at all even once in 1e7 random trials, the probability would have to be down in the range of 1e-12.
Walter this x was just an example, my point is how to ordinally encode large vector using unique's 3rd output, but in such a way that it does not go through the double type. I am dealing with very large vectors coming in 100s of millions, most often they have only few unique values so the first step for me is to reduce its memory foorprint by encoding them to the uint8 and working on them much easier and faster. But when unique returns the 3rd output in double it often fulls RAM and breaks everything - that is why it would be perfect for me if I could control in what type the uniques's 3rd output comes out.
As Bjorn has mentioned already, randi(255, 'uint8') produces values from 1 to 255. A real test with uint8 values must consider 256 elements including the 0. Then the id cannot be an UINT8, because it cannot carry the value 256.
I've fixed this in my answer: id is an UINT16. Still 20 times faster with a lookup table than with Matlab's UNIQUE.
Would there happen to be a prefix size by which you could be sure that you would have encountered all of the values that will be seen in the table? For example if there were four different values, would it always be the case that all four would be encounted by x(1:1000) ? Or is it likely that some of the values will not occur until late in x ?
After converting to a categorical it looks like the end result does not use doubles for the encoding. Does it use doubles under the hood in an interim step?
> whos
Name Size Bytes Class Attributes
cx 10000000x1 20029990 categorical
id1 10000000x1 80000000 double
id2 10000000x1 20000000 uint16
pool 1x256 512 uint16
ux1 256x1 256 uint8
ux2 256x1 256 uint8
x 10000000x1 10000000 uint8
id1, id2, pool, ux1, ux2 from running the first two parts of Jan's answer.

Sign in to comment.

 Accepted Answer

If you cannot edit the built-in function, write your own one:
function YourTest
x = randi([0, 255], 1e7, 1, 'uint8');
% For test without full data:
% x = randi([10, 253], 1e7, 1, 'uint8');
tic;
[ux1, ~, id1] = unique(x);
toc;
tic;
ux2 = unique(x);
pool = zeros(1, 256, 'uint16');
pool(uint16(ux2) + uint16(1)) = uint16(1):uint16(numel(ux2));
id2 = pool(uint16(x) + uint16(1)).';
toc;
tic;
[ux3, id3] = myUniqueUINT8(x);
toc;
isequal(ux1, ux2, ux3) && isequal(id1, id2, id3)
end
function [ux, id] = myUniqueUINT8(x)
% A fast lookup table:
x16 = (uint16(x) + uint16(1));
m = false(1, 256);
m(x16) = true;
if sum(m) == 256 % All elements found:
ux = (uint8(0):uint8(255)).';
id = x16;
else % Some elements are missing:
p = uint8(0):uint8(255);
ux = p(m).';
q = uint16(0):uint16(256);
q(m) = uint16(1):uint16(numel(ux));
id = q(x16).';
end
end
Timings on Matlab R2018b, Core i5 with 2 cores:
  • Elapsed time is 1.191019 seconds. UNIQUE
  • Elapsed time is 0.334867 seconds. UNIQUE with 1 output + get index
  • Elapsed time is 0.131350 seconds. Lookup table (not all elements found)
  • Elapsed time is 0.066704 seconds. Lookup table (all elements found)
This means that for specific input data it can be much more efficient to write an adjusted function than trying to let Matlab convert some output arrays.

9 Comments

Do you need more speed and do you have a C-compiler installed? Then a C-mex function is most likely a further improvement.
This is perfect solution for integer data types, and it sorts out also the problem when data comes in double but is in fact integer and can be quickly cast down to the (u)int types. But in case my data comes as flotaing point dobles but has only few unique values, the problem persist. I presume for such cases its probably best to collect unique values first and then look for more efficient ways of getting the index (3rd output) in the uint8 type. Doing [ux]=unique(x); and then [~,id]=ismember(x,ux); offers fairly good speedup but id comes out still as double. Here is my testing code for such case:
x=rand(10,1); x=x(randi(10,1e7,1));
tic; [ux,~,id]=unique(x); toc % Elapsed time is 1.293983 seconds
tic; ux=unique(x); [~,id]=ismember(x,ux); toc % Elapsed time is 0.295099 seconds
As far as I understand, you mention a new problem. The original question concerned:
x=randi(255, 1e7, 1, 'uint8')
Now you are asking for:
x = rand(10, 1); x = x(randi(10, 1e7, 1))
To find an optimal solution it is required to know exactly, which information are available, and which are not:
  1. Do you know the number of unique elements in advance?
  2. What is the range of the values of elements?
  3. How much RAM do you have and how large are the real data?
  4. Do you have a C compiler installed?
With such information I would not spend time for posting efficient solutions for a problem you do not have. Now it looks like all you need is a fast hashing methods, which projects a small set of floating point values to a set of UINT8 values. This can be solved by very cheap hashing algorithms.
... And, as I asked before, "Would there happen to be a prefix size by which you could be sure that you would have encountered all of the values that will be seen in the table?"
Gents, apologies for lack of sufficient assumptions, but that is because there are none in practise. Overall I am after generic ordinal encoder that gives me sorted uniques ux and ordinally encoded input (id) of the least memory occupant data type (uintx) for any incoming vector variable x without any prior knowledge about it other than whatever you can inspect yourself: i.e. check its size, type/class, test the range etc. You have provided excellent speed-up when the input x is of uint8 type. I can easily extend it into all integer types including double/single in case they actually are integer: all(ceil(x)==x).
For logical inputs this task is not applicable or trivial if enforced for consistency.
The remaining major types of input one could get are floating point doubles/singles or cell arrays of strings and for them I am using simply:
ux=unique(x); [~,id]=ismember(x,ux);
which although quite fast, produces id as a double, which on its own doubles the memory requirement and could stall large double data processing. So all I was asking is a friendly hint if there are any tricks from your extensive experience that you could recommend to speedup the computation time and/or reduce memory requirements for the remaining types of floating point/cellarrays of strings. I will explore the hashing option - many thanks for the suggestion.
Yes I do have the compiler, and mex is always the final speedup opportunity after all algorithmic options are exhausted.
For floating-point arrays you will also have to deal with tolerances. As far as I understand you cannot guarantee that two elements that should have the same value actually have (after lengthy calculations you might have to expect an epsilon-difference in floats/doubles) - unless they are produced by explicitly assigning the identical values to the different elements...
I can easily extend it into all integer types including double/single in case they actually are integer: all(ceil(x)==x).
typecast() the floating point values to uint64 or uint32 and you would then have reduced down to the integer case.
@dymitr ruta: You wrote:
You have provided excellent speed-up when the input x is of uint8 type. I can easily extend it into all integer types including double/single in case they actually are integer: all(ceil(x)==x).
Be careful. The shown lookup table works very efficiently for UINT8, because the number of possible elements is tiny. For UINT32 this will be extremely slow and for UINT64 your computer must crash.
A general method to catch different values and classes of inputs is the built-in unique already. It is easy to improve the speed of such a function, when you omit the general applicability. But then this faster method cannot be generalized without considering all special cases. If you do so, you will end up at the original Matlab function.
My function is not exhaustively tested and does not check the inputs. The Matlab typical behavior of considering the orientation of vectors is missing also. As soon as this is implemented, the speed gain will be smaller.

Sign in to comment.

More Answers (1)

And another version which is 5% faster than my other answer, if not all 256 possible values are found. If all values appear in the input, it can be 10 times faster:
function [ux, idx] = UniqueUINT8(x)
% INPUT: x: UINT8 vector
% OUTPUT: ux: sorted unique values of x
% idx: indices such that ux(idx)==x
% The values are used as indices in a lookup table. Shift values by 1 to let 0
% become a valid index. Then 255 becomes 256, such that UINT16 is required:
xx = (uint16(x) + uint16(1));
% Create a table as logical vector:
T = false(1, 256);
% Fill the table by values of xx:
% Short version: T(xx) = true;
% It is faster to do this in chunks. This can be stopped if tab is filled with
% all possible 256 elements already:
chunk = 1e5; % Chuck length
len = numel(xx); % Number of elements
ini = 1; % Initial index of chunk
fin = min(chunk, len); % final index of chunk
while ini < fin && ~all(T) % Loop in chunks
T(xx(ini:fin)) = true; % Fill table
ini = fin + 1; % Advance chunk limits:
fin = min(ini + chunk, len);
end
ux = (uint8(0):uint8(255)).';
if all(T) % Faster version for complete table:
idx = xx; % Shifted values are the indices already
else
ux = ux(T).'; % UINT8(find(T) - 1), the original unique values
if nargout > 1
LUT = zeros(1, 256, 'uint16');
LUT(T) = uint16(1):uint16(numel(ux)); % Look up table of indices
idx = LUT(xx).';
end
end
end

Categories

Products

Release

R2020b

Asked:

on 4 Feb 2021

Commented:

Jan
on 8 Feb 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!