Finding longest subvector of equal elements in 5000000 length int-vector.

Question

Filip Berntsson on 23 Jan 2021

1
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/724663-finding-longest-subvector-of-equal-elements-in-5000000-length-int-vector

Commented: Image Analyst on 23 Jan 2021

Hi there!

Heads up: Technically this is a homework-problem, but I'm done and just want your thoughts on if the code could be faster/better.

The task is to find how many same-digit-sequences of different lengths that appear in the data. I.E. in data=[1,1,1,2,2,3,3,4,4] sequcens of length 3 appears once, sequences of length 2 appears thrice. So i want appearances=[0,3,1,0,...] (indices indicate length of sequence)

The data is typically a vector of length 5000000 and you don't know how long the longest sequences are (otherwise I'd probably use strfind and decimate the data). Further this is done in a context where the starting-index of the sequences might later become important, even though I don't extract them atm.

The code submitted below should only have to run over the data vector once to withdraw all neccesary information. But still it's at least two times slower than what I've got reason to belive that it could be.

Do you have any ideas on how to solve this task in a quicker/cleaner fashion? Would love to see your take on it. /FB

%find longest similar sequence in data
%data size typically 1x5000000 (first five million digits in pi or e for example)
data=[1,1,1,2,1,1,3,3,2,4,4,4,5,6,7,8,8,9,9,9,8,8,7,7,5,5,4,6,6,2,5,9,9,9,9,7,7,7,7,8,8,4,4,5,5,6,6,4,4,5,5,1,1,2,1,3,3,1];
appearances=zeros(10,1);  %count how many times sequences of specific lengths 1:10 appear.
%assumes that we wont find similar sequences longer than 10 such as 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 in data
i=2;       %counter s.t while approximate a for-loop. Not using for since can't manually change i within for-loop
%starting at 2 since looking at 2 numbers at a time to compare if equal or not
while i<length(data)
    isEqual=(data(i-1)==data(i)); %true if in a sequence of equal digits. 1,1 gives true, 1,2 gives false etc.
    
    switch isEqual
        case true
            %save where sequence started and keep incrementing while checking if equal
            start_i=i-1;
            while isEqual&&(i<length(data))
                i=i+1;
                isEqual=(data(i-1)==data(i));
            end            
            %calculate length of sequence and record as new appearance of that length
           
            end_i=i-1;
            sequenceLength=(end_i-start_i+1);
            appearances(sequenceLength)=appearances(sequenceLength)+1;
            
            
        case false
            %keep incrementing and look for sequence
            while~isEqual&&(i<length(data))
                i=i+1;
                isEqual=(data(i-1)==data(i));
            end
            
    end  
end
%to find out sequences of length one subtract the sum of all other
%sequences from length of data
appearances(1)=length(data)-sum(appearances);

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Matt J on 23 Jan 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/724663-finding-longest-subvector-of-equal-elements-in-5000000-length-int-vector#answer_604283

Edited: Matt J on 23 Jan 2021

Open in MATLAB Online

data=[1,1,1,2,2,3,3,4,4];
sequenceLengths = diff( find( diff([inf,data,inf]) ) ) ;
result = histcounts(sequenceLengths, 1:max(sequenceLengths)+1)
result = 1×3
     0     3     1

2 Comments
Show NoneHide None

Filip Berntsson on 23 Jan 2021

Super neat! Deals excellently with the large dataset. Now I just have to sit down and try and understand just why it works...

Thanks!

Filip Berntsson on 23 Jan 2021

Now I get it. Clever!

Sign in to comment.

Answer 2

Image Analyst on 23 Jan 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/724663-finding-longest-subvector-of-equal-elements-in-5000000-length-int-vector#answer_604308

Open in MATLAB Online

Here's how I thought of to do it, using regionprops to count the length of each consequtive run of numbers:

data = [1,1,1,2,1,1,3,3,2,4,4,4,5,6,7,8,8,9,9,9,8,8,7,7,5,5,4,6,6,2,5,9,9,9,9,7,7,7,7,8,8,4,4,5,5,6,6,4,4,5,5,1,1,2,1,3,3,1]
uniqueData = unique(data)
% Preallocate an array for what we think the longest possible run length would be, say a 1000 or so
appearances = zeros(1000, 2);  %count how many times sequences of specific lengths 1:10 appear.
% Put lengths into column 2 for convenience for when we look at our output matrix.
appearances(:, 1) = [1 : size(appearances, 1)]';
for k = 1 : length(uniqueData)
	% Get this one unique number.
	thisValue = uniqueData(k);
	% Find regions with k
	props = regionprops(data == thisValue, 'Area');
	% Increment the count of each length of data.
	for k2 = 1 : length(props)
		appearances(props(k2).Area, 2) = appearances(props(k2).Area, 2) + 1;
	end
end
appearances % Show in command window.
% Crop off bins that has no runs of that length.
lastBin = find(appearances(:, 2) > 0, 1, 'last');
appearances = appearances(1:lastBin, :) % Show in command window.

You'll see the final result in the command window:

appearances =

1 11

2 15

3 3

4 2

So there were 11 sequences of length 1, 15 sequences of length 2, 3 sequences of length 3, and 2 sequences of length 4. Same result as Matt's (if you use the same vector) - it's just a different approach.

2 Comments
Show NoneHide None

Filip Berntsson on 23 Jan 2021

Unfortunately this code seems to take about 20 times longer than my initial submission. Am I missing something or does the toolbox need some sort of setup after installation for this to function properly?

Yours took about 19 seconds when dealing with the 5000000 length data set (whereas mine was about 0.9s, and Matt Js was about 0.13s). It's primarily regionprops that drains time (approx 80% of time).

Image Analyst on 23 Jan 2021

Like you, I also thought Matt's answer was somewhat cryptic so I gave a more intuitive, easy to understand way. But if mine takes longer and speed is what you're after, then use Matt's approach.

Sign in to comment.

Finding longest subvector of equal elements in 5000000 length int-vector.

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Finding longest subvector of equal elements in 5000000 length int-vector.

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None