Breaking data from a large text file into groups

Question

Neil on 13 Jul 2020

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/564551-breaking-data-from-a-large-text-file-into-groups

Commented: Walter Roberson on 15 Jul 2020

I have a text file that has groups of elements formattted like this:

{"1":  [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423], .... }

and so on. The data is all in one row in this format. I just need to determine the group with the highest number of elements and output that list only. However these files can be fairly large (~260 MB) with a couple thousand groups and element numbers up to the millions. I'm struggling to find the best method to break the scan (probably at the double quotes), save that group (probably to a cell), and then move on to the next one.

3 Comments
Show 1 older commentHide 1 older comment

Neil on 13 Jul 2020

Yes, there is one pair of curly brackets around the entire row, in the actual file. To your second question, both- the first group is typically the largest and may have most of the elements present.

When the file was smaller, the easiest method was to just manually delete everything after this first group as long as that was the case. I've been thinking and it would probably be easiest for my code to just count the group number and number of elements. Then go back to the largest group and rewrite that list in a new file. Just not sure the best way to separate and count them.

Walter Roberson on 15 Jul 2020

Open in MATLAB Online

data = cat(1, image_patches,labels);

That code is overwriting all of data each iteration.

It looks to me as if data will not be a vector, but I do not seem to be able to locate any hellopatches() function so I cannot tell what shape it will be. As you are not doing imresize() I also cannot be sure that all of the images are the same size, so I cannot be sure that data will be the same size for each iteration. Under the circumstances you should be considering saving into a cell array.

Note: please do not post the same query multiple times. I found at least 10 copies of your query :(

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 14 Jul 2020

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/564551-breaking-data-from-a-large-text-file-into-groups#answer_465386

Edited: dpb on 15 Jul 2020

Open in MATLAB Online

Well, the following is pretty easy as far as counting goes...how it works on real file as far as speed and whether need to read record piecemeal or not I've no klew w/o a real file to test.

s='{"1":  [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423]}';
s=erase(s,{'{','}'});
ss=split(s,{':',']'});
ss=ss(2:2:end);
>> ss
ss =
  4×1 cell array
    {'  [1,2,3,5,10,15,25,37'}
    {' [1,5,10,20'           }
    {' [2000,2170'           }
    {' [35,72,423'           }
>>
>> [~,ixss]=max(cellfun(@(s) sum(s==','),ss))
ixss =
          1
>>

Undoubtedly regular expressions could come to the rescue here as well but I'm no guru...

The new(ish) string functions version could be-

ss=extractBetween(s,'[',']');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))

or, if want the brackets, too, then

ss=extractBetween(s,'[',']','Boundaries','inclusive');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))

1 Comment
Show -1 older commentsHide -1 older comments

Neil on 15 Jul 2020

Thanks! This worked well for me, and was pretty fast (~30 sec for the 260 MB file including writing a new file)

Sign in to comment.

Answer 2

Stephen23 on 15 Jul 2020

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/564551-breaking-data-from-a-large-text-file-into-groups#answer_466124

Edited: Stephen23 on 15 Jul 2020

Open in MATLAB Online

test.txt

"I just need to determine the group with the highest number of elements and output that list only."

So for your example data this would be the first group?

What happens if multiple groups have the same number of elements?

I doubt the importing the entire file into MATLAB and doing string operations would be particularly efficient. I would do as much processing as possible with as little data as possible, which means operating at the level of file-reading. For example, reading only one group at a time would likely be efficient, something like this:

outN = 0;
outV = [];
[fid,msg] = fopen('test.txt','rt');
assert(fid>=3,msg)
fscanf(fid,'{');
while ~feof(fid)
    tmpN = fscanf(fid,'"%f"%*[: ][');
    tmpV = fscanf(fid,'%f,',[1,Inf]);
    fscanf(fid,']%*[, }]');
    assert(~isempty(tmpN),'could not match number')
    assert(~isempty(tmpV),'could not match vector')
    if numel(tmpV)>numel(outV) % or whatever condition.
        outV = tmpV;
        outN = tmpN;
    end
end
fclose(fid);

If you could upload a small sample file (a few thousand characters) by clicking the paperclip button then I could test this too. Instead I had to create my own test file (attached) to test my code with (i made the third group have the most elements).

2 Comments
Show NoneHide None

dpb on 15 Jul 2020

Indeed. Nothing in the above was intended as anything that would necessarily be fast.

Your approach is similar to what I figured would be the necessary -- read a block of whatever size is feasible given memory constraints, find the last "]" in the block and count the commas between groups.

If there's another "[" in the block after the last "]", then that's part of next block to process.

Rinse and repeat...

Neil on 15 Jul 2020

Thank you both for the help! As I responded above the string editing worked out pretty quickly. But I'll try this out if my file size increases any more.

Sign in to comment.

Breaking data from a large text file into groups

3 Comments
Show 1 older commentHide 1 older comment

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

Breaking data from a large text file into groups

3 Comments Show 1 older commentHide 1 older comment

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None