Breaking data from a large text file into groups

4 views (last 30 days)
I have a text file that has groups of elements formattted like this:
{"1": [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423], .... }
and so on. The data is all in one row in this format. I just need to determine the group with the highest number of elements and output that list only. However these files can be fairly large (~260 MB) with a couple thousand groups and element numbers up to the millions. I'm struggling to find the best method to break the scan (probably at the double quotes), save that group (probably to a cell), and then move on to the next one.
  3 Comments
Neil
Neil on 13 Jul 2020
Yes, there is one pair of curly brackets around the entire row, in the actual file. To your second question, both- the first group is typically the largest and may have most of the elements present.
When the file was smaller, the easiest method was to just manually delete everything after this first group as long as that was the case. I've been thinking and it would probably be easiest for my code to just count the group number and number of elements. Then go back to the largest group and rewrite that list in a new file. Just not sure the best way to separate and count them.
Walter Roberson
Walter Roberson on 15 Jul 2020
data = cat(1, image_patches,labels);
That code is overwriting all of data each iteration.
It looks to me as if data will not be a vector, but I do not seem to be able to locate any hellopatches() function so I cannot tell what shape it will be. As you are not doing imresize() I also cannot be sure that all of the images are the same size, so I cannot be sure that data will be the same size for each iteration. Under the circumstances you should be considering saving into a cell array.
Note: please do not post the same query multiple times. I found at least 10 copies of your query :(

Sign in to comment.

Accepted Answer

dpb
dpb on 14 Jul 2020
Edited: dpb on 15 Jul 2020
Well, the following is pretty easy as far as counting goes...how it works on real file as far as speed and whether need to read record piecemeal or not I've no klew w/o a real file to test.
s='{"1": [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423]}';
s=erase(s,{'{','}'});
ss=split(s,{':',']'});
ss=ss(2:2:end);
>> ss
ss =
4×1 cell array
{' [1,2,3,5,10,15,25,37'}
{' [1,5,10,20' }
{' [2000,2170' }
{' [35,72,423' }
>>
>> [~,ixss]=max(cellfun(@(s) sum(s==','),ss))
ixss =
1
>>
Undoubtedly regular expressions could come to the rescue here as well but I'm no guru...
The new(ish) string functions version could be-
ss=extractBetween(s,'[',']');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))
or, if want the brackets, too, then
ss=extractBetween(s,'[',']','Boundaries','inclusive');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))
  1 Comment
Neil
Neil on 15 Jul 2020
Thanks! This worked well for me, and was pretty fast (~30 sec for the 260 MB file including writing a new file)

Sign in to comment.

More Answers (1)

Stephen23
Stephen23 on 15 Jul 2020
Edited: Stephen23 on 15 Jul 2020
"I just need to determine the group with the highest number of elements and output that list only."
So for your example data this would be the first group?
What happens if multiple groups have the same number of elements?
I doubt the importing the entire file into MATLAB and doing string operations would be particularly efficient. I would do as much processing as possible with as little data as possible, which means operating at the level of file-reading. For example, reading only one group at a time would likely be efficient, something like this:
outN = 0;
outV = [];
[fid,msg] = fopen('test.txt','rt');
assert(fid>=3,msg)
fscanf(fid,'{');
while ~feof(fid)
tmpN = fscanf(fid,'"%f"%*[: ][');
tmpV = fscanf(fid,'%f,',[1,Inf]);
fscanf(fid,']%*[, }]');
assert(~isempty(tmpN),'could not match number')
assert(~isempty(tmpV),'could not match vector')
if numel(tmpV)>numel(outV) % or whatever condition.
outV = tmpV;
outN = tmpN;
end
end
fclose(fid);
If you could upload a small sample file (a few thousand characters) by clicking the paperclip button then I could test this too. Instead I had to create my own test file (attached) to test my code with (i made the third group have the most elements).
  2 Comments
dpb
dpb on 15 Jul 2020
Indeed. Nothing in the above was intended as anything that would necessarily be fast.
Your approach is similar to what I figured would be the necessary -- read a block of whatever size is feasible given memory constraints, find the last "]" in the block and count the commas between groups.
If there's another "[" in the block after the last "]", then that's part of next block to process.
Rinse and repeat...
Neil
Neil on 15 Jul 2020
Thank you both for the help! As I responded above the string editing worked out pretty quickly. But I'll try this out if my file size increases any more.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!