Breaking data from a large text file into groups
4 views (last 30 days)
Show older comments
I have a text file that has groups of elements formattted like this:
{"1": [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423], .... }
and so on. The data is all in one row in this format. I just need to determine the group with the highest number of elements and output that list only. However these files can be fairly large (~260 MB) with a couple thousand groups and element numbers up to the millions. I'm struggling to find the best method to break the scan (probably at the double quotes), save that group (probably to a cell), and then move on to the next one.
3 Comments
Walter Roberson
on 15 Jul 2020
data = cat(1, image_patches,labels);
That code is overwriting all of data each iteration.
It looks to me as if data will not be a vector, but I do not seem to be able to locate any hellopatches() function so I cannot tell what shape it will be. As you are not doing imresize() I also cannot be sure that all of the images are the same size, so I cannot be sure that data will be the same size for each iteration. Under the circumstances you should be considering saving into a cell array.
Note: please do not post the same query multiple times. I found at least 10 copies of your query :(
Accepted Answer
dpb
on 14 Jul 2020
Edited: dpb
on 15 Jul 2020
Well, the following is pretty easy as far as counting goes...how it works on real file as far as speed and whether need to read record piecemeal or not I've no klew w/o a real file to test.
s='{"1": [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423]}';
s=erase(s,{'{','}'});
ss=split(s,{':',']'});
ss=ss(2:2:end);
>> ss
ss =
4×1 cell array
{' [1,2,3,5,10,15,25,37'}
{' [1,5,10,20' }
{' [2000,2170' }
{' [35,72,423' }
>>
>> [~,ixss]=max(cellfun(@(s) sum(s==','),ss))
ixss =
1
>>
Undoubtedly regular expressions could come to the rescue here as well but I'm no guru...
The new(ish) string functions version could be-
ss=extractBetween(s,'[',']');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))
or, if want the brackets, too, then
ss=extractBetween(s,'[',']','Boundaries','inclusive');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))
More Answers (1)
Stephen23
on 15 Jul 2020
Edited: Stephen23
on 15 Jul 2020
"I just need to determine the group with the highest number of elements and output that list only."
So for your example data this would be the first group?
What happens if multiple groups have the same number of elements?
I doubt the importing the entire file into MATLAB and doing string operations would be particularly efficient. I would do as much processing as possible with as little data as possible, which means operating at the level of file-reading. For example, reading only one group at a time would likely be efficient, something like this:
outN = 0;
outV = [];
[fid,msg] = fopen('test.txt','rt');
assert(fid>=3,msg)
fscanf(fid,'{');
while ~feof(fid)
tmpN = fscanf(fid,'"%f"%*[: ][');
tmpV = fscanf(fid,'%f,',[1,Inf]);
fscanf(fid,']%*[, }]');
assert(~isempty(tmpN),'could not match number')
assert(~isempty(tmpV),'could not match vector')
if numel(tmpV)>numel(outV) % or whatever condition.
outV = tmpV;
outN = tmpN;
end
end
fclose(fid);
If you could upload a small sample file (a few thousand characters) by clicking the paperclip button then I could test this too. Instead I had to create my own test file (attached) to test my code with (i made the third group have the most elements).
2 Comments
dpb
on 15 Jul 2020
Indeed. Nothing in the above was intended as anything that would necessarily be fast.
Your approach is similar to what I figured would be the necessary -- read a block of whatever size is feasible given memory constraints, find the last "]" in the block and count the commas between groups.
If there's another "[" in the block after the last "]", then that's part of next block to process.
Rinse and repeat...
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!