How can I read a huge json file (0.5GB)
Show older comments
I have a huge json file. I tried using the "fileread" command but I get an "Out of Memory" message. I have 4 GB of RAM on my computer. Can anyone suggest a solution?
Thanks,
Meir
Answers (2)
Guillaume
on 28 Apr 2019
1 vote
The problem is that your computer simply doesn't have enough memory.What does memory says after you've simply started matlab?
Have you tried jsondecode? Despite its flaws (*) it's certainly more suited than fileread but I suspect it would still run out of memory since you don't appear to have enough memory to hold the whole file in memory, let alone its decoded content.
*: I got mathworks to fix several issues with their json decoding when it was first implemented but one flaw still remaining is that it will irreversibly mangle object propery names that are not valid variable names or more than 64 characters.
1 Comment
Jan
on 29 Apr 2019
While fileread requires a contigious block of 1 GB (two bytes per charatcer in the file), parsing the JSON string will split the data to several junks, which need not be store as a contiguous block. But maybe the JSON file contains one big matrix of numerical data, which are stored with 3 characters and a separator. Then the parsing creates a matrix with 8 bytes per element (if double is used), such that this needs much more RAM than fileread.
The only reliable way to import a large JSON file is to offer enough RAM. Therefore: +1
If you use fileread, the 0.5 GB of bytes are converted to a char vector, which occupies 1 GB of RAM, because Matlab uses 2 Byte per CHAR. You do not have 1 GB of free RAM in a contigous block. You can import the file to a cell string, but this will need more RAM due to the overhead of about 100 Bytes for each line of text. But the memory does not need to be free in one contigous block for a cell string. The pre-allocation is not trivial, but you can do it in blocks of e.g. 1000 lines:
function C = readtextfile(FileName)
[fid, msg] = fopen(FileName, 'r');
if fid == -1
error(msg);
end
nC = 1000;
iC = 0;
C = cell(1, nC);
while ~feof(fid)
s = fgetl(fid);
if ischar(s)
iC = iC + 1;
if iC > nC
nC = nC + 1000;
C{nC} = []; % Expand the cell
end
C{iC} = s;
end
end
fclose(fid);
C = C(1:iC); % Crop unused cell strings
end
This will take a while. The result will need more than 1 GB iof RAM, but not in a contiguous block. So maybe the import works. But as soon as you want to process the data, you will need more memory. So the only reliable solution is to install more RAM, or to import only a specific section of the data.
7 Comments
Meir Preiszler
on 28 Apr 2019
Guillaume
on 28 Apr 2019
However, the application that created this file did not add "CR/LF"
Yes, because newlines are absolutely not required in json files and they should be treated exactly as spaces by any json parser.
That is why "fgetl" is unable to read it.
Whether you read all the content in one fgetl call or several, doesn't matter. You still need the same amount of memory.
My idea now is to split the file into several smaller parts using the editor I found.
Doesn't sound like a good idea. Most likely, you'll make the file unreadable by any json parser.
Walter Roberson
on 28 Apr 2019
You can fopen() the file, fread() as uint8, fclose(). The result would occupy the same amount of memory as the file length. The data type will not be char, but there are useful operations you can do on it.
Jan
on 29 Apr 2019
Even if the file has been imported as UINT8 (and needs 0.5 GB of contiguous memory only), it has to be parsed afterwards. As soon as the file contains large arrays, the memory will be exhausted again. The only reliable method is to install a sufficient amount of RAM. No magic trick can override this fact.
Walter Roberson
on 29 Apr 2019
Two passes reading the text file from disk should be enough. In the first pass, count the size and data types of entities, and pre-allocate for them. In the second pass, decode them from text to binary.
Where this could go wrong is the case where a small number of text bytes are used to represent doubles. For example the text [1.,3.5] needs 16 bytes to store but only 8 on input.
Jan
on 29 Apr 2019
... 16 bytes for the data plus about 100 bytes for the header of the variable. A 0.5 GB file can contain a lot of variables, wuch that the overhead migth matter.
Walter Roberson
on 29 Apr 2019
True. Success would depend upon whether there are big data chunks or a number of small variables -- though perhaps in context it would turn out to make sense to bundle the small variables into arrays.
More RAM wouldn't hurt...
Categories
Find more on JSON Format in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!