How can I read a huge json file (0.5GB)

I have a huge json file. I tried using the "fileread" command but I get an "Out of Memory" message. I have 4 GB of RAM on my computer. Can anyone suggest a solution?
Thanks,
Meir

Answers (2)

Guillaume
Guillaume on 28 Apr 2019
The problem is that your computer simply doesn't have enough memory.What does memory says after you've simply started matlab?
Have you tried jsondecode? Despite its flaws (*) it's certainly more suited than fileread but I suspect it would still run out of memory since you don't appear to have enough memory to hold the whole file in memory, let alone its decoded content.
*: I got mathworks to fix several issues with their json decoding when it was first implemented but one flaw still remaining is that it will irreversibly mangle object propery names that are not valid variable names or more than 64 characters.

1 Comment

While fileread requires a contigious block of 1 GB (two bytes per charatcer in the file), parsing the JSON string will split the data to several junks, which need not be store as a contiguous block. But maybe the JSON file contains one big matrix of numerical data, which are stored with 3 characters and a separator. Then the parsing creates a matrix with 8 bytes per element (if double is used), such that this needs much more RAM than fileread.
The only reliable way to import a large JSON file is to offer enough RAM. Therefore: +1

Sign in to comment.

Jan
Jan on 25 Apr 2019
Edited: Jan on 25 Apr 2019
If you use fileread, the 0.5 GB of bytes are converted to a char vector, which occupies 1 GB of RAM, because Matlab uses 2 Byte per CHAR. You do not have 1 GB of free RAM in a contigous block. You can import the file to a cell string, but this will need more RAM due to the overhead of about 100 Bytes for each line of text. But the memory does not need to be free in one contigous block for a cell string. The pre-allocation is not trivial, but you can do it in blocks of e.g. 1000 lines:
function C = readtextfile(FileName)
[fid, msg] = fopen(FileName, 'r');
if fid == -1
error(msg);
end
nC = 1000;
iC = 0;
C = cell(1, nC);
while ~feof(fid)
s = fgetl(fid);
if ischar(s)
iC = iC + 1;
if iC > nC
nC = nC + 1000;
C{nC} = []; % Expand the cell
end
C{iC} = s;
end
end
fclose(fid);
C = C(1:iC); % Crop unused cell strings
end
This will take a while. The result will need more than 1 GB iof RAM, but not in a contiguous block. So maybe the import works. But as soon as you want to process the data, you will need more memory. So the only reliable solution is to install more RAM, or to import only a specific section of the data.

7 Comments

Hi Jan,
First I much appreciate your very prompt answer. I was busy over the weekend so I apologize for the fact that it took me a while to get back to you.
I implemented your function but I still got the "Out of Memory" error message when calling the "fgetl" function. I managed to edit the file with an editor that is able to edit huge files and now I understand the problem. This is a special type of file called a .json file. It contains information regarding the position of various objects within an image. However, the application that created this file did not add "CR/LF" so all the information is written in one huge line. That is why "fgetl" is unable to read it. If the line is not too long "fgetl" suceeds. I managed to read a line of 19987840 characters.
My idea now is to split the file into several smaller parts using the editor I found.
Thanks again,
Meir
However, the application that created this file did not add "CR/LF"
Yes, because newlines are absolutely not required in json files and they should be treated exactly as spaces by any json parser.
That is why "fgetl" is unable to read it.
Whether you read all the content in one fgetl call or several, doesn't matter. You still need the same amount of memory.
My idea now is to split the file into several smaller parts using the editor I found.
Doesn't sound like a good idea. Most likely, you'll make the file unreadable by any json parser.
You can fopen() the file, fread() as uint8, fclose(). The result would occupy the same amount of memory as the file length. The data type will not be char, but there are useful operations you can do on it.
Even if the file has been imported as UINT8 (and needs 0.5 GB of contiguous memory only), it has to be parsed afterwards. As soon as the file contains large arrays, the memory will be exhausted again. The only reliable method is to install a sufficient amount of RAM. No magic trick can override this fact.
Two passes reading the text file from disk should be enough. In the first pass, count the size and data types of entities, and pre-allocate for them. In the second pass, decode them from text to binary.
Where this could go wrong is the case where a small number of text bytes are used to represent doubles. For example the text [1.,3.5] needs 16 bytes to store but only 8 on input.
... 16 bytes for the data plus about 100 bytes for the header of the variable. A 0.5 GB file can contain a lot of variables, wuch that the overhead migth matter.
True. Success would depend upon whether there are big data chunks or a number of small variables -- though perhaps in context it would turn out to make sense to bundle the small variables into arrays.
More RAM wouldn't hurt...

Sign in to comment.

Products

Release

R2018b

Asked:

on 25 Apr 2019

Commented:

on 29 Apr 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!