Reading Parquet Key/Value Metadata
10 views (last 30 days)
Show older comments
Is there a recommended method for reading metadata key/value pairs in the footer for a parquet file?
I'm using MATLAB to extract JSON metadata added from python's pyarrow. The footer layout is documented here https://github.com/apache/parquet-format#file-format. I have the following code which works, but would prefer a more robust solution in the future.
function metadata = parquetmeta(filename)
fid = fopen(filename);
fseek(fid, -8, 'eof');
footer_bytes = fread(fid, 1, 'uint32', 'ieee-le');
fseek(fid, -footer_bytes, 'eof');
footer = fread(fid, [1, footer_bytes], '*char');
fclose(fid);
start_idx = find(footer == '{', 1, 'first');
end_idx = find(footer == '}', 1, 'last');
metadata = struct;
if ~isempty(start_idx) && ~isempty(end_idx)
try %#ok<TRYNC>
metadata = jsondecode(footer(start_idx, end_idx));
end
end
end
I'm hoping MathWorks will add support for reading Parquet file metadata in a future release. Interestingly, it appears this use to be included as a feature before MATLAB 2019, but cannot verify on my version of MATLAB.
jobj = com.mathworks.bigdata.parquet.Reader;
md = jobj.getParquetFileReader.getFileMetaData.getKeyValueMetaData;
Thanks!
0 Comments
Answers (0)
See Also
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!