parquetwrite

Write columnar data to Parquet file

Syntax

parquetwrite(filename,T)

parquetwrite(filename,T,Name,Value)

Description

parquetwrite(filename,T) writes a table or timetable T to a Parquet 2.0 file with the filename specified in filename.

example

parquetwrite(filename,T,Name,Value) specifies additional options with one or more name-value pair arguments. For example, you can specify "VariableCompression" to change the compression algorithm used, or "Version" to write the data to a Parquet 1.0 file.

Examples

collapse all

Write Table or Timetable to Parquet File

Open Live Script

Write tabular data into a Parquet file and compare the size of the same tabular data in .csv and .parquet file formats.

Read the tabular data from the file outages.csv into a table.

T = readtable('outages.csv');

Write the data to Parquet file format. By default, the parquetwrite function uses the Snappy compression scheme. To specify other compression schemes see 'VariableCompression' name-value pair.

parquetwrite('outagesDefault.parquet',T)

Get the file sizes and compute the ratio of the size of tabular data in the .csv format to size of the same data in .parquet format.

Get size of .csv file.

fcsv = dir(which('outages.csv'));
size_csv = fcsv.bytes

size_csv = 101040

Get size of .parquet file.

fparquet  = dir('outagesDefault.parquet');
size_parquet = fparquet.bytes

size_parquet = 44881

Compute the ratio.

sizeRatio = ( size_parquet/size_csv )*100 ;
disp(['Size Ratio = ', num2str(sizeRatio) '% of original size'])

Size Ratio = 44.419% of original size

Write Nested Data to Parquet File

Open Live Script

Create nested data and write it to a Parquet file.

Create a table with one nested layer of data.

FirstName = ["Akane"; "Omar"; "Maria"];
LastName = ["Saito"; "Ali"; "Silva"];
Names = table(FirstName,LastName);
NumCourse = [5; 3; 6];
Courses = {["Calculus I"; "U.S. History"; "English Literature"; "Studio Art"; "Organic Chemistry II"];
            ["U.S. History"; "Art History"; "Philosphy"];
            ["Calculus II"; "Philosphy II"; "Ballet"; "Music Theory"; "Organic Chemistry I"; "English Literature"]};
data = table(Names,NumCourse,Courses)

data=3×3 table
            Names            NumCourse      Courses   
    _____________________    _________    ____________

    FirstName    LastName                             
    _________    ________                             
                                                      
     "Akane"     "Saito"         5        {5x1 string}
     "Omar"      "Ali"           3        {3x1 string}
     "Maria"     "Silva"         6        {6x1 string}

Write your nested data to a Parquet file.

parquetwrite("StudentCourseLoads.parq",data)

Read the nested Parquet data.

t2 = parquetread("StudentCourseLoads.parq")

t2=3×3 table
            Names            NumCourse      Courses   
    _____________________    _________    ____________

    FirstName    LastName                             
    _________    ________                             
                                                      
     "Akane"     "Saito"         5        {5x1 string}
     "Omar"      "Ali"           3        {3x1 string}
     "Maria"     "Silva"         6        {6x1 string}

Input Arguments

collapse all

`filename` — Name of output Parquet file
character vector | string scalar

Name of output Parquet file, specified as a character vector or string scalar.

Depending on the location you are writing to, filename can take on one of these forms.

Location

Form

Current folder

To write to the current folder, specify the name of the file in filename.

Example: 'myData.parquet'

Other folders

To write to a folder different from the current folder, specify the full or relative path name in filename.

Example: 'C:\myFolder\myData.parquet'

Example: 'dataDir\myData.parquet'

Remote Location

To write to a remote location, filename must contain the full path of the file specified as a uniform resource locator (URL) of the form:

scheme_name://path_to_file/myData.parquet

Based on the remote location, scheme_name can be one of the values in this table.

Remote Location	`scheme_name`
Amazon S3™	`s3`
Windows Azure^® Blob Storage	`wasb`, `wasbs`
HDFS™	`hdfs`

For more information, see Work with Remote Data.

Example: 's3://bucketname/path_to_file/myData.parquet'

Data Types: char | string

`T` — Input data
table | timetable

Input data, specified as a table or timetable.

Use parquetwrite to export structured Parquet data. For more information on Parquet data types supported for writing, see Apache Parquet Data Type Mappings.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: parquetwrite(filename,T,'VariableCompression','gzip','Version','1.0')

`VariableCompression` — Compression scheme names
`'snappy'` (default) | `'brotli'` | `'gzip'` | `'uncompressed'` | cell array of character vectors | string vector

Compression scheme names, specified as one of these values:

'snappy', 'brotli', 'gzip', or 'uncompressed'. If you specify one compression algorithm then parquetwrite compresses all variables using the same algorithm.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general, 'snappy' has better performance for reading and writing, 'gzip' has a higher compression ratio at the cost of more CPU processing time, and 'brotli' typically produces the smallest file size at the cost of compression speed.

Example: parquetwrite('myData.parquet', T, 'VariableCompression', 'brotli')

Example: parquetwrite('myData.parquet', T, 'VariableCompression', {'brotli' 'snappy' 'gzip'})

`VariableEncoding` — Encoding scheme names
`'auto'` (default) | `'dictionary'` | `'plain'` | cell array of character vectors | string vector

Encoding scheme names, specified as one of these values:

'auto' — parquetwrite uses 'plain' encoding for logical variables, and 'dictionary' encoding for all others.
'dictionary', 'plain' — If you specify one encoding scheme then parquetwrite encodes all variables with that scheme.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.

In general, 'dictionary' encoding results in smaller file sizes, but 'plain' encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, see Parquet encoding definitions.

Example: parquetwrite('myData.parquet', T, 'VariableEncoding', 'plain')

Example: parquetwrite('myData.parquet', T, 'VariableEncoding', {'plain' 'dictionary' 'plain'})

`RowGroupHeights` — Number of rows to write per output row group
nonnegative numeric scalar | vector of nonnegative integers

Number of rows to write per output row group, specified as a nonnegative numeric scalar or vector of nonnegative integers.

If you specify a scalar, the scalar value sets the height of all row groups in the output Parquet file. The last row group may contain fewer rows if there is not an exact multiple.
If you specify a vector, each value in the vector sets the height of a corresponding row group in the output Parquet file. The sum of all the values in the vector must match the height of the input table.

A row group is the smallest subset of a Parquet file that can be read into memory at once. Reducing the row group height helps the data fit into memory when reading. Row group height also affects the performance of filtering operations on a Parquet data set because a larger row group height can be used to filter larger amounts of data when reading.

If RowGroupHeights is unspecified and the input table exceeds 67108864 rows, the number of row groups in the output file is equal to floor(TotalNumberOfRows/67108864)+1.

Example: RowGroupHeights=100

Example: RowGroupHeights=[300, 400, 500, 0, 268]

`Version` — Parquet version to use
`'2.0'` (default) | `'1.0'`

Parquet version to use, specified as either '1.0' or '2.0'. By default, '2.0' offers the most efficient storage, but you can select '1.0' for the broadest compatibility with external applications that support the Parquet format.

Caution

Parquet version 1.0 has a limitation that it cannot round-trip variables of type uint32 (they are read back into MATLAB^® as int64).

Limitations

In some cases, parquetwrite creates files that do not represent the original array T exactly. If you use parquetread or datastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Extended Capabilities

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.

Version History

Introduced in R2019a

expand all

R2022b: Write nested data to Parquet files

Write nested table and timetable variables to Parquet files using parquetwrite.

R2022b: Use function in thread-based environments

This function supports thread-based environments.

R2022a: Determine and define row groups in Parquet file data

A Parquet file can store a range of rows as a distinct row group for increased granularity and targeted analysis. parquetread uses the RowGroups name-value argument to determine row groups while reading Parquet file data. parquetwrite uses the RowGroupHeights name-value argument to define row groups while writing Parquet file data.

R2022a: Export nested data

You can now export nested cell arrays as LIST arrays.

R2021b: Read and write datetimes with original time zones

Parquet files require time-zone-aware timestamps to be in the UTC time zone. When writing datetimes, parquetwrite converts them to equivalent UTC values and stores the original time zone values in the metadata of the Parquet file. parquetread uses the stored original time zone values to enable roundtripping.

R2021a: Use categorical data in Parquet data format

Write Parquet data that contains the categorical data type.

R2020a: Control encoding scheme and Parquet version when writing files

The parquetwrite function has two new name-value arguments:

'VariableEncoding' controls whether a Parquet file uses plain or dictionary encoding for each variable.
'Version' specifies whether to use Parquet 1.0 or Parquet 2.0 file formatting.

R2019b: Write tabular data containing any characters

Write tabular data that has variable names containing any Unicode characters, including spaces and non-ASCII characters. To write tabular data that contains arbitrary variable names, such as variable names with spaces and non-ASCII characters, set the PreserveVariableNames parameter to true.