matlab.io.datastore.Partitionable Class
Namespace: matlab.io.datastore
Add parallelization support to datastore
Description
matlab.io.datastore.Partitionable
is an abstract mixin class that
adds parallelization support to your custom datastore for use with Parallel Computing Toolbox™ and MATLAB®
Parallel Server™.
To use this mixin class, you must inherit from
matlab.io.datastore.Partitionable
class, in addition to
inheriting from the matlab.io.Datastore
base class. Type the following syntax as the first line
of your class definition
file:
classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Partitionable ... end
To add support for parallel processing to your custom datastore, you must:
Inherit from the additional class
matlab.io.datastore.Partitionable
.Define these additional methods:
maxpartitions
andpartition
.
For more details and steps to create your custom datastore with parallel processing support, see Develop Custom Datastore.
Methods
maxpartitions | Maximum number of partitions possible |
numpartitions | Default number of partitions |
partition | Partition a datastore |
Examples
Build Datastore with Parallel Processing Support
Build a datastore with parallel processing support and use it to bring your custom or proprietary data into MATLAB®. Then, process the data in a parallel pool.
Create a .m
class definition file that contains the code implementing your custom datastore. You must save this file in your working folder or in a folder that is on the MATLAB® path. The name of the .m
file must be the same as the name of your object constructor function. For example, if you want your constructor function to have the name MyDatastorePar, then the name of the .m
file must be MyDatastorePar.m
. The .m
class definition file must contain the following steps:
Step 1: Inherit from the datastore classes.
Step 2: Define the constructor and the required methods.
Step 3: Define your custom file reading function.
In addition to these steps, define any other properties or methods that you need to process and analyze your data.
%% STEP 1: INHERIT FROM DATASTORE CLASSES classdef MyDatastorePar < matlab.io.Datastore & ... matlab.io.datastore.Partitionable properties(Access = private) CurrentFileIndex double FileSet matlab.io.datastore.DsFileSet end % Property to support saving, loading, and processing of % datastore on different file system machines or clusters. % In addition, define the methods get.AlternateFileSystemRoots() % and set.AlternateFileSystemRoots() in the methods section. properties(Dependent) AlternateFileSystemRoots end %% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS methods % Define your datastore constructor function myds = MyDatastorePar(location,altRoots) myds.FileSet = matlab.io.datastore.DsFileSet(location,... 'FileExtensions','.bin', ... 'FileSplitSize',8*1024); myds.CurrentFileIndex = 1; if nargin == 2 myds.AlternateFileSystemRoots = altRoots; end reset(myds); end % Define the hasdata method function tf = hasdata(myds) % Return true if more data is available tf = hasfile(myds.FileSet); end % Define the read method function [data,info] = read(myds) % Read data and information about the extracted data % See also: MyFileReader() if ~hasdata(myds) msgII = ['Use the reset method to reset the datastore ',... 'to the start of the data.']; msgIII = ['Before calling the read method, ',... 'check if data is available to read ',... 'by using the hasdata method.']; error('No more data to read.\n%s\n%s',msgII,msgIII); end fileInfoTbl = nextfile(myds.FileSet); data = MyFileReader(fileInfoTbl); info.Size = size(data); info.FileName = fileInfoTbl.FileName; info.Offset = fileInfoTbl.Offset; % Update CurrentFileIndex for tracking progress if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ... fileInfoTbl.FileSize myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ; end end % Define the reset method function reset(myds) % Reset to the start of the data reset(myds.FileSet); myds.CurrentFileIndex = 1; end % Define the partition method function subds = partition(myds,n,ii) subds = copy(myds); subds.FileSet = partition(myds.FileSet,n,ii); reset(subds); end % Getter for AlternateFileSystemRoots property function altRoots = get.AlternateFileSystemRoots(myds) altRoots = myds.FileSet.AlternateFileSystemRoots; end % Setter for AlternateFileSystemRoots property function set.AlternateFileSystemRoots(myds,altRoots) try % The DsFileSet object manages AlternateFileSystemRoots % for your datastore myds.FileSet.AlternateFileSystemRoots = altRoots; % Reset the datastore reset(myds); catch ME throw(ME); end end end methods (Hidden = true) % Define the progress method function frac = progress(myds) % Determine percentage of data read from datastore if hasdata(myds) frac = (myds.CurrentFileIndex-1)/... myds.FileSet.NumFiles; else frac = 1; end end end methods(Access = protected) % If you use the FileSet property in the datastore, % then you must define the copyElement method. The % copyElement method allows methods such as readall % and preview to remain stateless function dscopy = copyElement(ds) dscopy = copyElement@matlab.mixin.Copyable(ds); dscopy.FileSet = copy(ds.FileSet); end % Define the maxpartitions method function n = maxpartitions(myds) n = maxpartitions(myds.FileSet); end end end %% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION function data = MyFileReader(fileInfoTbl) % create a reader object using FileName reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName); % seek to the offset seek(reader,fileInfoTbl.Offset,'Origin','start-of-file'); % read fileInfoTbl.SplitSize amount of data data = read(reader,fileInfoTbl.SplitSize); end
Your custom datastore is now ready. Use your custom datastore to read and process the data in a parallel pool.
Read Data Using Custom Datastore And Process in Parallel Pool
Use custom datastore to preview and read your proprietary data into MATLAB for parallel processing.
This example uses a simple data set to illustrate a workflow using your custom datastore. The data set is a collection of 15 binary (.bin
) files where each file contains a column (1
variable) and 10000
rows (records) of unsigned integers.
dir('*.bin')
binary_data01.bin binary_data02.bin binary_data03.bin binary_data04.bin binary_data05.bin binary_data06.bin binary_data07.bin binary_data08.bin binary_data09.bin binary_data10.bin binary_data11.bin binary_data12.bin binary_data13.bin binary_data14.bin binary_data15.bin
Create a datastore object using the MyDatastorePar
function. For implementation details of MyDatastorePar
, see the example Build Datastore with Parallel Processing Support.
folder = fullfile('*.bin');
ds = MyDatastorePar(folder);
Preview the data from the datastore.
preview(ds)
ans = 8x1 uint8 column vector
113
180
251
91
29
66
254
214
Identify the number of partitions for your datastore. If you have Parallel Computing Toolbox (PCT), then you can use n = numpartitions(ds,myPool)
, where myPool
is gcp
or parpool
.
n = numpartitions(ds);
Partition the datastore into n
parts and n
workers in a parallel pool.
parfor ii = 1:n subds = partition(ds,n,ii); while hasdata(subds) data = read(subds); % do something end end
Process Datastore on Different Platforms
To process your datastore with parallel and distributed
computing that involves different platform cloud or cluster machines, you must
pre-define 'AlternateFileSystemRoots'
parameter. For
instance, create a datastore on your local machine, and analyze a small portion
of the data. Then, scale up your analysis to the entire dataset using
Parallel Computing Toolbox and MATLAB
Parallel Server.
Create a datastore using MyDatastorePar
and assign a
value to the 'AlternateFileSystemRoots'
property. For
implementation details of MyDatastorePar
, see the example
Build Datastore with Parallel Processing
Support
.
To set the value for the 'AlternateFileSystemRoots'
property, identify the root paths for your data on the different platforms.
The root paths differ based on the machine or file system. For instance, if
you access your data using these root paths:
"Z:\DataSet"
from the Windows® machine."/nfs-bldg001/DataSet"
from the MATLAB Parallel Server Linux® cluster.
Then, associate these root paths using the
AlternateFileSystemRoots
property.
altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"]; ds = MyDatastorePar('Z:\DataSet',altRoots);
Analyze a small portion of the data on your local machine. For instance, get a partitioned subset of the data and clean the data by removing any missing entries. Then, examine a plot of the variables.
tt = tall(partition(ds,100,1));
summary(tt);
% analyze your data
tt = rmmissing(tt);
plot(tt.MyVar1,tt.MyVar2)
Scale up your analysis to the entire dataset by using MATLAB Parallel Server cluster (Linux cluster). For instance, start a worker pool using the cluster profile, and then perform analysis on the entire dataset by using parallel and distributed computing capabilities.
parpool('MyMjsProfile') tt = tall(ds); summary(tt); % analyze your data tt = rmmissing(tt); plot(tt.MyVar1,tt.MyVar2)
Tips
For your custom datastore implementation, best practice is not to implement the
numpartitions
method.
Version History
Introduced in R2017b
See Also
mapreduce
| datastore
| matlab.io.datastore.HadoopLocationBased
| matlab.io.Datastore
Topics
- Develop Custom Datastore
- Tall Arrays for Out-of-Memory Data
- Partition a Datastore in Parallel (Parallel Computing Toolbox)
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)