Finding problem row of data with TabularTextDatastore

16 views (last 30 days)
I'm using a tall array / datastore to process large csv file. Sometimes the input csv files have the header line repeated in the middle of the file. Using the following code:
clear;
ds = datastore('C:\temp\data.csv');
tt = tall(ds);
x = tt.Hours(2);
gather(x)
gives
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: 42% complete
Evaluation 20% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 81)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 2530, field number 2) ==> Hours;MID;ID1
Description;ID2;ID3;PN;ID4;Group;Model;Description;Level Number;Type;Occurrence Count
(Incremental);Timestamp;Location;City;State;Country;Software\n
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
[t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}] = iGather(varargin{:});
Error in tallArray (line 5)
gather(x)
Caused by:
Reading the variable name 'Hours' using format '%f' from file: 'C:\temp\data.csv' starting at offset 363378899.
The errant line was on row (including header) 1568509 of 3734377. I was only able to find out errant row number out by hunting and pecking, i.e. deleting/re-adding rows and trying again.
My question is: How can I find the errant row more easily? Is there a way to use the "row number" given (2530), the given offset (363378899) to figure out where in the large csv the error is (1568509) ?
additional info:
ds =
TabularTextDatastore with properties:
Files: {
'C:\temp\data.csv'
}
FileEncoding: 'UTF-8'
AlternateFileSystemRoots: {}
ReadVariableNames: true
VariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
Text Format Properties:
NumHeaderLines: 0
Delimiter: ';'
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN
Advanced Text Format Properties:
TextscanFormats: {'%q', '%f', '%f' ... and 17 more}
TextType: 'char'
ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
SelectedFormats: {'%q', '%f', '%f' ... and 17 more}
ReadSize: 20000 rows
  1 Comment
Gunes ERTUNC
Gunes ERTUNC on 5 Feb 2023
Catching possible missing entries via 'TreatAsMissing' helped for me. You can try as follows.
ds = datastore('C:\temp\data.csv','TreatAsMissing',{'-','NA','');

Sign in to comment.

Answers (0)

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Products


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!