read ASCII data with varying lengths of space as delimiters

18 views (last 30 days)
I am having trouble in reading a structured dataset. The columns are separeted by spaces which have different lengths. Besides the varying lengths of space for delimiters, the column widths are varying as well.
I have tried all kinds of appraoches and functions, and searached online, but couldnot find a solution.
Can someone help me? Please.
A subset from a huge dataset is attached.
  1 Comment
dpb
dpb on 3 Apr 2020
Need clear definition of the format -- does look like fixed-width but not absolutely positive about whether the apparent missing fields are defined or skipped.
One presumes somewhere whoever created these files also built a toolset to read them and published a definitive description.
I'd look for the first first rather than reinvent the wheel and lacking that w/o the second it's going to be a lot of detective work.

Sign in to comment.

Answers (1)

per isakson
per isakson on 3 Apr 2020
Edited: per isakson on 4 Apr 2020
subset.txt is a fixed-width text file. No doubt about that.
Since R2018b (R2017a?) it is straight forward to read fixed-width text file with readtable(). Before that it was a bit tricky to read fixed-width text files. There are a number of threads here in Answers on the topic.
The documentation includes a good description on how to read fixed-width text files. See fixedWidthImportOptions, Import options object for fixed-width text files.
IMO: The error prone part is to determine the widths of the columns. Here is a start.
  • copy&paste the column header line to the command window
  • use regexp() to determine the starting position of each column. The column headers start with a letter (the first one is special).
>> chr = '# DTG LOCATION NAME ...'
>> cac = regexp( chr, '\<\w', 'start' );
>> cac
cac =
3 22 42 90 110 130 150 170 190 210 230 250 270 290 310 330 350
  1 Comment
dpb
dpb on 4 Apr 2020
Edited: dpb on 4 Apr 2020
Is there only one header line? Maybe the browser is folding the long line that confused the issue...ah! Indeed it is. That makes the file look much more regular than with folded lines. Indeed with the comment character to skip header lines presuming don't need them (altho then the column header record will be skipped too, probably; I've never tested that).
Well, now have downloaded, let's see just how smart TMW was... :)
>> opt=detectImportOptions('subset.txt','FileType','fixedwidth','CommentStyle','#',"NumHeaderLines",20,'ReadVariableNames',1)
opt =
FixedWidthImportOptions with properties:
Format Properties:
Whitespace: '\b\t '
LineEnding: {'\n' '\r' '\r\n'}
CommentStyle: {'#'}
EmptyLineRule: 'skip'
Encoding: 'windows-1252'
Replacement Properties:
MissingRule: 'fill'
ImportErrorRule: 'fill'
ExtraColumnsRule: 'addvars'
PartialFieldRule: 'keep'
Variable Import Properties: Set types by name using setvartype
VariableNames: {'x_DTG', 'Var2', 'LOCATIO' ... and 12 more}
VariableTypes: {'datetime', 'duration', 'char' ... and 12 more}
VariableWidths: [10.00 9.00 9.00 21.00 4.00 8.00 2.00 35.00 19.00 16.00 99.00 20.00 20.00 40.00 55.00]
SelectedVariableNames: {'x_DTG', 'Var2', 'LOCATIO' ... and 12 more}
VariableOptions: Show all 15 VariableOptions
Access VariableOptions sub-properties using setvaropts/getvaropts
PreserveVariableNames: false
Location Properties:
DataLines: [22.00 Inf]
VariableNamesLine: 21.00
RowNamesColumn: 0
VariableUnitsLine: 0
VariableDescriptionsLine: 0
To display a preview of the table, use preview
>> opt.VariableNames
ans =
1×15 cell array
Columns 1 through 10
{'x_DTG'} {'Var2'} {'LOCATIO'} {'NNAME'} {'Var5'} {'Var6'} {'Var7'} {'LATITUDE'} {'LONGITUD'} {'EALTI'}
Columns 11 through 15
{'TUDEU_BOOL_10T_…'} {'EWP_SEA_10T_D'} {'RYB_SEA_10TN_'} {'DRYB_10T_WETB_1…'} {'DRYB_10U_10U_SE…'}
>>
Well, kinda' sorta' made a stab at it...will have to physically set the column widths and number variables -- it finds the time string separate from the date so inserts Var2 for it, then thinks the LOCATION column ends at the end of the shorter string data instead of the longer so it truncates the name as LOCATIO and includes the last N in the next column. But one gets an object to play with for starters.
Good trick for automagically guessing column widths, Per, will have to remember that going forward.
OP will still need to either account for the time column independently or to set the width to encompass both and read as string 'cuz can't handle the blank field on input parsing in readtable to convert to datetime.

Sign in to comment.

Categories

Find more on Cell Arrays in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!