detectImportOptions locates the wrong header line

9 views (last 30 days)
Dave
Dave on 12 Jul 2022
Edited: dpb on 12 Jul 2022
I am handling fixed column text files with optional comment lines at the beginning. Therefore the line with column headers is not known in advance. When the last datum on the first data line is missing, detectImportOptions identifies the wrong header line:
detectImportOptions ('data4.txt', ...
'FileType', 'fixedwidth', ...
'ReadVariableNames', true)
This file works okay:
Name City Lat Lon Elev
den Denver 39.74 -104.99 5280
chi Chicago 41.88 -87.71 597
atl Atlanta 33.75 -84.36 820
Remove last item on line 2. This locates the wrong header:
Name City Lat Lon Elev
den Denver 39.74 -104.99
chi Chicago 41.88 -87.71 597
atl Atlanta 33.75 -84.36 820
Selected differences between the results, from diff:
< VariableNames: {'Name', 'City', 'Lat' ... and 2 more}
> VariableNames: {'chi', 'Chicago', 'x41_88' ... and 2 more}
< DataLines: [2 Inf]
> DataLines: [4 Inf]
< VariableNamesLine: 1
> VariableNamesLine: 3
My local workaround will be to always prevent a missing value at the end of the first data line. However, it would be nice if the heuristics in detectImportOptions would find the correct line number in scenarios like this.
Is this a legal scenario? Is this a bug in Matlab?
  1 Comment
dpb
dpb on 12 Jul 2022
Badly formed data files are not indicative of bugs -- the problem is in the data file as you've already acknowledged.
I believe your expectations are unrealistic -- there's really no way programmatically to determine that record is malformed and not something else.
You don't show a file with a comment line -- does it consist of a comment character in first column? Using that information MIGHT be able to help.
Probably the most foolproof way would be be to create the files as delimited and ensure have the correct number of delimiters in every real record on creation of the file instead. You may need a preprocessing step to add that robustness, or maybe the original process can be modified.

Sign in to comment.

Answers (1)

Dave
Dave on 12 Jul 2022
Edited: Dave on 12 Jul 2022
@dpb, thank you for your insights. I left out comment lines to make the reproducer as small as possible. The header problem occurs the same way, with or without the comment lines.
MATLAB claims some ability to handle messy text files. I agree my scenario is pushing the limit of reasonable expectation. I am asking whether the heuristics inside detectImportOptions are working as intended, or whether they could be made smarter.
  1 Comment
dpb
dpb on 12 Jul 2022
Edited: dpb on 12 Jul 2022
The heuristics can always be made "smarter" at the cost of performance; there is a conscious decision(*) to also not be excessively time consuming for the normal case.
And, of course, "smarter" comes at another cost of the likelihood of introducing what I'll call Type II error -- wrongfully classifying non-data as data.
In this case, how's it supposed to know the difference between whether there is simply a missing value or the line should be ignored because it doesn't match the bulk of the rest of the file? Only your expectation that you know...
If you stuff that record down somewhere in the middle, then the odds are it'll detect and let you set a missing value or not import the record, your choice.
You can always use the example and submit it as an enhancement request and see what TMW thinks of the possibilities of catching it...
(*) It's not in the formal documentation, but I recall the comment having been made by a TMW employee in discussion here on some of the changes in the way the file probing has evolved. That discussion was mostly centered on the behind-the-scenes done by the readXXX routines that don't do the whole in-depth analysis the detectImportOptions does, but is an overhead in every use of the routines. One gives up some performance when doing the full probe deliberately; that is then expected to take some time but one still doesn't want it to become an inordinate delay.

Sign in to comment.

Categories

Find more on Language Support in Help Center and File Exchange

Products


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!