Reading in ascii files with white space as delimiter.
8 views (last 30 days)
Show older comments
I am trying to read in a very simple ascii file that looks like the following:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
994.0 270 7.0 6.0 93 5.93 40 10 280.6 297.1 281.6
989.0 312 6.2 5.2 93 5.64 42 12 280.2 295.9 281.2
972.0 455 4.8 4.0 95 5.27 48 18 280.2 294.9 281.1
...
There seem to be a dozen functions that I can read this in with but I'm struggling with all of them.
The simplest seems to be dlmread. I'm currently using the command:
M = dlmread('radiosonde.ascii',' ',3,1)
However this seems to register a single space as the delimiter instead of all the white space. If I use:
M = dlmread('radiosonde.ascii')
It registers the white space as the delimiter but I cannot specify to ignore the headers. Is there some way to specify white space as the delimitter while also ignoring the headers?
Is there a better way to do this? Why hasn't Mathworks streamlined reading text files to be one universal function?
0 Comments
Accepted Answer
Star Strider
on 9 Nov 2015
Edited: Star Strider
on 9 Nov 2015
The dlmread function digests only numeric data so it will have problems with the strings.
I would use the textscan function:
fidi = fopen('radiosonde.ascii','rt');
D = textscan(fidi, repmat('%f',1,11), 'Delimiter',' ', 'MultipleDelimsAsOne',true, 'HeaderLines',3, 'CollectOutput',true);
You might need other name-value pair agruments, but this should get you started. The repmat call creates the input format string for the numerical data.
2 Comments
dpb
on 13 Nov 2015
Edited: dpb
on 13 Nov 2015
Actually, as noted in the follow on from the other thread of same subject http://www.mathworks.com/matlabcentral/answers/253939-reading-in-ascii-files-with-white-space-as-delimiter#comment_322672, if one uses an empty field for the format string, apparently textscan internally counts fields per record and automagically returns the right shape (at least for regular files such as this). And, neither specific 'Delimiter' nor the 'MultipleDelimsAsOne' fields are needed for the default white space. From the doc--"White space can be any combination of space (' '), backspace ('\b'), or tab ('\t') characters. If you do not specify a delimiter, textscan interprets repeated white-space characters as a single delimiter."
Consequently, all that's really needed for this specific file is
D = textscan(fid, '', 'HeaderLines',3, 'CollectOutput',true);
if you're satisfied with the cell array returned (which is why I almost always wrap the textscan call inside cell2mat or use textread instead).
I spent some time this morning following up on my observation from yesterday and so far I find the behavior with an empty format string mentioned nowhere in the documentation. It's a key piece of knowledge that can help a bunch but isn't made known.
PS. I followed up from the result of the other thread by submitting an enhancement request for dlmread and friends and provided the suggested patch that doesn't change the interface at all. That it will get accepted I have little hope, but it is at least on TMW radar. I just provided the Tech Support rep who responded to the request with the observation here that the "feature" appears undocumented regarding the behavior with empty formatting string; also whether that'll make it into future doc remains to be seen. It seems to be a supported behavior given TMW relies on it in dlmread (and csvread is simply a wrapper around dlmread) and likely elsewhere.
dpb
on 13 Nov 2015
Edited: dpb
on 13 Nov 2015
"The dlmread function digests only numeric data so it will have problems with the strings."
It's not documented to work, correct, but it's also not guar-on-teed to fail...
>> type jr.txt
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
994.0 270 7.0 6.0 93 5.93 40 10 280.6 297.1 281.6
989.0 312 6.2 5.2 93 5.64 42 12 280.2 295.9 281.2
972.0 455 4.8 4.0 95 5.27 48 18 280.2 294.9 281.1
>> dlmread('jr.txt',' ',3,1) % explicit blank delimiter
ans =
Columns 1 through 9
0 994.0000 0 0 0 270.0000 0 0 0
0 989.0000 0 0 0 312.0000 0 0 0
0 972.0000 0 0 0 455.0000 0 0 0
...
gets the data but with the problem of the blank fields. That wouldn't be so bad if it did NaN infill instead of zero; then be at least a reasonable shot one could just remove all columns with full complement of NaN and get the desired result.
The text header didn't seem to cause any problem; apparently because there are no embedded blanks or other oddities in the first header line so the count of possible delimiters isn't fouled up. Again, it's a bonus it works and shouldn't be relied on in general but worth noting the behavior methinks.
I went ahead here and did the unrecommended thing of making a patch in the TMW-supplied dlmread function as it seems harmless at worst and beneficial in general and I'm willing to accept that if it breaks it's my fault...
>> dlmread('jr.txt',[],3,0) % the modified version
s =
Columns 1 through 9
994.0000 270.0000 7.0000 6.0000 93.0000 5.9300 40.0000 10.0000 280.6000
989.0000 312.0000 6.2000 5.2000 93.0000 5.6400 42.0000 12.0000 280.2000
972.0000 455.0000 4.8000 4.0000 95.0000 5.2700 48.0000 18.0000 280.2000
...
>>
And we see "magic has occurred"...
Looks to me like dlmread needs a redesign/rewrite -- why shouldn't the offset row count be used in the preliminary stage of delimiter auto-detection? That seems only reasonable that if one asks to ignore an area of the file to do so. The change from the traditional interface to named parameters here would be a.good.thing(tm)
More Answers (0)
See Also
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!