Modify then write data in the given format

Hi, I will start with a brief overview of what I am trying to achieve in this code.
I have a sample piece of data I want manipulate below. It stores the xyz coordinates and velocities of atoms in solution.
1SOL OW 1 4.309 5.254 4.135 -0.2790 0.3440 0.2064
1SOL HW1 2 4.314 5.169 4.082 -1.5406 0.3918 -0.0293
1SOL HW2 3 4.388 5.312 4.114 -1.3375 0.9272 -2.6151
2SOL OW 4 1.743 1.687 2.366 0.2136 0.2777 0.3181
2SOL HW1 5 1.818 1.750 2.387 0.3115 0.1542 0.3431
4502OCTA H13545 2.108 5.326 1.045 -1.2169 0.4890 -2.6144
4502OCTA H13546 2.068 5.492 1.036 0.7609 0.6650 0.8612
4502OCTA H13547 2.285 5.388 1.207 3.0144 2.5562 1.0920
4502OCTA H13548 2.121 5.425 1.265 -1.2460 -1.3635 1.4829
4502OCTA Oc13549 2.131 5.677 1.238 -0.0183 -0.0221 -1.0402
4502OCTA Oh13550 2.353 5.635 1.208 -0.6036 0.2241 -0.8140
4502OCTA H13551 2.383 5.198 0.399 0.4893 0.7154 -0.9915
4502OCTA Ho13552 2.413 5.565 1.189 -0.4685 -0.0421 -2.1107
What I need to do is add a specific value, for instance add 1 to the numbers in the columns 4 and 5 and rows 3 to 8. Basically, I want to translate the positions of certain atoms. It is important that I keep the original file's format.
There were two challenging aspects to this code: The first being that the second and the third column merge when the numbers in the third column (atom IDs) go into the 5 digits. I've worked around that, albeit not elegantly.
The second issue which I haven't been able to solve is how to write the new coordinates into a file. Matlab ignores the empty spaces before each row begins, and ignores the spaces in between the columns, and ignores the spaces after the rows end. I've tried using horzcat, mat2str, strcat, and maybe some others without success. I will leave my code below for you to examine.
function output = Gro_editor(filename, x1, y1, y2,addval)
%Initialize Values%
filend = 0;
fin = 0;
data = cell(15,22);
Xpos = 1; %X-coordinate
Ypos = 1; %Y-coordinate
%Open and get permission to write target file%
fid = fopen(filename,'r');
if fid == -1
disp('Could not open file');
else
disp('File Open...');
%Index the entire file with meaningful partitions.
%First 15 are indexed individually, following seven are read as blocks.
while filend == 0
if Xpos < 16
data(Ypos,Xpos) = {cellstr(fscanf(fid,'%c',1))};
elseif Xpos > 15
data(Ypos,Xpos) = {fscanf(fid,'%f',1)};
end
if Xpos == 22
Xpos = 1;
Ypos = Ypos + 1;
fscanf(fid,'%c',2);
elseif Xpos < 22
Xpos = Xpos + 1;
end
if Ypos == 20
filend = 1;
end
end
closeresult = fclose(fid);
if closeresult == 0
disp('File closed')
else
disp('File close unsuccessful')
end
%Writing new cell array
while fin == 0
data{y1,x1} = data{y1,x1}+addval
if y1 < y2
y1 = y1+1;
elseif y1 == y2
fin = 1;
end
end
%So far so good. I don't know what I am doing after here.%
----------------------------------------------------------
output=data;
fid = fopen('output.txt','w');
[nrows,ncols] = size(data);
formatSpec = '%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c\n';
for row = 1:nrows
for col = 1:ncols
line = horzcat(data(nrows,1:15))
fprintf(fid,formatSpec,line(col));
end
end
fclose(fid);
type output.txt
end

1 Comment

dpb
dpb on 13 Mar 2015
Edited: dpb on 15 Mar 2015
It is a pita to read fixed-width nondelimited data with Matlab owing to its C i/o formatting--the idea of having such was apparently overlooked and it is, simply put, impossible w/o counting columns if the fields are ever actually full.
That said, not sure where, precisely, your output formatting issues are arising but to write fixed column fields use a counted field of the proper width for both the string and the numeric fields and you can control that however is desired.
Provide a precise definition of the first two string columns' content (and spacing within the column if that is actually also significant) and I'm sure we can write an appropriate format string.

Sign in to comment.

 Accepted Answer

per isakson
per isakson on 15 Mar 2015
Edited: per isakson on 15 Mar 2015
The documentation of textscan doesn't cover fixed-width very well. However, textscan has "undocumented"/hidden capabilities.
Approach
  • Read the first three columns to one string, since they shall only be copied to the output file.
  • Read the following six columns to a double array.
  • Add 1 to the prescribed elements of the array
  • Use the same format string to write the data (don't forget new-line)
Run example code (I use R2013b)
fixed_width_format(13)
where
function fixed_width_format( N )
fid = fopen( 'fixed_width_format.txt' );
format_spec = '%20s%8.3f%8.3f%8.3f%8.4f%8.4f%8.4f';
cac = textscan( fid, format_spec, N ...
, 'Whitespace' , '' ...
, 'Delimiter' , '' ...
, 'CollectOutput' , true );
fclose( fid );
RowHead = cac{1};
Data = cac{2};
% add 1 to the numbers in the columns 4 and 5 and rows 3 to 8.
Data( 3:8, 4:5 ) = Data( 3:8, 4:5 ) + 1;
fid = fopen( 'fixed_width_format_out.txt', 'w' );
for rr = 1 : N
fprintf( fid, [format_spec,'\n'], RowHead{rr}, Data(rr,:) );
end
fclose( fid );
end
and where fixed_width_format.txt contains
1SOL OW 1 4.309 5.254 4.135 -0.2790 0.3440 0.2064
1SOL HW1 2 4.314 5.169 4.082 -1.5406 0.3918 -0.0293
1SOL HW2 3 4.388 5.312 4.114 -1.3375 0.9272 -2.6151
2SOL OW 4 1.743 1.687 2.366 0.2136 0.2777 0.3181
2SOL HW1 5 1.818 1.750 2.387 0.3115 0.1542 0.3431
4502OCTA H13545 2.108 5.326 1.045 -1.2169 0.4890 -2.6144
4502OCTA H13546 2.068 5.492 1.036 0.7609 0.6650 0.8612
4502OCTA H13547 2.285 5.388 1.207 3.0144 2.5562 1.0920
4502OCTA H13548 2.121 5.425 1.265 -1.2460 -1.3635 1.4829
4502OCTA Oc13549 2.131 5.677 1.238 -0.0183 -0.0221 -1.0402
4502OCTA Oh13550 2.353 5.635 1.208 -0.6036 0.2241 -0.8140
4502OCTA H13551 2.383 5.198 0.399 0.4893 0.7154 -0.9915
4502OCTA Ho13552 2.413 5.565 1.189 -0.4685 -0.0421 -2.1107
--- 0---|--- 10---|--- 20---|--- 30---|--- 40---|--- 50---|--- 60---
123456789|123456789|123456789|123456789|123456789|123456789|123456789
'%20s%8.3f%8.3f%8.3f%8.4f%8.4f%8.4f%8.4f'
and where fixed_width_format_out.txt contains
1SOL OW 1 4.309 5.254 4.135 -0.2790 0.3440 0.2064
1SOL HW1 2 4.314 5.169 4.082 -1.5406 0.3918 -0.0293
1SOL HW2 3 4.388 5.312 4.114 -0.3375 1.9272 -2.6151
2SOL OW 4 1.743 1.687 2.366 1.2136 1.2777 0.3181
2SOL HW1 5 1.818 1.750 2.387 1.3115 1.1542 0.3431
4502OCTA H13545 2.108 5.326 1.045 -0.2169 1.4890 -2.6144
4502OCTA H13546 2.068 5.492 1.036 1.7609 1.6650 0.8612
4502OCTA H13547 2.285 5.388 1.207 4.0144 3.5562 1.0920
4502OCTA H13548 2.121 5.425 1.265 -1.2460 -1.3635 1.4829
4502OCTA Oc13549 2.131 5.677 1.238 -0.0183 -0.0221 -1.0402
4502OCTA Oh13550 2.353 5.635 1.208 -0.6036 0.2241 -0.8140
4502OCTA H13551 2.383 5.198 0.399 0.4893 0.7154 -0.9915
4502OCTA Ho13552 2.413 5.565 1.189 -0.4685 -0.0421 -2.1107
&nbsp
Finally, does this example rely on undocumented features of textscan?
&nbsp
Addendum triggered by comment
>> data = fixed_width_format(7);
>> data(1:5,:)
ans =
4.3090 5.2540 4.1350 -0.2790 0.3440 0.2064
4.3140 5.1690 4.0820 -1.5406 0.3918 -0.0293
4.3880 5.3120 4.1140 -0.3375 1.9272 -2.6151
1.7430 1.6870 2.3660 1.2136 1.2777 0.3181
1.8180 1.7500 2.3870 1.3115 1.1542 0.3431
>> data(6:7,:)
ans =
0 111 111111 1222222 2222333 3333333
123456 789012 345678 9012345 6789012 3456789
>>
where
function data = fixed_width_format(N)
fid = fopen( 'fixed_width_format_dpb.txt' );
format_spec = '%6f%6f%6f%7f%7f%7f';
cac = textscan( fid, format_spec, N ...
, 'Whitespace' , '' ...
, 'Delimiter' , '' ...
, 'CollectOutput' , true );
fclose( fid );
data = cac{1};
end
and where fixed_width_format_dpb.txt contains
4.309 5.254 4.135-0.2790 0.3440 0.2064
4.314 5.169 4.082-1.5406 0.3918-0.0293
4.388 5.312 4.114-0.3375 1.9272-2.6151
1.743 1.687 2.366 1.2136 1.2777 0.3181
1.818 1.750 2.387 1.3115 1.1542 0.3431
000000000111111111122222222223333333333
123456789012345678901234567890123456789

27 Comments

There is a mistake. Replace
Data( 3:8, 4:5 ) = Data( 3:8, 4:5 ) + 1;
by
Data( 3:8, (4:5)-3 ) = Data( 3:8, (4:5)-3 ) + 1;
I don't see any undocumented behavior used; fixed width parsing works in this case because the fields aren't full. If the data array section were to look like
4.309 5.254 4.135-0.2790 0.3440 0.2064
4.314 5.169 4.082-1.5406 0.3918-0.0293
4.388 5.312 4.114-0.3375 1.9272-2.6151
1.743 1.687 2.366 1.2136 1.2777 0.3181
1.818 1.750 2.387 1.3115 1.1542 0.3431
000000000111111111122222222223333333333
123456789012345678901234567890123456789
you'd have a real problem as a format string of
[repmat('%6.3f',1,3) repmat('%7.4f',1,3)]
will fail on the locations where the negative sign of the subsequent value abuts directly against the end of the preceding column.
Given the OPs file structure, though, good thinking to simply treat the character data as a block; unless, (as I thought was wanted) needs to mung on it, too, should be "good to go"...
Somewhere some months ago there's a thread going into the problems in some depth where I demonstrate the failures with any of the Matlab scanning input functions for such files (which are common in at least older Fortran and are still around and can be parsed with no issues at all in Fortran where "a column means a column".
per isakson
per isakson on 15 Mar 2015
Edited: per isakson on 15 Mar 2015
  • It's possible to read your example with textscan and a similar format specifier - see my addendum. Or are we speaking past each other?
  • Can we be sure that "undocumented" behavior of textscan doesn't change between releases?
  • "Somewhere some months ago" &nbsp might be this thread read ascii non-delimited file.
  • "older Fortran and are still around" &nbsp that's true and most large "csv-files" contains fixed-width columns. Long time ago reading fixed-format text-files was much faster than reading free-format. Is that still the case?
  • Rows with leading spaces causes problems. '%s' handles it, but '%f' doesn't.
I would like a mex-file, which reads simple fixed-width "csv-files" the old "fortran way".
OK, I'd forgotten the issue slightly, Per. The problematic file that was the crux of that thread was one where the file includes one or more a blank fields...parsing does work correctly for the above file that is complete.
I'm presuming this still is so in latest release, R2012b is last I have here? I don't think it's only leading columns but anywhere within the data array the blanks will get "eaten" by the definition C i/o uses for fields irrespective of count.
I have no data for timing on list-directed i/o vis a vis explicit formatting as a general rule in Fortran. One would guess that's probably so that there's more overhead involved in list directed so it would be at least somewhat faster for a direct comparison between the same file as one way versus the other but I suspect it's highly compiler and even more so, specific file dependent so a blanket "much faster" would be a conclusion I'd definitely be reluctant to draw. I've not done in comparative timings but it stands to reason that textscan is probably not as quick as fscanf or even textread given the extra stuff it does/can do in comparison. Probably an even more fair comparison/thought or otherwise would be to importdata, though, that has to try to figure out what it's looking at on the fly.
I still wish one of two things had occurred...
1) C authors had adopted the well-established FORTRAN FORMAT rules for field definitions so we had the ease of repeat counts and fixed-width fields were handled automagically, or
2) TMW had stayed with the very earliest roots of Matlab and it's earliest heritage with FORTRAN and retained that part of Fortran for its i/o. They kept the internal storage order; why not FORMAT?
dpb
dpb on 16 Mar 2015
Edited: dpb on 17 Mar 2015
Since the topics are so disparate I make a separate comment on your note
"Can we be sure that "undocumented" behavior of textscan doesn't change between releases?"
I'm not sure which feature you think is undocumented, Per? I've had an argu^h^h^h^h, uh, "discussion" with Bruno over some aspects of disparities between fscanf and friends and textscan over interpretations of format specifiers wherein different functions returned different results with the same C format string on the same file.
Bruno initially said this was ok, as there was no documentation that specifically said what the return should be; my contention was (and is) is that while TMW doesn't provide complete documentation the doc's for all end up referencing the C Standards hence by that reference the results should be expected to be consistent with the official interpretation of what that Standards body would rule in cases of interpretation.
Now, since Matlab is a proprietary language owned solely by TMW, there's no recourse to anywhere but TMW in any such case and they can choose to be as compliant or non-compliant as they deem appropriate but there is at least a place to hang your hat in expecting them to have a given result.
As for changes between releases, that's a major issue in my mind with using Matlab for anything except exploratory and non-critical applications. One just can't really trust it for other work as they do change behavior excessively rapidly in my view with non-compatible results that requires excessive version testing when new features are introduced so frequently. Compilers and language Standards evolve much more slowly and so one gives up the convenience for stability and it's got to be a institution-specific evaluation of what is/is not an acceptable balance between the two.
While it's now been quite a long time since I left the nuclear power field where licensing issues with the NRC on codes for safety-related analyses were a direct and daily concern, at that time it would have been extremely difficult to keep any computations done in a product such as Matlab compliant with the NRC requirements for validation to the point that the amount of manpower required to do so would have prevented any attempt to make it a safety-related computation engine. My guess is that is probably still the case altho I have no current knowledge of the vendors.
dpb
dpb on 16 Mar 2015
Edited: dpb on 16 Mar 2015
OK, one more, I missed the last sentence in earlier comment up 'til proofreading the last comments...
"... would like a mex-file, which reads simple fixed-width "csv-files" the old "fortran way"
Indeed, before retiring I had built a moderately complete function that took in a Fortran format string and used Fortran to read the file/character array for internal i/o given that string.
Unfortunately, in the move from the employer to self-employed, it appears the only copy of the source for that was on the employer machine and by the time I discovered it was missing the particular machine had been re-purposed and all the old files were gone. I've not had the motivation to rebuild it after retiring from the consulting gig and coming back to the family farm; the biggest complexity in it was returning the data back to Matlab owing to the need to construct appropriate data structures for the general outputs possible to come back. Very simple things like fixed numeric arrays wouldn't be too bad (similar to the limitations for csvread, say).
At one time I thought I had submitted this routine to TMW as an enhancement but a conversation in a cs-sm thread similar to this w/ Steven Lord indicated he had done some looking but if so it's no longer in their database. Whether the wish is in the database of requested enhancements at all I don't know.
It works perfectly! Fixed with delimiters were a PITA until now.
"It works perfectly! Fixed with delimiters were a PITA until now"
As long as there are no blank fields...
Google helped me understand that "PITA" in this context is not a kind of bread.
As dpb says this "solution" cannot handle all cases. It is not robust and in certain cases it returns erroneous result without any warning. E.g.
>> cac = textscan( '123456 23456 34561234567 4567 67' ...
, '%6f%6f%6f%7f%7f%7f' ...
, 'Delimiter' , '' ...
, 'Whitespace' , '' ...
, 'CollectOutput' , true );
>> cac{:}
ans =
123456 23456 345612 34567 4567 67
I expected
123456 23456 3456 1234567 4567 67
Replacing %*f by %*s returns the expected result
>> cac{:}
ans =
'123456' ' 23456' ' 3456' '1234567' ' 4567' ' 67'
That explains why my solution is safe with the example of the original question. However, "PITA" isn't gone! TBC
dpb
dpb on 17 Mar 2015
Edited: dpb on 17 Mar 2015
"...understand that "PITA" in this context is not a kind of bread."
chuckles The english-specific (and perhaps American?) acronym is why I occasionally write as pit[proverbial]a[ppendage] on the hope it'll be decipherable even by non-native speakers. :)
I'm curious still, though, Per as to what you think may be "undocumented" behavior or did my rationalization above of referring back to the underlying C formatting cure the question (even if it's not a definitive answer)?
I'll amplify one specific point -- I'm no C whizard, but my reading of the Standard and documentation of the standard functions descriptions leads me to believe that the above behavior is that expected from C -- filled fixed-width fields should parse correctly but a blank field of the same width ends up getting "eaten" by an (imo) ill-conceived definition of white space that is still active even with the explicit width field.
I believe that earlier versions of Matlab than R2012b and particularly textscan had some bugs; I've not done sufficient testing to know whether all the edge cases that have discovered are resolved nor even whether the aforementioned discrepancies in textread|fscanf|and friends and textscan are yet resolved or not, but it appears TMW has made conscious efforts in making them more consistent than was previously.
I tried again, to date I've not been able to get an installed compiler to successfully link against the R2012b mex libraries; this area is another of exceeding frustration I have...I remember in early years I had no problems but since R14 I've never gotten anywhere with mex so have basically given up trying. Trying to decipher the install process is hopeless; it's so convoluted in config files and perl or whatever it is they use that it's impossible to make heads nor tails of... :(
I have never used C and never will. My ultimate justification to that is Unix, a Hoax?
Matlab is advertised as a high level tool for non-programmers like me.
Long time ago there was references to C in the Matlab documentation on some text-reading-functions and I then tried to read some C-documentation on reading text files. In the current documentation of textscan I cannot find any references to C. Thus, the Matlab documentation should suffice.
"I'm curious still" &nbsp My function, fixed_width_format, relies on fact that '%wids' honors the width specification. Obviously, '%widf' does not. AFAIK: This difference is not documented. Furthermore, the effects of 'Delimiter','' 'Whitespace','' are not documented. I use them as "placebo".
... and I dislike behaviors like this one
>> sscanf( '1.2.34.2', '%f' )'
ans =
1.2000 0.3400 0.2000
TBC
Ah yes, I remember that one well!!! :) 'Twas well done, indeed...
I see in the doc TMW has removed most of the former references to the C i/o...in a fairly brief look the only thing I see still there is the following at fprintf
"Note:  The low-level file I/O functions are based on functions in the ANSI® Standard C Library. However, MATLAB includes vectorized versions of the functions, to read and write data in an array with minimal control loops."
Of course, all the "high"-level file I/O functions are based on the low-level ones and all file i/o eventually translates into the compiler runtime library so the end result is that all i/o is based on the Standard C library.
So, while it may not be expressly stated and TMW has tried to duplicate the descriptions of the format strings around various places, they all end up back at the underlying behavior of C since the language is written in C++/C and the format strings clearly mimic those of C with the extensions to vectorized behavior.
Older code still has remnants of the former documentation path...
>> help textread
textread Read formatted data from text file.
...
[A,B,C, ...] = textread('FILENAME','FORMAT')
...
FORMAT string. If there are fewer fields in the file than in the
format string, an error is produced. See FORMAT STRINGS below for
more information.
...
FORMAT STRINGS
...
Supported conversion specifications:
%n - read a number - float or integer (returns double array)
%5n reads up to 5 digits or until next delimiter
%d - read a signed integer value (returns double array)
%5d reads up to 5 digits or until next delimiter
...
See the Language Reference Guide or a C manual for complete
details.
...
>>
But, while they can remove the explicit references, they can't remove the behavior of the underlying rtl (runtime library). While they have their own for the vectorized functionality, it still has to eventually translate to the base language rtl or they have to replace that whole functionality, too.
'123456 23456 34561234567 4567 67'
000000000111111111122222222223333333333
123456789012345678901234567890123456789
Your example also illustrates a case I forgot to test that is, by all reasonable meanings of "fixed width" field, broken. But, it is not broken according to the C Standard; it is, in fact, the expected behavior owing to the definition of the field width parameter for numeric scanning as not actually counting columns but convertible characters after a delimiter and that multiple delimiters are counted as one. You'll note that in the field that failed ' 34561234567' that the returned value with the '%6f' format applicable did return the value 345612 which is, in fact, only six characters in length(*) albeit it began counting with the leading '3' instead of with the blank in the 13th column of the overall string where Fortran would have done. So it did "honor" the width, it just has a different definition of what "honoring" means. There's a place where that behavior isn't documented specifically in Matlab by TMW because it is C behavior and they haven't reproduced all of the C documentation in its entirety.
(*) W/o the width specifier '%f' would scan the full substring of numeric characters until the whitespace or non-numeric character.
So, my contention is that the (increasingly meager) references to the C Standard are still significant and are, still, the underlying authority of what one should expect. Unfortunately, that isn't always a useful expectation for fixed width fields. It took me a long, long, time to finally understand enough of the C jargon to figure out that it is, according to it, indeed, the proper result, as disappointing as that is.
So, one last time..."why, oh why, couldn't they have kept Fortran FORMAT instead"???
dpb
dpb on 18 Mar 2015
Edited: dpb on 18 Mar 2015
One last comment on the documentation vis a vis C. This conundrum illustrates the most fundamental problem with TMW's documentation in that it is all heuristic and example-based, rather than there being a formal definition of required behavior as a base line then a users' manual which explains how to use it.
And, not to misunderstand, in general this approach works well and the intent to make it relatively easy to use the product is both understandable and reasonable; the problem comes in the hard parts and the edge cases that defy simplification and have so many options and then in this case, rely on behavior besides that of Matlab itself but of the C language i/o. It's a big job but not having the complete "Standardese" specifications to which the C Standard Library functions are written or a precise description of every one of those points in the documentation leads to the inevitable holes that some areas simply aren't covered by TMW; you either have to go look up what C says should happen or find out by trial and error, in which case unless you know C you don't know if that's right or not.
It's not clear to me whether behind the scenes TMW does have such a master definition or not; I presume there must be for the core language elements, but for the multitude of functions I have no idea how they actually make the final decisions on what behavior is accepted and whether there's an a priori requirement to which a function is given or whether there's a more general target area being addressed with a function like textscan and it just evolves by development and then at some point before release is considered to be an ok definition that then gets updated periodically as quirks are found and features added, removed, refined based on user experience. That things can be dynamic is obvious with functions like interp1 which has had some fairly significant changes in functionality and behavior with recent releases.
Of course, that level of integration of higher level function tends not to be present in other language standards; closest I would know of would be the standard template library with C++ and I'm just not familiar enough with it to be able to even begin to draw comparisons; certainly Matlab has a much wider range of areas of integration even in the base product and continues to add more at a breathtaking pace...
Anyway, the point is, that the documentation for complicated functions such as textscan attempts to explain by example and as good a job as can of what the various inputs mean, but when it's this complicated and in this case relies on behavior that is borrowed from somewhere else, unless that "somewhere else" is fully documented then the TMW documentation is, inevitably, incomplete.
@Per...you may have gotten bored with the topic but I had one additional thought and comparison on the documentation quandary.
Referring the formatted i/o format strings to the C Standard Library definitions isn't unique, actually. It's the same thing as for string formatting with TeX/LaTeX in the interpreter selection for text and friends. TMW makes no attempt to fully reproduce the pertinent documentation but only a subset. This area is even more frustrating as it appears that the incorporated subset of TeX isn't fully compatible and there's no information as to what is/isn't like the original so one is really left to one's own devices and trial and error there.
I'm an absolute novice when it comes to regular expressions but I know you use them extensively. Isn't it also similar there with the builtin regexp?
@dpb, No I'm not bored, but I don't know what to add. The topics are large and deserve separate threads, which conclude in collective recommendations/wishes to The MathWorks.
  • Regarding the documentation in general I think that it's okay and is being improved over time. However, I've have wishes. More elaborated problem oriented examples, e.g. "How to read fixed width text file" and "How to read multi-section text files". Easier to find a function when I forgotten the name. And more.
  • textscan &nbsp Knowledge on C might helps to understand why, but must not be required. (If needed that should be clearly stated.) Regarding 'Delimiter','' and 'Whitespace','' I still think they don't cause the promised behavior. Recently, I read some documentation on Fortran format specifiers. We only want a small subset.
  • The documentation of the low-level HDF5-functions is another example of referring to third party documentation. In this case I think it is okay.
  • The documentation on Regular expressions is okay I think. However, I think more elaborated examples would be useful.
dpb
dpb on 21 Apr 2015
Edited: dpb on 22 Apr 2015
@Per, I agree large subjects and I've said my piece with TMW on much of the documentation issues with some direct interaction with a person on the documentation team--it was, in fact, that feedback and interaction which ultimately lead to the present license I'm using that lets me be at least reasonably closely in touch with current releases (altho the limitations of memory capacity of my present machine have made going much farther seemingly going to have to wait until can get a more powerful box).
I think I have made enough of an impression on some areas it may eventually show up in a small way but much of what I'm not terribly fond of is just too far gone to ever get back until there's another major paradigm shift by the OS vendor. Hopefully like many fads/trends, we'll see a retrogression back towards what worked well before as they discover that "pretty isn't necessarily useful".
textscan isn't the only instance that relies on C -- all i/o does and it's implicit. I agree the doc should bring that into play but most who are being introduced to Matlab don't know C, either, so it's a conundrum that ideally it should be fully self-documented.
Unfortunately, C i/o scanning is complex and counter-intuitive in many cases when it's out of the routine. I'm sure it gets even more complex than C Standard Library addresses when it is vectorized as TMW has done.
The 'delimiter' and 'whitespace' options are an extension so there we're back to the issue that whether it "does what it promises" or not is indeterminate as we're not privy to the actual design spec. There isn't any such option in the Standard library; that's an introduced behavior of the higher-level wrappers TMW has written around the low-level formatted i/o calls.
I saw a link to a recent comment on the previous answer but couldn't tell for sure what was actually an addition/modification. I don't think there's a problem with the regular fixed-width file as long as there are no missing fields. Then is when it is just irrecoverably broken.
I agree a subset of FORMAT would solve much -- over the last couple of weeks I've managed to get a Fortran compiler working with R2012b here so I can now (finally) build mex files again. My goal is to build a small sample of a functional capability I had some years ago that will let one specify a FORMAT expression and then use the mex file to use Fortran to actually do the read and return the results. I used it to solve the problem quite extensively years ago but lost the source/compiled files when left the employer to go out on own as simply forgot to collect them from the machine before it was reformatted and didn't have a copy at the house.
ADDENDUM
As a thought of a specific suggestion, perhaps an additional 'FixedFieldWidth' flag could be introduced. If set, it would tell textscan to actually use the counted fields irrespective of content for scanning instead of the default C behavior of neglecting "extra" white space. This would have the effect desired and place the busy-work of subsetting strings into the lower-level mex'ed code instead of having to deal with it at the user level. On first blush the 'EmptyValue' option could, by default a la Fortran, return NiL (zero) by default instead of NaN if the option were set. I'd think that would solve >95% of the problems I've seen users have.
@Per -- I finally seem to have managed to get a setup that allows me to mex Fortran files under the R2012b release (albeit w/ a custom setup rather than integral to mex but it appears to be functional even if a little more of a bother than should be necessary).
I'm going to try to build a very rudimenatary first past to refresh my recollection of a textscan workalike that uses FORMAT statement forms for trhe format string and returns variables (or an array of numerics if same type as an enhancement over base textscan functionality) and then we can test it on the various fixed-width files herein.
BTW, in looking at the documentation, I see some notes on whitespace that I had kinda' forgotten about that explain a lot of the behavior that causes the problem in the case with interpreting a blank field. From the textread doc it says
"textread matches and converts groups of characters from the input. Each input field is defined as a string of non-white-space characters that extends to the next white-space or delimiter character, or to the maximum field width. Repeated delimiter characters are significant, while repeated white-space characters are treated as one." (Emphasis added) It's that behavior that I tried to explain earlier that's the key difference between Fortran and C in their interpretation of what "fixed width" really means.
So then I went back to textscan to see if it has the same words -- not quite, but a description that (I think) means the same thing:
"textscan does not include leading white-space characters in the processing of any data fields. When processing numeric data, textscan also ignores trailing white space."
This is what has the effect of "mushing together" fields to leave missing fields at the end of a record instead of intended to be significant fields containing whitespace within a record (or at the end also, if that's their positional location).
@dpb - That's great that you now have a setup to make mex-functions with fortran.
I have made a quick-and-dirty function, expand_format_spec (attached) to make format-strings for textscan. They are more readable and less error prone. It's basically textscan specifier together with fortran multipliers. Here are three variants, which expand to the same textscan specifier.
disp( expand_format_spec( '%*12s,2(%*8f),3(%6.2f),12(%8d)' ) )
disp( expand_format_spec( '%*12s 2(%*8f) 3(%6.2f) 12(%8d)' ) )
disp( expand_format_spec( '%*12s2(%*8f)3(%6.2f)12(%8d)' ) )
%*12s%*8f%*8f%6.2f%6.2f%6.2f%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d
%*12s%*8f%*8f%6.2f%6.2f%6.2f%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d
%*12s%*8f%*8f%6.2f%6.2f%6.2f%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d
>>
I think
  • the first one (with commas) is the best (i.e. most readable) and
  • that something like this would fit better in the Matlab environment than a syntax, which mimics fortrans FORMAT
I look forward to your function.
Yes, 'FixedFieldWidth' is a possibility. However, textscan is already too complicated with all its Name-Value Pair Arguments.
The syntax simplification to write the format string is an enhancement, but unless the underlying read routine changes its behavior it would still have the above problem with fields that are significant but not containing anything but white space. That would be the function of the proposed switch, to change that behavior.
I agree there are a "veritable plethora" of options already, but there is a need for them, unfortunately, to provide sufficient flexibility to handle (almost) all cases.
I figured that would, while probably the proverbial snowball's chance of ever being accepted by TMW, have better shot than another whole new function.
  • Sure, it's just "syntactic sugar".
  • Maybe, a syntax like this would be enough (with a fortran engine)
  • I don't understand TMW
Does anybody? :)
OK, on point 1) above; and it's better than repmat by far, I agree. Kinda' like my iswithin and friends to move the multiple comparisons out of the top level is...
Unfortunately, I've gotten sidetracked on other issues for a while so the FORMAT routine is going to have to wait a while...
Assume the short range goal is a new function, read_fixed_format, in The File Exchange.
I thought of making a prototype in in m-code:
  • reading with textscan together with a format-string based entirely on %wc where w is a whole number
  • format-string according to my comment above
  • converting to numerical in a second step - maybe with str2num.
  • do some timing experiments
Move our discussion to a new thread. Ask the following questions here at Answer (in one question):
  • Is there a need for a new function to read fixed-formatted text files?
  • Does my prototype handle the most frequent types of fixed-width files?
  • A good fortran-mex-function, how much faster would that be?
What do you think?
dpb
dpb on 5 May 2015
Edited: dpb on 5 May 2015
Seems eminently reasonable to me... :)
I'd suggest to add the alternative of the 'FixedWidth' flag option to textscan (and wouldn't ignore the red-haired-child status of textread despite it's deprecated status it has some useful features not included w/ its preferred kin as well) instead of again, a whole new function to keep the proliferation of specialty routines at a minimum in lieu of the general routine to handle formatted i/o.
Wonder if in prototype above wouldn't be more straightforward once have the field widths to just glom up the full file as stream file as character array then use direct indexing for the columns to do the conversions?
I'm not sure now when I'm going to get back to the main task; the pro bono work for the local community college is making some serious demands just now and for the foreseeable future plus farming season has now arrived in earnest and we're actually getting rain so time is going to become precious...
I cannot think of a scenario when giving&nbsp<precision> when reading would be useful. First surprise:
>> cac = textscan( '3.1416,3.1416', '%6.2f', 'Delimiter', ',' );
>> [cac{:}]'
ans =
3.1400 16.0000 3.1400 16.0000
and for the fun of it
>> cac = textscan( '3.1416,3.1416', '%6.2f%u', 'Delimiter', ',' );
>> [cac{:}]'
ans =
3 3
16 16
I would rather have expected
ans =
3.1400 3.1400
16 16
Conclusion: Never use&nbsp<precision> when reading!
The problem in the latter is the [] around the display result forces a cast to the lower precision...try just
cac{:}
to see the actual content of each cell.
On the initial query, we're back to the question of what the design document says the TMW-specific version of the C format string is supposed to do (which gets back to my previous complaint that we don't have an actual language specification, only the descriptive documentation and the oblique reference to C). I don't know C well enough to know what the C99 Standard Library actually says on this point. I'll refer back to Fortran FORMAT behavior as the gold standard of what it should do (but as noted have no idea whether that's consistent w/ expected behavior from a conformant C compiler or not; my initial guess is "not").
  1. On input, the F data edit descriptor transfers w characters from an external field and assigns their real value to the corresponding I/O list item. The external field data must be an integer or real constant.
  2. If the input field contains only an exponent letter or decimal point, it is treated as a zero value.
  3. If the input field does not contain a decimal point or an exponent, it is treated as a real number of w digits, with d digits to the right of the decimal point. (Leading zeros are added, if necessary.)
  4. If the input field contains a decimal point, the location of that decimal point overrides the location specified by the F descriptor.
  5. If the field contains an exponent, that exponent is used to establish the magnitude of the value before it is assigned to the list element.
We already know about the problems of interpreting the field width itself in C (and hence Matlab) owing to blanks and the necessity of a delimiter in some instances.
I think the rule that an explicit decimal point overriding the FORMAT [precision] field is the better solution; leads to far less confusion overall, although one can't interpret the above record without either
  1. An explicit 1X to skip the delimiter, or
  2. A compiler-dependent extension that recognizes the comma as a delimiter(*)
(*) At least one commercial compiler has such a switch I'm aware of. The prime description/use of this feature that is NOT Standard behavior is the facility to terminate shorter-than-W fields; otherwise Fortran FORMAT will read as many characters as needed to fulfill the READ per the FORMAT statement. This underlying behavior is what it may be that prompted the modification in the original C behavior; it does make for more consistent language-reading behavior as fields are basically considered to be the equivalent of words in a sentence as a rough analogy. That's great for much text processing or visually scannable data files but not so good for data files that may be computer generated and where the position is significant. It's a difference in point-of-view of the developers at the time methinks; K&R weren't really that concerned of "serious compute" applications.
per isakson
per isakson on 10 May 2015
Edited: per isakson on 10 May 2015
"the display result forces a cast to the lower precision" &nbsp Thanks! I was too occupied with textscan to think of that.
Over the years there is a remarkable number of new and updated Matlab functions to read flat text files. And many toolboxes have their own variants. Obviously, TMW wants to enhance the "user experience". With my comment I just wanted to say that they should try even harder; "constant dropping wears the stone".
Thank you for the detailed explanation. However, IMO, there are way too many subtle details for a "high level language" like Matlab.
I'll be back with a new thread.
dpb
dpb on 10 May 2015
Edited: dpb on 11 May 2015
"...there are way too many subtle details for a "high level language" like Matlab."
I'd concur although I think it's inevitable given the choice of the underlying implementation; it's just inherent with the way the C library operates for these kinds of cases and there is so much generality that one must be able to handle to make a truly universal tool.
IMO it would, help however if the documentation were written as a definitive normative description that did have sufficient detail that one could infer the result simply from the TMW-supplied help files. But, then they would be so complex that
  1. Nobody would read them, and
  2. It would take a "language lawyer" to parse the result in the exotic cases if did.
The latter above is a discussion often at comp.lang.fortran wherein one of the regulars is a former editor of the Standard and there are regular discussions and disagreements as to whether a given construct is or is not "standard".
ADDENDUM
BTW, it is the combination of #1 and #3 above that is perhaps the most critical difference between Fortran and C on the interpretation of fixed-width fields input. That W characters are read irrespective of a presumed interpretation as "white space" (1) and that a field is "zero-filled" as necessary (3) so that a blank field is thus NOT presumed empty. (Hmmmm....interesting thought--would that mean that for your function you could use the "read character array to memory" idea and do a global substitution of zeros for blanks and then the field width count from existing textscan would work? Not sure if it would be totally general or not otomh but it's an intriguing thought, methinks...)

Sign in to comment.

More Answers (0)

Categories

Find more on Sudoku in Help Center and File Exchange

Asked:

on 13 Mar 2015

Edited:

dpb
on 11 May 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!