Modify then write data in the given format

Question

0 votes

Hi, I will start with a brief overview of what I am trying to achieve in this code.

I have a sample piece of data I want manipulate below. It stores the xyz coordinates and velocities of atoms in solution.

    1SOL     OW    1   4.309   5.254   4.135 -0.2790  0.3440  0.2064
    1SOL    HW1    2   4.314   5.169   4.082 -1.5406  0.3918 -0.0293
    1SOL    HW2    3   4.388   5.312   4.114 -1.3375  0.9272 -2.6151
    2SOL     OW    4   1.743   1.687   2.366  0.2136  0.2777  0.3181
    2SOL    HW1    5   1.818   1.750   2.387  0.3115  0.1542  0.3431
 4502OCTA     H13545   2.108   5.326   1.045 -1.2169  0.4890 -2.6144
 4502OCTA     H13546   2.068   5.492   1.036  0.7609  0.6650  0.8612
 4502OCTA     H13547   2.285   5.388   1.207  3.0144  2.5562  1.0920
 4502OCTA     H13548   2.121   5.425   1.265 -1.2460 -1.3635  1.4829
 4502OCTA    Oc13549   2.131   5.677   1.238 -0.0183 -0.0221 -1.0402
 4502OCTA    Oh13550   2.353   5.635   1.208 -0.6036  0.2241 -0.8140
 4502OCTA     H13551   2.383   5.198   0.399  0.4893  0.7154 -0.9915
 4502OCTA    Ho13552   2.413   5.565   1.189 -0.4685 -0.0421 -2.1107

What I need to do is add a specific value, for instance add 1 to the numbers in the columns 4 and 5 and rows 3 to 8. Basically, I want to translate the positions of certain atoms. It is important that I keep the original file's format.

There were two challenging aspects to this code: The first being that the second and the third column merge when the numbers in the third column (atom IDs) go into the 5 digits. I've worked around that, albeit not elegantly.

The second issue which I haven't been able to solve is how to write the new coordinates into a file. Matlab ignores the empty spaces before each row begins, and ignores the spaces in between the columns, and ignores the spaces after the rows end. I've tried using horzcat, mat2str, strcat, and maybe some others without success. I will leave my code below for you to examine.

 function output = Gro_editor(filename, x1, y1, y2,addval)
 %Initialize Values%
 filend = 0;
 fin = 0;
 data = cell(15,22);
 Xpos = 1; %X-coordinate
 Ypos = 1; %Y-coordinate
 %Open and get permission to write target file%
 fid = fopen(filename,'r');
 if fid == -1
     disp('Could not open file');
 else
     disp('File Open...');
     %Index the entire file with meaningful partitions.
     %First 15 are indexed individually, following seven are read as blocks.
     while filend == 0
         if Xpos < 16
             data(Ypos,Xpos) = {cellstr(fscanf(fid,'%c',1))};
         elseif Xpos > 15
             data(Ypos,Xpos) = {fscanf(fid,'%f',1)};
         end
         if Xpos == 22
             Xpos = 1;
             Ypos = Ypos + 1;
             fscanf(fid,'%c',2);
         elseif Xpos < 22
             Xpos = Xpos + 1;
         end
         if Ypos == 20
             filend = 1;
         end
     end
     closeresult = fclose(fid);
     if closeresult == 0
         disp('File closed')
     else
         disp('File close unsuccessful')
     end
     %Writing new cell array
     while fin == 0
         data{y1,x1} = data{y1,x1}+addval
         if y1 < y2
             y1 = y1+1;
         elseif y1 == y2
             fin = 1;
         end
     end
     %So far so good. I don't know what I am doing after here.%
     ----------------------------------------------------------
     output=data;
     fid = fopen('output.txt','w');
     [nrows,ncols] = size(data);
     formatSpec = '%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c\n';
     for row = 1:nrows
         for col = 1:ncols
             line = horzcat(data(nrows,1:15))
             fprintf(fid,formatSpec,line(col));
         end
     end
 fclose(fid);
     type output.txt
 end

1 Comment
Show -1 older comments Hide -1 older comments

dpb on 13 Mar 2015

Edited: dpb on 15 Mar 2015

It is a pita to read fixed-width nondelimited data with Matlab owing to its C i/o formatting--the idea of having such was apparently overlooked and it is, simply put, impossible w/o counting columns if the fields are ever actually full.

That said, not sure where, precisely, your output formatting issues are arising but to write fixed column fields use a counted field of the proper width for both the string and the numeric fields and you can control that however is desired.

Provide a precise definition of the first two string columns' content (and spacing within the column if that is actually also significant) and I'm sure we can write an appropriate format string.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

per isakson on 15 Mar 2015

Edited: per isakson on 15 Mar 2015

Open in MATLAB Online

0 votes

The documentation of textscan doesn't cover fixed-width very well. However, textscan has "undocumented"/hidden capabilities.

Approach

Read the first three columns to one string, since they shall only be copied to the output file.
Read the following six columns to a double array.
Add 1 to the prescribed elements of the array
Use the same format string to write the data (don't forget new-line)

Run example code (I use R2013b)

fixed_width_format(13)

where

    function    fixed_width_format( N )
        fid = fopen( 'fixed_width_format.txt' );
        format_spec = '%20s%8.3f%8.3f%8.3f%8.4f%8.4f%8.4f';
        cac = textscan( fid, format_spec, N     ...
                    ,   'Whitespace'    , ''    ...
                    ,   'Delimiter'     , ''    ...
                    ,   'CollectOutput' , true  );
        fclose( fid );
        RowHead = cac{1};
        Data    = cac{2};
        % add 1 to the numbers in the columns 4 and 5 and rows 3 to 8. 
        Data( 3:8, 4:5 ) = Data( 3:8, 4:5 ) + 1;
        fid = fopen( 'fixed_width_format_out.txt', 'w' );
        for rr = 1 : N
           fprintf( fid, [format_spec,'\n'], RowHead{rr}, Data(rr,:) );
        end
        fclose( fid );   
    end

and where fixed_width_format.txt contains

        1SOL     OW    1   4.309   5.254   4.135 -0.2790  0.3440  0.2064
        1SOL    HW1    2   4.314   5.169   4.082 -1.5406  0.3918 -0.0293
        1SOL    HW2    3   4.388   5.312   4.114 -1.3375  0.9272 -2.6151
        2SOL     OW    4   1.743   1.687   2.366  0.2136  0.2777  0.3181
        2SOL    HW1    5   1.818   1.750   2.387  0.3115  0.1542  0.3431
     4502OCTA     H13545   2.108   5.326   1.045 -1.2169  0.4890 -2.6144
     4502OCTA     H13546   2.068   5.492   1.036  0.7609  0.6650  0.8612
     4502OCTA     H13547   2.285   5.388   1.207  3.0144  2.5562  1.0920
     4502OCTA     H13548   2.121   5.425   1.265 -1.2460 -1.3635  1.4829
     4502OCTA    Oc13549   2.131   5.677   1.238 -0.0183 -0.0221 -1.0402
     4502OCTA    Oh13550   2.353   5.635   1.208 -0.6036  0.2241 -0.8140
     4502OCTA     H13551   2.383   5.198   0.399  0.4893  0.7154 -0.9915
     4502OCTA    Ho13552   2.413   5.565   1.189 -0.4685 -0.0421 -2.1107
    ---  0---|--- 10---|--- 20---|--- 30---|--- 40---|--- 50---|--- 60---
    123456789|123456789|123456789|123456789|123456789|123456789|123456789
    '%20s%8.3f%8.3f%8.3f%8.4f%8.4f%8.4f%8.4f'

and where fixed_width_format_out.txt contains

        1SOL     OW    1   4.309   5.254   4.135 -0.2790  0.3440  0.2064
        1SOL    HW1    2   4.314   5.169   4.082 -1.5406  0.3918 -0.0293
        1SOL    HW2    3   4.388   5.312   4.114 -0.3375  1.9272 -2.6151
        2SOL     OW    4   1.743   1.687   2.366  1.2136  1.2777  0.3181
        2SOL    HW1    5   1.818   1.750   2.387  1.3115  1.1542  0.3431
     4502OCTA     H13545   2.108   5.326   1.045 -0.2169  1.4890 -2.6144
     4502OCTA     H13546   2.068   5.492   1.036  1.7609  1.6650  0.8612
     4502OCTA     H13547   2.285   5.388   1.207  4.0144  3.5562  1.0920
     4502OCTA     H13548   2.121   5.425   1.265 -1.2460 -1.3635  1.4829
     4502OCTA    Oc13549   2.131   5.677   1.238 -0.0183 -0.0221 -1.0402
     4502OCTA    Oh13550   2.353   5.635   1.208 -0.6036  0.2241 -0.8140
     4502OCTA     H13551   2.383   5.198   0.399  0.4893  0.7154 -0.9915
     4502OCTA    Ho13552   2.413   5.565   1.189 -0.4685 -0.0421 -2.1107

&nbsp

Finally, does this example rely on undocumented features of textscan?

&nbsp

Addendum triggered by comment

    >> data = fixed_width_format(7);
    >> data(1:5,:)
    ans =
        4.3090    5.2540    4.1350   -0.2790    0.3440    0.2064
        4.3140    5.1690    4.0820   -1.5406    0.3918   -0.0293
        4.3880    5.3120    4.1140   -0.3375    1.9272   -2.6151
        1.7430    1.6870    2.3660    1.2136    1.2777    0.3181
        1.8180    1.7500    2.3870    1.3115    1.1542    0.3431
    >> data(6:7,:)
    ans =
               0         111      111111     1222222     2222333     3333333
          123456      789012      345678     9012345     6789012     3456789
    >>

where

    function    data = fixed_width_format(N)
        fid = fopen( 'fixed_width_format_dpb.txt' );
        format_spec = '%6f%6f%6f%7f%7f%7f';
        cac = textscan( fid, format_spec, N     ...
                    ,   'Whitespace'    , ''    ...
                    ,   'Delimiter'     , ''    ...
                    ,   'CollectOutput' , true  );
        fclose( fid );
        data = cac{1};
    end

and where fixed_width_format_dpb.txt contains

309 5.254 4.135-0.2790 0.3440 0.2064
314 5.169 4.082-1.5406 0.3918-0.0293
388 5.312 4.114-0.3375 1.9272-2.6151
743 1.687 2.366 1.2136 1.2777 0.3181
818 1.750 2.387 1.3115 1.1542 0.3431
    000000000111111111122222222223333333333
    123456789012345678901234567890123456789

27 Comments
Show 25 older comments Hide 25 older comments

dpb on 15 Mar 2015

Open in MATLAB Online

I don't see any undocumented behavior used; fixed width parsing works in this case because the fields aren't full. If the data array section were to look like

309 5.254 4.135-0.2790 0.3440 0.2064
314 5.169 4.082-1.5406 0.3918-0.0293
388 5.312 4.114-0.3375 1.9272-2.6151
743 1.687 2.366 1.2136 1.2777 0.3181
818 1.750 2.387 1.3115 1.1542 0.3431
  000000000111111111122222222223333333333
  123456789012345678901234567890123456789

you'd have a real problem as a format string of

[repmat('%6.3f',1,3) repmat('%7.4f',1,3)]

will fail on the locations where the negative sign of the subsequent value abuts directly against the end of the preceding column.

Given the OPs file structure, though, good thinking to simply treat the character data as a block; unless, (as I thought was wanted) needs to mung on it, too, should be "good to go"...

Somewhere some months ago there's a thread going into the problems in some depth where I demonstrate the failures with any of the Matlab scanning input functions for such files (which are common in at least older Fortran and are still around and can be parsed with no issues at all in Fortran where "a column means a column".

dpb on 16 Mar 2015

OK, I'd forgotten the issue slightly, Per. The problematic file that was the crux of that thread was one where the file includes one or more a blank fields...parsing does work correctly for the above file that is complete.

I'm presuming this still is so in latest release, R2012b is last I have here? I don't think it's only leading columns but anywhere within the data array the blanks will get "eaten" by the definition C i/o uses for fields irrespective of count.

I have no data for timing on list-directed i/o vis a vis explicit formatting as a general rule in Fortran. One would guess that's probably so that there's more overhead involved in list directed so it would be at least somewhat faster for a direct comparison between the same file as one way versus the other but I suspect it's highly compiler and even more so, specific file dependent so a blanket "much faster" would be a conclusion I'd definitely be reluctant to draw. I've not done in comparative timings but it stands to reason that textscan is probably not as quick as fscanf or even textread given the extra stuff it does/can do in comparison. Probably an even more fair comparison/thought or otherwise would be to importdata, though, that has to try to figure out what it's looking at on the fly.

I still wish one of two things had occurred...

1) C authors had adopted the well-established FORTRAN FORMAT rules for field definitions so we had the ease of repeat counts and fixed-width fields were handled automagically, or

2) TMW had stayed with the very earliest roots of Matlab and it's earliest heritage with FORTRAN and retained that part of Fortran for its i/o. They kept the internal storage order; why not FORMAT?

dpb on 16 Mar 2015

Edited: dpb on 17 Mar 2015

Since the topics are so disparate I make a separate comment on your note

"Can we be sure that "undocumented" behavior of textscan doesn't change between releases?"

I'm not sure which feature you think is undocumented, Per? I've had an argu^h^h^h^h, uh, "discussion" with Bruno over some aspects of disparities between fscanf and friends and textscan over interpretations of format specifiers wherein different functions returned different results with the same C format string on the same file.

Bruno initially said this was ok, as there was no documentation that specifically said what the return should be; my contention was (and is) is that while TMW doesn't provide complete documentation the doc's for all end up referencing the C Standards hence by that reference the results should be expected to be consistent with the official interpretation of what that Standards body would rule in cases of interpretation.

Now, since Matlab is a proprietary language owned solely by TMW, there's no recourse to anywhere but TMW in any such case and they can choose to be as compliant or non-compliant as they deem appropriate but there is at least a place to hang your hat in expecting them to have a given result.

As for changes between releases, that's a major issue in my mind with using Matlab for anything except exploratory and non-critical applications. One just can't really trust it for other work as they do change behavior excessively rapidly in my view with non-compatible results that requires excessive version testing when new features are introduced so frequently. Compilers and language Standards evolve much more slowly and so one gives up the convenience for stability and it's got to be a institution-specific evaluation of what is/is not an acceptable balance between the two.

While it's now been quite a long time since I left the nuclear power field where licensing issues with the NRC on codes for safety-related analyses were a direct and daily concern, at that time it would have been extremely difficult to keep any computations done in a product such as Matlab compliant with the NRC requirements for validation to the point that the amount of manpower required to do so would have prevented any attempt to make it a safety-related computation engine. My guess is that is probably still the case altho I have no current knowledge of the vendors.

dpb on 16 Mar 2015

Edited: dpb on 16 Mar 2015

OK, one more, I missed the last sentence in earlier comment up 'til proofreading the last comments...

"... would like a mex-file, which reads simple fixed-width "csv-files" the old "fortran way"

Indeed, before retiring I had built a moderately complete function that took in a Fortran format string and used Fortran to read the file/character array for internal i/o given that string.

Unfortunately, in the move from the employer to self-employed, it appears the only copy of the source for that was on the employer machine and by the time I discovered it was missing the particular machine had been re-purposed and all the old files were gone. I've not had the motivation to rebuild it after retiring from the consulting gig and coming back to the family farm; the biggest complexity in it was returning the data back to Matlab owing to the need to construct appropriate data structures for the general outputs possible to come back. Very simple things like fixed numeric arrays wouldn't be too bad (similar to the limitations for csvread, say).

At one time I thought I had submitted this routine to TMW as an enhancement but a conversation in a cs-sm thread similar to this w/ Steven Lord indicated he had done some looking but if so it's no longer in their database. Whether the wish is in the database of requested enhancements at all I don't know.

dpb on 17 Mar 2015

Edited: dpb on 17 Mar 2015

"...understand that "PITA" in this context is not a kind of bread."

chuckles The english-specific (and perhaps American?) acronym is why I occasionally write as pit[proverbial]a[ppendage] on the hope it'll be decipherable even by non-native speakers. :)

I'm curious still, though, Per as to what you think may be "undocumented" behavior or did my rationalization above of referring back to the underlying C formatting cure the question (even if it's not a definitive answer)?

I'll amplify one specific point -- I'm no C whizard, but my reading of the Standard and documentation of the standard functions descriptions leads me to believe that the above behavior is that expected from C -- filled fixed-width fields should parse correctly but a blank field of the same width ends up getting "eaten" by an (imo) ill-conceived definition of white space that is still active even with the explicit width field.

I believe that earlier versions of Matlab than R2012b and particularly textscan had some bugs; I've not done sufficient testing to know whether all the edge cases that have discovered are resolved nor even whether the aforementioned discrepancies in textread|fscanf|and friends and textscan are yet resolved or not, but it appears TMW has made conscious efforts in making them more consistent than was previously.

I tried again, to date I've not been able to get an installed compiler to successfully link against the R2012b mex libraries; this area is another of exceeding frustration I have...I remember in early years I had no problems but since R14 I've never gotten anywhere with mex so have basically given up trying. Trying to decipher the install process is hopeless; it's so convoluted in config files and perl or whatever it is they use that it's impossible to make heads nor tails of... :(

dpb on 17 Mar 2015

Edited: dpb on 18 Mar 2015

Open in MATLAB Online

Ah yes, I remember that one well!!! :) 'Twas well done, indeed...

I see in the doc TMW has removed most of the former references to the C i/o...in a fairly brief look the only thing I see still there is the following at fprintf

"Note: The low-level file I/O functions are based on functions in the ANSI® Standard C Library. However, MATLAB includes vectorized versions of the functions, to read and write data in an array with minimal control loops."

Of course, all the "high"-level file I/O functions are based on the low-level ones and all file i/o eventually translates into the compiler runtime library so the end result is that all i/o is based on the Standard C library.

So, while it may not be expressly stated and TMW has tried to duplicate the descriptions of the format strings around various places, they all end up back at the underlying behavior of C since the language is written in C++/C and the format strings clearly mimic those of C with the extensions to vectorized behavior.

Older code still has remnants of the former documentation path...

>> help textread
textread Read formatted data from text file.
   ...
   [A,B,C, ...] = textread('FILENAME','FORMAT')
   ...
   FORMAT string.  If there are fewer fields in the file than in the
   format string, an error is produced.  See FORMAT STRINGS below for
   more information.
   ...
   FORMAT STRINGS
    ...
   Supported conversion specifications:
       %n - read a number - float or integer (returns double array)
            %5n reads up to 5 digits or until next delimiter
       %d - read a signed integer value (returns double array)
            %5d reads up to 5 digits or until next delimiter
        ...
   See the Language Reference Guide or a C manual for complete 
   details.
   ...
>>

But, while they can remove the explicit references, they can't remove the behavior of the underlying rtl (runtime library). While they have their own for the vectorized functionality, it still has to eventually translate to the base language rtl or they have to replace that whole functionality, too.

'123456 23456  34561234567   4567     67'
 000000000111111111122222222223333333333
 123456789012345678901234567890123456789

Your example also illustrates a case I forgot to test that is, by all reasonable meanings of "fixed width" field, broken. But, it is not broken according to the C Standard; it is, in fact, the expected behavior owing to the definition of the field width parameter for numeric scanning as not actually counting columns but convertible characters after a delimiter and that multiple delimiters are counted as one. You'll note that in the field that failed ' 34561234567' that the returned value with the '%6f' format applicable did return the value 345612 which is, in fact, only six characters in length(*) albeit it began counting with the leading '3' instead of with the blank in the 13th column of the overall string where Fortran would have done. So it did "honor" the width, it just has a different definition of what "honoring" means. There's a place where that behavior isn't documented specifically in Matlab by TMW because it is C behavior and they haven't reproduced all of the C documentation in its entirety.

(*) W/o the width specifier '%f' would scan the full substring of numeric characters until the whitespace or non-numeric character.

So, my contention is that the (increasingly meager) references to the C Standard are still significant and are, still, the underlying authority of what one should expect. Unfortunately, that isn't always a useful expectation for fixed width fields. It took me a long, long, time to finally understand enough of the C jargon to figure out that it is, according to it, indeed, the proper result, as disappointing as that is.

So, one last time..."why, oh why, couldn't they have kept Fortran FORMAT instead"???

dpb on 18 Mar 2015

Edited: dpb on 18 Mar 2015

One last comment on the documentation vis a vis C. This conundrum illustrates the most fundamental problem with TMW's documentation in that it is all heuristic and example-based, rather than there being a formal definition of required behavior as a base line then a users' manual which explains how to use it.

And, not to misunderstand, in general this approach works well and the intent to make it relatively easy to use the product is both understandable and reasonable; the problem comes in the hard parts and the edge cases that defy simplification and have so many options and then in this case, rely on behavior besides that of Matlab itself but of the C language i/o. It's a big job but not having the complete "Standardese" specifications to which the C Standard Library functions are written or a precise description of every one of those points in the documentation leads to the inevitable holes that some areas simply aren't covered by TMW; you either have to go look up what C says should happen or find out by trial and error, in which case unless you know C you don't know if that's right or not.

It's not clear to me whether behind the scenes TMW does have such a master definition or not; I presume there must be for the core language elements, but for the multitude of functions I have no idea how they actually make the final decisions on what behavior is accepted and whether there's an a priori requirement to which a function is given or whether there's a more general target area being addressed with a function like textscan and it just evolves by development and then at some point before release is considered to be an ok definition that then gets updated periodically as quirks are found and features added, removed, refined based on user experience. That things can be dynamic is obvious with functions like interp1 which has had some fairly significant changes in functionality and behavior with recent releases.

Of course, that level of integration of higher level function tends not to be present in other language standards; closest I would know of would be the standard template library with C++ and I'm just not familiar enough with it to be able to even begin to draw comparisons; certainly Matlab has a much wider range of areas of integration even in the base product and continues to add more at a breathtaking pace...

Anyway, the point is, that the documentation for complicated functions such as textscan attempts to explain by example and as good a job as can of what the various inputs mean, but when it's this complicated and in this case relies on behavior that is borrowed from somewhere else, unless that "somewhere else" is fully documented then the TMW documentation is, inevitably, incomplete.

per isakson on 21 Apr 2015

Edited: per isakson on 2 May 2015

@dpb, No I'm not bored, but I don't know what to add. The topics are large and deserve separate threads, which conclude in collective recommendations/wishes to The MathWorks.

Regarding the documentation in general I think that it's okay and is being improved over time. However, I've have wishes. More elaborated problem oriented examples, e.g. "How to read fixed width text file" and "How to read multi-section text files". Easier to find a function when I forgotten the name. And more.
textscan &nbsp Knowledge on C might helps to understand why, but must not be required. (If needed that should be clearly stated.) Regarding 'Delimiter','' and 'Whitespace','' I still think they don't cause the promised behavior. Recently, I read some documentation on Fortran format specifiers. We only want a small subset.
The documentation of the low-level HDF5-functions is another example of referring to third party documentation. In this case I think it is okay.
The documentation on Regular expressions is okay I think. However, I think more elaborated examples would be useful.

dpb on 21 Apr 2015

Edited: dpb on 22 Apr 2015

@Per, I agree large subjects and I've said my piece with TMW on much of the documentation issues with some direct interaction with a person on the documentation team--it was, in fact, that feedback and interaction which ultimately lead to the present license I'm using that lets me be at least reasonably closely in touch with current releases (altho the limitations of memory capacity of my present machine have made going much farther seemingly going to have to wait until can get a more powerful box).

I think I have made enough of an impression on some areas it may eventually show up in a small way but much of what I'm not terribly fond of is just too far gone to ever get back until there's another major paradigm shift by the OS vendor. Hopefully like many fads/trends, we'll see a retrogression back towards what worked well before as they discover that "pretty isn't necessarily useful".

textscan isn't the only instance that relies on C -- all i/o does and it's implicit. I agree the doc should bring that into play but most who are being introduced to Matlab don't know C, either, so it's a conundrum that ideally it should be fully self-documented.

Unfortunately, C i/o scanning is complex and counter-intuitive in many cases when it's out of the routine. I'm sure it gets even more complex than C Standard Library addresses when it is vectorized as TMW has done.

The 'delimiter' and 'whitespace' options are an extension so there we're back to the issue that whether it "does what it promises" or not is indeterminate as we're not privy to the actual design spec. There isn't any such option in the Standard library; that's an introduced behavior of the higher-level wrappers TMW has written around the low-level formatted i/o calls.

I saw a link to a recent comment on the previous answer but couldn't tell for sure what was actually an addition/modification. I don't think there's a problem with the regular fixed-width file as long as there are no missing fields. Then is when it is just irrecoverably broken.

I agree a subset of FORMAT would solve much -- over the last couple of weeks I've managed to get a Fortran compiler working with R2012b here so I can now (finally) build mex files again. My goal is to build a small sample of a functional capability I had some years ago that will let one specify a FORMAT expression and then use the mex file to use Fortran to actually do the read and return the results. I used it to solve the problem quite extensively years ago but lost the source/compiled files when left the employer to go out on own as simply forgot to collect them from the machine before it was reformatted and didn't have a copy at the house.

ADDENDUM

As a thought of a specific suggestion, perhaps an additional 'FixedFieldWidth' flag could be introduced. If set, it would tell textscan to actually use the counted fields irrespective of content for scanning instead of the default C behavior of neglecting "extra" white space. This would have the effect desired and place the busy-work of subsetting strings into the lower-level mex'ed code instead of having to deal with it at the user level. On first blush the 'EmptyValue' option could, by default a la Fortran, return NiL (zero) by default instead of NaN if the option were set. I'd think that would solve >95% of the problems I've seen users have.

dpb on 25 Apr 2015

@Per -- I finally seem to have managed to get a setup that allows me to mex Fortran files under the R2012b release (albeit w/ a custom setup rather than integral to mex but it appears to be functional even if a little more of a bother than should be necessary).

I'm going to try to build a very rudimenatary first past to refresh my recollection of a textscan workalike that uses FORMAT statement forms for trhe format string and returns variables (or an array of numerics if same type as an enhancement over base textscan functionality) and then we can test it on the various fixed-width files herein.

BTW, in looking at the documentation, I see some notes on whitespace that I had kinda' forgotten about that explain a lot of the behavior that causes the problem in the case with interpreting a blank field. From the textread doc it says

"textread matches and converts groups of characters from the input. Each input field is defined as a string of non-white-space characters that extends to the next white-space or delimiter character, or to the maximum field width. Repeated delimiter characters are significant, while repeated white-space characters are treated as one." (Emphasis added) It's that behavior that I tried to explain earlier that's the key difference between Fortran and C in their interpretation of what "fixed width" really means.

So then I went back to textscan to see if it has the same words -- not quite, but a description that (I think) means the same thing:

"textscan does not include leading white-space characters in the processing of any data fields. When processing numeric data, textscan also ignores trailing white space."

This is what has the effect of "mushing together" fields to leave missing fields at the end of a record instead of intended to be significant fields containing whitespace within a record (or at the end also, if that's their positional location).

per isakson on 2 May 2015

Open in MATLAB Online

expand_format_spec.m

@dpb - That's great that you now have a setup to make mex-functions with fortran.

I have made a quick-and-dirty function, expand_format_spec (attached) to make format-strings for textscan. They are more readable and less error prone. It's basically textscan specifier together with fortran multipliers. Here are three variants, which expand to the same textscan specifier.

    disp( expand_format_spec( '%*12s,2(%*8f),3(%6.2f),12(%8d)' ) )
    disp( expand_format_spec( '%*12s 2(%*8f) 3(%6.2f) 12(%8d)' ) )
    disp( expand_format_spec( '%*12s2(%*8f)3(%6.2f)12(%8d)' ) )
    %*12s%*8f%*8f%6.2f%6.2f%6.2f%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d
    %*12s%*8f%*8f%6.2f%6.2f%6.2f%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d
    %*12s%*8f%*8f%6.2f%6.2f%6.2f%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d%8d
    >>

I think

the first one (with commas) is the best (i.e. most readable) and
that something like this would fit better in the Matlab environment than a syntax, which mimics fortrans FORMAT

I look forward to your function.

Yes, 'FixedFieldWidth' is a possibility. However, textscan is already too complicated with all its Name-Value Pair Arguments.

dpb on 10 May 2015

Edited: dpb on 10 May 2015

Open in MATLAB Online

The problem in the latter is the [] around the display result forces a cast to the lower precision...try just

cac{:}

to see the actual content of each cell.

On the initial query, we're back to the question of what the design document says the TMW-specific version of the C format string is supposed to do (which gets back to my previous complaint that we don't have an actual language specification, only the descriptive documentation and the oblique reference to C). I don't know C well enough to know what the C99 Standard Library actually says on this point. I'll refer back to Fortran FORMAT behavior as the gold standard of what it should do (but as noted have no idea whether that's consistent w/ expected behavior from a conformant C compiler or not; my initial guess is "not").

On input, the F data edit descriptor transfers w characters from an external field and assigns their real value to the corresponding I/O list item. The external field data must be an integer or real constant.
If the input field contains only an exponent letter or decimal point, it is treated as a zero value.
If the input field does not contain a decimal point or an exponent, it is treated as a real number of w digits, with d digits to the right of the decimal point. (Leading zeros are added, if necessary.)
If the input field contains a decimal point, the location of that decimal point overrides the location specified by the F descriptor.
If the field contains an exponent, that exponent is used to establish the magnitude of the value before it is assigned to the list element.

We already know about the problems of interpreting the field width itself in C (and hence Matlab) owing to blanks and the necessity of a delimiter in some instances.

I think the rule that an explicit decimal point overriding the FORMAT [precision] field is the better solution; leads to far less confusion overall, although one can't interpret the above record without either

An explicit 1X to skip the delimiter, or
A compiler-dependent extension that recognizes the comma as a delimiter(*)

(*) At least one commercial compiler has such a switch I'm aware of. The prime description/use of this feature that is NOT Standard behavior is the facility to terminate shorter-than-W fields; otherwise Fortran FORMAT will read as many characters as needed to fulfill the READ per the FORMAT statement. This underlying behavior is what it may be that prompted the modification in the original C behavior; it does make for more consistent language-reading behavior as fields are basically considered to be the equivalent of words in a sentence as a rough analogy. That's great for much text processing or visually scannable data files but not so good for data files that may be computer generated and where the position is significant. It's a difference in point-of-view of the developers at the time methinks; K&R weren't really that concerned of "serious compute" applications.

per isakson on 10 May 2015

Edited: per isakson on 10 May 2015

"the display result forces a cast to the lower precision" &nbsp Thanks! I was too occupied with textscan to think of that.

Over the years there is a remarkable number of new and updated Matlab functions to read flat text files. And many toolboxes have their own variants. Obviously, TMW wants to enhance the "user experience". With my comment I just wanted to say that they should try even harder; "constant dropping wears the stone".

Thank you for the detailed explanation. However, IMO, there are way too many subtle details for a "high level language" like Matlab.

I'll be back with a new thread.

dpb on 10 May 2015

Edited: dpb on 11 May 2015

"...there are way too many subtle details for a "high level language" like Matlab."

I'd concur although I think it's inevitable given the choice of the underlying implementation; it's just inherent with the way the C library operates for these kinds of cases and there is so much generality that one must be able to handle to make a truly universal tool.

IMO it would, help however if the documentation were written as a definitive normative description that did have sufficient detail that one could infer the result simply from the TMW-supplied help files. But, then they would be so complex that

Nobody would read them, and
It would take a "language lawyer" to parse the result in the exotic cases if did.

The latter above is a discussion often at comp.lang.fortran wherein one of the regulars is a former editor of the Standard and there are regular discussions and disagreements as to whether a given construct is or is not "standard".

ADDENDUM

BTW, it is the combination of #1 and #3 above that is perhaps the most critical difference between Fortran and C on the interpretation of fixed-width fields input. That W characters are read irrespective of a presumed interpretation as "white space" (1) and that a field is "zero-filled" as necessary (3) so that a blank field is thus NOT presumed empty. (Hmmmm....interesting thought--would that mean that for your function you could use the "read character array to memory" idea and do a global substitution of zeros for blanks and then the field width count from existing textscan would work? Not sure if it would be totally general or not otomh but it's an intriguing thought, methinks...)

Sign in to comment.

Modify then write data in the given format

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

27 Comments
Show 25 older comments Hide 25 older comments

More Answers (0)

Categories

Tags

Community Treasure Hunt

Modify then write data in the given format

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

27 Comments Show 25 older comments Hide 25 older comments

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

27 Comments
Show 25 older comments Hide 25 older comments