Parsing a file with nonstandard format

1 view (last 30 days)
Lets say I need to get data out of a file but I can't count on the exact format of the entire file. All I can count on is the data immediately preceding or following the data I need. In particular lets say this text exists somewhere in the file "<x>45.234</x>" and what I need is that number. It might be in its own line, or not. It might have stuff around it, or not. And there are an unknown number of these datapoints in the file. How do I read all such numbers into an array, in the order they appear? Example file attached

Accepted Answer

Walter Roberson
Walter Roberson on 6 Mar 2016
filecontent = fileread('YourFileName.txt');
xval_string = regexp(filecontent, '(?<=<x>)-?[0-9.]+(?=<x>)', 'match');
xval = str2double(xval_string);
This accounts for a possible negative sign immediately before the number, but otherwise restricts numbers to being composed of digits and '.' . It does not do good verification on numbers, and would (for example) cheerfully accept '134.8..171.9.' . It can be improved, but before that is done it would be best to have documentation on exactly what forms a "number" is permitted to take. Spaces after leading minus? Unary plus permitted? For numbers with absolute magnitude less than 1, can the leading 0 be omitted? Exponential format permitted? Which characters can be in the exponent operator? How many digits of exponent? Is '.' permitted immediately before the exponent character or will there always be a digit before the exponent character? And so on.
  2 Comments
Morpheuskibbe
Morpheuskibbe on 6 Mar 2016
Edited: Morpheuskibbe on 6 Mar 2016
I tested this on an example file. The total contents were read into the "filecontent" string. however the xval_string and xval variables were totally empty.
Let me attach said example. I don't believe there will ever be a space in the number and I think that abs<1 numbers will have a leading 0.
Walter Roberson
Walter Roberson on 6 Mar 2016
I missed the '/'
filecontent = fileread('test3.txt');
xval_strings = regexp(filecontent, '(?<=<x>)-?\d+(.\d+)?(?=</x>)', 'match');
xvals = str2double(xval_strings);
Note that this will get all of the x values without paying attention to which mesh the values are in.
I recommend that you use xmlread() to parse this.
%all item numbering in XML is 0 based
doc = xmlread('test3.xml');
doc_objects = doc.getElementsByTagName('object');
num_obj = doc_objects.getLength;
obj_struct(num_obj) = struct('id', [], 'xyz', []); %pre-allocate
for obj_num = 1 : num_obj
this_obj = doc_objects.item(obj_num - 1);
obj_id = str2double(this_obj.getAttribute('id'));
obj_struct(obj_num).id = obj_id;
this_mesh = this_obj.getElementsByTagName('mesh').item(0);
these_x = this_mesh.getElementsByTagName('x');
num_x = these_x.getLength;
xyz = zeros(num_x, 3);
for xidx = 1 : num_x
this_x = these_x.item(xidx-1);
x_str = this_x.item(0).getData;
this_y = this_x.getNextSibling;
y_str = this_y.item(0).getData;
this_z = this_y.getNextSibling;
z_str = this_z.item(0).getData;
xyz(xidx,:) = str2double({x_str, y_str, z_str});
end
obj_struct(obj_num).xyz = xyz;
end
This will return a struct with two fields for every object, 'id' and 'xyz' where xyz is an N x 3 array of the x, y, z coordinates. I make the assumption that there will be exactly 1 mesh per object.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!