MATLAB Answers

Extracting Data field of a Series in HTML file

3 views (last 30 days)
b
b on 7 Apr 2020
Commented: b on 10 May 2020
In an HTML file, there is a section like this :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [45,78,84,91,111,125,178,231,274,283,303,333] }],
How to extract the 'data' field into an array in a matlab code ?
There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.

  4 Comments

Show 1 older comment
b
b on 7 Apr 2020
Thanks, but only want that particular series which matches name='Numbers'. Not the other series'.
Also, findElement does not seem to understand regexp.
Mohammad Sami
Mohammad Sami on 7 Apr 2020
are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.
you can easily change the pattern to name: \'Numbers\'
b
b on 7 Apr 2020
I am trying to do the following:
url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);
The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

Sign in to comment.

Accepted Answer

per isakson
per isakson on 7 Apr 2020
Edited: per isakson on 9 May 2020
I misunderstood your question. This is a bit of overkill.
Assumptions
  • the string, series:, always indicates the start of a block of interest
I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).
This script reads all blocks
%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = matlab.lang.makeValidName( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
and extract "series which matches name='Numbers'. Not the other series'."
>> series(strcmp({series.name},'Numbers')).data
ans =
45 78 84 91 111 125 178 231 274 283 303 333
In response to comment below
Assumptions
  • the string, series:, always indicates the start of a block of interest
  • the string, }], indicates the end of a block of interest
  • all html-files of interest are named index.html
  • all files named index.html are of interest
  • all html-files of interest are in subfolders under a root-folder, ...\finCase
  • every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers
The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.
Try
>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
4×2 cell array
{'anderson' } {1×9 double}
{'kim-j-clijsters'} {1×10 double}
{'paul-judd' } {1×11 double}
{'simmi' } {1×12 double}
>>
where (in one m-file)
function client_data = read_client_data( root, file, name )
sad = dir( fullfile( root, '**', file ) );
len = length( sad );
client_data = cell( len, 2 );
for jj = 1 : len
cac = strsplit( sad(jj).folder, filesep );
client = cac{end};
series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
end
end
function series = read_one_file_( file )
chr = fileread( fullfile( file ) );
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = strtrim( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
end
TODO: add error handling and comments

  10 Comments

Show 7 older comments
b
b on 9 May 2020
This code has been a life-saver. But there is just one more issue. The 'Data' field sometimes has null values :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [null,null,null,null,45,78,84,91,111,125,178,231,274,283,303,333] }],
in which case,
>> series(strcmp({series.name},'Numbers')).data
ans =
[]
instead of
>> series(strcmp({series.name},'Numbers')).data
ans =
0 0 0 0 45 78 84 91 111 125 178 231 274 283 303 333
Why are the null values giving empty data field ?
per isakson
per isakson on 9 May 2020
A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.
Replace the statement
series(jj).data = str2num( txt ); %#ok<ST2NM>
by
out = textscan( txt , '%f' ...
, 'CollectOutput' , true ...
, 'Delimiter' , ',' ...
, 'EmptyValue' , 0 ...
, 'TreatAsEmpty' , 'null' ...
, 'Whitespace' , ' \t[]' );
series(jj).data = reshape( out{:}, 1,[] );
and read about textscan in the documentation.
b
b on 10 May 2020
LOL on the tragedy of being Null.
The code section works nicely with output as needed.
Indebted once again.

Sign in to comment.

More Answers (0)