Extracting Data field of a Series in HTML file

Question

b on 7 Apr 2020

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/515926-extracting-data-field-of-a-series-in-html-file

Commented: b on 10 May 2020

In an HTML file, there is a section like this :

        series: [{
            name: 'Numbers',
            color: '#33CCFF',
            lineWidth: 5,
            data: [45,78,84,91,111,125,178,231,274,283,303,333]        }],

How to extract the 'data' field into an array in a matlab code ?

There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.

4 Comments
Show 2 older commentsHide 2 older comments

Mohammad Sami on 7 Apr 2020

are you parsing the html in Matlab as char array ? regexp is for string, cellstr or char data.

you can easily change the pattern to name: \'Numbers\'

b on 7 Apr 2020

Open in MATLAB Online

I am trying to do the following:

url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);

The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 7 Apr 2020

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/515926-extracting-data-field-of-a-series-in-html-file#answer_424483

Edited: per isakson on 9 May 2020

Open in MATLAB Online

cssm.txt

I misunderstood your question. This is a bit of overkill.

Assumptions

the string, series:, always indicates the start of a block of interest

I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).

This script reads all blocks

%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] ); 
for jj = 1 : len
    
    txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).name = matlab.lang.makeValidName( txt );
    txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
    txt(txt== '''') = [];
    series(jj).color = txt;
    
    txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
    series(jj).lineWidth = str2double( txt );
    
    txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
    series(jj).data = str2num( txt );  %#ok<ST2NM>
end

and extract "series which matches name='Numbers'. Not the other series'."

>> series(strcmp({series.name},'Numbers')).data
ans =
    45    78    84    91   111   125   178   231   274   283   303   333
    
    

In response to comment below

Assumptions

the string, series:, always indicates the start of a block of interest
the string, }], indicates the end of a block of interest
all html-files of interest are named index.html
all files named index.html are of interest
all html-files of interest are in subfolders under a root-folder, ...\finCase
every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers

The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.

Try

>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
  4×2 cell array
    {'anderson'       }    {1×9  double}
    {'kim-j-clijsters'}    {1×10 double}
    {'paul-judd'      }    {1×11 double}
    {'simmi'          }    {1×12 double}
>> 

where (in one m-file)

function    client_data = read_client_data( root, file, name )
    
    sad = dir( fullfile( root, '**', file ) ); 
    len = length( sad );
    client_data = cell( len, 2 );
    for jj = 1 : len 
        cac = strsplit( sad(jj).folder, filesep );
        client = cac{end};
        series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
        client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
    end
end
function    series = read_one_file_( file )
    
    chr = fileread( fullfile( file ) );
    cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
    
    len = length( cac );
    series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
    
    for jj = 1 : len
        
        txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).name = strtrim( txt );
        
        txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
        txt(txt== '''') = [];
        series(jj).color = txt;
        
        txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
        series(jj).lineWidth = str2double( txt );
        
        txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
        series(jj).data = str2num( txt );  %#ok<ST2NM>
        
    end
end

TODO: add error handling and comments

10 Comments
Show 8 older commentsHide 8 older comments

per isakson on 9 May 2020

Open in MATLAB Online

A nice thing with standards is that there are so many to chose between. Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database [Wikipedia]. However, Matlab doesn't honor Null.

Replace the statement

series(jj).data = str2num( txt ); %#ok<ST2NM>

by

out = textscan( txt             , '%f'      ...
            ,   'CollectOutput' , true      ...  
            ,   'Delimiter'     , ','       ...
            ,   'EmptyValue'    , 0         ...      
            ,   'TreatAsEmpty'  , 'null'    ...
            ,   'Whitespace'    , ' \t[]'   );
series(jj).data = reshape( out{:}, 1,[] );

and read about textscan in the documentation.

b on 10 May 2020

LOL on the tragedy of being Null.

The code section works nicely with output as needed.

Indebted once again.

Sign in to comment.

Extracting Data field of a Series in HTML file

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

10 Comments
Show 8 older commentsHide 8 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

Extracting Data field of a Series in HTML file

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

10 Comments Show 8 older commentsHide 8 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

10 Comments
Show 8 older commentsHide 8 older comments