How should XPath be set in TableSelector for htmlImportOptions so readtable( ) can output the first three tables in an html file?

9 views (last 30 days)
I like to read first three tables in an html file with calling readtable( ) once in order to reduce the html file reading time. However, the readtable( ) function from the database toolbox seems to read only one table at a time. I have tried to manipulate TableSelector right by playing around with a few XPath scripts. They either return error message or only one table. For example, the one below returns the first table, but there is no table 2 or 3.
opts = detectHtmlImportOptions(htmlfile);
opts.TableSelector = "//TABLE[position()<4]";
readtable(htmlfile, opts)
I was wondering that because the output argument of readtable( ) is a table, it can only read only table at a time.
Another related question.
% why is not lowercase 'table' right?
opts.TableSelector = "//table[1]";
readtable(htmlfile, opts)
% ans=
% 0x1 empty table
% When TABLE[1] use upper case letters, readtable( ) output the first table correctly.
opts.TableSelector = "//TABLE[1]";
readtable(htmlfile, opts)

Accepted Answer

Kevin Gurney
Kevin Gurney on 30 Sep 2022
1. Is there a way to return multiple tables from readtable in a single function call?
Answer:
Currently, readtable does not support returning multiple tables with a single function call. However, your expectations for how readtable should work in this case are very reasonable, so thank you for pointing this out as something to consider for a potential future enhancement to readtable.
Conceptually, the XPath expression //TABLE that you are using, is "correct", in that it does semantically mean "select all of the <table> elements in the HTML file". However, readtable currently only returns the first table in the set of tables selected by an XPath expression.
2. Why does "TABLE" need to be uppercase in the TableSelector XPath expression?
Answer:
This is a consequence of the fact that (1) XPath is case-sensitive and (2) readtable normalizes HTML element names to uppercase on import. This is a limitation of how readtable interacts with HTML files.

More Answers (1)

Christopher Creutzig
Christopher Creutzig on 30 Sep 2022
readtable currently only returns a single table. There has been talk about a function returning multiple tables, but I don't know of any concrete plans. It may be worth letting support@mathworks.com know you are looking for something like that – given the time things can take from inception to release, it may not always be obvious, but user demand does influence priorities.
As for lowercase table selectors … table selectors are XPath expressions, and XPath is, well, case-sensitive. Most HTML versions/variants (maybe in practice all of them) are case agnostic, although their standards differ in what they regard as the “right” casing to use. htmlTree normalizes to uppercase. But that doesn't mean we could simply treat the XPath expression as case agnostic, as it can contain parts where case matters. I'm not sure if your question is simple curiosity or if this is actually a bump in the road to solving your problems. If the latter, please let us know more.
Nitpick: readtable does not require Database Toolbox, it is in core MATLAB.
  1 Comment
Simon
Simon on 1 Oct 2022
Edited: Simon on 1 Oct 2022
At first, I thought it is the main cause to slow the codes down. After my codes encounter a few more abnormalities in the html files, it's actually better for readtable( ) to return only one table. I have a number of html files, each of which is supposed to contain same number of tables I am interested in. However, in reality some miss this or that table. So it's better for my codes to deal with one table at a time.
function bigT = func(htmlfile)
opts.TableSelector = "//TABLE[contains(.,'phrase_to_identify_table01')]";
T = readtable(htmlfile, opts)
if isempty(T)
T01 = [];
% output useful information about this bad htmlfile,
% so I can take a close look at it separately.
else
% call gunc( ) extracct useful data
T01 = gunc(T);
end
%% then the codes move on to deal with table02
opts.TableSelector = "//TABLE[contains(.,'phrase_to_identify_table02')]";
% and same if-else for table02
%% and so on so forth for table03, table04
%% finally, concatenate all good tables and to a big table,
% which is the main output of this function
bigT = [T01; T02; T03; T04]
end
function outT = gunc(inT)
% elaborate process to extract useful information from T
% return outT
end

Sign in to comment.

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!