readtable(html file) producing extra empty columns

Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.
Update: name-value pair does have 'DataRows' option.
opt.ExtraColumnsRule = 'ignore' % readtable only the first column.
% either
opt = htmlImportOptions;
opt.DataRows = 4;
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore';
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4)
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error
I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.
Thanks for any solutions!
dpb on 11 Sep 2023
Answers (1)

dpb on 10 Sep 2023
Use 'SelectedVariableNames' with the variable(s) desired
I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then
opt = htmlImportOptions;
opt.DataRows = 4;
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1); % read only the first column
tData=readtable(htmlfile, opt);
dpb on 12 Sep 2023
As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.
While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.
If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.
Simon on 17 Sep 2023
@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

