Extract email information from webpages/URLs using Matlab

Hi,
I do need your help. When I run the code, K gives 143 empty cells. In other words, K does not contain any email. I tried this with other websites that show emails on some pages but in vain (K always gave empty cells). Therefore, can you please help to find what is wrong with or where I am screwing things up in the following code? I want to write a code that is capable of checking each L page of the main H website and extract emails according to the expression E.
H = webread("https://edition.cnn.com");
L = regexp(H,'https?://[^"]+','match')';
E ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
K = regexpi(L,E,'match')';

 Accepted Answer

Looks like L is a cell array of web sites, none of which is an email address with @ in it. So why do you think it should find an email there? Try searching H instead:
K = regexpi(H,E,'match')';
That will give you email addresses.
% Retrieve a web page.
%url = 'https://www.mathworks.com/matlabcentral/answers/?term=';
url = 'https://edition.cnn.com';
webPageContents = webread(url);
% Harvest web sites listed in the page.
listOfWebSites1 = regexp(webPageContents,'https?://[^"]+','match')';
listOfWebSites2 = regexp(webPageContents,'https?://[^"]+','match')';
% Throw out duplicates:
listOfWebSites = unique([listOfWebSites1;listOfWebSites2])
% Harvest email addresses listed in the page.
reForEMailAddresses ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
listOfEMails = regexpi(webPageContents,reForEMailAddresses,'match')';
% Throw out duplicates:
listOfEMails = unique(listOfEMails)

5 Comments

Thank you! But, Here:
listOfEMails = regexpi(webPageContents,reForEMailAddresses,'match')';
Why do you use "webPageContents" instead of "listOfWebSites" while I aim at harvesting the content of all listed/linked websites? Not just a single website "webPageContents". Or do I miss something about "webPageContents"? Does it include the content of all webpages linked to it?
You were searching the list of web sites that were listed on the page. This list of web sites does not have any email addresses in it. That's why your k was empty.
I assumed you wanted to find email addresses, and the only place to find them is on the original web page, not from a small subset of that that you scraped off the web site (i.e. not from the list of web sites because there are no email addresses there).\
If you're interested in how to collect a list of stock prices and drop them onto an Excel workbook with your stock portfolio on it (with Windows), I can show you that too.
Thank you again! Of course the list of websites does not contain any email address but if you open each listed webpage, one by one, you can find something, I think. Bref, as you mentioned, I wanted to find email addresses on both the index page (i.e., webPageContents) and all direct links or index page links (i.e., listOfWebSites(:,1)).
You could just go through each web site found on the main web site and scan each of those web sites for emails:
reForEMailAddresses ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
for k = 1 : length(listOfWebSites)
thisWebSite = listOfWebSites{k};
webPageContents = webread(thisWebSite);
% Harvest email addresses listed in the page.
listOfEMails = regexpi(webPageContents, reForEMailAddresses, 'match')';
% Throw out duplicates:
listOfEMails = unique(listOfEMails)
end
Is that what you want?

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2021a

Asked:

on 13 Jun 2021

Commented:

on 14 Jun 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!