Using webread() to fetch bibliographic data from Google Scholar

3 views (last 30 days)
Matt J
Matt J on 24 Dec 2021
Commented: Ive J on 25 Dec 2021
I would like to use webread() to obtain bibliographic data from my Google Scholar profile. I am trying to build a struct array where each element contains the title of an article and its authors. The format would be,
S(i).Title='Name of the Article';
I am not sure whether this is possible because the data is not posted there as a downloadable file. I would somehow have to read the text printed on the webpage given the URL. Can webread() do that?
  1 Comment
Rik on 25 Dec 2021
In my experience there is still a lot of parsing to do if you read a web page. I'm currently on mobile, so it is difficult to have a look myself. Do you have an example link for when I get back?

Sign in to comment.

Accepted Answer

Ive J
Ive J on 25 Dec 2021
Edited: Matt J on 25 Dec 2021
parsing the returned HTML aside, GS doesn't have an API (and doesn't intend to develop AFAIK), which makes things harder for fetching data from GS. There are several Python/R libraries developed for this purpose but all suffer from a common issue: GS may block the IP after several attemps (anti-bot measures), so it's important to use a proxy (also, it's unclear to me if GS scraping is legal or not!). Keeping that in mind, I believe this function may do the job:
profile = "OFVWBK0AAAAJ";
getMore = true;
pgsz = 20;
start = 0; ct = 1;
maxAttempts = 1; % only for the sake of this example
[finames, values, papers, cites] = deal({});
while getMore
gs = webread('' + profile + '&hl=en&cstart=' + start + '&pagesize=' + pgsz);
papLinks = string(extractBetween(gs, 'class="gsc_a_t"><a href="', '" class="gsc_a_at"'));
if isempty(papLinks) || ct > maxAttempts % no more publications for this profile
fprintf('round %d- fetching %d papers...\n', ct, pgsz)
papers{ct, 1} = string(extractBetween(gs, 'class="gsc_a_at">', '</a><div class="gs_gray">'));
cites{ct, 1} = string(extractBetween(gs, 'class="gsc_a_ac gs_ibl">', '</a>'));
[tmpFields, tmpValues] = deal(cell(numel(papLinks), 1));
for i = 1:numel(papLinks)
paper = webread("" + replace(papLinks(i), ';', '&'));
tmpFields{i, 1} = string(extractBetween(paper, '<div class="gsc_oci_field">', '</div>'));
tmpValues{i, 1} = string(extractBetween(paper, '"gsc_oci_value"' + wildcardPattern + '>', '</div>'));
tmpValues{i, 1} = erase(tmpValues{i, 1}, asManyOfPattern("<" + ("div"|"h3") + wildcardPattern + ">"));
finames{ct, 1} = tmpFields;
values{ct, 1} = tmpValues;
ct = ct + 1;
start = start + pgsz;
% decodeHTMLEntities(papers{ct, 1})
papers = decodeHTMLEntities(vertcat(papers{:}));
cites = double(vertcat(cites{:}));
finames = vertcat(finames{:});
values = vertcat(values{:});
% sort out publications and return some desired fields
cols = ["Authors", "Publication date", "Journal", "Volume", "Issue", ...
"Pages", "Publisher", "Description"];
gs = splitvars(table(strings(numel(papers), numel(cols))));
gs.Properties.VariableNames = cols;
[colIdx, fiIdx] = cellfun(@(x) ismember(cols, x), finames, 'uni', false);
for i = 1:numel(colIdx)
gs{i, colIdx{i}} = values{i}(fiIdx{i}(colIdx{i})).';
gs.Title = papers;
gs.Citations = cites;
gs.Description = decodeHTMLEntities(gs.Description);
ans = 8×10 table
Authors Publication date Journal Volume Issue Pages Publisher Description Title Citations _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ ________________ ________________________________________ ______ ______ ___________ _______________________________ ____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ _____________________________________________________________________________________________________________ _________ "Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker, Mark J Daly, Pak C Sham" "2007/9/1" "The American journal of human genetics" "81" "3" "559-575" "Cell Press" "Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association …" "PLINK: a tool set for whole-genome association and population-based linkage analyses" 24495 "Monkol Lek, Konrad J Karczewski, Eric V Minikel, Kaitlin E Samocha, Eric Banks, Timothy Fennell, Anne H O’Donnell-Luria, James S Ware, Andrew J Hill, Beryl B Cummings, Taru Tukiainen, Daniel P Birnbaum, Jack A Kosmicki, Laramie E Duncan, Karol Estrada, Fengmei Zhao, James Zou, Emma Pierce-Hoffman, Joanne Berghout, David N Cooper, Nicole Deflaux, Mark DePristo, Ron Do, Jason Flannick, Menachem Fromer, Laura Gauthier, Jackie Goldstein, Namrata Gupta, Daniel Howrigan, Adam Kiezun, Mitja I Kurki, Ami Levy Moonshine, Pradeep Natarajan, Lorena Orozco, Gina M Peloso, Ryan Poplin, Manuel A Rivas, Valentin Ruano-Rubio, Samuel A Rose, Douglas M Ruderfer, Khalid Shakir, Peter D Stenson, Christine Stevens, Brett P Thomas, Grace Tiao, Maria T Tusie-Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David M Altshuler, Diego Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua, Jose C Florez, Stacey B Gabriel, Gad Getz, Stephen J Glatt, Christina M Hultman, Sekar Kathiresan, Markku Laakso, Steven McCarroll, Mark I McCarthy, Dermot McGovern, Ruth McPherson, Benjamin M Neale, Aarno Palotie, Shaun M Purcell, Danish Saleheen, Jeremiah M Scharf, Pamela Sklar, Patrick F Sullivan, Jaakko Tuomilehto, Ming T Tsuang, Hugh C Watkins, James G Wilson, Mark J Daly, Daniel G MacArthur" "2016/8" "Nature" "536" "7616" "285-291" "Nature Publishing Group" "Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype …" "Analysis of protein-coding genetic variation in 60,706 humans" 8196 "Stephan Ripke, Benjamin M Neale, Aiden Corvin, James TR Walters, Kai-How Farh, Peter A Holmans, Phil Lee, Brendan Bulik-Sullivan, David A Collier, Hailiang Huang, Tune H Pers, Ingrid Agartz, Esben Agerbo, Margot Albus, Madeline Alexander, Farooq Amin, Silviu A Bacanu, Martin Begemann, Richard A Belliveau Jr, Judit Bene, Sarah E Bergen, Elizabeth Bevilacqua, Tim B Bigdeli, Donald W Black, Richard Bruggeman, Nancy G Buccola, Randy L Buckner, William Byerley, Wiepke Cahn, Guiqing Cai, Dominique Campion, Rita M Cantor, Vaughan J Carr, Noa Carrera, Stanley V Catts, Kimberley D Chambert, Raymond CK Chan, Ronald YL Chan, Eric YH Chen, Wei Cheng, Eric FC Cheung, Siow Ann Chong, C Robert Cloninger, David Cohen, Nadine Cohen, Paul Cormican, Nick Craddock, James J Crowley, David Curtis, Michael Davidson, Kenneth L Davis, Franziska Degenhardt, Jurgen Del Favero, Ditte Demontis, Dimitris Dikeos, Timothy Dinan, Srdjan Djurovic, Gary Donohoe, Elodie Drapeau, Jubao Duan, Frank Dudbridge, Naser Durmishi, Peter Eichhammer, Johan Eriksson, Valentina Escott-Price, Laurent Essioux, Ayman H Fanous, Martilias S Farrell, Josef Frank, Lude Franke, Robert Freedman, Nelson B Freimer, Marion Friedl, Joseph I Friedman, Menachem Fromer, Giulio Genovese, Lyudmila Georgieva, Ina Giegling, Paola Giusti-Rodríguez, Stephanie Godard, Jacqueline I Goldstein, Vera Golimbet, Srihari Gopal, Jacob Gratten, Lieuwe De Haan, Christian Hammer, Marian L Hamshere, Mark Hansen, Thomas Hansen, Vahram Haroutunian, Annette M Hartmann, Frans A Henskens, Stefan Herms, Joel N Hirschhorn, Per Hoffmann, Andrea Hofman, Mads V Hollegaard, David M Hougaard, Masashi Ikeda, Inge Joa, Antonio Julia, Rene S Kahn, Luba Kalaydjieva, Sena Karachanak-Yankova, Juha Karjalainen, David Kavanagh, Matthew C Keller, James L Kennedy, Andrey Khrunin, Yunjung Kim, Janis Klovins, James A Knowles, Bettina Konte, Vaidutis Kucinskas, Zita Ausrele Kucinskiene, Hana Kuzelova-Ptackova, Anna K Kähler, Claudine Laurent, Jimmy Lee, S Hong Lee, Sophie E Legge, Bernard Lerer, Miaoxin Li, Tao Li, Kung-Yee Liang, Jeffrey Lieberman, Svetlana Limborska, Carmel M Loughland, Jan Lubinski, Jouko Lönnqvist, Milan Macek, Patrik KE Magnusson, Brion S Maher, Wolfgang Maier, Jacques Mallet, Sara Marsal, Manuel Mattheisen, Morten Mattingsdal, Robert W McCarley, Colm McDonald, Andrew M McIntosh, Sandra Meier, Carin J Meijer, Bela Melegh, Ingrid Melle, Raquelle I Mesholam-Gately, Andres Metspalu, Patricia T Michie, Lili Milani, Vihra Milanova" "2014/7/24" "Nature" "511" "7510" "421" "Europe PMC Funders" "Schizophrenia is a highly heritable disorder. Genetic risk is conferred by a large number of alleles, including common alleles of small effect that might be detected by genome-wide association studies. Here, we report a multi-stage schizophrenia genome-wide association study of up to 36,989 cases and 113,075 controls. We identify 128 independent associations spanning 108 conservatively defined loci that meet genome-wide significance, 83 of which have not been previously reported. Associations were enriched among genes expressed in brain providing biological plausibility for the findings. Many findings have the potential to provide entirely novel insights into aetiology, but associations at DRD2 and multiple genes involved in glutamatergic neurotransmission highlight molecules of known and potential therapeutic relevance to schizophrenia, and are consistent with leading pathophysiological hypotheses …" "Biological insights from 108 schizophrenia-associated genetic loci" 5552 "Brendan K Bulik-Sullivan, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, Alkes L Price, Benjamin M Neale" "2015/3" "Nature genetics" "47" "3" "291-295" "Nature Publishing Group" "Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size." "LD Score regression distinguishes confounding from polygenicity in genome-wide association studies" 2537 "Cross-Disorder Group of the Psychiatric Genomics Consortium" "2013/4/20" "The Lancet" "381" "9875" "1371-1379" "Elsevier" "Background</h3>Findings from family and twin studies suggest that genetic contributions to psychiatric disorders do not in all cases map to present diagnostic categories. We aimed to identify specific variants underlying genetic effects shared between the five disorders in the Psychiatric Genomics Consortium: autism spectrum disorder, attention deficit-hyperactivity disorder, bipolar disorder, major depressive disorder, and schizophrenia." "Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis" 2439 "Konrad J Karczewski, Laurent C Francioli, Grace Tiao, Beryl B Cummings, Jessica Alföldi, Qingbo Wang, Ryan L Collins, Kristen M Laricchia, Andrea Ganna, Daniel P Birnbaum, Laura D Gauthier, Harrison Brand, Matthew Solomonson, Nicholas A Watts, Daniel Rhodes, Moriel Singer-Berk, Eleina M England, Eleanor G Seaby, Jack A Kosmicki, Raymond K Walters, Katherine Tashman, Yossi Farjoun, Eric Banks, Timothy Poterba, Arcturus Wang, Cotton Seed, Nicola Whiffin, Jessica X Chong, Kaitlin E Samocha, Emma Pierce-Hoffman, Zachary Zappala, Anne H O’Donnell-Luria, Eric Vallabh Minikel, Ben Weisburd, Monkol Lek, James S Ware, Christopher Vittal, Irina M Armean, Louis Bergelson, Kristian Cibulskis, Kristen M Connolly, Miguel Covarrubias, Stacey Donnelly, Steven Ferriera, Stacey Gabriel, Jeff Gentry, Namrata Gupta, Thibault Jeandet, Diane Kaplan, Christopher Llanwarne, Ruchi Munshi, Sam Novod, Nikelle Petrillo, David Roazen, Valentin Ruano-Rubio, Andrea Saltzman, Molly Schleicher, Jose Soto, Kathleen Tibbetts, Charlotte Tolonen, Gordon Wade, Michael E Talkowski, Benjamin M Neale, Mark J Daly, Daniel G MacArthur" "2020/5" "Nature" "581" "7809" "434-443" "Nature Publishing Group" "Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes 1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human …" "The mutational constraint spectrum quantified from variation in 141,456 humans" 2321 "Giulio Genovese, Anna K Kähler, Robert E Handsaker, Johan Lindberg, Samuel A Rose, Samuel F Bakhoum, Kimberly Chambert, Eran Mick, Benjamin M Neale, Menachem Fromer, Shaun M Purcell, Oscar Svantesson, Mikael Landén, Martin Höglund, Sören Lehmann, Stacey B Gabriel, Jennifer L Moran, Eric S Lander, Patrick F Sullivan, Pamela Sklar, Henrik Grönberg, Christina M Hultman, Steven A McCarroll" "2014/12/25" "New England Journal of Medicine" "371" "26" "2477-2487" "Massachusetts Medical Society" " Background</h3> Cancers arise from multiple acquired mutations, which presumably occur over many years. Early stages in cancer development might be present years before cancers become clinically apparent." "Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence" 2261 "Brendan Bulik-Sullivan, Hilary K Finucane, Verneri Anttila, Alexander Gusev, Felix R Day, Po-Ru Loh, Laramie Duncan, John RB Perry, Nick Patterson, Elise B Robinson, Mark J Daly, Alkes L Price, Benjamin M Neale" "2015/11" "Nature genetics" "47" "11" "1236-1241" "Nature Publishing Group" "Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual-level genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique—cross-trait LD Score regression—for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use this method to estimate 276 genetic correlations among 24 traits. The results include genetic correlations between anorexia nervosa and schizophrenia, anorexia and obesity, and educational attainment and several diseases. These results highlight the power of genome-wide analyses, as there …" "An atlas of genetic correlations across human diseases and traits" 2153

Sign in to comment.

More Answers (0)




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!