Count number of words in a PDF document.
6 views (last 30 days)
Show older comments
I want to count the number of words in a pdf. I have a pdf in Arabic and I want to know, for each word, how many times it occurs, like a histogram. For example WORK is in the pdf, so I want to know how many times did the work word occur in the pdf. I want this word to process as an image. So please help.
0 Comments
Answers (1)
KSSV
on 15 Feb 2022
You can read your pdf file using:
str = extractFileText("Test.pdf"); % give your pdf name
The above will read the conent of pdf into a string. And after you can use functions like strcmp, strcmpi, strfind to check whether the given word is present in the str. Then you can get the number.
s = strsplit(str) ; % split string to words of cell array
idx = strcmpi(s,word) ; % give your word
nnz(idx) % count how many times word is present
2 Comments
Image Analyst
on 15 Feb 2022
Edited: Image Analyst
on 15 Feb 2022
@KSSV I didn't know about extractFileText(). Is it in the TextAnalytics Toolbox?
@sajid khan what do you mean by "I want this word to process as an image." If you can get the words directly from the data, why render the page as an image and then try to do OCR on it?
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!