How to find the most used word in a text?
    5 views (last 30 days)
  
       Show older comments
    
i have a notepad file with a literary text and i need to find the most repeated word/words . How many times they appear in that text.
1 Comment
  the cyclist
      
      
 on 3 Apr 2023
				
      Edited: the cyclist
      
      
 on 3 Apr 2023
  
			FYI, this question was closed by another editor as a duplicate, but I don't think it was. This question is asking about repeated words, and the other was asking about repeated letters.
Answers (3)
  the cyclist
      
      
 on 3 Apr 2023
        
      Edited: the cyclist
      
      
 on 3 Apr 2023
  
      I'm putting this answer here as possibly the "canonical" MATLAB answer, but I expect you do not have the Text Analytics Toolbox.
myTextFile = "sonnets.txt"; % Put your file name here
str = extractFileText(myTextFile);
T = wordCloudCounts(str);
0 Comments
  DGM
      
      
 on 3 Apr 2023
        
      Edited: DGM
      
      
 on 3 Apr 2023
  
      Define "word".  Once you have defined "word" and have implemented a means to split a block of text into said words, then the rest is basic.  
I'm sure this can be improved a lot, but I was in a hurry.
bunchofwords = fileread('wordpile.txt')
% i assume the capitalization doesn't matter
bunchofwords = lower(bunchofwords);
% try to fix words that are hyphenated on linebreaks
% but not all hyphenation is done with U+002D
bunchofwords = regexprep(bunchofwords,'(?<=\w+)-(\r\n|\r|\n)+(?=\w+)','');
% split the file into blobs separated by whitespace
% this causes lots of problems
%words = regexp(bunchofwords,'\S+','match');
% instead, split the file into blobs of "word" type characters
% this still has problems, but it's a bit better
words = regexp(bunchofwords,'\w+','match');
% find unique words
[uwords,~,uwidx] = unique(words);
% get histogram counts and sort them
hc = histcounts(uwidx,'binmethod','integers');
[hc hcidx] = sort(hc,'descend');
% sort unique word list by frequency
uwordssorted = uwords(hcidx);
% display the results as a table as a cursory effort toward readability
table(uwordssorted.',hc.')
Note that this still has plenty of problems with contractions.
2 Comments
  Image Analyst
      
      
 on 3 Apr 2023
				Or simpler than 
words = regexp(bunchofwords,'\w+','match');
is to use strsplit
words = strsplit(bunchofwords);
  DGM
      
      
 on 3 Apr 2023
				
      Edited: DGM
      
      
 on 3 Apr 2023
  
			No, that would be similar to the first example, naively splitting on whitespace.  This causes problems with any punctuation.  Note the cases of 'file', 'list', and 'words'.
bunchofwords = fileread('wordpile.txt');
bunchofwords = lower(bunchofwords);
uwords = unique(strsplit(bunchofwords))
uwords = unique(regexp(bunchofwords,'\S+','match'))
uwords = unique(regexp(bunchofwords,'\w+','match'))
I'm sure there are better ways to handle splitting into words, but using \w+ was simple enough.
  Image Analyst
      
      
 on 3 Apr 2023
        If you don't have the Text Analytics Toolbox (like @the cyclist solution requires) then you can get a histogram like this:
str = 'abcddrd,ee,fghd,**^^###$s t q j' % Whatever your character array is
% Convert characters to numbers.
strAscii = str - char(0);
% Compute histogram
edges = 0 : max(strAscii);
counts = histogram(strAscii, edges);
% Fancy up the plot.
grid on;
xlabel('ASCII value');
ylabel('Count');
title('Histogram of Characters')
2 Comments
  the cyclist
      
      
 on 3 Apr 2023
				Unless I misunderstand, this solution finds the count of characters. This question (and my solution) is about finding words.
  Image Analyst
      
      
 on 3 Apr 2023
				I think your solution is more like what the OP wants.  But maybe I'll leave mine up in case someone in the future stumbles across it and wants a histogram of characters.
By the way, if he doesn't have that toolbox, is there a solution for a histogram of complete words?
See Also
Categories
				Find more on Labels and Annotations in Help Center and File Exchange
			
	Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



