Counting syllables in text from a txt file.
5 views (last 30 days)
Show older comments
Hello, I'm trying to create a script which will have the ability to count syllables in a .txt file, I am able to count the occurrences of vowels however to count syllables I need to somehow count occurrences of [A,I,O,U,E] but only count it as one syllable if it occurs more than once in a row, I also need to be able to disregard an 'E' as a syllable if it occurs at the end of a word.
3 Comments
Walter Roberson
on 27 Apr 2017
We volunteers get pretty disappointed when people remove their question. We are not free private consultants! The "cost" we charge for our advice is that the question and answers stay public so that everyone can learn from them.
Answers (4)
Walter Roberson
on 24 Apr 2017
Counting syllables in English takes a lot of knowledge of the language. In some words, the number of syllables depends upon how the word is being used. For example, "unionized" might be union-ized (2 syllables) or it might be un-ion-ized (3 syllables.) The number of syllables in a word can depend upon which part of speech it is acting in.
In English, the location of syllable breaks depends upon whether a syllable is stressed or not. It also depends upon whether vowels are long or not (which can determine whether a consonant run is split into pieces or not.) These two factors are influenced by the suffixes -- adding a suffix to a word can shift how the syllables are to be broken up in earlier parts of the word, which in turn can change how many syllables there are.
If you analyze mechanically looking at characters, then you need to be able to deal with "ghoti" being one syllable.
5 Comments
Walter Roberson
on 26 Apr 2017
"you know if regexp would be capable of finding 3 instances of a "syllable" in a word"
No, I am absolutely certain that it cannot do that.
"The resulting hyphenation algorithm uses about 4500 patterns [...]"
Now if you instead wanted to do the completely different task of counting groups of vowels, then:
regexp(S, '\<[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ]*([AEIOUaeiou]+[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ]+){2}[AEIOUaeiou]+[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ]*\>')
This is not at all the same as the number of syllables. Remember, in English, it is possible for there to be adjacent syllables that contain only vowels, provided that those vowels are long or stressed. (And this pattern once more neglects poor "y" ...)
John D'Errico
on 25 Apr 2017
Edited: John D'Errico
on 27 Apr 2017
The English language is a aggregate mess, compiled from words taken from many languages. So you can virtually never be 100% correct in such a task. Accept that as fact, and then just aim for the lowest failure rate that you can achieve.
If I HAD to try to solve such task, I'd use a dictionary approach. That is...
1. If possible, I'd find an online dictionary, that included syllabification. Even if the dictionary was limited in size, it would be a great starting point. Otherwise, you need to build it yourself.
2. Next, write a simple rule based code, the goal of which is to be as good as possible, but I'd not invest a huge amount of time there. My target might be to have an initial success rate as high as possible. So you want to pick off the low hanging fruit first.
3. Test the algorithm on your dictionary, looking for errors. Where possible, add new rules if you can see an obvious rule that you might have missed.
4. Next, test the tool on blocks of test pasted from any online sources you can find. Books, articles, etc. Skip over words that already exist in the dictionary. Those that are missing from the dictionary, apply your algorithm. Now, check each word so identified. You will need to rely on either your own knowledge, or if your own language skills are limited, on a large dictionary resource like the OED. Add each word to your internal MATLAB dictionary, building/extending it one word at a time.
5. For the words that are syllabically ambiguous, like unionized, now you need to go back and use grammatical rules to identify the correct count for that word.
The quality of your result will depend on how much effort you are willing to invest, regardless of the approach you follow. Perfection will take a great deal of effort.
0 Comments
Sergey Kasyanov
on 24 Apr 2017
If i understand you right, try this.
vowels={'A','I','O','U','E'};
hF=fopen('filename.txt');
%sum contains count of vowels in text
sum=0;
%repeat while end of file is not reached
while ~feof(hF)
%read and derive row to uppercase
str=upper(fgetl(hF));
%find each vowels in row and add 1 to sum (count only first vowels in row)
for i=1:length(vowels)
b=findstr(str,vowels{i});
if ~isempty(b)
sum=sum+1;
end
end
%check for letter 'E' in end of word
cE=findstr(str,'E');
dcE=0;
for j=1:length(cE)
if cE(j)~=length(str)
if isletter(str(cE(j)+1))
continue
end
end
dcE=dcE+1;
end
%correct count of vowels with 'E' in end of word
sum=sum-dcE;
end
fclose(hF);
4 Comments
Sergey Kasyanov
on 25 Apr 2017
Thanks for notation about variable sum. Hmmm... I'm check this code on test file. Can you attach your text in question?
suraj s
on 9 Jan 2020
Hello all, I am working on the project on matlab based on lip gesture recognition caption generator Plz do help me with codes and the functions that is used for this
0 Comments
See Also
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!