Using TFIDF with Naive bayes

3 views (last 30 days)
Sarah Alduayj
Sarah Alduayj on 11 Jul 2018
Commented: Oscar Green on 10 May 2019
I'm building a sentiment classification model using TFIDF and naive bayes. But the model keeps misclassifying the second class.Although I have used TFIDf with other models such as SVM and random forest and it was working fine. Below I will describe my data and steps used: I have 2000 comments (1000 positive, 1000 negative). I did the following steps: 1) data preprocessing
cleanTextData = erasePunctuation(textData);
cleanTextData = lower(cleanTextData);
words = stopWords;
cleanDocuments = tokenizedDocument(cleanTextData);
cleanDocuments = removeWords(cleanDocuments,words);
cleanDocuments = normalizeWords(cleanDocuments);
cleanDocuments(1:10)
%%Bag of Words
cleanBag = bagOfWords(cleanDocuments)
cleanBag = removeInfrequentWords(cleanBag,2) % remove words with frequency less than or equal 2
%%remove emplty documents caused by preprocessing
[cleanBag,idx] = removeEmptyDocuments(cleanBag);
Then I used TFIDF
predictors = tfidf(cleanBag,'Normalized',true,'TFWeight','log','IDFWeight','smooth');
Then I passed the results to my naive bayes model
t = templateNaiveBayes('DistributionNames','mvmn');
CVMdl = fitcecoc(predictors,response,'KFold',10,'Learners',t,'FitPosterior',true,'Coding','onevsone','ResponseName','response');
But the confusion matrix will give the following results :
C1 C2
____ __
990 10
1000 0
It seems it is classifying almost all the 2000 observations to one class only. Please advice, I have tried almost all what I know and what ever suggested by others. This is related to my master thesis and I only have few weeks to submit it.
  4 Comments
Christopher Creutzig
Christopher Creutzig on 26 Nov 2018
Edited: Christopher Creutzig on 26 Nov 2018
Do you have to use naïve Bayes, or did you try other models and got even worse results?
With only two classes, I do not see why you use fitcecoc, which is an interface to use multiple binary classifiers to build a multi-class one. You could use fitclinear instead, which in my experience is pretty good at the kind of high-dimensional fitting required in text analytics.
Oscar Green
Oscar Green on 10 May 2019
One thing I've done in the past is to aggregate/discretize into log-frequency buckets and treat those as features. It's a bit of a hack, but so is naive bayes, and it ends up working pretty well.

Sign in to comment.

Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!