Why LDA model generate a topic include stop words although these words don't exist in the data ?

Question

Jack on 27 Sep 2021

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/1461879-why-lda-model-generate-a-topic-include-stop-words-although-these-words-don-t-exist-in-the-data

Answered: the cyclist on 27 Sep 2021

Hello and good day to you..

I am doing topic modling by Latent Dirichlet Allocation (LDA), and this require preprocessing (cleaning) the data before. Thus, I did preprocessing steps in order as follows:

However, when topics generated by the LDA model, whereby a topic in LDA means (a collection of propably related words), there is a topic contain stop words although it were removed from the data. I also check the data and there is no single stop word in it. Why these stop words still there and showed as one of resulted topics, althgouh these words do not even exist in the Vocabulary of the model ?

Please Help !

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

the cyclist on 27 Sep 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/1461879-why-lda-model-generate-a-topic-include-stop-words-although-these-words-don-t-exist-in-the-data#answer_796359

I don't think it is possible to answer this question well without seeing the data.

I think it is extraordinarily unlikely that the stop word does not appear in the data, if it shows up in a topic. Perhaps you are somehow accidentally incorporating another corpus, besides your data? Another possibility is that a stop word (e.g. "run") does not appear in your data, but a related word (e.g. "running") does appear, and there is an algorithm that is doing trimming of words to their root words.

One thing you could try, to debug this weirdness, is to run your code on half your data, to see if these stop words still show up. If they do, run it on on the other half. Keep slicing up the data, and maybe you can narrow down to see exactly which part of your corpus is causing the "error".

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Why LDA model generate a topic include stop words although these words don't exist in the data ?

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Why LDA model generate a topic include stop words although these words don't exist in the data ?

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments