w: observed word in a document i.----WORD
z: topic for jth word in documnet i.----TOPIC
O: topic distribution for document i.----CONTEXT
Motivation:
e.g. topics t1,t2,t3
t1: list of words w11,w12,.... each word have some probability to belong to t1.
t2: list of words w21,w22,.... each word have some probability to belong to t2.
t3: list of words w31,w32,.... each word have some probability to belong to t3.
Recipe of document di is : t1% + t2% + t3%=100%therefore take right no. of words from t1,t2,t3 and mix them for di.
Find out for any document, its recipe. so that other similar recipe document match it.
-----------------------------------------------------------------------------------------------------------------------
Document d1,d2,d3,.... takes and collect words from each of them and words belong to same context relates them and put in a same topics.
Dirchelet Parameters a and b.
a: per document topic distribution
[high a means mixture of many topic in document di. Therefore high document similarity]
b: per topic word distribution
[high b means contain mixture of many words in topic ti
therefore high topic similarity
---------------------------------------------------------------------------------------------------------------------
LDA work:
take a document x: take all its words X.
preprocessing: remove stop words, lemmatization, now consider pos tagged words, like noun, adjective, verb.
now LDA give reciepe for document that from which topic (say total y topics, which you can vary) and what ratio it is constructed. (probability distribution of words. sum=1)
Document x represented now as y dimension, which is type of unique identity to document.
you can find similar documents of x by measuring distance between 2 probability distribution, e.g KL divergence, jenson shannon divergence.
--------------------------------------------------------------------------------------------------------------------
python library can used- gensim
---------------------------------------------------------------------------------------------------------------------
z: topic for jth word in documnet i.----TOPIC
O: topic distribution for document i.----CONTEXT
Motivation:
e.g. topics t1,t2,t3
t1: list of words w11,w12,.... each word have some probability to belong to t1.
t2: list of words w21,w22,.... each word have some probability to belong to t2.
t3: list of words w31,w32,.... each word have some probability to belong to t3.
Recipe of document di is : t1% + t2% + t3%=100%therefore take right no. of words from t1,t2,t3 and mix them for di.
Find out for any document, its recipe. so that other similar recipe document match it.
-----------------------------------------------------------------------------------------------------------------------
Document d1,d2,d3,.... takes and collect words from each of them and words belong to same context relates them and put in a same topics.
Dirchelet Parameters a and b.
a: per document topic distribution
[high a means mixture of many topic in document di. Therefore high document similarity]
b: per topic word distribution
[high b means contain mixture of many words in topic ti
therefore high topic similarity
---------------------------------------------------------------------------------------------------------------------
LDA work:
take a document x: take all its words X.
preprocessing: remove stop words, lemmatization, now consider pos tagged words, like noun, adjective, verb.
now LDA give reciepe for document that from which topic (say total y topics, which you can vary) and what ratio it is constructed. (probability distribution of words. sum=1)
Document x represented now as y dimension, which is type of unique identity to document.
you can find similar documents of x by measuring distance between 2 probability distribution, e.g KL divergence, jenson shannon divergence.
--------------------------------------------------------------------------------------------------------------------
python library can used- gensim
---------------------------------------------------------------------------------------------------------------------
No comments:
Post a Comment