NLP and ML basic concepts

Thursday, 24 March 2016

Basics of Numpy library

Numpy is the library that provide powerful N-dimensional array object and useful linear algebra.

n dimentional array = n rank

>>> import numpy as np

 
>>> a = np.array([[1,2,3,4],[5,6,7,8]])

>>> a.dtype  #datatype
dtype('int32') 

>>> a.shape  #2*4 array
(2, 4) 

>>> a.size  #2*4=8
8

>>> a.ndim  #rank
2

>>> a.itemsize  #size (bytes) of each element
4

>>> print (a[0,0],a[1,1])
(1, 6)

>>> b = np.zeros((2,3))
>>> b
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

>>> b=np.ones((2,1))
>>> b
array([[ 1.],
       [ 1.]])

>>> np.empty( (2,3) )   
array([[  0.00000000e+000,   4.70293910e-268,   4.70325177e-268],
       [  0.00000000e+000,  -9.78202667e-042,   2.04432588e-268]])
>>> np.empty( (2,3) )   
array([[  1.03007337e-268,   6.87367407e-316,   0.00000000e+000],
       [  4.30279633e-308,   6.34874355e-321,   4.93121130e-306]])

>>> np.eye(3)
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

>>> np.random.random((1,2))
array([[ 0.22703097,  0.75265074]])
>>> np.random.random((1,2))
array([[ 0.47774304,  0.96365177]])

>>> a
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])>>> a[:2, 1:3]
array([[2, 3],
       [6, 7]])
>>> a
array([[1, 2],
       [3, 4],
       [5, 6]])
>>> print a[[0, 1], [0, 1]] 
[1 4]

>>> a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
>>> b = np.array([0, 2, 1])
>>> print a[np.arange(3), b] 
[1 6 8]
>>> print a[1, b] 
[4 6 5]

>>> a
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
>>> (a>5)
array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
>>> print a[a>5]
[ 6  7  8  9 10 11 12]

>>> v = np.array([1, 0, 2])
>>> vv = np.tile(v, (4, 2))  
>>> vv
array([[1, 0, 2, 1, 0, 2],
       [1, 0, 2, 1, 0, 2],
       [1, 0, 2, 1, 0, 2],
       [1, 0, 2, 1, 0, 2]])

>>> a
array([[3, 2, 1],
       [4, 5, 6],
       [0, 2, 1]])
>>> c
[1, 2, 3]
>>> a+c #np.add(a,c)
array([[4, 4, 4],
       [5, 7, 9],
       [1, 4, 4]])
>>> b=[[1,2,1],[2,0,1],[2,2,1]]
>>> a+b #np.add(a,b)
array([[4, 4, 2],
       [6, 5, 7],
       [2, 4, 2]])

>>> a
array([[3, 2, 1],
       [4, 5, 6],
       [0, 2, 1]])
>>> a.T
array([[3, 4, 0],
       [2, 5, 2],
       [1, 6, 1]])
#(transpose of a rank 1 array does nothing)

>>> x
array([[1, 1, 1],
       [2, 2, 2]])
>>> c
array([1, 2, 3])
>>> np.dot(x,c)
array([ 6, 12])

Sunday, 20 March 2016

Latent Dirichlet Allocation ( LDA ) basics

w: observed word in a document i.----WORD
z: topic for jth word in documnet i.----TOPIC
O: topic distribution for document i.----CONTEXT
Motivation:
e.g. topics t1,t2,t3
t1: list of words w11,w12,.... each word have some probability to belong to t1.
t2: list of words w21,w22,.... each word have some probability to belong to t2.
t3: list of words w31,w32,.... each word have some probability to belong to t3.

Recipe of document di is : t1% + t2% + t3%=100%therefore take right no. of words from t1,t2,t3 and mix them for di.

Find out for any document, its recipe. so that other similar recipe document match it.
-----------------------------------------------------------------------------------------------------------------------
Document d1,d2,d3,.... takes and collect words from each of them and words belong to same context relates them and put in a same topics.

Dirchelet Parameters a and b.
a: per document topic distribution
[high a means mixture of many topic in document di. Therefore high document similarity]
b: per topic word distribution
[high b means contain mixture of many words in topic ti
therefore high topic similarity
---------------------------------------------------------------------------------------------------------------------
LDA work:
take a document x: take all its words X.
preprocessing: remove stop words, lemmatization, now consider pos tagged words, like noun, adjective, verb.
now LDA give reciepe for document that from which topic (say total y topics, which you can vary) and what ratio it is constructed. (probability distribution of words. sum=1)
Document x represented now as y dimension, which is type of unique identity to document.
you can find similar documents of x by measuring distance between 2 probability distribution, e.g KL divergence, jenson shannon divergence.
--------------------------------------------------------------------------------------------------------------------
python library can used- gensim
---------------------------------------------------------------------------------------------------------------------