[关闭]
@HaomingJiang 2016-07-30T08:19:41.000000Z 字数 15974 阅读 2038

Tweets Analysis 4

Tweets Textmining



N-Gram Model

3-Gram with KneserNey Smoothing

According to many nlp materials, KneserNey Smoothing consistently had best performance (e.g. http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf). Also in nltk, the implementation of KneserNey Smoothing is easy to apply.

  1. nltk.KneserNeyProbDist

The result of ten runs with random splits of test data and training data:
confusion matrix for one run example:

prediction negative neutral positive
negative 546 93 71
neutral 41 86 24
positive 27 25 87

For ten runs:
Accuarcy:
G-mean:

The performance is worse than Naive Bayes.


Deep Learning in NLP

Basic Idea

A Basic Idea of applying deep learning in NLP is turning words into vectors (word2vec). By utilzing the word vector as features, we can achieve better performance on NLP tasks. Just as the paper's title desscribed ( Natural Language Processing (Almost) from Scratch ) , almost every NLP task can be resolved in this bround new way.
A good chinese introduction is in (http://licstar.net/archives/328)

After reviewing some NLP, deep learning related materials. I find out that deep learning is a popular, powerful and prospective methodology in NLP area. Many researchs have been done including sentiment analysis.

Deep Learning In Sentiment Analysis

For sentiment analysis, we can compute the average vector of each word representation as the whole vector for the tweet. After that, we can take it as the input of classifiers. Even more, we can use tf, idf or tf-idf weighting for averaging. However, this method has been proven useless.

A lot of experiment results can be found in the following papers:

We can also compute a vector for a whole document by using models.doc2vec in gensim (http://radimrehurek.com/gensim/models/doc2vec.html). The original paper of this method is (http://arxiv.org/pdf/1405.4053v2.pdf). In this paper, it also evaluate the vector via using it to do sentiment analysis.

There are two popular dataset, which are always used in evaluating sentiment analysis performance:

In the following research, if we want to compare others' work, we can use these data sets. But these are balanced data, as we focus on imbalance we can still work with airline data set.

If we can compute vector for each document, is that means we can use the vector as the input of classifier for sentiment analysis? In that way, training classifier such as NaiveBayes, SVM, MaxEnt and Boosting can become easier. It is my original idea, but I find out that it has already been applied (ref: http://arxiv.org/pdf/1405.4053v2.pdf). And the deep models achieve good results.

According to the above papers and material. I think we can start focus on this type of deep models. As a begining, this week I tried the following experiment with two state-of-the-art deep models.(Recursive Deep Model and Doc2Vec)

Recursive Deep Model For Sentiment Analysis

References: http://nlp.stanford.edu/sentiment/

I use the coreNLP (http://stanfordnlp.github.io/CoreNLP/index.html) which contains implementation of the algorithm (http://nlp.stanford.edu/sentiment/), which is one of the state-of-the-art deep language models. However, it use the data of IMDB movie rewiews, which is quite different from our airline data, to build the classifier.

The result of appling it in airline data is presented below.

prediction negative neutral positive
negative 7327 1901 749
neutral 1308 830 570
positive 447 338 1015

accuarcy:0.633
g-mean:0.5593

Obviously, the performance is worse than the baseline algorithm. I think the main reason is that the training data is quite different from our airline data. If we retrain the model by the airline data, I believe we could achieve higher performance.

Doc2Vec For Sentiment Analysis

Here I use the Doc2Vec and logistic regression to analysis the sentiemnt of airline data.

For one run result:
accuarcy:0.488
g-mean:0.2215

(PS: This week I only test the feasibility of this approach without turally understanding each parameter. I need to refine this next week. I wish I can achieve a better performance.)

About Imbalance

In terms of class imbalance, we may use SMOTE to generate better samples for minority class, since the new vector representation of tweet captures better semantic meaning of the whole tweet than tf,tf-idf etc..
In terms of data set, the previous work in deep model has little disscussion about class imbalance. They just handle balanced data such as IMDB data set. We can analysis it via airline data set, or even man-made imbalanced data. So first of all, we should run these algorithms on our airline data.


Working Plan


Overview of previous work

The following result is the average of ten runs
1. Naive Bayes with occurrence feature with triming of sparse term, the dimension of feature is about 1,700
Accuarcy = 76.24%
G-mean = 0.6868
2. Model one, but with tf-idf feature instead of occurrence
Mean Accuarcy = 75.30%
Mean G-mean = 0.6739
3. Model one, but with tf feature insted of occurrence
Mean Accuarcy = 73.30%
Mean G-mean = 0.6739
4. Model one, but with Max-Entropy instead of Naive Bayes
Mean Accuarcy = 75.14%
Mean G-mean = 0.6729
5. Model one with oversampling (sampling ratio = 1.5, for two minority classes)
Mean Accuarcy = 76.52%
Mean G-mean = 0.6896
6. Model one with oversampling (sampling ratio = 2, for two minority classes)
Mean Accuarcy = 76.43%
Mean G-mean = 0.6881
7. Model one with oversampling (sampling ratio = 2.5, for two minority classes)
Mean Accuarcy = 76.96%
Mean G-mean = 0.693
8. Model one with undersampling (sampling ratio = 0.8)
Mean Accuarcy = 75.59%
Mean G-mean = 0.6775
9. Model one with undersampling (sampling ratio = 0.6)
Mean Accuarcy = 74.37%
Mean G-mean = 0.6643
10. Model one with SMOTE (the result is for only one run. Because it is really time consuming)
Mean Accuarcy = 75.4%
Mean G-mean = 0.6856
11. 3-gram LM with KN smoothing
Mean Accuarcy = 70.4%
Mean G-mean = 0.5597
12. A try of Recursive Deep Model which is trained on IMDB data, the result is the prediction of the whole airline data
Mean Accuarcy = 63.3%
Mean G-mean = 0.5593
13. A try of Doc2Vec Model which is trained on IMDB data, the result is only for one run.
Mean Accuarcy = 48.8%
Mean G-mean = 0.2215


Reference

http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

http://licstar.net/archives/328

A useful language model (including deep learning model) -- http://radimrehurek.com/gensim/
Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C., 2003. A neural probabilistic language model. journal of machine learning research, 3(Feb), pp.1137-1155.

Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.http://arxiv.org/pdf/1405.4053v2.pdf

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011.


  1. # -*- coding: utf-8 -*-
  2. import nltk
  3. import pandas as pd
  4. import matplotlib.pyplot as plt
  5. from nltk.tokenize import sent_tokenize
  6. from nltk.tokenize import word_tokenize
  7. from nltk.stem.porter import PorterStemmer
  8. import re
  9. import htmlentitydefs
  10. import string
  11. ##############################################################
  12. emoticon_string = r"""
  13. (?:
  14. [<>]?
  15. [:;=8] # eyes
  16. [\-o\*\']? # optional nose
  17. [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
  18. |
  19. [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
  20. [\-o\*\']? # optional nose
  21. [:;=8] # eyes
  22. [<>]?
  23. )"""
  24. regex_strings = (
  25. # Phone numbers:
  26. r"""
  27. (?:
  28. (?: # (international)
  29. \+?[01]
  30. [\-\s.]*
  31. )?
  32. (?: # (area code)
  33. [\(]?
  34. \d{3}
  35. [\-\s.\)]*
  36. )?
  37. \d{3} # exchange
  38. [\-\s.]*
  39. \d{4} # base
  40. )"""
  41. ,
  42. # Emoticons:
  43. emoticon_string
  44. ,
  45. # HTML tags:
  46. r"""<[^>]+>"""
  47. ,
  48. # Twitter username:
  49. r"""(?:@[\w_]+)"""
  50. ,
  51. # Twitter hashtags:
  52. r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
  53. ,
  54. # Remaining word types:
  55. r"""
  56. (?:[a-z][a-z'\-_]+[a-z]) # Words with apostrophes or dashes.
  57. |
  58. (?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
  59. |
  60. (?:[\w_]+) # Words without apostrophes or dashes.
  61. |
  62. (?:\.(?:\s*\.){1,}) # Ellipsis dots.
  63. |
  64. (?:\S) # Everything else that isn't whitespace.
  65. """
  66. )######################################################################
  67. # This is the core tokenizing regex:
  68. word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
  69. # The emoticon string gets its own regex so that we can preserve case for them as needed:
  70. emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
  71. # These are for regularizing HTML entities to Unicode:
  72. html_entity_digit_re = re.compile(r"&#\d+;")
  73. html_entity_alpha_re = re.compile(r"&\w+;")
  74. amp = "&amp;"
  75. ######################################################################
  76. class Tokenizer:
  77. def __init__(self, preserve_case=False):
  78. self.preserve_case = preserve_case
  79. def tokenize(self, s):
  80. """
  81. Argument: s -- any string or unicode object
  82. Value: a tokenize list of strings; conatenating this list returns the original string if preserve_case=False
  83. """
  84. # Try to ensure unicode:
  85. try:
  86. s = unicode(s)
  87. except UnicodeDecodeError:
  88. s = str(s).encode('string_escape')
  89. s = unicode(s)
  90. # Fix HTML character entitites:
  91. s = self.__html2unicode(s)
  92. # Tokenize:
  93. s = re.sub(r"""(?:@[\w_]+)""", 'MENTIONSOMEONE', s)
  94. s = re.sub(r"""(.)\1{2,}""", r"""\1\1\1""", s)
  95. s = re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-f][0-9a-f]))+',"WEBSITE",s)
  96. s = re.sub(r"""(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?""", "WEBSITE", s)
  97. s = re.sub(r"""
  98. (?:
  99. (?: # (international)
  100. \+?[01]
  101. [\-\s.]*
  102. )?
  103. (?: # (area code)
  104. [\(]?
  105. \d{3}
  106. [\-\s.\)]*
  107. )?
  108. \d{3} # exchange
  109. [\-\s.]*
  110. \d{4} # base
  111. )""",'PHONENUM',s)
  112. s = re.sub(r"""[+-]?[$]?[ ]?(?:[+\-]?\d+[,/.:-]\d+[+\-]?)""","APRICE",s)
  113. s = re.sub(r"\S*\\d+\S*"," ",s)
  114. words = word_re.findall(s)
  115. # Possible alter the case, but avoid changing emoticons like :D into :d:
  116. if not self.preserve_case:
  117. words = map((lambda x : x if emoticon_re.search(x) else x.lower()), words)
  118. words = map((lambda x : 'PUN' if x in list(string.punctuation) else x), words)
  119. return words
  120. def tokenize_random_tweet(self):
  121. """
  122. If the twitter library is installed and a twitter connection
  123. can be established, then tokenize a random tweet.
  124. """
  125. try:
  126. import twitter
  127. except ImportError:
  128. print "Apologies. The random tweet functionality requires the Python twitter library: http://code.google.com/p/python-twitter/"
  129. from random import shuffle
  130. api = twitter.Api()
  131. tweets = api.GetPublicTimeline()
  132. if tweets:
  133. for tweet in tweets:
  134. if tweet.user.lang == 'en':
  135. return self.tokenize(tweet.text)
  136. else:
  137. raise Exception("Apologies. I couldn't get Twitter to give me a public English-language tweet. Perhaps try again")
  138. def __html2unicode(self, s):
  139. """
  140. Internal metod that seeks to replace all the HTML entities in
  141. s with their corresponding unicode characters.
  142. """
  143. # First the digits:
  144. ents = set(html_entity_digit_re.findall(s))
  145. if len(ents) > 0:
  146. for ent in ents:
  147. entnum = ent[2:-1]
  148. try:
  149. entnum = int(entnum)
  150. s = s.replace(ent, unichr(entnum))
  151. except:
  152. pass
  153. # Now the alpha versions:
  154. ents = set(html_entity_alpha_re.findall(s))
  155. ents = filter((lambda x : x != amp), ents)
  156. for ent in ents:
  157. entname = ent[1:-1]
  158. try:
  159. s = s.replace(ent, unichr(htmlentitydefs.name2codepoint[entname]))
  160. except:
  161. pass
  162. s = s.replace(amp, " and ")
  163. return s
  164. ###############################################################################
  165. tok = Tokenizer(preserve_case=True)
  166. def clean_tweet(s):
  167. '''
  168. :s : string; a tweet
  169. :return : list; words that dont contain url, @somebody, and in utf-8 and lower case
  170. '''
  171. words = tok.tokenize(s)
  172. return words
  173. testT = u"he is the biggest 1st person in the world! :) WoooooW!!!! Is it true?!?! The price is $3.00 at 19:00. However, for tomorrow Nov 9 3:00pm it is only 3000. for details http://baidu.com/ please call 13823372000 @JHM #lalala"
  174. clean_tweet(testT)
  175. df = pd.read_csv('../data/Tweets.csv')
  176. df = df[[u'airline_sentiment',u'text']]
  177. df.loc[:,'text'] = df.loc[:,'text'].map(clean_tweet)
  178. ###deep model
  179. accus=[]
  180. Gmeans=[]
  181. from gensim.models.doc2vec import TaggedDocument,Doc2Vec
  182. from collections import OrderedDict
  183. import multiprocessing
  184. cores = multiprocessing.cpu_count()
  185. documents = [TaggedDocument(list(df.loc[i,'text']),[i]) for i in range(0,14640)]
  186. model = Doc2Vec(size=100, window=4, min_count=2, workers=cores, iter=15, negative = 3 )
  187. model.build_vocab(documents)
  188. import random
  189. random.seed(1212)
  190. newindex = random.sample(range(0,14640),14640)
  191. testID = newindex[-1000:]
  192. trainID = newindex[:-1000]
  193. trainDoc = [documents[id] for id in trainID]
  194. Labels = df.loc[:,'airline_sentiment']
  195. from random import shuffle
  196. alldoc = documents
  197. shuffle(alldoc)
  198. import statsmodels.api as sm
  199. import sklearn.linear_model as skllm
  200. from sklearn.metrics import confusion_matrix
  201. model.train(trainDoc)
  202. train_targets, train_regressors = zip(*[(Labels[id], model.docvecs[id]) for id in trainID])
  203. train_regressors = sm.add_constant(train_regressors)
  204. predictor = skllm.LogisticRegression(multi_class='multinomial',solver='lbfgs')
  205. predictor.fit(train_regressors,train_targets)
  206. test_regressors = [model.infer_vector(documents[id].words, steps=5, alpha=0.1) for id in testID]
  207. test_regressors = sm.add_constant(test_regressors)
  208. test_predictions = predictor.predict(test_regressors)
  209. accu=0
  210. for i in range(0,1000):
  211. if test_predictions[i]==df.loc[testID[i],u'airline_sentiment']:
  212. accu=accu+1
  213. accus=accus+[1.0*accu/100]
  214. confusionM = confusion_matrix(test_predictions,(df.loc[testID,u'airline_sentiment']))
  215. Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
  216. from sklearn.metrics import confusion_matrix
  217. from nltk.util import ngrams
  218. generated_ngrams = ngrams(['TEXT a','TEXT b','TEXT c','TEXT d'], 3, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  219. n4grams=3
  220. Probdist = nltk.KneserNeyProbDist
  221. accus=[]
  222. Gmeans=[]
  223. for iter in range(0,10):
  224. import random
  225. random.seed(1212+iter)
  226. newindex = random.sample(range(0,14485),14485)
  227. testID = newindex[-1000:]
  228. trainID = newindex[:-1000]
  229. trainID_p = [id for id in trainID if df.loc[id,u'airline_sentiment']=='positive']
  230. trainID_neg = [id for id in trainID if df.loc[id,u'airline_sentiment']=='negative']
  231. trainID_neu = [id for id in trainID if df.loc[id,u'airline_sentiment']=='neutral']
  232. alllist = []
  233. for i in trainID_p:
  234. generated_ngrams = ngrams(df.loc[i,'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  235. alllist = alllist+list(generated_ngrams)
  236. freq_dist = nltk.FreqDist(alllist)
  237. Dist_p = Probdist(freq_dist,1)
  238. alllist = []
  239. for i in trainID_neg:
  240. generated_ngrams = ngrams(df.loc[i,'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  241. alllist = alllist+list(generated_ngrams)
  242. freq_dist = nltk.FreqDist(alllist)
  243. Dist_neg = Probdist(freq_dist,1)
  244. alllist = []
  245. for i in trainID_neu:
  246. generated_ngrams = ngrams(df.loc[i,'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  247. alllist = alllist+list(generated_ngrams)
  248. freq_dist = nltk.FreqDist(alllist)
  249. Dist_neu = Probdist(freq_dist,1)
  250. predictLabels=[]
  251. for i in range(0,1000):
  252. generated_ngrams = ngrams(df.loc[testID[i],'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  253. prob_sum_p = 0
  254. for k in generated_ngrams:
  255. prob_sum_p += Dist_p.prob(k)
  256. generated_ngrams = ngrams(df.loc[testID[i],'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  257. prob_sum_neg = 0
  258. for k in generated_ngrams:
  259. prob_sum_neg += Dist_neg.prob(k)
  260. generated_ngrams = ngrams(df.loc[testID[i],'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
  261. prob_sum_neu = 0
  262. for k in generated_ngrams:
  263. prob_sum_neu += Dist_neu.prob(k)
  264. if(prob_sum_p>prob_sum_neu and prob_sum_p>prob_sum_neg):
  265. predictLabels = predictLabels+['positive']
  266. else:
  267. if(prob_sum_neg>prob_sum_neu and prob_sum_neg>prob_sum_p):
  268. predictLabels = predictLabels+['negative']
  269. else:
  270. predictLabels = predictLabels+['neutral']
  271. accu=0
  272. for i in range(0,1000):
  273. if predictLabels[i]==df.loc[testID[i],u'airline_sentiment']:
  274. accu=accu+1
  275. accus=accus+[1.0*accu/100]
  276. predictLabels = predictLabels
  277. confusionM = confusion_matrix(predictLabels,(df.loc[testID,u'airline_sentiment']))
  278. Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
  279. #
  280. #def cross_validation(clf, X, Y, cv=5, avg=False):
  281. # '''
  282. # :clf : classifier with fit() and predict() method
  283. # :X : pd.DataFrame; features
  284. # :Y : pd.DataFrame(1 column) or pd.Series; labels
  285. # :cv : int; cross validation folders
  286. #
  287. # :return : list of float; cross validation scores
  288. # '''
  289. #
  290. # k = [int((len(X))/cv*j) for j in range(cv+1)]
  291. # score = [0.0]*cv
  292. # for i in range(cv):
  293. # train_x, train_y = pd.concat([X[:k[i]],X[k[i+1]:]]), pd.concat([Y[:k[i]],Y[k[i+1]:]])
  294. # test_x, test_y = X[k[i]:k[i+1]], Y[k[i]:k[i+1]]
  295. #
  296. # clf.fit(X,Y)
  297. # pred = clf.predict(test_x)
  298. #
  299. # score[i] = (pred == test_y).sum()/float(len(test_y))
  300. # if avg: return sum(score)/float(len(score))
  301. # return score
  302. #
  303. #
  304. #models = [lm()]*len(dfs)
  305. #avg_score = [cross_validation(model, X, Y, avg=True, cv=10)]
  306. #print(avg_score)
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注