@HaomingJiang
2016-07-30T08:19:41.000000Z
字数 15974
阅读 2038
Tweets
Textmining
3-Gram with KneserNey Smoothing
According to many nlp materials, KneserNey Smoothing consistently had best performance (e.g. http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf). Also in nltk, the implementation of KneserNey Smoothing is easy to apply.
nltk.KneserNeyProbDist
The result of ten runs with random splits of test data and training data:
confusion matrix for one run example:
prediction | negative | neutral | positive |
---|---|---|---|
negative | 546 | 93 | 71 |
neutral | 41 | 86 | 24 |
positive | 27 | 25 | 87 |
For ten runs:
Accuarcy:
G-mean:
The performance is worse than Naive Bayes.
A Basic Idea of applying deep learning in NLP is turning words into vectors (word2vec). By utilzing the word vector as features, we can achieve better performance on NLP tasks. Just as the paper's title desscribed ( Natural Language Processing (Almost) from Scratch ) , almost every NLP task can be resolved in this bround new way.
A good chinese introduction is in (http://licstar.net/archives/328)
After reviewing some NLP, deep learning related materials. I find out that deep learning is a popular, powerful and prospective methodology in NLP area. Many researchs have been done including sentiment analysis.
For sentiment analysis, we can compute the average vector of each word representation as the whole vector for the tweet. After that, we can take it as the input of classifiers. Even more, we can use tf, idf or tf-idf weighting for averaging. However, this method has been proven useless.
A lot of experiment results can be found in the following papers:
We can also compute a vector for a whole document by using models.doc2vec in gensim (http://radimrehurek.com/gensim/models/doc2vec.html). The original paper of this method is (http://arxiv.org/pdf/1405.4053v2.pdf). In this paper, it also evaluate the vector via using it to do sentiment analysis.
There are two popular dataset, which are always used in evaluating sentiment analysis performance:
In the following research, if we want to compare others' work, we can use these data sets. But these are balanced data, as we focus on imbalance we can still work with airline data set.
If we can compute vector for each document, is that means we can use the vector as the input of classifier for sentiment analysis? In that way, training classifier such as NaiveBayes, SVM, MaxEnt and Boosting can become easier. It is my original idea, but I find out that it has already been applied (ref: http://arxiv.org/pdf/1405.4053v2.pdf). And the deep models achieve good results.
According to the above papers and material. I think we can start focus on this type of deep models. As a begining, this week I tried the following experiment with two state-of-the-art deep models.(Recursive Deep Model and Doc2Vec)
References: http://nlp.stanford.edu/sentiment/
I use the coreNLP (http://stanfordnlp.github.io/CoreNLP/index.html) which contains implementation of the algorithm (http://nlp.stanford.edu/sentiment/), which is one of the state-of-the-art deep language models. However, it use the data of IMDB movie rewiews, which is quite different from our airline data, to build the classifier.
The result of appling it in airline data is presented below.
prediction | negative | neutral | positive |
---|---|---|---|
negative | 7327 | 1901 | 749 |
neutral | 1308 | 830 | 570 |
positive | 447 | 338 | 1015 |
accuarcy:0.633
g-mean:0.5593
Obviously, the performance is worse than the baseline algorithm. I think the main reason is that the training data is quite different from our airline data. If we retrain the model by the airline data, I believe we could achieve higher performance.
Here I use the Doc2Vec and logistic regression to analysis the sentiemnt of airline data.
For one run result:
accuarcy:0.488
g-mean:0.2215
(PS: This week I only test the feasibility of this approach without turally understanding each parameter. I need to refine this next week. I wish I can achieve a better performance.)
In terms of class imbalance, we may use SMOTE to generate better samples for minority class, since the new vector representation of tweet captures better semantic meaning of the whole tweet than tf,tf-idf etc..
In terms of data set, the previous work in deep model has little disscussion about class imbalance. They just handle balanced data such as IMDB data set. We can analysis it via airline data set, or even man-made imbalanced data. So first of all, we should run these algorithms on our airline data.
The following result is the average of ten runs
1. Naive Bayes with occurrence feature with triming of sparse term, the dimension of feature is about 1,700
Accuarcy = 76.24%
G-mean = 0.6868
2. Model one, but with tf-idf feature instead of occurrence
Mean Accuarcy = 75.30%
Mean G-mean = 0.6739
3. Model one, but with tf feature insted of occurrence
Mean Accuarcy = 73.30%
Mean G-mean = 0.6739
4. Model one, but with Max-Entropy instead of Naive Bayes
Mean Accuarcy = 75.14%
Mean G-mean = 0.6729
5. Model one with oversampling (sampling ratio = 1.5, for two minority classes)
Mean Accuarcy = 76.52%
Mean G-mean = 0.6896
6. Model one with oversampling (sampling ratio = 2, for two minority classes)
Mean Accuarcy = 76.43%
Mean G-mean = 0.6881
7. Model one with oversampling (sampling ratio = 2.5, for two minority classes)
Mean Accuarcy = 76.96%
Mean G-mean = 0.693
8. Model one with undersampling (sampling ratio = 0.8)
Mean Accuarcy = 75.59%
Mean G-mean = 0.6775
9. Model one with undersampling (sampling ratio = 0.6)
Mean Accuarcy = 74.37%
Mean G-mean = 0.6643
10. Model one with SMOTE (the result is for only one run. Because it is really time consuming)
Mean Accuarcy = 75.4%
Mean G-mean = 0.6856
11. 3-gram LM with KN smoothing
Mean Accuarcy = 70.4%
Mean G-mean = 0.5597
12. A try of Recursive Deep Model which is trained on IMDB data, the result is the prediction of the whole airline data
Mean Accuarcy = 63.3%
Mean G-mean = 0.5593
13. A try of Doc2Vec Model which is trained on IMDB data, the result is only for one run.
Mean Accuarcy = 48.8%
Mean G-mean = 0.2215
http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
http://licstar.net/archives/328
A useful language model (including deep learning model) -- http://radimrehurek.com/gensim/
Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C., 2003. A neural probabilistic language model. journal of machine learning research, 3(Feb), pp.1137-1155.
Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.http://arxiv.org/pdf/1405.4053v2.pdf
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011.
# -*- coding: utf-8 -*-
import nltk
import pandas as pd
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
import re
import htmlentitydefs
import string
##############################################################
emoticon_string = r"""
(?:
[<>]?
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
|
[\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# Phone numbers:
r"""
(?:
(?: # (international)
\+?[01]
[\-\s.]*
)?
(?: # (area code)
[\(]?
\d{3}
[\-\s.\)]*
)?
\d{3} # exchange
[\-\s.]*
\d{4} # base
)"""
,
# Emoticons:
emoticon_string
,
# HTML tags:
r"""<[^>]+>"""
,
# Twitter username:
r"""(?:@[\w_]+)"""
,
# Twitter hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types:
r"""
(?:[a-z][a-z'\-_]+[a-z]) # Words with apostrophes or dashes.
|
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)######################################################################
# This is the core tokenizing regex:
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
# The emoticon string gets its own regex so that we can preserve case for them as needed:
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
# These are for regularizing HTML entities to Unicode:
html_entity_digit_re = re.compile(r"&#\d+;")
html_entity_alpha_re = re.compile(r"&\w+;")
amp = "&"
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
"""
Argument: s -- any string or unicode object
Value: a tokenize list of strings; conatenating this list returns the original string if preserve_case=False
"""
# Try to ensure unicode:
try:
s = unicode(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Fix HTML character entitites:
s = self.__html2unicode(s)
# Tokenize:
s = re.sub(r"""(?:@[\w_]+)""", 'MENTIONSOMEONE', s)
s = re.sub(r"""(.)\1{2,}""", r"""\1\1\1""", s)
s = re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-f][0-9a-f]))+',"WEBSITE",s)
s = re.sub(r"""(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?""", "WEBSITE", s)
s = re.sub(r"""
(?:
(?: # (international)
\+?[01]
[\-\s.]*
)?
(?: # (area code)
[\(]?
\d{3}
[\-\s.\)]*
)?
\d{3} # exchange
[\-\s.]*
\d{4} # base
)""",'PHONENUM',s)
s = re.sub(r"""[+-]?[$]?[ ]?(?:[+\-]?\d+[,/.:-]\d+[+\-]?)""","APRICE",s)
s = re.sub(r"\S*\\d+\S*"," ",s)
words = word_re.findall(s)
# Possible alter the case, but avoid changing emoticons like :D into :d:
if not self.preserve_case:
words = map((lambda x : x if emoticon_re.search(x) else x.lower()), words)
words = map((lambda x : 'PUN' if x in list(string.punctuation) else x), words)
return words
def tokenize_random_tweet(self):
"""
If the twitter library is installed and a twitter connection
can be established, then tokenize a random tweet.
"""
try:
import twitter
except ImportError:
print "Apologies. The random tweet functionality requires the Python twitter library: http://code.google.com/p/python-twitter/"
from random import shuffle
api = twitter.Api()
tweets = api.GetPublicTimeline()
if tweets:
for tweet in tweets:
if tweet.user.lang == 'en':
return self.tokenize(tweet.text)
else:
raise Exception("Apologies. I couldn't get Twitter to give me a public English-language tweet. Perhaps try again")
def __html2unicode(self, s):
"""
Internal metod that seeks to replace all the HTML entities in
s with their corresponding unicode characters.
"""
# First the digits:
ents = set(html_entity_digit_re.findall(s))
if len(ents) > 0:
for ent in ents:
entnum = ent[2:-1]
try:
entnum = int(entnum)
s = s.replace(ent, unichr(entnum))
except:
pass
# Now the alpha versions:
ents = set(html_entity_alpha_re.findall(s))
ents = filter((lambda x : x != amp), ents)
for ent in ents:
entname = ent[1:-1]
try:
s = s.replace(ent, unichr(htmlentitydefs.name2codepoint[entname]))
except:
pass
s = s.replace(amp, " and ")
return s
###############################################################################
tok = Tokenizer(preserve_case=True)
def clean_tweet(s):
'''
:s : string; a tweet
:return : list; words that dont contain url, @somebody, and in utf-8 and lower case
'''
words = tok.tokenize(s)
return words
testT = u"he is the biggest 1st person in the world! :) WoooooW!!!! Is it true?!?! The price is $3.00 at 19:00. However, for tomorrow Nov 9 3:00pm it is only 3000. for details http://baidu.com/ please call 13823372000 @JHM #lalala"
clean_tweet(testT)
df = pd.read_csv('../data/Tweets.csv')
df = df[[u'airline_sentiment',u'text']]
df.loc[:,'text'] = df.loc[:,'text'].map(clean_tweet)
###deep model
accus=[]
Gmeans=[]
from gensim.models.doc2vec import TaggedDocument,Doc2Vec
from collections import OrderedDict
import multiprocessing
cores = multiprocessing.cpu_count()
documents = [TaggedDocument(list(df.loc[i,'text']),[i]) for i in range(0,14640)]
model = Doc2Vec(size=100, window=4, min_count=2, workers=cores, iter=15, negative = 3 )
model.build_vocab(documents)
import random
random.seed(1212)
newindex = random.sample(range(0,14640),14640)
testID = newindex[-1000:]
trainID = newindex[:-1000]
trainDoc = [documents[id] for id in trainID]
Labels = df.loc[:,'airline_sentiment']
from random import shuffle
alldoc = documents
shuffle(alldoc)
import statsmodels.api as sm
import sklearn.linear_model as skllm
from sklearn.metrics import confusion_matrix
model.train(trainDoc)
train_targets, train_regressors = zip(*[(Labels[id], model.docvecs[id]) for id in trainID])
train_regressors = sm.add_constant(train_regressors)
predictor = skllm.LogisticRegression(multi_class='multinomial',solver='lbfgs')
predictor.fit(train_regressors,train_targets)
test_regressors = [model.infer_vector(documents[id].words, steps=5, alpha=0.1) for id in testID]
test_regressors = sm.add_constant(test_regressors)
test_predictions = predictor.predict(test_regressors)
accu=0
for i in range(0,1000):
if test_predictions[i]==df.loc[testID[i],u'airline_sentiment']:
accu=accu+1
accus=accus+[1.0*accu/100]
confusionM = confusion_matrix(test_predictions,(df.loc[testID,u'airline_sentiment']))
Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
from sklearn.metrics import confusion_matrix
from nltk.util import ngrams
generated_ngrams = ngrams(['TEXT a','TEXT b','TEXT c','TEXT d'], 3, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
n4grams=3
Probdist = nltk.KneserNeyProbDist
accus=[]
Gmeans=[]
for iter in range(0,10):
import random
random.seed(1212+iter)
newindex = random.sample(range(0,14485),14485)
testID = newindex[-1000:]
trainID = newindex[:-1000]
trainID_p = [id for id in trainID if df.loc[id,u'airline_sentiment']=='positive']
trainID_neg = [id for id in trainID if df.loc[id,u'airline_sentiment']=='negative']
trainID_neu = [id for id in trainID if df.loc[id,u'airline_sentiment']=='neutral']
alllist = []
for i in trainID_p:
generated_ngrams = ngrams(df.loc[i,'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
alllist = alllist+list(generated_ngrams)
freq_dist = nltk.FreqDist(alllist)
Dist_p = Probdist(freq_dist,1)
alllist = []
for i in trainID_neg:
generated_ngrams = ngrams(df.loc[i,'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
alllist = alllist+list(generated_ngrams)
freq_dist = nltk.FreqDist(alllist)
Dist_neg = Probdist(freq_dist,1)
alllist = []
for i in trainID_neu:
generated_ngrams = ngrams(df.loc[i,'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
alllist = alllist+list(generated_ngrams)
freq_dist = nltk.FreqDist(alllist)
Dist_neu = Probdist(freq_dist,1)
predictLabels=[]
for i in range(0,1000):
generated_ngrams = ngrams(df.loc[testID[i],'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
prob_sum_p = 0
for k in generated_ngrams:
prob_sum_p += Dist_p.prob(k)
generated_ngrams = ngrams(df.loc[testID[i],'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
prob_sum_neg = 0
for k in generated_ngrams:
prob_sum_neg += Dist_neg.prob(k)
generated_ngrams = ngrams(df.loc[testID[i],'text'], n4grams, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')
prob_sum_neu = 0
for k in generated_ngrams:
prob_sum_neu += Dist_neu.prob(k)
if(prob_sum_p>prob_sum_neu and prob_sum_p>prob_sum_neg):
predictLabels = predictLabels+['positive']
else:
if(prob_sum_neg>prob_sum_neu and prob_sum_neg>prob_sum_p):
predictLabels = predictLabels+['negative']
else:
predictLabels = predictLabels+['neutral']
accu=0
for i in range(0,1000):
if predictLabels[i]==df.loc[testID[i],u'airline_sentiment']:
accu=accu+1
accus=accus+[1.0*accu/100]
predictLabels = predictLabels
confusionM = confusion_matrix(predictLabels,(df.loc[testID,u'airline_sentiment']))
Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
#
#def cross_validation(clf, X, Y, cv=5, avg=False):
# '''
# :clf : classifier with fit() and predict() method
# :X : pd.DataFrame; features
# :Y : pd.DataFrame(1 column) or pd.Series; labels
# :cv : int; cross validation folders
#
# :return : list of float; cross validation scores
# '''
#
# k = [int((len(X))/cv*j) for j in range(cv+1)]
# score = [0.0]*cv
# for i in range(cv):
# train_x, train_y = pd.concat([X[:k[i]],X[k[i+1]:]]), pd.concat([Y[:k[i]],Y[k[i+1]:]])
# test_x, test_y = X[k[i]:k[i+1]], Y[k[i]:k[i+1]]
#
# clf.fit(X,Y)
# pred = clf.predict(test_x)
#
# score[i] = (pred == test_y).sum()/float(len(test_y))
# if avg: return sum(score)/float(len(score))
# return score
#
#
#models = [lm()]*len(dfs)
#avg_score = [cross_validation(model, X, Y, avg=True, cv=10)]
#print(avg_score)