@HaomingJiang
2016-08-12T22:29:41.000000Z
字数 5038
阅读 9346
Tweets
Textmining
Doc2Vec
This is a tutorial about the procedue of using doc2vec to do sentiment analysis on airline tweets. Although the result is not very beautiful, by this tutorial you still can learn the procedue of sentiment analysis via Gensim Doc2Vec.
The main procedue is:
1. Text Preprocessing
2. Building Doc2Vec Model
3. Building Sentiment Classifier
The airline tweets data can be collected from Kaggle. It is a small corpus. In order to get a great document vector, a larger data set is needed. A larger data set, which contains about 55,000 airline tweets can be downloaded here. In addition, I collect about 55,000 unlabeled data from twitter API. It can be downloaded here
PS: Here I renamed the labeled data as "Tweets_NAg.csv" for simplisity.
A good introduction about how to preprocess tweet data can be found here. The main purpose of this step is to tokenize the meaningful word unit from original tweet. Based on the provided code of sentiment tokenizer, I did a slight modification including nomalizing the URLs and Usernames. The file you need is ReadACleanT.py
Put the files in the same fold. Let's load and clean the data.
from ReadACleanT import clean_tweet
import pandas as pd
import random
df = pd.read_csv('Tweets_NAg.csv')
df = df[[u'airline_sentiment',u'text']]
df.loc[:,'text'] = df.loc[:,'text'].map(clean_tweet)
udf = pd.read_csv('../data/Tweets_Unlabeled.csv')
udf = udf[[u'text']]
udf.loc[:,'text'] = udf.loc[:,'text'].map(clean_tweet)
First we need to turn our data into TaggedDocument form, which is the input form of Doc2Vec.
from gensim.models.doc2vec import TaggedDocument,Doc2Vec
TotalNum = df.size/2
TotalNum_Unlabed = udf.size
TestNum =3000
TrainNum=TotalNum-TestNum
documents = [TaggedDocument(list(df.loc[i,'text']),[i]) for i in range(0,TotalNum)]
documents_unlabeled = [TaggedDocument(list(udf.loc[i,'text']),[i+TotalNum]) for i in range(0,TotalNum_Unlabed)]
documents_all = documents+documents_unlabeled
Doc2VecTrainID = range(0,TotalNum+TotalNum_Unlabed)
random.shuffle(Doc2VecTrainID)
trainDoc = [documents_all[id] for id in Doc2VecTrainID]
Labels = df.loc[:,'airline_sentiment']
After that we can construct the Doc2Vec model. According to the paper, I train two seprate models (i.e. DM and DBOW). Later I will concatenate two vectors into one. For the model parameter setting, you can refer to the document.
import multiprocessing
cores = multiprocessing.cpu_count()
model_DM = Doc2Vec(size=400, window=8, min_count=1, sample=1e-4, negative=5, workers=cores, dm=1, dm_concat=1 )
model_DBOW = Doc2Vec(size=400, window=8, min_count=1, sample=1e-4, negative=5, workers=cores, dm=0)
Next, we are going to build vocabulary for models.
model_DM.build_vocab(trainDoc)
model_DBOW.build_vocab(trainDoc)
Now, we are ready to train Doc2Vec model. Here I use both the testing data and training data to get the vector. It seems violate the normal process. However, this step is unsupervised, we don't use the true label information of testing set.
It will take a lot of time, it's time to have a break.
for it in range(0,10):
random.shuffle(Doc2VecTrainID)
trainDoc = [documents_all[id] for id in Doc2VecTrainID]
model_DM.train(trainDoc)
model_DBOW.train(trainDoc)
Here I simply use softmax to build the classifier. Neural network and SVM may achieve higher performance. Try them if you want.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import statsmodels.api as sm
random.seed(1212)
newindex = random.sample(range(0,TotalNum),TotalNum)
testID = newindex[-TestNum:]
trainID = newindex[:-TestNum]
train_targets, train_regressors = zip(*[(Labels[id], list(model_DM.docvecs[id])+list(model_DBOW.docvecs[id])) for id in trainID])
train_regressors = sm.add_constant(train_regressors)
predictor = LogisticRegression(multi_class='multinomial',solver='lbfgs')
predictor.fit(train_regressors,train_targets)
Let's use G-mean (Geometric mean of recalls) and accuracy to evaluate the model.
accus=[]
Gmeans=[]
test_regressors = [list(model_DM.docvecs[id])+list(model_DBOW.docvecs[id]) for id in testID]
test_regressors = sm.add_constant(test_regressors)
test_predictions = predictor.predict(test_regressors)
for i in range(0,TestNum):
if test_predictions[i]==df.loc[testID[i],u'airline_sentiment']:
accu=accu+1
accus=accus+[1.0*accu/TestNum]
confusionM = confusion_matrix(test_predictions,(df.loc[testID,u'airline_sentiment']))
Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
train_predictions = predictor.predict(train_regressors)
accu=0
for i in range(0,len(train_targets)):
if train_predictions[i]==train_targets[i]:
accu=accu+1
accus=accus+[1.0*accu/len(train_targets)]
confusionM = confusion_matrix(train_predictions,train_targets)
Gmeans=Gmeans+[pow(((1.0*confusionM[0,0]/(confusionM[1,0]+confusionM[2,0]+confusionM[0,0]))*(1.0*confusionM[1,1]/(confusionM[1,1]+confusionM[2,1]+confusionM[0,1]))*(1.0*confusionM[2,2]/(confusionM[1,2]+confusionM[2,2]+confusionM[0,2]))), 1.0/3)]
In my machine, the performance is
data | G-Mean | Accuracy |
---|---|---|
testing set | 0.6225 | 0.7687 |
training set | 0.6297 | 0.7755 |
You can set different random seed for multiple runs or use cross validation to evaluate the model.