豆瓣电影短评数据情感分析Baseline

），为了进一步发挥数据的价值，这次将介绍下如何基于豆瓣影评数据进行评论情感分析，分享一个比较简单的情感分析baseline，后续有机会再将进一步的优化结果分享出来。

数据集处理

Fig 1. 数据集样例

from langconv import *def traditional2simplified(text):    text = Converter('zh-hans').convert(text)    return textdataset["CONTENT"] = dataset.CONTENT.apply(lambda x: traditional2simplified(x))

情感类别定义

Fig 2. 评论Label数据量统计

数据预处理

import jiebafrom sklearn.feature_extraction.text import CountVectorizer def get_stopwords():    stopwords = [line.strip() for line in open('data/stopwords/stopword_normal.txt',encoding='UTF-8').readlines()]    return stopwordsimport restopwords = get_stopwords()def text_process(text):    '''    按照下面方式处理字符串    1. 去除标点符号    2. 去掉无用词    3. 返回剩下的词的list    '''    text = re.sub("[\s+\.\!\/_,\$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]", "",text)​    ltext = jieba.lcut(text)    res_text = []    for word in ltext:        if word not in stopwords:            res_text.append(word)    return res_text​X = dataset.CONTENTy = dataset.labelbow_transformer = CountVectorizer(analyzer=text_process).fit(X)X = bow_transformer.transform(X)

模型训练与评估

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=99)

from sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()nb.fit(X_train, y_train)preds = nb.predict(X_test)

模型评估

from sklearn.metrics import confusion_matrix, classification_report#根据预测值和真实值计算相关指标print(classification_report(y_test, preds))

                precision    recall  f1-score   support​          -1       0.56      0.56      0.56     70406           0       0.50      0.44      0.47    121276           1       0.66      0.72      0.69    166544​    accuracy                           0.59    358226   macro avg       0.57      0.57      0.57    358226weighted avg       0.59      0.59      0.59    358226

def sentiment_pred(text):    text_transformed = bow_transformer.transform([text])    score = nb.predict(text_transformed)[0]    return score

Fig 3. 影评预测结果

References

1. 豆瓣影评数据集： http://moviedata.csuldw.com
2. http://www.csuldw.com
3. https://scikit-learn.org