情感AI:如何像专家一样打造情感预测模型?

情感AI:如何像专家一样打造情感预测模型?

本文将探讨如何利用不同的机器学习技术构建情感预测模型。将涵盖传统的机器学习模型,如逻辑回归、朴素贝叶斯和支持向量机(SVM),以及高级模型,如BERT、LSTM、RNN和结合Word2Vec嵌入的XGBoost。本文还将讨论每种方法的优缺点,以及如何解释它们的结果。如果你想了解更多关于人工智能的相关内容,可以阅读以下这些文章:
2025年人工智能工程师必备的15个Python库
2025年十大数据和人工智能趋势
人工智能代理的崛起:智能应用开发的新时代
AI认证指南:轻松获取六位数年薪的职业路径

我们将从一个员工离职调查评论的小样本数据集开始:

data = [
    {"text": "I loved working here, but I need to move to a new city.", "sentiment": "positive"},
    {"text": "The work environment was toxic and stressful.", "sentiment": "negative"},
    {"text": "It was an okay experience, nothing special.", "sentiment": "neutral"},
    {"text": "I am unsure about my feelings towards this job.", "sentiment": "ambiguous"}
]

TF-IDF 向量化

TF-IDF(词频-逆文档频率)是一种统计方法,用于评估文档中某个词语相对于整个文档集合的重要性。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert data to DataFrame
df = pd.DataFrame(data)

# Encode labels
label_mapping = {'positive': 0, 'negative': 1, 'neutral': 2, 'ambiguous': 3}
df['label'] = df['sentiment'].map(label_mapping)

# Split data
X_train, X_val, y_train, y_val = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# Vectorize text data
vectorizer = TfidfVectorizer(max_features=1000)
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

逻辑回归

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train the model
log_reg = LogisticRegression()
log_reg.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = log_reg.predict(X_val_vec)
print("Logistic Regression Classification Report:")
print(classification_report(y_val, y_pred, target_names=['positive', 'negative', 'neutral', 'ambiguous'], labels=[0, 1, 2, 3]))

朴素贝叶斯

from sklearn.naive_bayes import MultinomialNB

# Train the model
nb = MultinomialNB()
nb.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = nb.predict(X_val_vec)
print("Naive Bayes Classification Report:")
print(classification_report(y_val, y_pred, target_names=['positive', 'negative', 'neutral', 'ambiguous'], labels=[0, 1, 2, 3]))

支持向量机(SVM)

from sklearn.svm import SVC

# Train the model
svm = SVC(kernel='linear')
svm.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = svm.predict(X_val_vec)
print("SVM Classification Report:")
print(classification_report(y_val, y_pred, target_names=['positive', 'negative', 'neutral', 'ambiguous'], labels=[0, 1, 2, 3]))

BERT

BERT(来自 Transformer 的双向编码器表示)是基于 Transformer 架构的模型,用于理解文本中单词的上下文。

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

# Split the data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    [d['text'] for d in data], [0, 1, 2, 3], test_size=0.2, random_state=42
)

# Tokenize the data
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

# Convert to torch tensors
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = Dataset(train_encodings, train_labels)
val_dataset = Dataset(val_encodings, val_labels)

# Define training arguments
training_args = TrainingArguments(output_dir='./results', num_train_epochs=3, per_device_train_batch_size=2)

# Create Trainer instance
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset)

# Train the model
trainer.train()

# Evaluate the model
predictions = trainer.predict(val_dataset)
pred_labels = predictions.predictions.argmax(axis=1)

# Print classification report
print(classification_report(val_labels, pred_labels, target_names=['positive', 'negative', 'neutral', 'ambiguous']))

LSTM

LSTM (Long – Short-Term Memory)网络是一种能够学习序列预测问题中顺序依赖的递归神经网络。

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Tokenize the data
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts([d['text'] for d in data])

# Split the data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    [d['text'] for d in data], [0, 1, 2, 3], test_size=0.2, random_state=42
)

# Tokenize and pad the data
train_sequences = tokenizer.texts_to_sequences(train_texts)
val_sequences = tokenizer.texts_to_sequences(val_texts)
train_padded = pad_sequences(train_sequences, maxlen=50)
val_padded = pad_sequences(val_sequences, maxlen=50)

# Define the LSTM model
model = Sequential()
model.add(Embedding(input_dim=1000, output_dim=64, input_length=50))
model.add(LSTM(64))
model.add(Dense(4, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_padded, train_labels, epochs=5, batch_size=2, validation_data=(val_padded, val_labels))

# Evaluate the model
pred_labels = model.predict(val_padded).argmax(axis=1)

# Print classification report
print(classification_report(val_labels, pred_labels, target_names=['positive', 'negative', 'neutral', 'ambiguous']))

RNN

RNN(递归神经网络)是一类神经网络,它是强大的建模序列数据。

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Tokenize the data
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts([d['text'] for d in data])

# Split the data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    [d['text'] for d in data], [0, 1, 2, 3], test_size=0.2, random_state=42
)

# Tokenize and pad the data
train_sequences = tokenizer.texts_to_sequences(train_texts)
val_sequences = tokenizer.texts_to_sequences(val_texts)
train_padded = pad_sequences(train_sequences, maxlen=50)
val_padded = pad_sequences(val_sequences, maxlen=50)

# Define the RNN model
model = Sequential()
model.add(Embedding(input_dim=1000, output_dim=64, input_length=50))
model.add(SimpleRNN(64))
model.add(Dense(4, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_padded, train_labels, epochs=5, batch_size=2, validation_data=(val_padded, val_labels))

# Evaluate the model
pred_labels = model.predict(val_padded).argmax(axis=1)

# Print classification report
print(classification_report(val_labels, pred_labels, target_names=['positive', 'negative', 'neutral', 'ambiguous']))

XGBoost与Word2Vec

XGBoost是一种强大的梯度增强算法,Word2Vec是一种创建密集词嵌入的技术,可以捕获词之间的语义关系。下面是如何使用Word2Vec嵌入和XGBoost进行情绪预测。

预处理与Word2Vec

首先,我们需要对文本数据进行预处理,并训练Word2Vec模型。

import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
import numpy as np

# Convert data to DataFrame
df = pd.DataFrame(data)

# Encode labels
label_mapping = {'positive': 0, 'negative': 1, 'neutral': 2, 'ambiguous': 3}
df['label'] = df['sentiment'].map(label_mapping)

# Split data
X_train, X_val, y_train, y_val = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# Tokenize text data
X_train_tokens = [text.split() for text in X_train]
X_val_tokens = [text.split() for text in X_val]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1, workers=4)

# Function to average word vectors for a document
def document_vector(doc):
    doc = [word for word in doc if word in word2vec_model.wv.index_to_key]
    return np.mean(word2vec_model.wv[doc], axis=0)

# Create feature vectors
X_train_vec = np.array([document_vector(doc) for doc in X_train_tokens])
X_val_vec = np.array([document_vector(doc) for doc in X_val_tokens])

XGBoost模型

import xgboost as xgb
from sklearn.metrics import classification_report

# Convert to DMatrix
dtrain = xgb.DMatrix(X_train_vec, label=y_train)
dval = xgb.DMatrix(X_val_vec, label=y_val)

# Set parameters
params = {
    'objective': 'multi:softmax',
    'num_class': 4,
    'eval_metric': 'mlogloss'
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=100)

# Predict and evaluate
y_pred = bst.predict(dval)
print("XGBoost Classification Report:")
print(classification_report(y_val, y_pred, target_names=['positive', 'negative', 'neutral', 'ambiguous']))

传统机器学习模型(逻辑回归、朴素贝叶斯、支持向量机)

优点:

  • 简单性:易于实现和解释,适合快速原型开发。
  • 效率:训练和预测时间较快,资源占用低。
  • 基线性能:作为起点模型,通常能为文本分类任务提供合理的性能。

缺点:

  • 上下文无知:无法捕捉单词的上下文关系和语义。
  • 特征工程依赖:需要手动设计和提取特征,工作量较大。

BERT

优点:

  • 上下文理解:能够理解句子中单词的上下文含义,更加贴近自然语言处理的需求。
  • 预训练模型:通过迁移学习利用强大的预训练模型,适用于多种任务。
  • 性能优越:在许多 NLP 任务上表现出卓越的精度和效果。

缺点:

  • 复杂性:需要较高的计算资源,训练和推理过程较复杂。
  • 训练时间:相较传统模型,训练时间更长,特别是在大型数据集上。

LSTM 和 RNN

优点:

  • 序列建模能力:能够有效捕捉文本中的顺序和时间依赖关系。
  • 灵活性:适用于可变长度的输入序列,如句子或文档。

缺点:

  • 训练时间较长:尤其是在处理大数据集时,训练速度会显著下降。
  • 梯度消失问题:RNN 容易遭遇梯度消失,尽管 LSTM 在一定程度上解决了这一问题,但仍可能存在性能瓶颈。

XGBoost 和 Word2Vec

优点:

  1. 高性能:XGBoost 以卓越的分类性能和准确性闻名。
  2. 语义理解:Word2Vec 能捕捉单词之间的语义关系,提高模型对文本的理解能力。

缺点:

  • 复杂性:需要额外的步骤来生成和整合词嵌入(Word2Vec)。
  • 训练时间较长:训练 Word2Vec 和 XGBoost 都可能消耗较多时间和计算资源。

分类报告提供了以下关键指标,用于评估模型的性能:

  • Precision(精确率):预测为正类的样本中,有多少是实际的正类。
    公式:Precision = TP / (TP + FP)
  • Recall(召回率):实际正类样本中,有多少被正确预测为正类。
    公式:Recall = TP / (TP + FN)
  • F1-Score:精确率和召回率的加权平均,权衡两者之间的关系。
    公式:F1 = 2 * (Precision * Recall) / (Precision + Recall)
  • Support(支持数):每个类别在数据集中实际出现的样本数。

下面是一个分类报告的例子:

 precision    recall  f1-score   support
    positive       0.80      0.89      0.84        100
    negative       0.75      0.60      0.67         50
     neutral       0.70      0.65      0.67         40
   ambiguous       0.60      0.50      0.55         10

    accuracy                           0.76        200
   macro avg       0.71      0.66      0.68        200
weighted avg       0.75      0.76      0.75        200

分类报告的分析

  • 高精度、低召回率:模型较为保守,倾向于避免误报(假阳性)。
  • 低精度、高召回率:模型较为宽松,倾向于减少漏报(假阴性)。
  • 平衡的 F1-Score:反映了精度和召回率之间的良好权衡。

本文详细探讨了构建情感预测模型的多种技术,包括传统的机器学习模型(逻辑回归、朴素贝叶斯、SVM),高级模型(BERT、LSTM、RNN),以及结合 Word2Vec 嵌入的 XGBoost。不同方法各有优缺点,选择适合的模型需结合实际任务需求。

通过全面了解各模型的特点与适用场景,你可以为情感分析项目选择最佳方案。此外,尝试多种方法并调整参数,往往能进一步提升模型的性能。

祝你在建模之路上编码愉快,探索无限可能!

感谢阅读!你还可以订阅我们的YouTube频道,观看大量大数据行业相关公开课:https://www.youtube.com/channel/UCa8NLpvi70mHVsW4J_x9OeQ;在LinkedIn上关注我们,扩展你的人际网络!https://www.linkedin.com/company/dataapplab/

原文作者:Shanaka C. DeSoysa
翻译作者:过儿
美工编辑:过儿
校对审稿:Jason
原文链接:https://pub.towardsai.net/emotion-ai-how-to-build-sentiment-prediction-models-like-a-pro-2f0a51bdd976