首先,介紹下本次處理的數據集,數據集包括:
訓練集。包含約20000條中文電影評論,其中積極與消極評論各10000條左右。
驗證集 。包含約6000條中文電影評論,其中積極與消極評論各3000條左右。
測試集。包含約360條中文電影評論,其中積極與消極評論各180條左右。
導入本次訓練所需要的模塊
import gensimimport torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as npfrom collections import Counterfrom torch.utils.data import TensorDataset,DataLoader
1、數據預處理(1)預訓練詞向量,本文使用的是中文維基百科詞向量word2vec,構建詞彙表並存儲,形如{word: id}:
def build_word2id(file, save_to_path=None):""" :file: word2id保存地址 :save_to_path: 保存訓練語料庫中的詞組對應的word2vec到本地 :return: None """ word2id = {'_PAD_': 0} path = ['./Dataset/train.txt', './Dataset/validation.txt']
for _path in path:with open(_path, encoding='utf-8') as f:for line in f.readlines(): sp = line.strip().split()for word in sp[1:]:if word not in word2id.keys(): word2id[word] = len(word2id)if save_to_path: with open(file, 'w', encoding='utf-8') as f:for w in word2id: f.write(w+'\t') f.write(str(word2id[w])) f.write('\n')
return word2id下面是函數調用後的結果:
(2)基於預訓練的word2vec構建訓練語料中所含詞語的word2vec:
def build_word2vec(fname, word2id, save_to_path=None):""" :fname: 預訓練的word2vec :word2id: 語料文本中包含的詞彙集 :save_to_path: 保存訓練語料庫中的詞組對應的word2vec到本地 :return: 語料文本中詞彙集對應的word2vec向量{id: word2vec} """ n_words = max(word2id.values()) + 1 model = gensim.models.KeyedVectors.load_word2vec_format(fname, binary=True) word_vecs = np.array(np.random.uniform(-1., 1., [n_words, model.vector_size]))for word in word2id.keys():try: word_vecs[word2id[word]] = model[word]except KeyError:passif save_to_path:with open(save_to_path, 'w', encoding='utf-8') as f:for vec in word_vecs: vec = [str(w) for w in vec] f.write(' '.join(vec)) f.write('\n')return word_vecs下面是函數調用結果:
(3)將分類類別對應為數值並以詞典方式保存{pos:0, neg:1}:
def cat_to_id(classes=None):""" :classes: 分類標籤;默認為0:pos, 1:neg :return: {分類標籤:id} """if not classes: classes = ['0', '1'] cat2id = {cat: idx for (idx, cat) in enumerate(classes)}return classes, cat2id(4)加載語料庫:train/dev/test:
def load_corpus(path, word2id, max_sen_len=50):""" :path: 樣本語料庫的文件 :return: 文本內容contents,以及分類標籤labels(onehot形式) """ _, cat2id = cat_to_id() contents, labels = [], []with open(path, encoding='utf-8') as f:for line in f.readlines(): sp = line.strip().split() label = sp[0] content = [word2id.get(w, 0) for w in sp[1:]] content = content[:max_sen_len]if len(content) < max_sen_len: content += [word2id['_PAD_']] * (max_sen_len - len(content)) labels.append(label) contents.append(content) counter = Counter(labels) print('總樣本數為:%d' % (len(labels))) print('各個類別樣本數如下:')for w in counter: print(w, counter[w])
contents = np.asarray(contents) labels = np.array([cat2id[l] for l in labels])
return contents, labels下面是具體的函數調用結果:
經過數據預處理,最終數據的格式如下:
x: [1434, 5454, 2323, ..., 0, 0, 0]
y: [1]
x為構成一條語句的單詞所對應的id。y為類別: pos(積極):0, neg(消極):1。
2、構建模型構建TextCNN模型,模型結構如下圖所示:
模型包括詞嵌入層、卷積層、池化層和全連接層。TextCNN模型與CNN模型結構類似,具體可回顧老shi前面介紹CNN模型的文章 利用Tensorflow2.0實現卷積神經網絡CNN
(1)配置模型相關參數:
class CONFIG(): update_w2v = True # 是否在訓練中更新w2v vocab_size = 58954 # 詞彙量,與word2id中的詞彙量一致 n_class = 2 # 分類數:分別為pos和neg embedding_dim = 50 # 詞向量維度 drop_keep_prob = 0.5 # dropout層,參數keep的比例 num_filters = 256 # 卷積層filter的數量 kernel_size = 3 # 卷積核的尺寸 pretrained_embed = word2vec # 預訓練的詞嵌入模型(2)構建TextCNN模型:
class TextCNN(nn.Module):def __init__(self, config):super(TextCNN, self).__init__() update_w2v = config.update_w2v vocab_size = config.vocab_size n_class = config.n_class embedding_dim = config.embedding_dim num_filters = config.num_filters kernel_size = config.kernel_size drop_keep_prob = config.drop_keep_prob pretrained_embed = config.pretrained_embed
# 使用預訓練的詞向量self.embedding = nn.Embedding(vocab_size, embedding_dim)self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embed))self.embedding.weight.requires_grad = update_w2v# 卷積層self.conv = nn.Conv2d(1,num_filters,(kernel_size,embedding_dim))# Dropoutself.dropout = nn.Dropout(drop_keep_prob)# 全連接層self.fc = nn.Linear(num_filters, n_class)
def forward(self, x): x = x.to(torch.int64) x = self.embedding(x) x = x.unsqueeze(1) x = F.relu(self.conv(x)).squeeze(3) x = F.max_pool1d(x, x.size(2)).squeeze(2) x = self.dropout(x) x = self.fc(x)return x
3、訓練模型(1)設置超參數:
config = CONFIG() # 配置模型參數learning_rate = 0.001 # 學習率 batch_size = 32 # 訓練批量epochs = 4 # 訓練輪數model_path = None # 預訓練模型路徑verbose = True # 列印訓練過程device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')print(device)(2)加載訓練數據:
# 混合訓練集和驗證集contents = np.vstack([train_contents, val_contents])labels = np.concatenate([train_labels, val_labels])
# 加載訓練用的數據train_dataset = TensorDataset(torch.from_numpy(contents).type(torch.float), torch.from_numpy(labels).type(torch.long))train_dataloader = DataLoader(dataset = train_dataset, batch_size = batch_size, shuffle = True, num_workers = 2)(3)模型訓練:
def train(dataloader):
# 配置模型,是否繼續上一次的訓練 model = TextCNN(config)if model_path: model.load_state_dict(torch.load(model_path)) model.to(device)
# 設置優化器 optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)
# 設置損失函數 criterion = nn.CrossEntropyLoss()
# 定義訓練過程for epoch in range(epochs):for batch_idx, (batch_x, batch_y) in enumerate(dataloader): batch_x, batch_y = batch_x.to(device), batch_y.to(device) output = model(batch_x) loss = criterion(output, batch_y)
if batch_idx % 200 == 0 & verbose: print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( epoch+1, batch_idx * len(batch_x), len(dataloader.dataset),100. * batch_idx / len(dataloader), loss.item()))
optimizer.zero_grad() loss.backward() optimizer.step()
# 保存模型 torch.save(model.state_dict(), 'model.pth')下面是具體模型訓練結果:
4、測試模型測試模型在測試集的準確率:
(1)設置超參數
model_path = 'model.pth' # 模型路徑batch_size = 32 # 測試批量大小(2)加載測試集
test_dataset = TensorDataset(torch.from_numpy(test_contents).type(torch.float), torch.from_numpy(test_labels).type(torch.long))test_dataloader = DataLoader(dataset = test_dataset, batch_size = batch_size, shuffle = False, num_workers = 2)(3)測試模型在測試集上的準確率:
def predict(dataloader):
# 讀取模型 model = TextCNN(config) model.load_state_dict(torch.load(model_path)) model.eval() model.to(device)
# 測試過程 count, correct = 0, 0for _, (batch_x, batch_y) in enumerate(dataloader): batch_x, batch_y = batch_x.to(device), batch_y.to(device) output = model(batch_x) correct += (output.argmax(1) == batch_y).float().sum().item() count += len(batch_x)
# 列印準確率 print('test accuracy is {:.2f}%.'.format(100*correct/count))結果可以看出,在測試集上TextCNN模型的準確率為85.37%,在文本分類模型中已經算是非常不錯的準確率,說明該模型在處理中文文本情感分類問題方面表現還是非常優異的。好了,本節課到此,有興趣學習更多機器學習方面知識的同學,可以持續關注老shi的公眾號文章,了解更多乾貨內容,感謝大家的支持!