「瘦身成功」的ALBERT,能取代BERT嗎?

2020-12-26 澎湃新聞

關注前沿科技 量子位

十三 發自 凹非寺

量子位 報導 | 公眾號 QbitAI

參數比BERT少了80%,性能卻提高了。

這就是谷歌去年提出的「瘦身成功版BERT」模型——ALBERT。

這個模型一經發布,就受到了高度關注,二者的對比也成為了熱門話題。

而最近,網友Naman Bansal就提出了一個疑問:

是否應該用ALBERT來代替BERT?

能否替代,比比便知。

BERT與ALBERT

BERT模型是大家比較所熟知的。

2018年由谷歌提出,訓練的語料庫規模非常龐大,包含33億個詞語。

模型的創新點集中在了預訓練過程,採用Masked LM和Next Sentence Prediction兩種方法,分別捕捉詞語和句子級別的表示。

BERT的出現,徹底改變了預訓練產生詞向量和下遊具體NLP任務的關係。

時隔1年後,谷歌又提出ALBERT,也被稱作「lite-BERT」,骨幹網絡和BERT相似,採用的依舊是 Transformer 編碼器,激活函數也是GELU。

其最大的成功,就在於參數量比BERT少了80%,同時還取得了更好的結果。

與BERT相比的改進,主要包括嵌入向量參數化的因式分解、跨層參數共享、句間連貫性損失採用SOP,以及移除了dropout。

下圖便是BERT和ALBERT,在SQuAD和RACE數據集上的性能測試比較結果。

可以看出,ALBERT性能取得了較好的結果。

如何實現自定義語料庫(預訓練)ALBERT?

為了進一步了解ALBERT,接下來,將在自定義語料庫中實現ALBERT。

所採用的數據集是「用餐點評數據集」,目標就是通過ALBERT模型來識別菜餚的名稱。

第一步:下載數據集並準備文件

1#Downlading all files and data

2

3!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/dish_name_train.csv

4!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/dish_name_val.csv

5!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review.txt

6!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review_nopunct.txt

7!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/models_toy/albert_config.json

8!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/model_checkpoint/finetune_checkpoint

9!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/model_checkpoint/pretrain_checkpoint

10

11#Creating files and setting up ALBERT

12

13!pip install sentencepiece

14!git clone https://github.com/google-research/ALBERT

15!python ./ALBERT/create_pretraining_data.py --input_file "restaurant_review.txt" --output_file "restaurant_review_train" --vocab_file "vocab.txt" --max_seq_length=64

16!pip install transformers

17!pip install tfrecord

第二步:使用transformer並定義層

1#Defining Layers for ALBERT

2

3from transformers.modeling_albert import AlbertModel, AlbertPreTrainedModel

4from transformers.configuration_albert import AlbertConfig

5import torch.nn as nn

6class AlbertSequenceOrderHead(nn.Module):

7 def __init__(self, config):

8 super().__init__()

9 self.dense = nn.Linear(config.hidden_size, 2)

10 self.bias = nn.Parameter(torch.zeros(2))

11

12 def forward(self, hidden_states):

13 hidden_states = self.dense(hidden_states)

14 prediction_scores = hidden_states + self.bias

15

16 return prediction_scores

17

18from torch.nn import CrossEntropyLoss

19from transformers.modeling_bert import ACT2FN

20class AlbertForPretrain(AlbertPreTrainedModel):

21

22 def __init__(self, config):

23 super().__init__(config)

24

25 self.albert = AlbertModel(config)

26

27 # For Masked LM

28 # The original huggingface implementation, created new output weights via dense layer

29 # However the original Albert

30 self.predictions_dense = nn.Linear(config.hidden_size, config.embedding_size)

31 self.predictions_activation = ACT2FN[config.hidden_act]

32 self.predictions_LayerNorm = nn.LayerNorm(config.embedding_size)

33 self.predictions_bias = nn.Parameter(torch.zeros(config.vocab_size))

34 self.predictions_decoder = nn.Linear(config.embedding_size, config.vocab_size)

35

36 self.predictions_decoder.weight = self.albert.embeddings.word_embeddings.weight

37

38 # For sequence order prediction

39 self.seq_relationship = AlbertSequenceOrderHead(config)

40

41

42 def forward(

43 self,

44 input_ids=None,

45 attention_mask=None,

46 token_type_ids=None,

47 position_ids=None,

48 head_mask=None,

49 inputs_embeds=None,

50 masked_lm_labels=None,

51 seq_relationship_labels=None,

52 ):

53

54 outputs = self.albert(

55 input_ids,

56 attention_mask=attention_mask,

57 token_type_ids=token_type_ids,

58 position_ids=position_ids,

59 head_mask=head_mask,

60 inputs_embeds=inputs_embeds,

61 )

62

63 loss_fct = CrossEntropyLoss()

64

65 sequence_output = outputs[0]

66

67 sequence_output = self.predictions_dense(sequence_output)

68 sequence_output = self.predictions_activation(sequence_output)

69 sequence_output = self.predictions_LayerNorm(sequence_output)

70 prediction_scores = self.predictions_decoder(sequence_output)

71

72

73 if masked_lm_labels is not None:

74 masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size)

75 , masked_lm_labels.view(-1))

76

77 pooled_output = outputs[1]

78 seq_relationship_scores = self.seq_relationship(pooled_output)

79 if seq_relationship_labels is not None:

80 seq_relationship_loss = loss_fct(seq_relationship_scores.view(-1, 2), seq_relationship_labels.view(-1))

81

82 loss = masked_lm_loss + seq_relationship_loss

83

84 return loss

第三步:使用LAMB優化器並微調ALBERT

1#Using LAMB optimizer

2#LAMB - "https://github.com/cybertronai/pytorch-lamb"

3

4import torch

5from torch.optim import Optimizer

6class Lamb(Optimizer):

7 r"""Implements Lamb algorithm.

8 It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.

9 Arguments:

10 params (iterable): iterable of parameters to optimize or dicts defining

11 parameter groups

12 lr (float, optional): learning rate (default: 1e-3)

13 betas (Tuple[float, float], optional): coefficients used for computing

14 running averages of gradient and its square (default: (0.9, 0.999))

15 eps (float, optional): term added to the denominator to improve

16 numerical stability (default: 1e-8)

17 weight_decay (float, optional): weight decay (L2 penalty) (default: 0)

18 adam (bool, optional): always use trust ratio = 1, which turns this into

19 Adam. Useful for comparison purposes.

20 .. _Large Batch Optimization for Deep Learning: Training BERT in 76 minutes:

21 https://arxiv.org/abs/1904.00962

22 """

23

24 def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,

25 weight_decay=0, adam=False):

26 if not 0.0 <= lr:

27 raise ValueError("Invalid learning rate: {}".format(lr))

28 if not 0.0 <= eps:

29 raise ValueError("Invalid epsilon value: {}".format(eps))

30 if not 0.0 <= betas[0] < 1.0:

31 raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))

32 if not 0.0 <= betas[1] < 1.0:

33 raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))

34 defaults = dict(lr=lr, betas=betas, eps=eps,

35 weight_decay=weight_decay)

36 self.adam = adam

37 super(Lamb, self).__init__(params, defaults)

38

39 def step(self, closure=None):

40 """Performs a single optimization step.

41 Arguments:

42 closure (callable, optional): A closure that reevaluates the model

43 and returns the loss.

44 """

45 loss = None

46 if closure is not None:

47 loss = closure()

48

49 for group in self.param_groups:

50 for p in group['params']:

51 if p.grad is None:

52 continue

53 grad = p.grad.data

54 if grad.is_sparse:

55 raise RuntimeError('Lamb does not support sparse gradients, consider SparseAdam instad.')

56

57 state = self.state[p]

58

59 # State initialization

60 if len(state) == 0:

61 state['step'] = 0

62 # Exponential moving average of gradient values

63 state['exp_avg'] = torch.zeros_like(p.data)

64 # Exponential moving average of squared gradient values

65 state['exp_avg_sq'] = torch.zeros_like(p.data)

66

67 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']

68 beta1, beta2 = group['betas']

69

70 state['step'] += 1

71

72 # Decay the first and second moment running average coefficient

73 # m_t

74 exp_avg.mul_(beta1).add_(1 - beta1, grad)

75 # v_t

76 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

77

78 # Paper v3 does not use debiasing.

79 # bias_correction1 = 1 - beta1 ** state['step']

80 # bias_correction2 = 1 - beta2 ** state['step']

81 # Apply bias to lr to avoid broadcast.

82 step_size = group['lr'] # * math.sqrt(bias_correction2) / bias_correction1

83

84 weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10)

85

86 adam_step = exp_avg / exp_avg_sq.sqrt().add(group['eps'])

87 if group['weight_decay'] != 0:

88 adam_step.add_(group['weight_decay'], p.data)

89

90 adam_norm = adam_step.pow(2).sum().sqrt()

91 if weight_norm == 0 or adam_norm == 0:

92 trust_ratio = 1

93 else:

94 trust_ratio = weight_norm / adam_norm

95 state['weight_norm'] = weight_norm

96 state['adam_norm'] = adam_norm

97 state['trust_ratio'] = trust_ratio

98 if self.adam:

99 trust_ratio = 1

100

101 p.data.add_(-step_size * trust_ratio, adam_step)

102

103 return loss

104

105 import time

106import torch.nn as nn

107import torch

108from tfrecord.torch.dataset import TFRecordDataset

109import numpy as np

110import os

111

112LEARNING_RATE = 0.001

113EPOCH = 40

114BATCH_SIZE = 2

115MAX_GRAD_NORM = 1.0

116

117print(f"--- Resume/Start training ---")

118feat_map = {"input_ids": "int",

119 "input_mask": "int",

120 "segment_ids": "int",

121 "next_sentence_labels": "int",

122 "masked_lm_positions": "int",

123 "masked_lm_ids": "int"}

124pretrain_file = 'restaurant_review_train'

125

126# Create albert pretrain model

127config = AlbertConfig.from_json_file("albert_config.json")

128albert_pretrain = AlbertForPretrain(config)

129# Create optimizer

130optimizer = Lamb([{"params": [p for n, p in list(albert_pretrain.named_parameters())]}], lr=LEARNING_RATE)

131albert_pretrain.train()

132dataset = TFRecordDataset(pretrain_file, index_path = None, description=feat_map)

133loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE)

134

135tmp_loss = 0

136start_time = time.time()

137

138if os.path.isfile('pretrain_checkpoint'):

139 print(f"--- Load from checkpoint ---")

140 checkpoint = torch.load("pretrain_checkpoint")

141 albert_pretrain.load_state_dict(checkpoint['model_state_dict'])

142 optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

143 epoch = checkpoint['epoch']

144 loss = checkpoint['loss']

145 losses = checkpoint['losses']

146

147else:

148 epoch = -1

149 losses = []

150for e in range(epoch+1, EPOCH):

151 for batch in loader:

152 b_input_ids = batch['input_ids'].long()

153 b_token_type_ids = batch['segment_ids'].long()

154 b_seq_relationship_labels = batch['next_sentence_labels'].long()

155

156 # Convert the dataformat from loaded decoded format into format

157 # loaded format is created by google's Albert create_pretrain.py script

158 # required by huggingfaces pytorch implementation of albert

159 mask_rows = np.nonzero(batch['masked_lm_positions'].numpy())[0]

160 mask_cols = batch['masked_lm_positions'].numpy()[batch['masked_lm_positions'].numpy()!=0]

161 b_attention_mask = np.zeros((BATCH_SIZE,64),dtype=np.int64)

162 b_attention_mask[mask_rows,mask_cols] = 1

163 b_masked_lm_labels = np.zeros((BATCH_SIZE,64),dtype=np.int64) - 100

164 b_masked_lm_labels[mask_rows,mask_cols] = batch['masked_lm_ids'].numpy()[batch['masked_lm_positions'].numpy()!=0]

165 b_attention_mask=torch.tensor(b_attention_mask).long()

166 b_masked_lm_labels=torch.tensor(b_masked_lm_labels).long()

167

168

169 loss = albert_pretrain(input_ids = b_input_ids

170 , attention_mask = b_attention_mask

171 , token_type_ids = b_token_type_ids

172 , masked_lm_labels = b_masked_lm_labels

173 , seq_relationship_labels = b_seq_relationship_labels)

174

175 # clears old gradients

176 optimizer.zero_grad()

177 # backward pass

178 loss.backward()

179 # gradient clipping

180 torch.nn.utils.clip_grad_norm_(parameters=albert_pretrain.parameters(), max_norm=MAX_GRAD_NORM)

181 # update parameters

182 optimizer.step()

183

184 tmp_loss += loss.detach().item()

185

186 # print metrics and save to checkpoint every epoch

187 print(f"Epoch: {e}")

188 print(f"Train loss: {(tmp_loss/20)}")

189 print(f"Train Time: {(time.time()-start_time)/60} mins")

190 losses.append(tmp_loss/20)

191

192 tmp_loss = 0

193 start_time = time.time()

194

195 torch.save({'model_state_dict': albert_pretrain.state_dict(),'optimizer_state_dict': optimizer.state_dict(),

196 'epoch': e, 'loss': loss,'losses': losses}

197 , 'pretrain_checkpoint')

198from matplotlib import pyplot as plot

199plot.plot(losses)

200

201#Fine tuning ALBERT

202

203# At the time of writing, Hugging face didnt provide the class object for

204# AlbertForTokenClassification, hence write your own defination below

205from transformers.modeling_albert import AlbertModel, AlbertPreTrainedModel

206from transformers.configuration_albert import AlbertConfig

207from transformers.tokenization_bert import BertTokenizer

208import torch.nn as nn

209from torch.nn import CrossEntropyLoss

210class AlbertForTokenClassification(AlbertPreTrainedModel):

211

212 def __init__(self, albert, config):

213 super().__init__(config)

214 self.num_labels = config.num_labels

215

216 self.albert = albert

217 self.dropout = nn.Dropout(config.hidden_dropout_prob)

218 self.classifier = nn.Linear(config.hidden_size, config.num_labels)

219

220 def forward(

221 self,

222 input_ids=None,

223 attention_mask=None,

224 token_type_ids=None,

225 position_ids=None,

226 head_mask=None,

227 inputs_embeds=None,

228 labels=None,

229 ):

230

231 outputs = self.albert(

232 input_ids,

233 attention_mask=attention_mask,

234 token_type_ids=token_type_ids,

235 position_ids=position_ids,

236 head_mask=head_mask,

237 inputs_embeds=inputs_embeds,

238 )

239

240 sequence_output = outputs[0]

241

242 sequence_output = self.dropout(sequence_output)

243 logits = self.classifier(sequence_output)

244

245 return logits

246

247import numpy as np

248def label_sent(name_tokens, sent_tokens):

249 label = []

250 i = 0

251 if len(name_tokens)>len(sent_tokens):

252 label = np.zeros(len(sent_tokens))

253 else:

254 while i255 found_match = False

256 if name_tokens[0] == sent_tokens[i]:

257 found_match = True

258 for j in range(len(name_tokens)-1):

259 if ((i+j+1)>=len(sent_tokens)):

260 return label

261 if name_tokens[j+1] != sent_tokens[i+j+1]:

262 found_match = False

263 if found_match:

264 label.extend(list(np.ones(len(name_tokens)).astype(int)))

265 i = i + len(name_tokens)

266 else:

267 label.extend([0])

268 i = i+ 1

269 else:

270 label.extend([0])

271 i=i+1

272 return label

273

274import pandas as pd

275import glob

276import os

277

278tokenizer = BertTokenizer(vocab_file="vocab.txt")

279

280df_data_train = pd.read_csv("dish_name_train.csv")

281df_data_train['name_tokens'] = df_data_train['dish_name'].apply(tokenizer.tokenize)

282df_data_train['review_tokens'] = df_data_train.review.apply(tokenizer.tokenize)

283df_data_train['review_label'] = df_data_train.apply(lambda row: label_sent(row['name_tokens'], row['review_tokens']), axis=1)

284

285df_data_val = pd.read_csv("dish_name_val.csv")

286df_data_val = df_data_val.dropna().reset_index()

287df_data_val['name_tokens'] = df_data_val['dish_name'].apply(tokenizer.tokenize)

288df_data_val['review_tokens'] = df_data_val.review.apply(tokenizer.tokenize)

289df_data_val['review_label'] = df_data_val.apply(lambda row: label_sent(row['name_tokens'], row['review_tokens']), axis=1)

290

291MAX_LEN = 64

292BATCH_SIZE = 1

293from keras.preprocessing.sequence import pad_sequences

294import torch

295from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

296

297tr_inputs = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_train['review_tokens']],maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

298tr_tags = pad_sequences(df_data_train['review_label'],maxlen=MAX_LEN, padding="post",dtype="long", truncating="post")

299# create the mask to ignore the padded elements in the sequences.

300tr_masks = [[float(i>0) for i in ii] for ii in tr_inputs]

301tr_inputs = torch.tensor(tr_inputs)

302tr_tags = torch.tensor(tr_tags)

303tr_masks = torch.tensor(tr_masks)

304train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)

305train_sampler = RandomSampler(train_data)

306train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

307

308

309val_inputs = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_val['review_tokens']],maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

310val_tags = pad_sequences(df_data_val['review_label'],maxlen=MAX_LEN, padding="post",dtype="long", truncating="post")

311# create the mask to ignore the padded elements in the sequences.

312val_masks = [[float(i>0) for i in ii] for ii in val_inputs]

313val_inputs = torch.tensor(val_inputs)

314val_tags = torch.tensor(val_tags)

315val_masks = torch.tensor(val_masks)

316val_data = TensorDataset(val_inputs, val_masks, val_tags)

317val_sampler = RandomSampler(val_data)

318val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=BATCH_SIZE)

319

320model_tokenclassification = AlbertForTokenClassification(albert_pretrain.albert, config)

321from torch.optim import Adam

322LEARNING_RATE = 0.0000003

323FULL_FINETUNING = True

324if FULL_FINETUNING:

325 param_optimizer = list(model_tokenclassification.named_parameters())

326 no_decay = ['bias', 'gamma', 'beta']

327 optimizer_grouped_parameters = [

328 {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],

329 'weight_decay_rate': 0.01},

330 {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],

331 'weight_decay_rate': 0.0}

332 ]

333else:

334 param_optimizer = list(model_tokenclassification.classifier.named_parameters())

335 optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]

336optimizer = Adam(optimizer_grouped_parameters, lr=LEARNING_RATE)

第四步:為自定義語料庫訓練模型

1#Training the model

2

3# from torch.utils.tensorboard import SummaryWriter

4import time

5import os.path

6import torch.nn as nn

7import torch

8EPOCH = 800

9MAX_GRAD_NORM = 1.0

10

11start_time = time.time()

12tr_loss, tr_acc, nb_tr_steps = 0, 0, 0

13eval_loss, eval_acc, nb_eval_steps = 0, 0, 0

14

15if os.path.isfile('finetune_checkpoint'):

16 print(f"--- Load from checkpoint ---")

17 checkpoint = torch.load("finetune_checkpoint")

18 model_tokenclassification.load_state_dict(checkpoint['model_state_dict'])

19 optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

20 epoch = checkpoint['epoch']

21 train_losses = checkpoint['train_losses']

22 train_accs = checkpoint['train_accs']

23 eval_losses = checkpoint['eval_losses']

24 eval_accs = checkpoint['eval_accs']

25

26else:

27 epoch = -1

28 train_losses,train_accs,eval_losses,eval_accs = [],[],[],[]

29

30print(f"--- Resume/Start training ---")

31for e in range(epoch+1, EPOCH):

32

33 # TRAIN loop

34 model_tokenclassification.train()

35

36 for batch in train_dataloader:

37 # add batch to gpu

38 batch = tuple(t for t in batch)

39 b_input_ids, b_input_mask, b_labels = batch

40 # forward pass

41 b_outputs = model_tokenclassification(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

42

43 ce_loss_fct = CrossEntropyLoss()

44 # Only keep active parts of the loss

45 b_active_loss = b_input_mask.view(-1) == 1

46 b_active_logits = b_outputs.view(-1, config.num_labels)[b_active_loss]

47 b_active_labels = b_labels.view(-1)[b_active_loss]

48

49 loss = ce_loss_fct(b_active_logits, b_active_labels)

50 acc = torch.mean((torch.max(b_active_logits.detach(),1)[1] == b_active_labels.detach()).float())

51

52 model_tokenclassification.zero_grad()

53 # backward pass

54 loss.backward()

55 # track train loss

56 tr_loss += loss.item()

57 tr_acc += acc

58 nb_tr_steps += 1

59 # gradient clipping

60 torch.nn.utils.clip_grad_norm_(parameters=model_tokenclassification.parameters(), max_norm=MAX_GRAD_NORM)

61 # update parameters

62 optimizer.step()

63

64

65 # VALIDATION on validation set

66 model_tokenclassification.eval()

67 for batch in val_dataloader:

68 batch = tuple(t for t in batch)

69 b_input_ids, b_input_mask, b_labels = batch

70

71 with torch.no_grad():

72

73 b_outputs = model_tokenclassification(b_input_ids, token_type_ids=None,

74 attention_mask=b_input_mask, labels=b_labels)

75

76 loss_fct = CrossEntropyLoss()

77 # Only keep active parts of the loss

78 b_active_loss = b_input_mask.view(-1) == 1

79 b_active_logits = b_outputs.view(-1, config.num_labels)[b_active_loss]

80 b_active_labels = b_labels.view(-1)[b_active_loss]

81 loss = loss_fct(b_active_logits, b_active_labels)

82 acc = np.mean(np.argmax(b_active_logits.detach().cpu().numpy(), axis=1).flatten() == b_active_labels.detach().cpu().numpy().flatten())

83

84 eval_loss += loss.mean().item()

85 eval_acc += acc

86 nb_eval_steps += 1

87

88 if e % 10 ==0:

89

90 print(f"Epoch: {e}")

91 print(f"Train loss: {(tr_loss/nb_tr_steps)}")

92 print(f"Train acc: {(tr_acc/nb_tr_steps)}")

93 print(f"Train Time: {(time.time()-start_time)/60} mins")

94

95 print(f"Validation loss: {eval_loss/nb_eval_steps}")

96 print(f"Validation Accuracy: {(eval_acc/nb_eval_steps)}")

97

98 train_losses.append(tr_loss/nb_tr_steps)

99 train_accs.append(tr_acc/nb_tr_steps)

100 eval_losses.append(eval_loss/nb_eval_steps)

101 eval_accs.append(eval_acc/nb_eval_steps)

102

103

104 tr_loss, tr_acc, nb_tr_steps = 0, 0, 0

105 eval_loss, eval_acc, nb_eval_steps = 0, 0, 0

106 start_time = time.time()

107

108 torch.save({'model_state_dict': model_tokenclassification.state_dict(),'optimizer_state_dict': optimizer.state_dict(),

109 'epoch': e, 'train_losses': train_losses,'train_accs': train_accs, 'eval_losses':eval_losses,'eval_accs':eval_accs}

110 , 'finetune_checkpoint')

111

112plot.plot(train_losses)

113plot.plot(train_accs)

114plot.plot(eval_losses)

115plot.plot(eval_accs)

116plot.legend(labels = ['train_loss','train_accuracy','validation_loss','validation_accuracy'])

第五步:預測

1#Prediction

2

3def predict(texts):

4 tokenized_texts = [tokenizer.tokenize(txt) for txt in texts]

5 input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],

6 maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

7 attention_mask = [[float(i>0) for i in ii] for ii in input_ids]

8

9 input_ids = torch.tensor(input_ids)

10 attention_mask = torch.tensor(attention_mask)

11

12 dataset = TensorDataset(input_ids, attention_mask)

13 datasampler = SequentialSampler(dataset)

14 dataloader = DataLoader(dataset, sampler=datasampler, batch_size=BATCH_SIZE)

15

16 predicted_labels = []

17

18 for batch in dataloader:

19 batch = tuple(t for t in batch)

20 b_input_ids, b_input_mask = batch

21

22 with torch.no_grad():

23 logits = model_tokenclassification(b_input_ids, token_type_ids=None,

24 attention_mask=b_input_mask)

25

26 predicted_labels.append(np.multiply(np.argmax(logits.detach().cpu().numpy(),axis=2), b_input_mask.detach().cpu().numpy()))

27 # np.concatenate(predicted_labels), to flatten list of arrays of batch_size * max_len into list of arrays of max_len

28 return np.concatenate(predicted_labels).astype(int), tokenized_texts

29

30def get_dish_candidate_names(predicted_label, tokenized_text):

31 name_lists = []

32 if len(np.where(predicted_label>0)[0])>0:

33 name_idx_combined = np.where(predicted_label>0)[0]

34 name_idxs = np.split(name_idx_combined, np.where(np.diff(name_idx_combined) != 1)[0]+1)

35 name_lists.append([" ".join(np.take(tokenized_text,name_idx)) for name_idx in name_idxs])

36 # If there duplicate names in the name_lists

37 name_lists = np.unique(name_lists)

38 return name_lists

39 else:

40 return None

41

42texts = df_data_val.review.values

43predicted_labels, _ = predict(texts)

44df_data_val['predicted_review_label'] = list(predicted_labels)

45df_data_val['predicted_name']=df_data_val.apply(lambda row: get_dish_candidate_names(row.predicted_review_label, row.review_tokens)

46 , axis=1)

47

48texts = df_data_train.review.values

49predicted_labels, _ = predict(texts)

50df_data_train['predicted_review_label'] = list(predicted_labels)

51df_data_train['predicted_name']=df_data_train.apply(lambda row: get_dish_candidate_names(row.predicted_review_label, row.review_tokens)

52 , axis=1)

53

54(df_data_val)

實驗結果

可以看到,模型成功地從用餐評論中,提取出了菜名。

模型比拼

從上面的實戰應用中可以看到,ALBERT雖然很lite,結果也可以說相當不錯。

那麼,參數少、結果好,是否就可以替代BERT呢?

我們可以仔細看下二者實驗性能的比較,這裡的Speedup是指訓練時間。

因為數據數據少了,分布式訓練時吞吐上去了,所以ALBERT訓練更快。但推理時間還是需要和BERT一樣的transformer計算。

所以可以總結為:

在相同的訓練時間下,ALBERT效果要比BERT好。

在相同的推理時間下,ALBERT base和large的效果都是沒有BERT好。

此外,Naman Bansal認為,由於ALBERT的結構,實現ALBERT的計算代價比BERT要高一些。

所以,還是「魚和熊掌不可兼得」的關係,要想讓ALBERT完全超越、替代BERT,還需要做更進一步的研究和改良。

傳送門

博客地址:

https://medium.com/@namanbansal9909/should-we-shift-from-bert-to-albert-e6fbb7779d3e

作者系網易新聞·網易號「各有態度」籤約作者

— 完 —

開始報名啦,3.26晚8點,英偉達專家將分享如何利用遷移式學習工具包加速Jetbot智能小車推理引擎部署。

戳二維碼,備註「英偉達」即可報名、加交流群、獲取前兩期直播回放,主講老師也會進群與大家交流互動哦~

免費報名 | 圖像與視頻處理系列直播課

學習計劃 | 關注AI發展新動態

內參新升級!拓展優質人脈,獲取最新AI資訊&論文教程,歡迎加入AI內參社群一起學習~

量子位 QbitAI · 頭條號籤約作者

վ'ᴗ' ի 追蹤AI技術和產品新動態

喜歡就點「在看」吧 !

原標題:《「瘦身成功」的ALBERT,能取代BERT嗎?》

閱讀原文

相關焦點

  • 谷歌ALBERT模型V2+中文版來了,GitHub熱榜第二
    /albert_base_zh.tar.gzLargehttps://storage.googleapis.com/albert_models/albert_large_zh.tar.gzXLargehttps://storage.googleapis.com/albert_models/albert_xlarge_zh.tar.gzXxlargehttps://storage.googleapis.com/albert_models/albert_xxlarge_zh.tar.gz
  • 喝咖啡能減肥瘦身嗎?
    堅信很多人聽過「咖啡減肥」的叫法,那麼,多飲用咖啡確實能有益於大家的平時瘦身計劃嗎?這也要好好地談起。 一般說來,真實的飲用咖啡能推動身人體開展基礎代謝,有益於減肥瘦身的,是咖啡。 但是,堅信大家都許多在一些短視頻軟體或是某書手機軟體裡刷出過一些說白了「減肥瘦身神器——咖啡:半個月左右瘦八斤十斤」的叫法,從而咖啡減肥這一定義傳播開來,變成很多人在潛意識中裡減肥瘦身的優選。 終究,假如無需動就能瘦身減肥得話,去試一下飲用咖啡也沒有什麼不太好的。 但很多人通常不清楚,咖啡減肥這件事情如同紅葡萄酒能加快血夜流動性活血化淤的叫法一樣一樣的。
  • 教你用BERT進行多標籤文本分類
    BERT是一種基於transformer架構的雙向模型,它以一種速度更快的基於Attention的方法取代了RNN(LSTM和GRU)的sequential屬性。該模型還在兩個無監督任務(「遮蔽語言模型」和「下一句預測」)上進行了預訓練。這讓我們可以通過對下遊特定任務(例如情緒分類,意圖檢測,問答等)進行微調來使用預先訓練的BERT模型。
  • 乾貨| BERT fine-tune 終極實踐教程
    對下載的壓縮文件進行解壓,可以看到文件裡有五個文件,其中bert_model.ckpt開頭的文件是負責模型變量載入的,而vocab.txt是訓練時中文文本採用的字典,最後bert_config.json是BERT在訓練時,可選調整的一些參數。
  • 練瑜伽能瘦身嗎
    核心提示:練瑜伽想要達到瘦身的效果,並不是一天兩天的事情,可能過度肥胖的人在開始練瑜伽的時候,接受不了一些動作,其實在開始的時候應該做一些簡單的動作,不應該一下子把難度增加的特別大。減肥也並不是一天兩天的事情,是需要長時間堅持的,只有長時間堅持才能夠達到真正瘦身的目的。減肥本身就是一件需要毅力的事情。
  • 谷歌親兒子BERT的王者榮耀,僅用一年雄霸谷歌搜索頭牌!
    2019年,biobert,roberta,albert等各種BERT變體開始層出不窮,給傳統的NLP任務帶來了革命性的進展。而谷歌作為BERT的本家,更是將它的優勢發揮的淋漓盡致。加入谷歌搜索剛一年,BERT「佔領」幾乎所有英語查詢2019年10月,BERT首次亮相谷歌搜索時,這一比例僅為10%。
  • 喝啤酒能減肥瘦身嗎?
    有不少的人認為,常常喝啤酒是很不利於減肥瘦身的,因為很多人就長成了啤酒肚將軍肚,很是難看的!其實呢,減肥瘦身也是可以採取啤酒的減肥方法的哦!那麼啤酒減肥法是怎麼樣的呢?常喝啤酒真的能減肥嗎?下面小編就來回答你這些問題!喝啤酒能減肥瘦身嗎?
  • 晚飯後快走一小時能減肥瘦身嗎
    核心提示:晚飯後快走一小時能減肥瘦身嗎?這個問題的答案是肯定的,如果在晚飯後快速的行走一個小時,走路的過程中沒有任何的停頓,這樣想不減肥都很難,可是在晚飯後並不適合馬上就做運動,晚飯後需要休息一個小時,因為腸胃需要一段時間安靜的消化食物,如果吃飯後休息的時間短就開始運動,很容易引發胃下垂。
  • 少吃多餐不一定都能瘦,只要方法正確才會瘦身成功
    健康瘦身是現在最呼籲也是運用率最高的瘦身辦法。而健康瘦身中的少吃多餐瘦身法更是備受追崇。那麼問題來了,只需少吃多餐就能瘦嗎?什麼樣的少吃多餐才幹瘦身成功呢?下面就介紹幾種明星和網紅都在運用的少吃多餐的瘦身辦法,期望可以幫助想瘦身的你。1.吃五分飽每天三到五餐飲食,但是每一餐都不能吃太多,五分飽就要間斷飲食。
  • 冥想能減肥瘦身嗎?
    問:我知道瑜伽能瘦身,練冥想也能嗎?答:首先可以很明確地回答你,修煉冥想是可以減肥瘦身的,只不過它的方式與瑜伽(身瑜伽)或是運動都不太一樣。具體怎麼個不一樣呢,下面我一一道來。冥想瘦身減肥一、提升自制力,讓你輕鬆節食。
  • 四月不減肥,五月徒傷悲,夏天到了,你成功瘦身了嗎?
    身材真的很重要嗎?當然重要!身材是一個人很重要的名片,減肥更是很多女性的終生事業。人一胖,無論做什麼事,都能感受到來自這個世界深深的惡意。身材好的人,給人第一印象就好,自然能更容易獲得別人的認可。身材好的人,至少能獲幾個方面的好處:一,健康方面:從健康來說的話,擁有一副好身材就相當於擁有了一個健康的體魄,因為好身材是需要經過長期的刻苦訓練,在配合日常營養均衡的飲食才能獲得。
  • 七日瘦身湯真的能瘦嗎?詳解其中的注意事項
    七日瘦身湯真的能瘦嗎?詳解其中的注意事項 核心提示:大家聽別人說有一種七日瘦身湯可以快速減掉身上多餘的脂肪,但是大家心裡有疑惑七日瘦身湯真的能瘦嗎?
  • 減輕便秘就等於瘦身成功
    不要小瞧便秘,它但是導致腹部肥胖和瘦身無法成功的主要因素。不僅如此,便秘還會影響腸內環境,讓許多的毒素留存在身體中,影響著身體健康以及皮膚健康。總覺得瘦身很難,但其實改善便秘可以輕鬆的消除大肚腩,快速的燃脂和瘦身瘦身。
  • 如何用 GPT2 和 BERT 建立一個可信的 reddit 自動回復機器人?
    如果願意,可以直接跳轉到項目代碼:https://github.com/lots-of-things/gpt2-bert-reddit-bot。這項工作還參考了以下內容:https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce;https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb
  • 谷歌搜索:幾乎所有的英文搜索都用上BERT了
    如今,這家搜索巨頭終於宣布:幾乎所有英文搜索都能用上 BERT 了。BERT 對於搜尋引擎意味著什麼?參考連結:https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193https://searchengineland.com/a-deep-dive-into-bert-how-bert-launched-a-rocket-into-natural-language-understanding
  • 夏天都過去了,你的瘦身計劃成功了嗎?
    ​​這個夏天都過去了,你夏初定的減肥計劃成功了嗎?如果沒有成功,你即將迎來的是「貼秋膘」!有些人成功的減去了肥肉,不久精神鬆懈下來,人又胖了回去。因為每個人的基因不同,肥胖的原因也是不一樣的。如遺傳、神經系統、飲食生活習慣、代謝紊亂等等。研究發現,個體間約40%-70%體重指數差異都是由遺傳因素決定的。
  • 鄒市明女兒瘦身成功,泰森女兒瘦身成功,霍思燕瘦身成功
    每次提到拳擊手的時候,我們腦海中總會浮現出鄒市明的名字,他是中國第二位職業拳王,憑藉快速出拳以及勇猛的風格,在比賽場上讓人聞風喪膽,因為拳風比較獨特,被人稱作是海盜式的拳擊,讓後輩們連連稱讚而鄒市明之所以能擁有如今的地位和成就,與他的辛苦努力是離不開關係的,我們都知道鄒市明個子比較矮小,在比賽場上是很不佔優勢的
  • 吃燕麥片能瘦身,這是真的嗎?
    而且據說有些明星也通過麥片成功瘦身,那麼,燕麥片減肥法真的有效嗎?攝入的可溶性纖維能在人體腸道內與水混合形成膠質,令人體吸收食物養分的時間延長,較長時間地維持飽腹感,以及避免血糖驟升驟降所帶來的想吃甜食的欲望。
  • 七天瘦身減肥湯真的有用嗎
    核心提示:不知道大家有沒有發現,在日常生活中肥胖的人越來越多了,由於減肥需要大不斷擴大,減肥方法和減肥產品也是層出不窮,比如七天瘦身減肥湯是目前比較流行的一種減肥方法,那麼,七天瘦身減肥湯真的有用嗎?下面我們來一探究竟。 七天瘦身減肥湯真的有用嗎?