「瘦身成功」的ALBERT,能取代BERT嗎?

2020-12-22 澎湃新聞

關注前沿科技 量子位

十三 發自 凹非寺

量子位 報導 | 公眾號 QbitAI

參數比BERT少了80%,性能卻提高了。

這就是谷歌去年提出的「瘦身成功版BERT」模型——ALBERT。

這個模型一經發布,就受到了高度關注,二者的對比也成為了熱門話題。

而最近,網友Naman Bansal就提出了一個疑問:

是否應該用ALBERT來代替BERT?

能否替代,比比便知。

BERT與ALBERT

BERT模型是大家比較所熟知的。

2018年由谷歌提出,訓練的語料庫規模非常龐大,包含33億個詞語。

模型的創新點集中在了預訓練過程,採用Masked LM和Next Sentence Prediction兩種方法,分別捕捉詞語和句子級別的表示。

BERT的出現,徹底改變了預訓練產生詞向量和下遊具體NLP任務的關係。

時隔1年後,谷歌又提出ALBERT,也被稱作「lite-BERT」,骨幹網絡和BERT相似,採用的依舊是 Transformer 編碼器,激活函數也是GELU。

其最大的成功,就在於參數量比BERT少了80%,同時還取得了更好的結果。

與BERT相比的改進,主要包括嵌入向量參數化的因式分解、跨層參數共享、句間連貫性損失採用SOP,以及移除了dropout。

下圖便是BERT和ALBERT,在SQuAD和RACE數據集上的性能測試比較結果。

可以看出,ALBERT性能取得了較好的結果。

如何實現自定義語料庫(預訓練)ALBERT?

為了進一步了解ALBERT,接下來,將在自定義語料庫中實現ALBERT。

所採用的數據集是「用餐點評數據集」,目標就是通過ALBERT模型來識別菜餚的名稱。

第一步:下載數據集並準備文件

1#Downlading all files and data

2

3!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/dish_name_train.csv

4!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/dish_name_val.csv

5!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review.txt

6!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review_nopunct.txt

7!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/models_toy/albert_config.json

8!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/model_checkpoint/finetune_checkpoint

9!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/model_checkpoint/pretrain_checkpoint

10

11#Creating files and setting up ALBERT

12

13!pip install sentencepiece

14!git clone https://github.com/google-research/ALBERT

15!python ./ALBERT/create_pretraining_data.py --input_file "restaurant_review.txt" --output_file "restaurant_review_train" --vocab_file "vocab.txt" --max_seq_length=64

16!pip install transformers

17!pip install tfrecord

第二步:使用transformer並定義層

1#Defining Layers for ALBERT

2

3from transformers.modeling_albert import AlbertModel, AlbertPreTrainedModel

4from transformers.configuration_albert import AlbertConfig

5import torch.nn as nn

6class AlbertSequenceOrderHead(nn.Module):

7 def __init__(self, config):

8 super().__init__()

9 self.dense = nn.Linear(config.hidden_size, 2)

10 self.bias = nn.Parameter(torch.zeros(2))

11

12 def forward(self, hidden_states):

13 hidden_states = self.dense(hidden_states)

14 prediction_scores = hidden_states + self.bias

15

16 return prediction_scores

17

18from torch.nn import CrossEntropyLoss

19from transformers.modeling_bert import ACT2FN

20class AlbertForPretrain(AlbertPreTrainedModel):

21

22 def __init__(self, config):

23 super().__init__(config)

24

25 self.albert = AlbertModel(config)

26

27 # For Masked LM

28 # The original huggingface implementation, created new output weights via dense layer

29 # However the original Albert

30 self.predictions_dense = nn.Linear(config.hidden_size, config.embedding_size)

31 self.predictions_activation = ACT2FN[config.hidden_act]

32 self.predictions_LayerNorm = nn.LayerNorm(config.embedding_size)

33 self.predictions_bias = nn.Parameter(torch.zeros(config.vocab_size))

34 self.predictions_decoder = nn.Linear(config.embedding_size, config.vocab_size)

35

36 self.predictions_decoder.weight = self.albert.embeddings.word_embeddings.weight

37

38 # For sequence order prediction

39 self.seq_relationship = AlbertSequenceOrderHead(config)

40

41

42 def forward(

43 self,

44 input_ids=None,

45 attention_mask=None,

46 token_type_ids=None,

47 position_ids=None,

48 head_mask=None,

49 inputs_embeds=None,

50 masked_lm_labels=None,

51 seq_relationship_labels=None,

52 ):

53

54 outputs = self.albert(

55 input_ids,

56 attention_mask=attention_mask,

57 token_type_ids=token_type_ids,

58 position_ids=position_ids,

59 head_mask=head_mask,

60 inputs_embeds=inputs_embeds,

61 )

62

63 loss_fct = CrossEntropyLoss()

64

65 sequence_output = outputs[0]

66

67 sequence_output = self.predictions_dense(sequence_output)

68 sequence_output = self.predictions_activation(sequence_output)

69 sequence_output = self.predictions_LayerNorm(sequence_output)

70 prediction_scores = self.predictions_decoder(sequence_output)

71

72

73 if masked_lm_labels is not None:

74 masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size)

75 , masked_lm_labels.view(-1))

76

77 pooled_output = outputs[1]

78 seq_relationship_scores = self.seq_relationship(pooled_output)

79 if seq_relationship_labels is not None:

80 seq_relationship_loss = loss_fct(seq_relationship_scores.view(-1, 2), seq_relationship_labels.view(-1))

81

82 loss = masked_lm_loss + seq_relationship_loss

83

84 return loss

第三步:使用LAMB優化器並微調ALBERT

1#Using LAMB optimizer

2#LAMB - "https://github.com/cybertronai/pytorch-lamb"

3

4import torch

5from torch.optim import Optimizer

6class Lamb(Optimizer):

7 r"""Implements Lamb algorithm.

8 It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.

9 Arguments:

10 params (iterable): iterable of parameters to optimize or dicts defining

11 parameter groups

12 lr (float, optional): learning rate (default: 1e-3)

13 betas (Tuple[float, float], optional): coefficients used for computing

14 running averages of gradient and its square (default: (0.9, 0.999))

15 eps (float, optional): term added to the denominator to improve

16 numerical stability (default: 1e-8)

17 weight_decay (float, optional): weight decay (L2 penalty) (default: 0)

18 adam (bool, optional): always use trust ratio = 1, which turns this into

19 Adam. Useful for comparison purposes.

20 .. _Large Batch Optimization for Deep Learning: Training BERT in 76 minutes:

21 https://arxiv.org/abs/1904.00962

22 """

23

24 def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,

25 weight_decay=0, adam=False):

26 if not 0.0 <= lr:

27 raise ValueError("Invalid learning rate: {}".format(lr))

28 if not 0.0 <= eps:

29 raise ValueError("Invalid epsilon value: {}".format(eps))

30 if not 0.0 <= betas[0] < 1.0:

31 raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))

32 if not 0.0 <= betas[1] < 1.0:

33 raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))

34 defaults = dict(lr=lr, betas=betas, eps=eps,

35 weight_decay=weight_decay)

36 self.adam = adam

37 super(Lamb, self).__init__(params, defaults)

38

39 def step(self, closure=None):

40 """Performs a single optimization step.

41 Arguments:

42 closure (callable, optional): A closure that reevaluates the model

43 and returns the loss.

44 """

45 loss = None

46 if closure is not None:

47 loss = closure()

48

49 for group in self.param_groups:

50 for p in group['params']:

51 if p.grad is None:

52 continue

53 grad = p.grad.data

54 if grad.is_sparse:

55 raise RuntimeError('Lamb does not support sparse gradients, consider SparseAdam instad.')

56

57 state = self.state[p]

58

59 # State initialization

60 if len(state) == 0:

61 state['step'] = 0

62 # Exponential moving average of gradient values

63 state['exp_avg'] = torch.zeros_like(p.data)

64 # Exponential moving average of squared gradient values

65 state['exp_avg_sq'] = torch.zeros_like(p.data)

66

67 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']

68 beta1, beta2 = group['betas']

69

70 state['step'] += 1

71

72 # Decay the first and second moment running average coefficient

73 # m_t

74 exp_avg.mul_(beta1).add_(1 - beta1, grad)

75 # v_t

76 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

77

78 # Paper v3 does not use debiasing.

79 # bias_correction1 = 1 - beta1 ** state['step']

80 # bias_correction2 = 1 - beta2 ** state['step']

81 # Apply bias to lr to avoid broadcast.

82 step_size = group['lr'] # * math.sqrt(bias_correction2) / bias_correction1

83

84 weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10)

85

86 adam_step = exp_avg / exp_avg_sq.sqrt().add(group['eps'])

87 if group['weight_decay'] != 0:

88 adam_step.add_(group['weight_decay'], p.data)

89

90 adam_norm = adam_step.pow(2).sum().sqrt()

91 if weight_norm == 0 or adam_norm == 0:

92 trust_ratio = 1

93 else:

94 trust_ratio = weight_norm / adam_norm

95 state['weight_norm'] = weight_norm

96 state['adam_norm'] = adam_norm

97 state['trust_ratio'] = trust_ratio

98 if self.adam:

99 trust_ratio = 1

100

101 p.data.add_(-step_size * trust_ratio, adam_step)

102

103 return loss

104

105 import time

106import torch.nn as nn

107import torch

108from tfrecord.torch.dataset import TFRecordDataset

109import numpy as np

110import os

111

112LEARNING_RATE = 0.001

113EPOCH = 40

114BATCH_SIZE = 2

115MAX_GRAD_NORM = 1.0

116

117print(f"--- Resume/Start training ---")

118feat_map = {"input_ids": "int",

119 "input_mask": "int",

120 "segment_ids": "int",

121 "next_sentence_labels": "int",

122 "masked_lm_positions": "int",

123 "masked_lm_ids": "int"}

124pretrain_file = 'restaurant_review_train'

125

126# Create albert pretrain model

127config = AlbertConfig.from_json_file("albert_config.json")

128albert_pretrain = AlbertForPretrain(config)

129# Create optimizer

130optimizer = Lamb([{"params": [p for n, p in list(albert_pretrain.named_parameters())]}], lr=LEARNING_RATE)

131albert_pretrain.train()

132dataset = TFRecordDataset(pretrain_file, index_path = None, description=feat_map)

133loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE)

134

135tmp_loss = 0

136start_time = time.time()

137

138if os.path.isfile('pretrain_checkpoint'):

139 print(f"--- Load from checkpoint ---")

140 checkpoint = torch.load("pretrain_checkpoint")

141 albert_pretrain.load_state_dict(checkpoint['model_state_dict'])

142 optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

143 epoch = checkpoint['epoch']

144 loss = checkpoint['loss']

145 losses = checkpoint['losses']

146

147else:

148 epoch = -1

149 losses = []

150for e in range(epoch+1, EPOCH):

151 for batch in loader:

152 b_input_ids = batch['input_ids'].long()

153 b_token_type_ids = batch['segment_ids'].long()

154 b_seq_relationship_labels = batch['next_sentence_labels'].long()

155

156 # Convert the dataformat from loaded decoded format into format

157 # loaded format is created by google's Albert create_pretrain.py script

158 # required by huggingfaces pytorch implementation of albert

159 mask_rows = np.nonzero(batch['masked_lm_positions'].numpy())[0]

160 mask_cols = batch['masked_lm_positions'].numpy()[batch['masked_lm_positions'].numpy()!=0]

161 b_attention_mask = np.zeros((BATCH_SIZE,64),dtype=np.int64)

162 b_attention_mask[mask_rows,mask_cols] = 1

163 b_masked_lm_labels = np.zeros((BATCH_SIZE,64),dtype=np.int64) - 100

164 b_masked_lm_labels[mask_rows,mask_cols] = batch['masked_lm_ids'].numpy()[batch['masked_lm_positions'].numpy()!=0]

165 b_attention_mask=torch.tensor(b_attention_mask).long()

166 b_masked_lm_labels=torch.tensor(b_masked_lm_labels).long()

167

168

169 loss = albert_pretrain(input_ids = b_input_ids

170 , attention_mask = b_attention_mask

171 , token_type_ids = b_token_type_ids

172 , masked_lm_labels = b_masked_lm_labels

173 , seq_relationship_labels = b_seq_relationship_labels)

174

175 # clears old gradients

176 optimizer.zero_grad()

177 # backward pass

178 loss.backward()

179 # gradient clipping

180 torch.nn.utils.clip_grad_norm_(parameters=albert_pretrain.parameters(), max_norm=MAX_GRAD_NORM)

181 # update parameters

182 optimizer.step()

183

184 tmp_loss += loss.detach().item()

185

186 # print metrics and save to checkpoint every epoch

187 print(f"Epoch: {e}")

188 print(f"Train loss: {(tmp_loss/20)}")

189 print(f"Train Time: {(time.time()-start_time)/60} mins")

190 losses.append(tmp_loss/20)

191

192 tmp_loss = 0

193 start_time = time.time()

194

195 torch.save({'model_state_dict': albert_pretrain.state_dict(),'optimizer_state_dict': optimizer.state_dict(),

196 'epoch': e, 'loss': loss,'losses': losses}

197 , 'pretrain_checkpoint')

198from matplotlib import pyplot as plot

199plot.plot(losses)

200

201#Fine tuning ALBERT

202

203# At the time of writing, Hugging face didnt provide the class object for

204# AlbertForTokenClassification, hence write your own defination below

205from transformers.modeling_albert import AlbertModel, AlbertPreTrainedModel

206from transformers.configuration_albert import AlbertConfig

207from transformers.tokenization_bert import BertTokenizer

208import torch.nn as nn

209from torch.nn import CrossEntropyLoss

210class AlbertForTokenClassification(AlbertPreTrainedModel):

211

212 def __init__(self, albert, config):

213 super().__init__(config)

214 self.num_labels = config.num_labels

215

216 self.albert = albert

217 self.dropout = nn.Dropout(config.hidden_dropout_prob)

218 self.classifier = nn.Linear(config.hidden_size, config.num_labels)

219

220 def forward(

221 self,

222 input_ids=None,

223 attention_mask=None,

224 token_type_ids=None,

225 position_ids=None,

226 head_mask=None,

227 inputs_embeds=None,

228 labels=None,

229 ):

230

231 outputs = self.albert(

232 input_ids,

233 attention_mask=attention_mask,

234 token_type_ids=token_type_ids,

235 position_ids=position_ids,

236 head_mask=head_mask,

237 inputs_embeds=inputs_embeds,

238 )

239

240 sequence_output = outputs[0]

241

242 sequence_output = self.dropout(sequence_output)

243 logits = self.classifier(sequence_output)

244

245 return logits

246

247import numpy as np

248def label_sent(name_tokens, sent_tokens):

249 label = []

250 i = 0

251 if len(name_tokens)>len(sent_tokens):

252 label = np.zeros(len(sent_tokens))

253 else:

254 while i255 found_match = False

256 if name_tokens[0] == sent_tokens[i]:

257 found_match = True

258 for j in range(len(name_tokens)-1):

259 if ((i+j+1)>=len(sent_tokens)):

260 return label

261 if name_tokens[j+1] != sent_tokens[i+j+1]:

262 found_match = False

263 if found_match:

264 label.extend(list(np.ones(len(name_tokens)).astype(int)))

265 i = i + len(name_tokens)

266 else:

267 label.extend([0])

268 i = i+ 1

269 else:

270 label.extend([0])

271 i=i+1

272 return label

273

274import pandas as pd

275import glob

276import os

277

278tokenizer = BertTokenizer(vocab_file="vocab.txt")

279

280df_data_train = pd.read_csv("dish_name_train.csv")

281df_data_train['name_tokens'] = df_data_train['dish_name'].apply(tokenizer.tokenize)

282df_data_train['review_tokens'] = df_data_train.review.apply(tokenizer.tokenize)

283df_data_train['review_label'] = df_data_train.apply(lambda row: label_sent(row['name_tokens'], row['review_tokens']), axis=1)

284

285df_data_val = pd.read_csv("dish_name_val.csv")

286df_data_val = df_data_val.dropna().reset_index()

287df_data_val['name_tokens'] = df_data_val['dish_name'].apply(tokenizer.tokenize)

288df_data_val['review_tokens'] = df_data_val.review.apply(tokenizer.tokenize)

289df_data_val['review_label'] = df_data_val.apply(lambda row: label_sent(row['name_tokens'], row['review_tokens']), axis=1)

290

291MAX_LEN = 64

292BATCH_SIZE = 1

293from keras.preprocessing.sequence import pad_sequences

294import torch

295from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

296

297tr_inputs = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_train['review_tokens']],maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

298tr_tags = pad_sequences(df_data_train['review_label'],maxlen=MAX_LEN, padding="post",dtype="long", truncating="post")

299# create the mask to ignore the padded elements in the sequences.

300tr_masks = [[float(i>0) for i in ii] for ii in tr_inputs]

301tr_inputs = torch.tensor(tr_inputs)

302tr_tags = torch.tensor(tr_tags)

303tr_masks = torch.tensor(tr_masks)

304train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)

305train_sampler = RandomSampler(train_data)

306train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

307

308

309val_inputs = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_val['review_tokens']],maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

310val_tags = pad_sequences(df_data_val['review_label'],maxlen=MAX_LEN, padding="post",dtype="long", truncating="post")

311# create the mask to ignore the padded elements in the sequences.

312val_masks = [[float(i>0) for i in ii] for ii in val_inputs]

313val_inputs = torch.tensor(val_inputs)

314val_tags = torch.tensor(val_tags)

315val_masks = torch.tensor(val_masks)

316val_data = TensorDataset(val_inputs, val_masks, val_tags)

317val_sampler = RandomSampler(val_data)

318val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=BATCH_SIZE)

319

320model_tokenclassification = AlbertForTokenClassification(albert_pretrain.albert, config)

321from torch.optim import Adam

322LEARNING_RATE = 0.0000003

323FULL_FINETUNING = True

324if FULL_FINETUNING:

325 param_optimizer = list(model_tokenclassification.named_parameters())

326 no_decay = ['bias', 'gamma', 'beta']

327 optimizer_grouped_parameters = [

328 {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],

329 'weight_decay_rate': 0.01},

330 {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],

331 'weight_decay_rate': 0.0}

332 ]

333else:

334 param_optimizer = list(model_tokenclassification.classifier.named_parameters())

335 optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]

336optimizer = Adam(optimizer_grouped_parameters, lr=LEARNING_RATE)

第四步:為自定義語料庫訓練模型

1#Training the model

2

3# from torch.utils.tensorboard import SummaryWriter

4import time

5import os.path

6import torch.nn as nn

7import torch

8EPOCH = 800

9MAX_GRAD_NORM = 1.0

10

11start_time = time.time()

12tr_loss, tr_acc, nb_tr_steps = 0, 0, 0

13eval_loss, eval_acc, nb_eval_steps = 0, 0, 0

14

15if os.path.isfile('finetune_checkpoint'):

16 print(f"--- Load from checkpoint ---")

17 checkpoint = torch.load("finetune_checkpoint")

18 model_tokenclassification.load_state_dict(checkpoint['model_state_dict'])

19 optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

20 epoch = checkpoint['epoch']

21 train_losses = checkpoint['train_losses']

22 train_accs = checkpoint['train_accs']

23 eval_losses = checkpoint['eval_losses']

24 eval_accs = checkpoint['eval_accs']

25

26else:

27 epoch = -1

28 train_losses,train_accs,eval_losses,eval_accs = [],[],[],[]

29

30print(f"--- Resume/Start training ---")

31for e in range(epoch+1, EPOCH):

32

33 # TRAIN loop

34 model_tokenclassification.train()

35

36 for batch in train_dataloader:

37 # add batch to gpu

38 batch = tuple(t for t in batch)

39 b_input_ids, b_input_mask, b_labels = batch

40 # forward pass

41 b_outputs = model_tokenclassification(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

42

43 ce_loss_fct = CrossEntropyLoss()

44 # Only keep active parts of the loss

45 b_active_loss = b_input_mask.view(-1) == 1

46 b_active_logits = b_outputs.view(-1, config.num_labels)[b_active_loss]

47 b_active_labels = b_labels.view(-1)[b_active_loss]

48

49 loss = ce_loss_fct(b_active_logits, b_active_labels)

50 acc = torch.mean((torch.max(b_active_logits.detach(),1)[1] == b_active_labels.detach()).float())

51

52 model_tokenclassification.zero_grad()

53 # backward pass

54 loss.backward()

55 # track train loss

56 tr_loss += loss.item()

57 tr_acc += acc

58 nb_tr_steps += 1

59 # gradient clipping

60 torch.nn.utils.clip_grad_norm_(parameters=model_tokenclassification.parameters(), max_norm=MAX_GRAD_NORM)

61 # update parameters

62 optimizer.step()

63

64

65 # VALIDATION on validation set

66 model_tokenclassification.eval()

67 for batch in val_dataloader:

68 batch = tuple(t for t in batch)

69 b_input_ids, b_input_mask, b_labels = batch

70

71 with torch.no_grad():

72

73 b_outputs = model_tokenclassification(b_input_ids, token_type_ids=None,

74 attention_mask=b_input_mask, labels=b_labels)

75

76 loss_fct = CrossEntropyLoss()

77 # Only keep active parts of the loss

78 b_active_loss = b_input_mask.view(-1) == 1

79 b_active_logits = b_outputs.view(-1, config.num_labels)[b_active_loss]

80 b_active_labels = b_labels.view(-1)[b_active_loss]

81 loss = loss_fct(b_active_logits, b_active_labels)

82 acc = np.mean(np.argmax(b_active_logits.detach().cpu().numpy(), axis=1).flatten() == b_active_labels.detach().cpu().numpy().flatten())

83

84 eval_loss += loss.mean().item()

85 eval_acc += acc

86 nb_eval_steps += 1

87

88 if e % 10 ==0:

89

90 print(f"Epoch: {e}")

91 print(f"Train loss: {(tr_loss/nb_tr_steps)}")

92 print(f"Train acc: {(tr_acc/nb_tr_steps)}")

93 print(f"Train Time: {(time.time()-start_time)/60} mins")

94

95 print(f"Validation loss: {eval_loss/nb_eval_steps}")

96 print(f"Validation Accuracy: {(eval_acc/nb_eval_steps)}")

97

98 train_losses.append(tr_loss/nb_tr_steps)

99 train_accs.append(tr_acc/nb_tr_steps)

100 eval_losses.append(eval_loss/nb_eval_steps)

101 eval_accs.append(eval_acc/nb_eval_steps)

102

103

104 tr_loss, tr_acc, nb_tr_steps = 0, 0, 0

105 eval_loss, eval_acc, nb_eval_steps = 0, 0, 0

106 start_time = time.time()

107

108 torch.save({'model_state_dict': model_tokenclassification.state_dict(),'optimizer_state_dict': optimizer.state_dict(),

109 'epoch': e, 'train_losses': train_losses,'train_accs': train_accs, 'eval_losses':eval_losses,'eval_accs':eval_accs}

110 , 'finetune_checkpoint')

111

112plot.plot(train_losses)

113plot.plot(train_accs)

114plot.plot(eval_losses)

115plot.plot(eval_accs)

116plot.legend(labels = ['train_loss','train_accuracy','validation_loss','validation_accuracy'])

第五步:預測

1#Prediction

2

3def predict(texts):

4 tokenized_texts = [tokenizer.tokenize(txt) for txt in texts]

5 input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],

6 maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

7 attention_mask = [[float(i>0) for i in ii] for ii in input_ids]

8

9 input_ids = torch.tensor(input_ids)

10 attention_mask = torch.tensor(attention_mask)

11

12 dataset = TensorDataset(input_ids, attention_mask)

13 datasampler = SequentialSampler(dataset)

14 dataloader = DataLoader(dataset, sampler=datasampler, batch_size=BATCH_SIZE)

15

16 predicted_labels = []

17

18 for batch in dataloader:

19 batch = tuple(t for t in batch)

20 b_input_ids, b_input_mask = batch

21

22 with torch.no_grad():

23 logits = model_tokenclassification(b_input_ids, token_type_ids=None,

24 attention_mask=b_input_mask)

25

26 predicted_labels.append(np.multiply(np.argmax(logits.detach().cpu().numpy(),axis=2), b_input_mask.detach().cpu().numpy()))

27 # np.concatenate(predicted_labels), to flatten list of arrays of batch_size * max_len into list of arrays of max_len

28 return np.concatenate(predicted_labels).astype(int), tokenized_texts

29

30def get_dish_candidate_names(predicted_label, tokenized_text):

31 name_lists = []

32 if len(np.where(predicted_label>0)[0])>0:

33 name_idx_combined = np.where(predicted_label>0)[0]

34 name_idxs = np.split(name_idx_combined, np.where(np.diff(name_idx_combined) != 1)[0]+1)

35 name_lists.append([" ".join(np.take(tokenized_text,name_idx)) for name_idx in name_idxs])

36 # If there duplicate names in the name_lists

37 name_lists = np.unique(name_lists)

38 return name_lists

39 else:

40 return None

41

42texts = df_data_val.review.values

43predicted_labels, _ = predict(texts)

44df_data_val['predicted_review_label'] = list(predicted_labels)

45df_data_val['predicted_name']=df_data_val.apply(lambda row: get_dish_candidate_names(row.predicted_review_label, row.review_tokens)

46 , axis=1)

47

48texts = df_data_train.review.values

49predicted_labels, _ = predict(texts)

50df_data_train['predicted_review_label'] = list(predicted_labels)

51df_data_train['predicted_name']=df_data_train.apply(lambda row: get_dish_candidate_names(row.predicted_review_label, row.review_tokens)

52 , axis=1)

53

54(df_data_val)

實驗結果

可以看到,模型成功地從用餐評論中,提取出了菜名。

模型比拼

從上面的實戰應用中可以看到,ALBERT雖然很lite,結果也可以說相當不錯。

那麼,參數少、結果好,是否就可以替代BERT呢?

我們可以仔細看下二者實驗性能的比較,這裡的Speedup是指訓練時間。

因為數據數據少了,分布式訓練時吞吐上去了,所以ALBERT訓練更快。但推理時間還是需要和BERT一樣的transformer計算。

所以可以總結為:

在相同的訓練時間下,ALBERT效果要比BERT好。

在相同的推理時間下,ALBERT base和large的效果都是沒有BERT好。

此外,Naman Bansal認為,由於ALBERT的結構,實現ALBERT的計算代價比BERT要高一些。

所以,還是「魚和熊掌不可兼得」的關係,要想讓ALBERT完全超越、替代BERT,還需要做更進一步的研究和改良。

傳送門

博客地址:

https://medium.com/@namanbansal9909/should-we-shift-from-bert-to-albert-e6fbb7779d3e

作者系網易新聞·網易號「各有態度」籤約作者

— 完 —

開始報名啦,3.26晚8點,英偉達專家將分享如何利用遷移式學習工具包加速Jetbot智能小車推理引擎部署。

戳二維碼,備註「英偉達」即可報名、加交流群、獲取前兩期直播回放,主講老師也會進群與大家交流互動哦~

免費報名 | 圖像與視頻處理系列直播課

學習計劃 | 關注AI發展新動態

內參新升級!拓展優質人脈,獲取最新AI資訊&論文教程,歡迎加入AI內參社群一起學習~

量子位 QbitAI · 頭條號籤約作者

վ'ᴗ' ի 追蹤AI技術和產品新動態

喜歡就點「在看」吧 !

原標題:《「瘦身成功」的ALBERT,能取代BERT嗎?》

閱讀原文

相關焦點

  • 谷歌ALBERT模型V2+中文版來了,GitHub熱榜第二
    /albert_base_zh.tar.gzLargehttps://storage.googleapis.com/albert_models/albert_large_zh.tar.gzXLargehttps://storage.googleapis.com/albert_models/albert_xlarge_zh.tar.gzXxlargehttps://storage.googleapis.com/albert_models/albert_xxlarge_zh.tar.gz
  • 減肥瘦身貼有效果嗎
    核心提示:減肥瘦身貼是一種減肥會有效果的用品,但是這種瘦身貼一定要在專人輔導下才能使用,如果自己貼多半不會有效果,之所以會這樣,是因為減肥瘦身貼的使用方法是貼穴位,通過穴位將瘦身貼的藥效吸收到身體內,然後在身體內發生作用,將脂肪從皮下分離、溶解燃燒,然後這些脂肪會通過排汗被代謝掉,這種瘦身貼也被稱作透皮吸收減肥貼。
  • Bert Stern 最後一位拍攝夢露的攝影師
    「這是一個千載難逢的機會,瑪麗蓮·夢露在房間」 Bert Stern說道,隨後他們在酒店呆了3天,足足拍了2500多張照片,Bert Stern認為夢露是他見過最美的女人,所以在他的個人網站上(www.bertsternmadman.com
  • 谷歌親兒子BERT的王者榮耀,僅用一年雄霸谷歌搜索頭牌!
    2019年,biobert,roberta,albert等各種BERT變體開始層出不窮,給傳統的NLP任務帶來了革命性的進展。而谷歌作為BERT的本家,更是將它的優勢發揮的淋漓盡致。加入谷歌搜索剛一年,BERT「佔領」幾乎所有英語查詢2019年10月,BERT首次亮相谷歌搜索時,這一比例僅為10%。
  • 你還記得《瘋丫頭》糖糖嗎?曾經因太胖不受喜歡,今逆襲瘦身成功
    糖糖並不招小朋友的喜歡,其實糖糖的真名叫做湯家琦,可是你知道嗎?小時候胖乎乎的她,在小朋友的眼中並不好看,可是在更多人的眼中還是很可愛的。而湯家琦長大之後,距離現在已經十年的時間了,胖丫頭已經成功瘦身逆襲,成為很多人的女神。
  • 從ULMFiT、BERT等經典模型看NLP 發展趨勢
    3、BERTGitHub 項目地址:https://github.com/google-research/bertBERT預訓練模型論文https://www.paperswithcode.com/paper/bert-pre-training-of-deep-bidirectional#code其他研究論文
  • 通過華晨宇瘦身成功,看到一個藝人光鮮背後的艱辛
    華晨宇最近在一次採訪中,主持人問到你私下風格和舞台風格有什麼區別嗎?,他說自己私下穿衣服喜歡素一點寬鬆一點,而舞臺衣服則畢竟瘦身一點,私下的話不想讓自己看起來那麼瘦。那麼華晨宇到底有多麼瘦呢?早在2017年時他的照片裡我們能看出他就很肉肉。然後2017年的3月他決定為了舞臺表演效果,嘗試了一種「遊戲」——辟穀。什麼叫辟穀呢?就是不吃飯,只喝水,堅持了14天。個人感覺應該是經紀人對他亮了黃牌,如果再不減肥可能會失去市場,畢竟現在的流量明星最明星的一個特點就是身材纖細。
  • 《好身材》第二季勵志收官 李伯恩成功瘦身三十斤
    導讀: 芒果TV和咪咕視頻聯合自製健康生活體驗真人秀《哎呀好身材》第二季已經於上周六收官,在收官的成果驗收大會上,王嶽倫為李湘準備驚喜,程瀟改變成功收穫好友祝福,李伯恩成功瘦身三十斤,楊迪與媽媽制定考核評分表。
  • 今日神評論:華為幫《逆水寒》瘦身成功?5G時代和75G說拜拜
    熱點事件每天有,段子手在民間~下面要跟大家分享的話題是【華為幫《逆水寒》瘦身成功~5G時代和75G說拜拜】,新聞一經爆出,就引來各界圍觀,分分鐘上了熱門頭條~一時之間多家媒體紛紛發表、轉載,如網易、新浪、騰訊等均有報導,大家可直接在百度中搜索該話題查看~僅在網易新聞中,24小時內點擊已經超過7548
  • 「威廉爸爸」SAM HAMMINGTON成功瘦身22公斤!他的目標是40公斤
    「威廉爸爸」Sam Hammington 在作出瘦身宣言後,僅一個月就減了14公斤,現在已經成功減了22公斤。Sam Hammington 過去曾嘗試過多種不同的瘦身方法,但每次都因為瘦身的過程太辛苦而中途放棄,或是之後會有體重反彈現象出現,最後得出失敗的結果,並因此放棄瘦身,但隨著他的體重達到120公斤,讓他的健康變得不好後,在專家的幫助下,開始挑戰他最後一次的瘦身。
  • 吳宗憲女兒瘦身成功,變成知性大美女,被男粉絲當街示愛求婚
    綜藝天王吳宗憲如今開始把自己的綜藝第一棒轉移到了女兒Sandy(吳姍儒)的身上,比起小女兒Vivian,吳宗憲更喜歡Sandy,因為大女兒懂事又能接管自己的主持事情,如今30歲的她可以獨擋一面了。
  • 打籃球減肥嗎 哪些運動可以減肥瘦身
    很多人覺得打籃球聽起來就像是一項偏男性化的運動,但其實女生也是可以打籃球的,並且還具有很多的好處,在現實生活中很多人都會通過打籃球來減肥瘦身。那打籃球減肥嗎?哪些運動可以減肥瘦身?下面我們一起來詳細了解一下!打籃球減肥嗎?
  • 谷歌搜索:幾乎所有的英文搜索都用上BERT了
    如今,這家搜索巨頭終於宣布:幾乎所有英文搜索都能用上 BERT 了。BERT 對於搜尋引擎意味著什麼?參考連結:https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193https://searchengineland.com/a-deep-dive-into-bert-how-bert-launched-a-rocket-into-natural-language-understanding
  • 克裡斯汀的現任女友,美嗎?好像已經成功取代「呆拉」了
    克裡斯汀的現任女友美嗎?好像已經成功取代「呆拉」了在介紹現任女友之前,我們在回顧一下呆拉和克裡斯汀的愛情。2016年12月,克裡斯汀和呆拉正式的公開戀情,那段時間,不少媒體也多次拍到克裡斯汀和呆拉約會。而克裡斯汀和呆拉的這場愛情,也就光明正大地公開了。可當時令克裡斯汀也沒有想到,她和呆拉的戀情,居然被大家所接受。「好般配!」
  • 張馨月產後恢復好身材,瘦身成功小蠻腰搶鏡,筆直漫畫腿出眾
    張馨月是模特出道,身材超好,生孩子對女藝人來說是非常大的挑戰,張馨月同樣有著女明星的身材,產女不足兩個月已經瘦身成功。這是張馨月產後首次露出長腿,她在11月2日曬出夫妻遛娃照一組,穿的是寬鬆牛仔褲,即便如此,也能依稀看出腿有多細。
  • 生化危機7最後是漢克嗎
    他的槍叫「albert,02」,所以可能是和啟示錄2女albert一樣的產物。而且那個標誌是神亞製藥公司麾下的,不是原本的保護傘公司。但也有說官方說是克裡斯改形象了。生化危機7最後漢克身份猜想介紹:他的槍叫「albert,02」,所以可能是和啟示錄2女albert一樣的產物;而且那個標誌是神亞製藥公司麾下的,不是原本的保護傘公司;應該是假克裡斯,看他髮型,不覺得像威斯克嗎;而且他的槍叫albert 2,說明可能是和啟示錄2女albert,一樣的產物;保護傘跟原來的標誌不一樣,是個新組織
  • 瑜伽瘦身計劃:5個體式,在家練習,每天半小時,成功減脂塑形!
    【本期導讀】瑜伽瘦身計劃:5個體式,每日一練,循序漸進,減脂塑形變美女!自從春節以來,為了控制流行病毒的傳染,我們一直都宅在家中。對於我們酷愛瑜伽的姐妹來說,近段時間一直在家中練習,可能有些感到逼悶了。
  • 小狸瘦身專訪:甩肉30斤從胖大媽逆襲女神,掌握這個方法你也能瘦...
    每個對自己體型不滿意的人都在叫嚷著要減肥,但真正堅持下來並成功的人卻是鳳毛麟角。但是在小狸瘦身這裡就不一樣了!基本上來到小狸的每一個家人,都會有成效!就和我們接下來要採訪的家人彭淑尋女士一樣,今年45歲的她從來都沒有和苗條沾邊,她來到小狸後,卻用三個月的時間,成功減脂30斤,並保持到現在。如今的她依然保持著健康的飲食習慣等等。
  • 四小花旦被取代?如今圈內流行「四楊雙麗」,能取代成功嗎?
    先來說下楊冪。楊冪雖然已經結婚生子,當然現在已經離婚了。 不知道你覺得「四楊雙麗」會取代「四小花旦」嗎?你喜歡她們中的哪一個女星呢?