NLP入門(八)使用CRF++實現命名實體識別(NER)

2021-02-08 NLP奇幻之旅

CRF與NER簡介

CRF，英文全稱為conditional random field, 中文名為條件隨機場，是給定一組輸入隨機變量條件下另一組輸出隨機變量的條件概率分布模型，其特點是假設輸出隨機變量構成馬爾可夫（Markov）隨機場。
較為簡單的條件隨機場是定義在線性鏈上的條件隨機場，稱為線性鏈條件隨機場（linear chain conditional random field）. 線性鏈條件隨機場可以用於序列標註等問題，而本文需要解決的命名實體識別(NER)任務正好可通過序列標註方法解決。這時，在條件概率模型P(Y|X)中，Y是輸出變量，表示標記序列（或狀態序列），X是輸入變量，表示需要標註的觀測序列。學習時，利用訓練數據集通過極大似然估計或正則化的極大似然估計得到條件概率模型p(Y|X)；預測時，對於給定的輸入序列x，求出條件概率p(y|x)最大的輸出序列y0.

X和Y有相同的圖結構的線性鏈條件隨機場

命名實體識別（Named Entity Recognition，簡稱NER）是信息提取、問答系統、句法分析、機器翻譯等應用領域的重要基礎工具，在自然語言處理技術走向實用化的過程中佔有重要地位。一般來說，命名實體識別的任務就是識別出待處理文本中三大類（實體類、時間類和數字類）、七小類（人名、機構名、地名、時間、日期、貨幣和百分比）命名實體。常見的實現NER的算法如下：

實現NER的算法

本文不準備詳細介紹條件隨機場的原理與實現算法，關於具體的原理與實現算法，可以參考《統計學習算法》一書。我們將藉助已實現條件隨機場的工具——CRF++來實現命名實體識別。關於用深度學習算法來實現命名實體識別，可以參考文章：NLP入門（五）用深度學習實現命名實體識別（NER）。

CRF++ 簡介

CRF++是著名的條件隨機場的開源工具，也是目前綜合性能最佳的CRF工具，採用C++語言編寫而成。其最重要的功能我認為是採用了特徵模板。這樣就可以自動生成一系列的特徵函數，而不用我們自己生成特徵函數，我們要做的就是尋找特徵，比如詞性等。關於CRF++的特性，可以參考網址：http://taku910.github.io/crfpp/ 。

安裝

CRF++的安裝可分為Windows環境和Linux環境下的安裝。關於Linux環境下的安裝，可以參考文章：CRFPP/CRF++編譯安裝與部署。在Windows中CRF++不需要安裝，下載解壓CRF++0.58文件即可以使用，下載網址為：https://blog.csdn.net/lilong117194/article/details/81160265 。

使用1. 語料

以我們本次使用的命名實體識別的語料為例，作為CRF++訓練的語料（前20行，每一句話以空格隔開。）如下：

played VBD O
on IN O
Monday NNP O
( ( O
home NN O
team NN O
in IN O
CAPS NNP O
) ) O
: : O

American NNP B-MISC
League NNP I-MISC

Cleveland NNP B-ORG
2 CD O
DETROIT NNP B-ORG
1 CD O

BALTIMORE VB B-ORG

需要注意字與標籤之間的分隔符為制表符\t，否則會導致feature_index.cpp(86) [max_size == size] inconsistent column size錯誤。

2. 模板

模板是使用CRF++的關鍵，它能幫助我們自動生成一系列的特徵函數，而不用我們自己生成特徵函數，而特徵函數正是CRF算法的核心概念之一。一個簡單的模板文件如下：

U00:%x[-2,0]
U01:%x[0,1]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]

B

在這裡，我們需要好好理解下模板文件的規則。T**:%x[#,#]中的T表示模板類型，兩個"#"分別表示相對的行偏移與列偏移。一共有兩種模板：

第一種模板是Unigram template:第一個字符是U，用於描述unigram feature的模板。每一行%x[#,#]生成一個CRF中的點(state)函數: f(s, o), 其中s為t時刻的的標籤(output)，o為t時刻的上下文。假設home NN O所在行為CURRENT TOKEN，

played VBD O
on IN O
Monday NNP O
( ( O
home NN O << CURRENT TOKEN
team NN O
in IN O
CAPS NNP O
) ) O
: : O

那麼%x[#,#]的對應規則如下：

templateexpanded feature%x[0,0]home%x[0,1]NN%x[-1,0](%x[-2,1]NNP%x[0,0]/%x[0,1]home/NNABC%x[0,1]123ABCNN123

以「U01:%x[0,1]」為例，它在該語料中生成的示例函數如下:

func1 = if (output = O and feature="U01:NN") return 1 else return 0
func2 = if (output = O and feature="U01:N") return 1 else return 0
func3 = if (output = O and feature="U01:NNP") return 1 else return 0
….

第二種模板是Bigram template:第一個字符是B，每一行%x[#,#]生成一個CRFs中的邊(Edge)函數:f(s', s, o), 其中s'為t–1時刻的標籤。也就是說,Bigram類型與Unigram大致相同,只是還要考慮到t–1時刻的標籤。如果只寫一個B的話,默認生成f(s', s)，這意味著前一個output token和current token將組合成bigram features。

3. 訓練

CRF++的訓練命令一般格式如下：

crf_learn -f 3 -c 4.0 template train.data model -t

其中，template為模板文件，train.data為訓練語料，-t表示可以得到一個model文件和一個model.txt文件，其他可選參數說明如下：

-f, –freq=INT使用屬性的出現次數不少於INT(默認為1)

-m, –maxiter=INT設置INT為LBFGS的最大迭代次數 (默認10k)

-c, –cost=FLOAT    設置FLOAT為代價參數，過大會過度擬合 (默認1.0)

-e, –eta=FLOAT設置終止標準FLOAT(默認0.0001)

-C, –convert將文本模式轉為二進位模式

-t, –textmodel為調試建立文本模型文件

-a, –algorithm=(CRF|MIRA)    選擇訓練算法，默認為CRF-L2

-p, –thread=INT線程數(默認1)，利用多個CPU減少訓練時間

-H, –shrinking-size=INT    設置INT為最適宜的跌代變量次數 (默認20)

-v, –version顯示版本號並退出

-h, –help顯示幫助並退出

在訓練過程中，會輸出一些信息，其意義如下：

iter：迭代次數。當迭代次數達到maxiter時，迭代終止

terr：標記錯誤率

serr：句子錯誤率

obj：當前對象的值。當這個值收斂到一個確定值的時候，訓練完成

diff：與上一個對象值之間的相對差。當此值低於eta時，訓練完成

4. 預測

在訓練完模型後，我們可以使用訓練好的模型對新數據進行預測，預測命令格式如下：

crf_test -m model NER_predict.data > predict.txt

-m model表示使用我們剛剛訓練好的model模型，預測的數據文件為NER_predict.data, > predict.txt表示將預測後的數據寫入到predict.txt中。

NER實現實例

接下來，我們將利用CRF++來實現英文命名實體識別功能。
本項目實現NER的語料庫如下(文件名為train.txt，一共42000行，這裡只展示前15行，可以在文章最後的Github地址下載該語料庫)：

played on Monday ( home team in CAPS ) :
VBD IN NNP ( NN NN IN NNP ) :
O O O O O O O O O O
American League
NNP NNP
B-MISC I-MISC
Cleveland 2 DETROIT 1
NNP CD NNP CD
B-ORG O B-ORG O
BALTIMORE 12 Oakland 11 ( 10 innings )
VB CD NNP CD ( CD NN )
B-ORG O B-ORG O O O O O
TORONTO 5 Minnesota 3
TO CD NNP CD
B-ORG O B-ORG O
……

簡單介紹下該語料庫的結構：該語料庫一共42000行，每三行為一組，其中，第一行為英語句子，第二行為句子中每個單詞的詞性，第三行為NER系統的標註，共分4個標註類別：PER（人名），LOC（位置），ORG（組織）以及MISC，其中B表示開始，I表示中間，O表示單字詞，不計入NER，sO表示特殊單字詞。
首先我們將該語料分為訓練集和測試集，比例為9:1，實現的Python代碼如下：

dir = "/Users/Shared/CRF_4_NER/CRF_TEST"

with open("%s/train.txt" % dir, "r") as f:
    sents = [line.strip() for line in f.readlines()]

RATIO = 0.9
train_num = int((len(sents)//3)*RATIO)

with open("%s/NER_train.data" % dir, "w") as g:
    for i in range(train_num):
        words = sents[3*i].split('\t')
        postags = sents[3*i+1].split('\t')
        tags = sents[3*i+2].split('\t')
        for word, postag, tag in zip(words, postags, tags):
            g.write(word+' '+postag+' '+tag+'\n')
        g.write('\n')

with open("%s/NER_test.data" % dir, "w") as h:
    for i in range(train_num+1, len(sents)//3):
        words = sents[3*i].split('\t')
        postags = sents[3*i+1].split('\t')
        tags = sents[3*i+2].split('\t')
        for word, postag, tag in zip(words, postags, tags):
            h.write(word+' '+postag+' '+tag+'\n')
        h.write('\n')

print('OK!')

運行此程序，得到NER_train.data, 此為訓練集數據，NER_test.data，此為測試集數據。NER_train.data的前20行數據如下（以\t分隔開來）：

我們使用的模板文件template內容如下：

U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]

U10:%x[-2,1]
U11:%x[-1,1]
U12:%x[0,1]
U13:%x[1,1]
U14:%x[2,1]
U15:%x[-2,1]/%x[-1,1]
U16:%x[-1,1]/%x[0,1]
U17:%x[0,1]/%x[1,1]
U18:%x[1,1]/%x[2,1]

U20:%x[-2,1]/%x[-1,1]/%x[0,1]
U21:%x[-1,1]/%x[0,1]/%x[1,1]
U22:%x[0,1]/%x[1,1]/%x[2,1]

B

接著訓練該數據，命令如下：

crf_learn -c 3.0 template NER_train.data model -t

運行時的輸出信息如下：

CRF++輸出信息

在筆者的電腦上一共迭代了193次，運行時間為490.32秒，標記錯誤率為0.00004，句子錯誤率為0.00056。

接著，我們需要在測試集上對該模型的預測表現做評估。預測命令如下：

crf_test -m model NER_test.data > result.txt

使用Python腳本統計預測的準確率，代碼如下：

dir = "/Users/Shared/CRF_4_NER/CRF_TEST"

with open("%s/result.txt" % dir, "r") as f:
    sents = [line.strip() for line in f.readlines() if line.strip()]

total = len(sents)
print(total)

count = 0
for sent in sents:
    words = sent.split()

    if words[-1] == words[-2]:
        count += 1

print("Accuracy: %.4f" %(count/total))

輸出的結果如下：

21487
Accuracy: 0.9706

由此可見，在測試集上的準確率高達0.9706，效果相當好。

最後，我們對新數據進行命名實體識別，看看模型在新數據上的識別效果。實現的Python代碼如下:

import os
import nltk

dir = "/Users/Shared/CRF_4_NER/CRF_TEST"

sentence = "Venezuelan opposition leader and self-proclaimed interim president Juan Guaidó said Thursday he will return to his country by Monday, and that a dialogue with President Nicolas Maduro won't be possible without discussing elections."

default_wt = nltk.word_tokenize
words = default_wt(sentence)
print(words)
postags = nltk.pos_tag(words)
print(postags)

with open("%s/NER_predict.data" % dir, 'w', encoding='utf-8') as f:
    for item in postags:
        f.write(item[0]+' '+item[1]+' O\n')

print("write successfully!")

os.chdir(dir)
os.system("crf_test -m model NER_predict.data > predict.txt")
print("get predict file!")

with open("%s/predict.txt" % dir, 'r', encoding='utf-8') as f:
    sents = [line.strip() for line in f.readlines() if line.strip()]

word = []
predict = []

for sent in sents:
    words = sent.split()
    word.append(words[0])
    predict.append(words[-1])

ner_reg_list = []
for word, tag in zip(word, predict):
    if tag != 'O':
        ner_reg_list.append((word, tag))

print("NER識別結果：")
if ner_reg_list:
    for i, item in enumerate(ner_reg_list):
        if item[1].startswith('B'):
            end = i+1
            while end <= len(ner_reg_list)-1 and ner_reg_list[end][1].startswith('I'):
                end += 1

            ner_type = item[1].split('-')[1]
            ner_type_dict = {'PER': 'PERSON: ',
                             'LOC': 'LOCATION: ',
                             'ORG': 'ORGANIZATION: ',
                             'MISC': 'MISC: '
                            }
            print(ner_type_dict[ner_type], ' '.join([item[0] for item in ner_reg_list[i:end]]))

識別的結果如下：

MISC:  Venezuelan
PERSON:  Juan Guaidó
PERSON:  Nicolas Maduro

識別有個地方不準確， Venezuelan應該是LOC，而不是MISC. 我們再接著測試其它的新數據：

輸入語句1：

Real Madrid's season on the brink after 3-0 Barcelona defeat

識別效果1：

ORGANIZATION: Real Madrid
LOCATION: Barcelona

輸入語句2：

British artist David Hockney is known as a voracious smoker, but the habit got him into a scrape in Amsterdam on Wednesday.

識別效果2：

MISC: British
PERSON: David Hockney
LOCATION: Amsterdam

輸入語句3：

India is waiting for the release of an pilot who has been in Pakistani custody since he was shot down over Kashmir on Wednesday, a goodwill gesture which could defuse the gravest crisis in the disputed border region in years.

識別效果3：

LOCATION: India
LOCATION: Pakistani
LOCATION: Kashmir

輸入語句4：

Instead, President Donald Trump's second meeting with North Korean despot Kim Jong Un ended in a most uncharacteristic fashion for a showman commander in chief: fizzle.

識別效果4：

PERSON: Donald Trump
PERSON: Kim Jong Un

輸入語句5：

And in a press conference at the Civic Leadership Academy in Queens, de Blasio said the program is already working.

識別效果5：

ORGANIZATION: Civic Leadership Academy
LOCATION: Queens
PERSON: de Blasio

輸入語句6：

The United States is a founding member of the United Nations, World Bank, International Monetary Fund.

識別效果6：

LOCATION: United States
ORGANIZATION: United Nations
PERSON: World Bank
ORGANIZATION: International Monetary Fund

在這些例子中，有讓我們驚喜之處：識別出了人物Donald Trump, Kim Jong Un. 但也有些不足之處，如將World Bank識別為人物，而不是組織機構。總的來說，識別效果還是讓人滿意的。

總結

最近由於工作繁忙，無暇顧及博客。但轉念一想，技術輸出也是比較重要的，需要長期堅持下去～
本項目的Github地址為：https://github.com/percent4/CRF_4_NER 。
五一將至，祝大家假期愉快～

NLP入門(八)使用CRF++實現命名實體識別(NER)

相關焦點

使用CRF++實現命名實體識別(NER)

BiLSTM-CRF實現命名實體識別(NER)

基於 CRF 的中文命名實體識別模型實現

NLP神器 spacy 入門教程

NLP(三十四)使用keras-bert實現序列標註任務

Keras入門(七)使用Flask+Keras-bert構建模型預測服務

tensorflow(8)將h5文件轉化為pb文件並利用tensorflow/serving實現模型部署

一點點spaCy思想食物:易於使用的NLP框架

使用NLP和ML來提取和構造Web數據

阿里AAAI2018論文解讀:輕量網絡訓練框架、GAN中文命名實體識別、英俄翻譯等

Rasa 聊天機器人框架使用

NLP集大成之預訓練模型綜述

每個人都應該知道的5個NLP代碼庫

從零開始學自然語言處理(十五)— 一個簡單的中文NLP工具包——foolnltk

支持53種語言預訓練模型,斯坦福發布全新NLP工具包StanfordNLP

支持 53 種語言預訓練模型,斯坦福發布全新 NLP 工具包 StanfordNLP

模板識別:使用OpenCV實現基於特徵的圖像對齊

四種常見NLP框架使用總結

8種優秀預訓練模型大盤點,NLP應用so easy!

復旦大學邱錫鵬教授:NLP預訓練模型綜述