自然語言處理NLP快速入門

2021-02-13 專知

【導讀】自然語言處理已經成為人工智慧領域一個重要的分支，它研究能實現人與計算機之間用自然語言進行有效通信的各種理論和方法。本文提供了一份簡要的自然語言處理介紹，幫助讀者對自然語言處理快速入門。

作者 | George Seif

編譯 | Xiaowen

An easy introduction to Natural Language Processing

Using computers to understand human language

計算機非常擅長處理標準化和結構化的數據，如資料庫表和財務記錄。他們能夠比我們人類更快地處理這些數據。但我們人類不使用「結構化數據」進行交流，也不會說二進位語言！我們用文字進行交流，這是一種非結構化數據。

不幸的是，計算機很難處理非結構化數據，因為沒有標準化的技術來處理它。當我們使用c、java或python之類的語言對計算機進行編程時，我們實際上是給計算機一組它應該操作的規則。對於非結構化數據，這些規則是非常抽象和具有挑戰性的具體定義。

網際網路上有很多非結構化的自然語言，有時甚至連谷歌都不知道你在搜索什麼！

人與計算機對語言的理解

人類寫東西已經有幾千年了。在這段時間裡，我們的大腦在理解自然語言方面獲得了大量的經驗。當我們在一張紙上或網際網路上的博客上讀到一些東西時，我們就會明白它在現實世界中的真正含義。我們感受到了閱讀這些東西所引發的情感，我們經常想像現實生活中那東西會是什麼樣子。

自然語言處理 (NLP) 是人工智慧的一個子領域，致力於使計算機能夠理解和處理人類語言，使計算機更接近於人類對語言的理解。計算機對自然語言的直觀理解還不如人類，他們不能真正理解語言到底想說什麼。簡而言之，計算機不能在字裡行間閱讀。

儘管如此，機器學習 (ML) 的最新進展使計算機能夠用自然語言做很多有用的事情！深度學習使我們能夠編寫程序來執行諸如語言翻譯、語義理解和文本摘要等工作。所有這些都增加了現實世界的價值，使得你可以輕鬆地理解和執行大型文本塊上的計算，而無需手工操作。

讓我們從一個關於NLP如何在概念上工作的快速入門開始。之後，我們將深入研究一些python代碼，這樣你就可以自己開始使用NLP了！

NLP難的真正原因

閱讀和理解語言的過程比乍一看要複雜得多。要真正理解一段文字在現實世界中意味著什麼，有很多事情要做。例如，你認為下面這段文字意味著什麼？

「Steph Curry was on fire last nice. He totallydestroyed the other team」

對一個人來說，這句話的意思很明顯。我們知道 Steph Curry 是一名籃球運動員，即使你不知道，我們也知道他在某種球隊，可能是一支運動隊。當我們看到「著火」和「毀滅」時，我們知道這意味著Steph Curry昨晚踢得很好，擊敗了另一支球隊。

計算機往往把事情看得太過字面意思。從字面上看，我們會看到「Steph Curry」，並根據大寫假設它是一個人，一個地方，或其他重要的東西。但後來我們看到Steph Curry「著火了」…電腦可能會告訴你昨天有人把Steph Curry點上了火！…哎呀。在那之後，電腦可能會說, curry已經摧毀了另一支球隊…它們不再存在…偉大的…

Steph Curry真的著火了！

但並不是機器所做的一切都是殘酷的，感謝機器學習，我們實際上可以做一些非常聰明的事情來快速地從自然語言中提取和理解信息！讓我們看看如何在幾行代碼中使用幾個簡單的python庫來實現這一點。

使用Python代碼解決NLP問題

為了了解NLP是如何工作的，我們將使用Wikipedia中的以下文本作為我們的運行示例：

Amazon.com, Inc., doing business as Amazon, is an Americanelectronic commerce and cloud computing company based in Seattle, Washington,that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largestInternet retailer in the world as measured by revenue and market capitalization,and second largest after Alibaba Group in terms of total sales. The amazon.comwebsite started as an online bookstore and later diversified to sell videodownloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming,software, video games, electronics, apparel, furniture, food, toys, andjewelry. The company also produces consumer electronics—Kindle e-readers,Fire tablets, Fire TV, and Echo—and is the world’s largest provider of cloud infrastructure services (IaaS andPaaS). Amazon also sells certain low-end products under its in-house brandAmazonBasics.

幾個需要的庫

首先，我們將安裝一些有用的python NLP庫，這些庫將幫助我們分析本文。

### Installing spaCy, general Python NLP lib

pip3 install spacy

### Downloading the English dictionary model for spaCy

python3 -m spacy download en_core_web_lg

### Installing textacy, basically a useful add-on to spaCy

pip3 install textacy

實體分析

現在所有的東西都安裝好了，我們可以對文本進行快速的實體分析。實體分析將遍歷文本並確定文本中所有重要的詞或「實體」。當我們說「重要」時，我們真正指的是具有某種真實世界語義意義或意義的單詞。

請查看下面的代碼，它為我們進行了所有的實體分析：

# coding: utf-8

import spacy

### Load spaCy's English NLP model

nlp = spacy.load('en_core_web_lg')

### The text we want to examine

text = "Amazon.com, Inc., doing business as Amazon,
is anAmerican electronic commerce and cloud computing
company based in Seattle,Washington, that was founded
by Jeff Bezos on July 5, 1994. The tech giant isthe
largest Internet retailer in the world as measured by
revenue and marketcapitalization, and second largest
after Alibaba Group in terms of total sales.The amazon.
com website started as an online bookstore and later
diversified tosell video downloads/streaming, MP3
downloads/streaming, audiobookdownloads/streaming,
software, video games, electronics, apparel, furniture,
food, toys, and jewelry. The company also produces
consumer electronics-Kindle e-readers,Fire tablets,
Fire TV, and Echo-and is the world's largest provider
of cloud infrastructureservices (IaaS and PaaS).
Amazon also sells certain low-end products under
itsin-house brand AmazonBasics."

### Parse the text with spaCy

### Our 'document' variable now contains a parsed version oftext.

document = nlp(text)

### print out all the named entities that were detected

for entity in document.ents:

print(entity.text,entity.label_)

我們首先加載spaCy’s learned ML模型，並初始化想要處理的文本。我們在文本上運行ML模型來提取實體。當運行taht代碼時，你將得到以下輸出：

Amazon.com, Inc. ORG
Amazon ORG
American NORP
Seattle GPE
Washington GPE
Jeff Bezos PERSON
July 5, 1994 DATE
second ORDINAL
Alibaba Group ORG
amazon.com ORG
Fire TV ORG
Echo - LOC
PaaS ORG
Amazon ORG
AmazonBasics ORG

文本旁邊的3個字母代碼[1]是標籤，表示我們正在查看的實體的類型。看來我們的模型幹得不錯！Jeff Bezos確實是一個人，日期是正確的，亞馬遜是一個組織，西雅圖和華盛頓都是地緣政治實體(即國家、城市、州等)。唯一棘手的問題是，Fire TV和Echo之類的東西實際上是產品，而不是組織。然而模型錯過了亞馬遜銷售的其他產品「視頻下載/流媒體、mp3下載/流媒體、有聲讀物下載/流媒體、軟體、視頻遊戲、電子產品、服裝、家具、食品、玩具和珠寶」，可能是因為它們在一個龐大的的列表中，因此看起來相對不重要。

總的來說，我們的模型已經完成了我們想要的。想像一下，我們有一個巨大的文檔，裡面滿是幾百頁的文本，這個NLP模型可以快速地讓你了解文檔的內容以及文檔中的關鍵實體是什麼。

對實體進行操作

讓我們嘗試做一些更適用的事情。假設你有與上面相同的文本塊，但出於隱私考慮，你希望自動刪除所有人員和組織的名稱。spaCy庫有一個非常有用的清除函數，我們可以使用它來清除任何我們不想看到的實體類別。如下所示：

# coding: utf-8

import spacy

### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')

### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon,
is an American electronic commerce and cloud computing
company based in Seattle, Washington, that was founded
by Jeff Bezos on July 5, 1994. The tech giant is the
largest Internet retailer in the world as measured by
revenue and market capitalization, and second largest
after Alibaba Group in terms of total sales. The
amazon.com website started as an online bookstore and
later diversified to sell video downloads/streaming,
MP3 downloads/streaming, audiobook downloads/streaming,
software, video games, electronics, apparel, furniture
, food, toys, and jewelry. The company also produces
consumer electronics - Kindle e-readers, Fire tablets,
Fire TV, and Echo - and is the world's largest
provider of cloud infrastructure services (IaaS and
PaaS). Amazon also sells certain low-end products
under its in-house brand AmazonBasics."

### Replace a specific entity with the word "PRIVATE"
def replace_entity_with_placeholder(token):
if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"):
return "[PRIVATE] "
else:
return token.string

### Loop through all the entities in a piece of text and apply entity replacement
def scrub(text):
doc = nlp(text)
for ent in doc.ents:
ent.merge()
tokens = map(replace_entity_with_placeholder, doc)
return "".join(tokens)

print(scrub(text))

效果很好！這實際上是一種非常強大的技術。人們總是在計算機上使用ctrl+f函數來查找和替換文檔中的單詞。但是使用NLP，我們可以找到和替換特定的實體，考慮到它們的語義意義，而不僅僅是它們的原始文本。

從文本中提取信息

我們之前安裝的textacy庫在spaCy的基礎上實現了幾種常見的NLP信息提取算法。它會讓我們做一些比簡單的開箱即用的事情更先進的事情。

它實現的算法之一是半結構化語句提取。這個算法從本質上分析了spaCy的NLP模型能夠提取的一些信息，並在此基礎上獲取一些關於某些實體的更具體的信息！簡而言之，我們可以提取關於我們選擇的實體的某些「事實」。

讓我們看看代碼中是什麼樣子的。對於這一篇，我們將把華盛頓特區維基百科頁面的全部摘要都拿出來。

# coding: utf-8

import spacy
import textacy.extract

### Load spaCy's English NLP model
nlp = spacy.load('en_core_web_lg')

### The text we want to examine
text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9]
The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country's East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District.
Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city's daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country.
All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross.
A locally elected mayor and a 13‑member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961."""
### Parse the text with spaCy
### Our 'document' variable now contains a parsed version of text.
document = nlp(text)

### Extracting semi-structured statements
statements = textacy.extract.semistructured_statements(document, "Washington")

print("**** Information from Washington's Wikipedia page ****")
count = 1
for statement in statements:
subject, verb, fact = statement
print(str(count) + " - Statement: ", statement)
print(str(count) + " - Fact: ", fact)
count += 1

我們的NLP模型從這篇文章中發現了關於華盛頓特區的三個有用的事實：

(1)華盛頓是美國的首都

(2)華盛頓的人口，以及它是大都會的事實

(3)許多國家紀念碑和博物館

最好的部分是，這些都是這一段文字中最重要的信息！

深入研究NLP

到這裡就結束了我們對NLP的簡單介紹。我們學了很多，但這只是一個小小的嘗試…

NLP有許多更好的應用，例如語言翻譯，聊天機器人，以及對文本文檔的更具體和更複雜的分析。今天的大部分工作都是利用深度學習，特別是遞歸神經網絡(RNNs)和長期短期記憶(LSTMs)網絡來完成的。

如果你想自己玩更多的NLP，看看spaCy文檔[2] 和textacy文檔[3] 是一個很好的起點！你將看到許多處理解析文本的方法的示例，並從中提取非常有用的信息。所有的東西都是快速和簡單的，你可以從中得到一些非常大的價值。是時候用深入的學習來做更大更好的事情了！

參考連結：

[1] https://spacy.io/usage/linguistic-features#entity-types

[2]https://spacy.io/api/doc

[3]http://textacy.readthedocs.io/en/latest/

原文連結：

https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1

-END-

專 · 知

人工智慧領域26個主題知識資料全集獲取與加入專知人工智慧服務群: 歡迎微信掃一掃加入專知人工智慧知識星球群，獲取專業知識教程視頻資料和與專家交流諮詢！

請PC登錄www.zhuanzhi.ai或者點擊閱讀原文，註冊登錄專知，獲取更多AI知識資料！

請加專知小助手微信（掃一掃如下二維碼添加），加入專知主題群（請備註主題類型：AI、NLP、CV、 KG等）交流~

AI 項目技術 & 商務合作：bd@zhuanzhi.ai, 或掃描上面二維碼聯繫！

請關注專知公眾號，獲取人工智慧的專業知識！

點擊「閱讀原文」，使用專知

自然語言處理NLP快速入門

相關焦點

【獨家】自然語言處理(NLP)入門指南

國內自然語言處理(NLP)研究組

帶新手走進自然語言處理,7本NLP專業書

實踐入門NLP:基於深度學習的自然語言處理

斯坦福NLP團隊發布最新自然語言處理Python庫

NLP入門+實戰必讀:一文教會你最常見的10種自然語言處理技術

NLP入門+實戰必讀:一文教會你最常見的10種自然語言處理技術(附代碼)

《自然語言處理入門》不是 NLP 學習路上的萬能藥

NLP自然語言處理組

Awesome-Chinese-NLP:中文自然語言處理相關資料

快速掌握 Spacy 在 Python 中進行自然語言處理

史丹福大學NLP組Python深度學習自然語言處理工具Stanza試用

中文NLP福利!大規模中文自然語言處理語料

NLP研究入門之道:自然語言處理簡介

一本開源的NLP入門書籍

Python自然語言處理(NLP)入門教程

獨家 | 快速掌握spacy在python中進行自然語言處理(附代碼&連結)

【NLP】自然語言處理基礎微課程|語言學午餐

NLP也分李逵和李鬼?關於自然語言處理你不知道的事

CMU2018春季課程:神經網絡自然語言處理課程(附PPT和代碼)