總結!實用Python文本預處理代碼

2021-01-07 網易

　　本文將討論文本預處理的基本步驟，旨在將文本信息從人類語言轉換為機器可讀格式以便用於後續處理。此外，本文還將進一步討論文本預處理過程所需要的工具。

　　當拿到一個文本後，首先從文本正則化（text normalization）處理開始。常見的文本正則化步驟包括：

刪除文本中出現的標點符號、重音符號以及其他變音符號文本規範化（text canonicalization）

　　下面將詳細描述上述文本正則化步驟。

　　將文本中出現的字母轉化為小寫

　　示例1：將字母轉化為小寫

　　Python 實現代碼：

　　input_str = 」The 5biggest countries bypopulation in2017are China, India, United States, Indonesia, andBrazil.」

　　input_str = input_str.lower

　　print(input_str)

　　輸出：

　　the 5biggest countries bypopulation in2017are china, india, united states, indonesia, andbrazil.

　　刪除文本中出現的數字

　　如果文本中的數字與文本分析無關的話，那就刪除這些數字。通常，正則化表達式可以幫助你實現這一過程。

　　示例2：刪除數字

　　Python 實現代碼：

　　importre

　　input_str = 』Box A contains 3red and5white balls, whileBox B contains 4red and2blue balls.』

　　result = re.sub(r』d+』, 『』, input_str)

　　print(result)

　　輸出：

　　Box A contains red andwhite balls, whileBox B contains red andblue balls.

　　刪除文本中出現的標點

　　以下示例代碼演示如何刪除文本中的標點符號，如 [!」#$%&』*+,-./:;<=>?@[]^_`{|}~] 等符號。

　　示例3：刪除標點

　　Python 實現代碼：

　　importstring

　　input_str = 「This &is [an] example? {of} string. with.? punctuation!!!!」 # Sample string

　　result = input_str.translate( string.maketrans(「」,」」), string.punctuation)

　　print(result)

　　輸出：

　　This isan example ofstringwithpunctuation

　　刪除文本中出現的空格

　　可以通過 strip函數移除文本前後出現的空格。

　　示例4：刪除空格

　　Python 實現代碼：

　　input_str = 「 t a string examplet 「

　　input_str = input_str.strip

　　input_str

　　輸出：

　　『a stringexample』

　　符號化（ Tokenization）

　　符號化是將給定的文本拆分成每個帶標記的小模塊的過程，其中單詞、數字、標點及其他符號等都可視為是一種標記。在下表中（Tokenization sheet），羅列出用於實現符號化過程的一些常用工具。

　　刪除文本中出現的終止詞

　　終止詞（Stop words）指的是「 a」，「 a」，「 on」，「 is」，「 all」等語言中最常見的詞。這些詞語沒什麼特別或重要意義，通常可以從文本中刪除。一般使用 Natural Language Toolkit（NLTK）來刪除這些終止詞，這是一套專門用於符號和自然語言處理統計的開源庫。

　　示例7：刪除終止詞

　　實現代碼：

　　input_str = 「NLTK isa leading platform forbuilding Python programs to work withhuman language data.」

　　stop_words = set(stopwords.words(『english』))

　　fromnltk.tokenize importword_tokenize

　　tokens = word_tokenize(input_str)

　　result = [i fori intokens ifnoti instop_words]

　　print(result)

　　輸出：

　　[『NLTK』, 『leading』, 『platform』, 『building』, 『Python』, 『programs』, 『work』, 『human』, 『language』, 『data』, 『.』]

　　此外， scikit-learn 也提供了一個用於處理終止詞的工具：

　　fromsklearn.feature_extraction.stop_words importENGLISH_STOP_WORDS

　　同樣， spaCy 也有一個類似的處理工具：

　　fromspacy.lang.en.stop_wordsimportSTOP_WORDS

　　刪除文本中出現的稀疏詞和特定詞

　　在某些情況下，有必要刪除文本中出現的一些稀疏術語或特定詞。考慮到任何單詞都可以被認為是一組終止詞，因此可以通過終止詞刪除工具來實現這一目標。

　　詞幹提取（Stemming）

　　詞幹提取是一個將詞語簡化為詞幹、詞根或詞形的過程（如 books-book， looked-look）。當前主流的兩種算法是 Porter stemming 算法（刪除單詞中刪除常見的形態和拐點結尾）和 Lancaster stemming 算法。

　　示例 8：使用 NLYK 實現詞幹提取

　　實現代碼：

　　fromnltk.stem importPorterStemmer

　　fromnltk.tokenize importword_tokenize

　　stemmer= PorterStemmer

　　input_str=」There are several types ofstemming algorithms.」

　　input_str=word_tokenize(input_str)

　　forword ininput_str:

　　print(stemmer.stem(word))

　　輸出：

　　There are sever typeofstem algorithm.

　　詞形還原（Lemmatization）

　　詞形還原的目的，如詞幹過程，是將單詞的不同形式還原到一個常見的基礎形式。與詞幹提取過程相反，詞形還原並不是簡單地對單詞進行切斷或變形，而是通過使用詞彙知識庫來獲得正確的單詞形式。

　　當前常用的詞形還原工具庫包括： NLTK（WordNet Lemmatizer）， spaCy， TextBlob， Pattern， gensim， Stanford CoreNLP，基於內存的淺層解析器（MBSP）， Apache OpenNLP， Apache Lucene，文本工程通用架構（GATE）， Illinois Lemmatizer 和 DKPro Core。

　　示例 9：使用 NLYK 實現詞形還原

　　實現代碼：

　　fromnltk.stem importWordNetLemmatizer

　　fromnltk.tokenize importword_tokenize

　　lemmatizer=WordNetLemmatizer

　　input_str=」been had done languages cities mice」

　　input_str=word_tokenize(input_str)

　　forword ininput_str:

　　print(lemmatizer.lemmatize(word))

　　輸出：

　　be have dolanguagecity mouse

　　詞性標註（POS）

　　詞性標註旨在基於詞語的定義和上下文意義，為給定文本中的每個單詞（如名詞、動詞、形容詞和其他單詞）分配詞性。當前有許多包含 POS 標記器的工具，包括 NLTK， spaCy， TextBlob， Pattern， Stanford CoreNLP，基於內存的淺層分析器（MBSP）， Apache OpenNLP， Apache Lucene，文本工程通用架構（GATE）， FreeLing， Illinois Part of Speech Tagger 和 DKPro Core。

　　示例 10：使用 TextBlob 實現詞性標註

　　實現代碼：

　　input_str=」Parts ofspeech examples: an article, to write, interesting, easily, and, of」

　　fromtextblob importTextBlob

　　result = TextBlob(input_str)

　　print(result.tags)

　　輸出：

　　[(『Parts』, u』NNS』), (『of』, u』IN』), (『speech』, u』NN』), (『examples』, u』NNS』), (『an』, u』DT』), (『article』, u』NN』), (『to』, u』TO』), (『write』, u』VB』), (『interesting』, u』VBG』), (『easily』, u』RB』), (『and』, u』CC』), (『of』, u』IN』)]

　　詞語分塊（淺解析）

　　示例 11：使用 NLYK 實現詞語分塊

　　第一步需要確定每個單詞的詞性。

　　實現代碼：

　　input_str=」A black television anda white stove were bought forthe newapartment ofJohn.」

　　fromtextblob importTextBlob

　　result = TextBlob(input_str)

　　print(result.tags)

　　輸出：

　　[(『A』, u』DT』), (『black』, u』JJ』), (『television』, u』NN』), (『and』, u』CC』), (『a』, u』DT』), (『white』, u』JJ』), (『stove』, u』NN』), (『were』, u』VBD』), (『bought』, u』VBN』), (『for』, u』IN』), (『the』, u』DT』), (『new』, u』JJ』), (『apartment』, u』NN』), (『of』, u』IN』), (『John』, u』NNP』)]

　　第二部就是進行詞語分塊

　　實現代碼：

　　reg_exp = 「NP: {

　　?*}」

　　rp = nltk.RegexpParser(reg_exp)

　　result = rp.parse(result.tags)

　　print(result)

　　輸出：

　　(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/ IN(NP the/DT new/JJ apartment/NN)

　　of/ INJohn/NNP)

　　也可以通過 result.draw(）函數繪製句子樹結構圖，如下圖所示。

　　示例 12：使用 TextBlob 實現詞性標註

　　實現代碼：

　　fromnltk importword_tokenize, pos_tag, ne_chunk

　　input_str = 「Bill works forApple so he went to Boston fora conference.」

　　printne_chunk(pos_tag(word_tokenize(input_str)))

　　輸出：

　　(S (PERSON Bill/NNP) works/VBZ for/ INApple/NNP so/ INhe/PRP went/VBD to/ TO(GPE Boston/NNP) for/ INa/DT conference/NN ./.)

　　共指解析 Coreference resolution（回指解析度 anaphora resolution）

　　代詞和其他引用表達應該與正確的個體聯繫起來。 Coreference resolution 在文本中指的是引用真實世界中的同一個實體。如在句子「安德魯說他會買車」中，代詞「他」指的是同一個人，即「安德魯」。常用的 Coreference resolution 工具如下表所示，包括 Stanford CoreNLP， spaCy， Open Calais， Apache OpenNLP 等。

　　搭配提取（Collocation extraction）

　　搭配提取過程並不是單獨、偶然發生的，它是與單詞組合一同發生的過程。該過程的示例包括「打破規則 break the rules」，「空閒時間 free time」，「得出結論 draw a conclusion」，「記住 keep in mind」，「準備好 get ready」等。

　　示例 13：使用 ICE 實現搭配提取

　　實現代碼：

　　input=[「he andChazz duel withall keys on the line.」]

　　fromICE importCollocationExtractor

　　extractor = CollocationExtractor.with_collocation_pipeline(「T1」 , bing_key = 「Temp」,pos_check = False)

　　print(extractor.get_collocations_of_length(input, length = 3))

　　輸出：

　　[「on the line」]

　　關係提取（Relationship extraction）

　　關係提取過程是指從非結構化的數據源（如原始文本）獲取結構化的文本信息。嚴格來說，它確定了命名實體（如人、組織、地點的實體）之間的關係（如配偶、就業等關係）。例如，從「昨天與 Mark 和 Emily 結婚」這句話中，我們可以提取到的信息是 Mark 是 Emily 的丈夫。

　　總結

總結!實用Python文本預處理代碼

相關焦點

Python文本預處理,試試BAT大佬總結的實用代碼!

英文文本挖掘預處理流程總結

Python文本預處理:步驟、使用工具及示例

NLP入門乾貨 | 文本預處理技術詳解

文本挖掘預處理之TF-IDF

手把手 | 基於TextRank算法的文本摘要(附Python代碼)

入門 | 三行Python代碼,讓數據預處理速度提高2到6倍

【下載】Scikit-learn作者新書《Python機器學習導論》, 教程+代碼手把手帶你實踐機器學習算法

Tweets的預處理

使用Python三步完成文本到語音的轉換

使用一行Python代碼從圖像讀取文本

第一個Python程序——在屏幕上輸出文本

PyTorch-Transformers:最先進的自然語言處理庫(附帶python代碼)

知識詳解+Python實現|文本挖掘中的預處理方法

基於Python 的自動文本提取:抽象法和生成法的比較

從零開始實現穿衣圖像分割完整教程(附python代碼演練)

使用單行代碼評估回歸模型的Python包

基於python將音頻文件轉化為文本輸出

總結Python繪製詞雲的三種方法

opencv-python圖像預處理-濾波