中文醫療NLP領域 數據集,論文 ,知識圖譜,語料,工具包

2021-02-23 深度學習自然語言處理

點擊上方,選擇星標置頂,每天給你送乾貨!

閱讀大概需要15分鐘

跟隨小博主,每天進步一丟丟


整理:深度學習技術前沿

Github主頁:https://github.com/lrs1353281004/Chinese_medical_NLP

Chinese_medical_NLP

醫療NLP領域(主要關注中文) 評測數據集 與 論文等相關資源。

中文評測數據集

中文醫學知識圖譜

英文數據集

相關論文

中文醫療領域語料

醫學embedding

開源工具包

工業級產品/解決方案

blog分享

友情連結

中文評測數據集1. Yidu-S4K:醫渡雲結構化4K數據集

數據集描述:

Yidu-S4K 數據集源自CCKS 2019 評測任務一,即「面向中文電子病歷的命名實體識別」的數據集,包括兩個子任務:1)醫療命名實體識別:由於國內沒有公開可獲得的面向中文電子病歷醫療實體識別數據集,本年度保留了醫療命名實體識別任務,對2017年度數據集做了修訂,並隨任務一同發布。本子任務的數據集包括訓練集和測試集。2)醫療實體及屬性抽取(跨院遷移):在醫療實體識別的基礎上,對預定義實體屬性進行抽取。本任務為遷移學習任務,即在只提供目標場景少量標註數據的情況下,通過其他場景的標註數據及非標註數據進行目標場景的識別任務。本子任務的數據集包括訓練集(非目標場景和目標場景的標註數據、各個場景的非標註數據)和測試集(目標場景的標註數據

數據集地址

度盤下載地址:https://pan.baidu.com/s/1QqYtqDwhc_S51F3SYMChBQ

提取碼:flql

2.瑞金醫院糖尿病數據集

數據集描述:

數據集來自天池大賽。此數據集旨在通過糖尿病相關的教科書、研究論文來做糖尿病文獻挖掘並構建糖尿病知識圖譜。參賽選手需要設計高準確率,高效的算法來挑戰這一科學難題。第一賽季課題為「基於糖尿病臨床指南和研究論文的實體標註構建」,第二賽季課題為「基於糖尿病臨床指南和研究論文的實體間關係構建」。

官方提供的數據只包含訓練集,真正用於最終排名的測試集沒有給出。

數據集地址

度盤下載地址:https://pan.baidu.com/s/1CWKblBNBqR-vs2h0xiXSdQ

提取碼:0c54

3.Yidu-N7K:醫渡雲標準化7K數據集

數據集描述:

Yidu-N4K 數據集源自CHIP 2019 評測任務一,即「臨床術語標準化任務」的數據集。臨床術語標準化任務是醫學統計中不可或缺的一項任務。臨床上,關於同一種診斷、手術、藥品、檢查、化驗、症狀等往往會有成百上千種不同的寫法。標準化(歸一)要解決的問題就是為臨床上各種不同說法找到對應的標準說法。有了術語標準化的基礎,研究人員才可對電子病歷進行後續的統計分析。本質上,臨床術語標準化任務也是語義相似度匹配任務的一種。但是由於原詞表述方式過於多樣,單一的匹配模型很難獲得很好的效果。

數據集地址

4.中文醫學問答數據集

數據集描述:

中文醫藥方面的問答數據集,超過10萬條。

數據說明:

questions.csv:所有的問題及其內容。answers.csv :所有問題的答案。train_candidates.txt, dev_candidates.txt, test_candidates.txt :將上述兩個文件進行了拆分。

數據集地址

數據集github地址

5.平安醫療科技疾病問答遷移學習比賽

數據集描述:

本次比賽是chip2019中的評測任務二,由平安醫療科技主辦。chip2019會議詳情見連結:http://cips-chip.org.cn/evaluation 遷移學習是自然語言處理中的重要一環,其主要目的是通過從已學習的相關任務中轉移知識來改進新任務的學習效果,從而提高模型的泛化能力。本次評測任務的主要目標是針對中文的疾病問答數據,進行病種間的遷移學習。具體而言,給定來自5個不同病種的問句對,要求判定兩個句子語義是否相同或者相近。所有語料來自網際網路上患者真實的問題,並經過了篩選和人工的意圖匹配標註。

數據集地址(需註冊)

6.天池新冠肺炎問句匹配比賽

數據集描述:

本次大賽數據包括:脫敏之後的醫療問題數據對和標註數據。醫療問題涉及「肺炎」、「支原體肺炎」、「支氣管炎」、「上呼吸道感染」、「肺結核」、「哮喘」、「胸膜炎」、「肺氣腫」、「感冒」、「咳血」等10個病種。數據共包含train.csv、dev.csv、test.csv三個文件,其中給參賽選手的文件包含訓練集train.csv和驗證集dev.csv,測試集test.csv 對參賽選手不可見。每一條數據由 Category,Query1,Query2,Label構成,分別表示問題類別、問句1、問句2、標籤。Label表示問句之間的語義是否相同,若相同,標為1,若不相同,標為0。其中,訓練集Label已知,驗證集和測試集Label未知。示例 類別:肺炎 問句1:肺部發炎是什麼原因引起的?問句2:肺部發炎是什麼引起的 標籤:1 類別:肺炎 問句1:肺部發炎是什麼原因引起的?問句2:肺部炎症有什麼症狀 標籤:0

數據集地址(需註冊)

線上第四名解決方案及代碼

線上第一名解決方案及代碼

7.中文醫患問答對話數據

數據說明: 來自某在線求醫產品的中文醫患對話數據。

原始描述:The MedDialog dataset contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances. The data is continuously growing and more dialogues will be added. The raw dialogues are from haodf.com. All copyrights of the data belong to haodf.com.

項目地址

度盤下載地址: https://pan.baidu.com/s/1ZwzNgvAAMQk4klerTspsoA

提取碼: lbo4

8.中文醫學問答數據

數據說明: 包含六個科室的醫學問答數據,來源不明。

項目地址

中文醫學知識圖譜CMeKG

地址

簡介:CMeKG(Chinese Medical Knowledge Graph)是利用自然語言處理與文本挖掘技術,基於大規模醫學文本數據,以人機結合的方式研發的中文醫學知識圖譜。CMeKG的構建參考了ICD、ATC、SNOMED、MeSH等權威的國際醫學標準以及規模龐大、多源異構的臨床指南、行業標準、診療規範與醫學百科等醫學文本信息。CMeKG 1.0包括:6310種疾病、19853種藥物(西藥、中成藥、中草藥)、1237種診療技術及設備的結構化知識描述,涵蓋疾病的臨床症狀、發病部位、藥物治療、手術治療、鑑別診斷、影像學檢查、高危因素、傳播途徑、多發群體、就診科室等以及藥物的成分、適應症、用法用量、有效期、禁忌證等30餘種常見關係類型,CMeKG描述的概念關係實例及屬性三元組達100餘萬。

英文數據集PubMedQA: A Dataset for Biomedical Research Question Answering

數據集描述:基於Pubmed提取的醫學問答數據集。PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially gen- erated QA instances.

論文地址

相關論文1.醫療領域預訓練embedding

註:目前沒有收集到中文醫療領域的開源預訓練模型,以下列出英文論文供參考。

Bio-bert

論文題目:BioBERT: a pre-trained biomedical language representation model for biomedical text mining

論文地址

項目地址

論文概要:以通用領域預訓練bert為初始權重,基於Pubmed上大量醫療領域英文論文訓練。在多個醫療相關下遊任務中超越SOTA模型的表現。

論文摘要:

Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from bio- medical literature has gained popularity among researchers, and deep learning has boosted the development of ef- fective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text min- ing often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results: We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation: We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

sci-bert

論文題目:SCIBERT: A Pretrained Language Model for Scientific Text

論文地址

項目地址

論文概要:AllenAI 團隊出品.基於Semantic Scholar 上 110萬+ 文章訓練的 科學領域bert.

論文摘要:Obtaining large-scale annotated data for NLP tasks in the scientific domain is challeng- ing and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data. SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve perfor- mance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demon- strate statistically significant improvements over BERT and achieve new state-of-the- art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

clinical-bert

論文題目:Publicly Available Clinical BERT Embeddings

論文地址

項目地址

論文概要:出自NAACL Clinical NLP Workshop 2019.基於MIMIC-III資料庫中的200萬份醫療記錄訓練的臨床領域bert.

論文摘要:Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.

clinical-bert(另一團隊的版本)

論文題目:ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission

論文地址

項目地址

論文概要:同樣基於MIMIC-III資料庫,但只隨機選取了10萬份醫療記錄訓練的臨床領域bert.

論文摘要:Clinical notes contain information about patients that goes beyond structured data like lab values and medications. However, clinical notes have been underused relative to structured data, because notes are high-dimensional and sparse. This work develops and evaluates representations of clinical notes using bidirectional transformers (ClinicalBert). Clini- calBert uncovers high-quality relationships between medical concepts as judged by hu- mans. ClinicalBert outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit. Code and model parameters are available.

BEHRT

論文題目:BEHRT: TRANSFORMER FOR ELECTRONIC HEALTH RECORDS

論文地址

項目地址: 暫未開源

論文概要:這篇論文中embedding是基於醫學實體訓練,而不是基於單詞。

論文摘要:Today, despite decades of developments in medicine and the growing interest in precision healthcare, vast majority of diagnoses happen once patients begin to show noticeable signs of illness. Early indication and detection of diseases, however, can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning (more specifically, deep learning) provides a great opportunity to address this unmet need. In this study, we introduce BEHRT: A deep neural sequence transduction model for EHR (electronic health records), capable of multitask prediction and disease trajectory mapping. When trained and evaluated on the data from nearly 1.6 million individuals, BEHRT shows a striking absolute improvement of 8.0-10.8%, in terms of Average Precision Score, compared to the existing state-of-the-art deep EHR models (in terms of average precision, when predicting for the onset of 301 conditions). In addition to its superior prediction power, BEHRT provides a personalised view of disease trajectories through its attention mechanism; its flexible architecture enables it to incorporate multiple heterogeneous concepts (e.g., diagnosis, medication, measurements, and more) to improve the accuracy of its predictions; and its (pre-)training results in disease and patient representations that can help us get a step closer to interpretable predictions.

2.綜述類文章nature medicine發表的綜述

論文題目:A guide to deep learning in healthcare

論文地址

論文概要:發表於nature medicine,包含醫學領域下CV,NLP,強化學習等方面的應用綜述。

論文摘要:Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep- learning methods for genomics are reviewed.

3.電子病歷相關文章Transfer Learning from Medical Literature for Section Prediction in Electronic Health Records

論文地址

論文概要:發表於EMNLP2019。基於少量in-domain數據和大量out-of-domain數據進行EHR相關的遷移學習。

論文摘要:sections such as Assessment and Plan, So- cial History, and Medications. These sec- tions help physicians find information easily and can be used by an information retrieval system to return specific information sought by a user. However, it is common that the exact format of sections in a particular EHR does not adhere to known patterns. There- fore, being able to predict sections and headers in EHRs automatically is beneficial to physi- cians. Prior approaches in EHR section pre- diction have only used text data from EHRs and have required significant manual annota- tion. We propose using sections from med- ical literature (e.g., textbooks, journals, web content) that contain content similar to that found in EHR sections. Our approach uses data from a different kind of source where la- bels are provided without the need of a time- consuming annotation effort. We use this data to train two models: an RNN and a BERT- based model. We apply the learned models along with source data via transfer learning to predict sections in EHRs. Our results show that medical literature can provide helpful su- pervision signal for this classification task.

4.醫學關係抽取Leveraging Dependency Forest for Neural Medical Relation Extraction

論文地址

論文概要:發表於EMNLP 2019. 基於dependency forest方法,提升對醫學語句中依存關係的召回率,同時引進了一部分噪聲,基於圖循環網絡進行特徵提取,提供了在醫療關係抽取中使用依存關係,同時減少誤差傳遞的一種思路。

論文摘要:Medical relation extraction discovers relations between entity mentions in text, such as research articles. For this task, dependency syntax has been recognized as a crucial source of features. Yet in the medical domain, 1-best parse trees suffer from relatively low accuracies, diminishing their usefulness. We investigate a method to alleviate this problem by utilizing dependency forests. Forests contain more than one possible decisions and therefore have higher recall but more noise compared with 1-best outputs. A graph neural network is used to represent the forests, automatically distinguishing the useful syntactic information from parsing noise. Results on two benchmarks show that our method outperforms the standard tree-based methods, giving the state-of-the-art results in the literature.

5.醫學知識圖譜Learning a Health Knowledge Graph from Electronic Medical Records

論文地址

論文概要:發表於nature scientificreports(2017). 基於27萬餘份電子病歷構建的疾病-症狀知識圖譜。

論文摘要:Demand for clinical decision support systems in medicine and self-diagnostic symptom checkers has substantially increased in recent years. Existing platforms rely on knowledge bases manually compiled through a labor-intensive process or automatically derived using simple pairwise statistics. This study explored an automated process to learn high quality knowledge bases linking diseases and symptoms directly from electronic medical records. Medical concepts were extracted from 273,174 de-identified patient records and maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier and a Bayesian network using noisy OR gates. A graph of disease-symptom relationships was elicited from the learned parameters and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. Our study shows that direct and automated construction of high quality health knowledge graphs from medical records using rudimentary concept extraction is feasible. The noisy OR model produces a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation. Noisy OR significantly outperforms all tested models across evaluation frameworks (p < 0.01).

6.輔助診斷Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence

論文地址

論文概要:該文章由廣州市婦女兒童醫療中心與依圖醫療等企業和科研機構共同完成,基於機器學習的自然語言處理(NLP)技術實現不輸人類醫生的強大診斷能力,並具備多場景的應用能力。據介紹,這是全球首次在頂級醫學雜誌發表有關自然語言處理(NLP)技術基於電子健康記錄(EHR)做臨床智能診斷的研究成果,也是利用人工智慧技術診斷兒科疾病的重磅科研成果。

論文摘要:Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal.

7.ACL2020醫學領域相關論文列表A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization

論文地址

Biomedical Entity Representations with Synonym Marginalization

論文地址

Document Translation vs. Query Translation for Cross-Lingual Information Retrieval in the Medical Domain

論文地址

MIE: A Medical Information Extractor towards Medical Dialogues

論文地址

Rationalizing Medical Relation Prediction from Corpus-level Statistics

論文地址

8.醫療實體Linking(標準化)Medical Entity Linking using Triplet Network

論文地址

論文概要:發表於ACL2019,論文內容為疾病實體Linking研究。使用三元組數據,(mention,正例,負例),目標使distance(mention,負例)-distance(mention,正例)>alpha(人臉識別的經典方案),具體損失函數參看論文。論文主要包括兩部分內容1)候選數據集生成,對給定mention,與標準疾病集合數據(標準詞及同義詞)計算餘弦相似度及Jaccard overlap分數,取topK作為候選樣例。2)網絡結構基於Triplet Network。詳見論文。

A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization

論文地址

論文概要: 發表於ACL2020。基於list-wise排序學習方法。主要分為兩部分:後續數據集生成 和 基於BERT的list-wise排序。較新穎的思路:1)在樣本生成過程中,對標準詞進行了基於同義詞的擴展。2)在loss中引入了語義類型正則化。詳見論文。

中文醫療領域語料醫學教材+培訓考試

說明:由於版權原因,現在無法提供度盤下載連結了,請大家前往原豆瓣連結下載吧。

語料說明:根據此豆瓣連結整理。

哈工大《大詞林》開放75萬核心實體詞及相關概念、關系列表(包含中藥/醫院/生物 類別)

語料說明:哈工大開源了《大詞林》中的75萬的核心實體詞,以及這些核心實體詞對應的細粒度概念詞(共1.8萬概念詞,300萬實體-概念元組),還有相關的關係三元組(共300萬)。這75萬核心實體列表涵蓋了常見的人名、地名、物品名等術語。概念詞列表則包含了細粒度的實體概念信息。藉助於細粒度的上位概念層次結構和豐富的實體間關係,本次開源的數據能夠為人機對話、智能推薦、等應用技術提供數據支持。

語料官方下載地址

說明: 通過網上查詢,這部分資源應該是被一些公司付費使用了,可能有版權問題,所以現在下載連結都失效了。後續如果再有開源的信息再進行更新。

醫學embedding開源英文醫學embedding

項目說明:發表於AMIA 2016. 開源醫學相關概念embedding.

項目地址

開源工具包分詞工具PKUSEG

項目地址

項目說明:北京大學推出的多領域中文分詞工具,支持選擇醫學領域。

工業級產品解決方案

靈醫智慧

左手醫生

blog分享

醫療領域構建自然語言處理系統的經驗教訓

友情連結

awesome_Chinese_medical_NLP

中文NLP數據集搜索

medical-data(海量醫療相關數據)


由於微信平臺算法改版,公號內容將不再以時間排序展示,如果大家想第一時間看到我們的推送,強烈建議星標我們和給我們多點點【在看】。星標具體步驟為:

(1)點擊頁面最上方深度學習自然語言處理」,進入公眾號主頁。

(2)點擊右上角的小點點,在彈出頁面點擊「設為星標」,就可以啦。

感謝支持,比心。

投稿或交流學習,備註:暱稱-學校(公司)-方向,進入DL&NLP交流群。

方向有很多:機器學習、深度學習,python,情感分析、意見挖掘、句法分析、機器翻譯、人機對話、知識圖譜、語音識別等。

整理不易,還望給個在看!

相關焦點

  • 【分享包】最全語音文本數據、工具包大分享,快來下載吧!(II)
    【最全任務型對話數據集】主要介紹了一份任務型對話數據集大全,這份數據集大全涵蓋了到目前在任務型對話領域的所有常用數據集的主要信息。63.基於醫療領域知識圖譜的問答系統 https://github.com/zhihao-chen/QASystemOnMedicalGraph該repo參考了https://github.com/liuhuanyong/QASystemOnMedicalKG64.基於依存句法與語義角色標註的事件三元組抽取
  • 乾貨 | NLP、知識圖譜教程、書籍、網站、工具...(附資源連結)
    /CoreNLP/THUCTCTHUCTC(THU Chinese Text Classification)是由清華大學自然語言處理實驗室推出的中文文本分類工具包。然而對於初學者來說,這卻是最適合的工具。這主要體現在以下幾個方面:1.中文處理能力。NLTK和OpenNLP對中文支持非常差,這裡不光是中文分詞的問題,有些NLP算法需要一定的語言模型數據,但瀏覽NLTK官方的模型庫,基本找不到中文模型數據。
  • NLP、KG相關數據集匯總
    數據使用範圍、授權請參考原始發布源(如果有的話),如有侵權,請聯繫我刪除。有的數據源(網站、論文)提供了多語語料,為避免重複,只在中文或外語對應章節列出(比如翻譯)。如有多語資源,會在相應章節進行說明(如需要特定任務的數據集,可以分別在中文和外語語料對應章節進行查看)。
  • Awesome-Chinese-NLP:中文自然語言處理相關資料
    (Python)QASystemOnMedicalKG (Python) 以疾病為中心的一定規模醫藥領域知識圖譜,並以該知識圖譜完成自動問答與分析服務。開放知識圖譜OpenKG.cn開放中文知識圖譜的schema大規模中文概念圖譜CN-Probase 公眾號介紹農業知識圖譜 農業領域的信息檢索,命名實體識別,關係抽取,分類樹構建,數據挖掘CLDC中文語言資源聯盟中文 Wikipedia Dump98年人民日報詞性標註庫@百度盤搜狗20061127
  • 中文NLP福利!大規模中文自然語言處理語料
    新智元推薦來源:AINLP作者:徐亮【新智元導讀】本文介紹一個中文自然語言處理語料庫項目:nlp_chinese_corpus ,初步貢獻了幾個已經預處理好的中文語料,包括維基、新聞和百科語料,可直接下載使用。
  • NLP Chinese Corpus項目:大規模中文自然語言處理語料
    NLP領域缺乏高質量的中文語料。希望大家一起為該項目貢獻語料,感興趣的同學可以直接關注該項目github地址,和作者直接聯繫,點擊文末"閱讀原文"直達github連結,可下載相關語料:大規模中文自然語言處理語料 Large Scale Chinese Corpus for NLPhttps://github.com/brightmart/nlp_chinese_corpus為中文自然語言處理領域發展貢獻語料
  • 中文自然語言處理相關資料集合指南
    (Python)QASystemOnMedicalKG (Python) 以疾病為中心的一定規模醫藥領域知識圖譜,並以該知識圖譜完成自動問答與分析服務。開放知識圖譜OpenKG.cn開放中文知識圖譜的schema大規模中文概念圖譜CN-Probase 公眾號介紹農業知識圖譜 農業領域的信息檢索,命名實體識別,關係抽取,分類樹構建,數據挖掘CLDC中文語言資源聯盟中文 Wikipedia Dump98年人民日報詞性標註庫@百度盤搜狗20061127
  • 一文詳解NLP語料構建技巧
    轉自:數據科學雜談記得寫畢業論文那會兒,經常為語料發愁。由於大多數 NLP 問題都是有監督問題,很多時候我們往往缺的不是算法,而是標註好的語料。這在中文語料上更是明顯。今天就和大家分享一些中文 NLP 領域,構建語料的經驗和技巧,雖然未必看了此文就能徹底解決語料的問題,但是或多或少會有些啟發。
  • NLP、KG相關軟體、工具、資源匯總
    在學習和工作中,選擇一套合適的工具、框架能夠事半功倍。這裡收集了NLP、KG領域目前常用的軟體、工具和一些資源,方便大家按照需求選用。fastHan (https://github.com/fastnlp/fastHan): fastHan是基於fastNLP與pytorch實現的中文自然語言處理工具,像spacy一樣調用方便
  • 知識圖譜:知識圖譜賦能企業數位化轉型 | AI 研習社職播間第 3 期
    現任北京知識圖譜科技有限公司 CEO、中文信息學會語言與知識計算專委會委員、開放知識圖譜聯盟成員。畢業後在湯森路透工作了幾年,做面向金融、科技行業的諮詢顧問,之後在 2017 年,我們成立了北京知識圖譜科技,面向醫療、軍工、金融等領域提供知識圖譜解決方案。今天我們分享內容包括:公司介紹&招聘,知識圖譜概述 &企業機遇挑戰,知識圖譜賦能企業數位化轉型,知識圖譜落地挑戰與趨勢四個方面。
  • 科普 | 典型的知識庫/連結數據/知識圖譜項目
    中文知識圖譜資源OpenKG.CN:中文開放知識圖譜聯盟旨在通過建設開放的社區來促進中文知識圖譜數據的開放與互聯,促進中文知識圖譜工具的標準化和技術普及。Zhishi.me :Zhishi.me是中文常識知識圖譜。
  • 騰訊醫療AI實驗室:3篇論文被國際頂尖會議收錄 ——騰訊醫療知識...
    AI領域的學術研究獲得實質性進展,旗下醫療AI實驗室共有3篇論文分別被KDD 2018、SIGIR 2018 、COLING 2018三個國際頂尖學術會議收錄,論文的主要研究方向為醫療知識圖譜中實體關係的發現和應用。
  • 一份超全的NLP語料資源集合及其構建現狀
    轉載:AI科技大本營作者:劉煥勇作者劉煥勇,語言學碩士,目前就職於中國科學院軟體研究所,主要從事信息抽取,知識圖譜,情感分析, 社會計算等自然語言處理研發工作,興趣包括:語言資源構建、信息抽取與知識圖譜、輿情監測與社會計算。
  • 支持53種語言預訓練模型,斯坦福發布全新NLP工具包StanfordNLP
    53 種語言預訓練模型的自然語言處理工具包 StanfordNLP,該工具包支持 Python 3.6 及之後版本,並基於 PyTorch,支持多種語言的完整文本分析管道,包括分詞、詞性標註、詞形歸併和依存關係解析,此外它還提供了與 CoreNLP 的 Python 接口。
  • cnSchema: 面向 bot 的開放中文知識圖譜 schema
    cnSchema 是 OpenKG 正在努力的一個方向,其目標就是通過復用與設計 schema,支持開放中文知識圖譜應用落地。這裡的 schema 就是中文知識圖譜中使用的詞彙集、數據字典。從下圖的實體數據中可以看到,在發布和使用開放數據中 schema 定義了實體的分類、屬性和數據結構,是數據接口的關鍵部分。
  • 斯坦福發布重磅NLP工具包StanfordNLP,支持中文等53種語言
    編輯:肖琴【新智元導讀】斯坦福團隊最新發布一個NLP任務的軟體包StanfordNLP,通過Python接口為53種語言提供標記、依存句法分析等NLP任務的重要工具。今天,斯坦福NLP團隊發布一個重磅NLP工具包:StanfordNLP。
  • cnSchema: 面向bot的開放中文知識圖譜schema
    cnSchema是OpenKG正在努力的一個方向,其目標就是通過復用與設計schema,支持開放中文知識圖譜應用落地。這裡的schema就是中文知識圖譜中使用的詞彙集、數據字典。從下圖的實體數據中可以看到,在發布和使用開放數據中schema定義了實體的分類、屬性和數據結構,是數據接口的關鍵部分。
  • 最全的中文語言處理數據集、平臺和工具!
    資源整理了文本分類、實體識別&詞性標註、搜索匹配、推薦系統、指代消歧、百科數據、預訓練詞向量or模型、中文完形填空等大量數據集,中文數據集平臺和NLP工具等。清華新聞分類語料:    根據新浪新聞RSS訂閱頻道2005~2011年間的歷史數據篩選過濾生成。
  • 在剛剛結束的ACL 2019上,知識圖譜領域都發生了哪些大事?
    Trisedya 等人(https://www.aclweb.org/anthology/P19-1023)提出了一個包含 255,000 個「文本——三元組」對,280,000 個實體和 158 個謂詞的 WIKI 數據集,其任務是從給定的自然語言句子以及該數據集的基準模型構建知識圖譜。
  • GitHub 最受歡迎的 NLP 相關項目 | 資源推薦
    整理常見 NLP 任務的 SOTA 模型,及對應數據集。想要涵蓋傳統和核心的 NLP 任務,例如依存分析、詞性標註以及最近的閱讀理解和自然語言推理。主要目的是讓讀者快速了解,他們感興趣任務的基準數據集和 SOTA 模型,為進一步研究奠定基礎。