點擊上方,選擇星標或置頂,每天給你送乾貨!
閱讀大概需要15分鐘
跟隨小博主,每天進步一丟丟
整理:深度學習技術前沿
Github主頁:https://github.com/lrs1353281004/Chinese_medical_NLP
Chinese_medical_NLP醫療NLP領域(主要關注中文) 評測數據集 與 論文等相關資源。
中文評測數據集
中文醫學知識圖譜
英文數據集
相關論文
中文醫療領域語料
醫學embedding
開源工具包
工業級產品/解決方案
blog分享
友情連結
數據集描述:
Yidu-S4K 數據集源自CCKS 2019 評測任務一,即「面向中文電子病歷的命名實體識別」的數據集,包括兩個子任務:1)醫療命名實體識別:由於國內沒有公開可獲得的面向中文電子病歷醫療實體識別數據集,本年度保留了醫療命名實體識別任務,對2017年度數據集做了修訂,並隨任務一同發布。本子任務的數據集包括訓練集和測試集。2)醫療實體及屬性抽取(跨院遷移):在醫療實體識別的基礎上,對預定義實體屬性進行抽取。本任務為遷移學習任務,即在只提供目標場景少量標註數據的情況下,通過其他場景的標註數據及非標註數據進行目標場景的識別任務。本子任務的數據集包括訓練集(非目標場景和目標場景的標註數據、各個場景的非標註數據)和測試集(目標場景的標註數據
數據集地址
度盤下載地址:https://pan.baidu.com/s/1QqYtqDwhc_S51F3SYMChBQ
提取碼:flql
數據集描述:
數據集來自天池大賽。此數據集旨在通過糖尿病相關的教科書、研究論文來做糖尿病文獻挖掘並構建糖尿病知識圖譜。參賽選手需要設計高準確率,高效的算法來挑戰這一科學難題。第一賽季課題為「基於糖尿病臨床指南和研究論文的實體標註構建」,第二賽季課題為「基於糖尿病臨床指南和研究論文的實體間關係構建」。
官方提供的數據只包含訓練集,真正用於最終排名的測試集沒有給出。
數據集地址
度盤下載地址:https://pan.baidu.com/s/1CWKblBNBqR-vs2h0xiXSdQ
提取碼:0c54
數據集描述:
Yidu-N4K 數據集源自CHIP 2019 評測任務一,即「臨床術語標準化任務」的數據集。臨床術語標準化任務是醫學統計中不可或缺的一項任務。臨床上,關於同一種診斷、手術、藥品、檢查、化驗、症狀等往往會有成百上千種不同的寫法。標準化(歸一)要解決的問題就是為臨床上各種不同說法找到對應的標準說法。有了術語標準化的基礎,研究人員才可對電子病歷進行後續的統計分析。本質上,臨床術語標準化任務也是語義相似度匹配任務的一種。但是由於原詞表述方式過於多樣,單一的匹配模型很難獲得很好的效果。
數據集地址
數據集描述:
中文醫藥方面的問答數據集,超過10萬條。
數據說明:
questions.csv:所有的問題及其內容。answers.csv :所有問題的答案。train_candidates.txt, dev_candidates.txt, test_candidates.txt :將上述兩個文件進行了拆分。
數據集地址
數據集github地址
數據集描述:
本次比賽是chip2019中的評測任務二,由平安醫療科技主辦。chip2019會議詳情見連結:http://cips-chip.org.cn/evaluation 遷移學習是自然語言處理中的重要一環,其主要目的是通過從已學習的相關任務中轉移知識來改進新任務的學習效果,從而提高模型的泛化能力。本次評測任務的主要目標是針對中文的疾病問答數據,進行病種間的遷移學習。具體而言,給定來自5個不同病種的問句對,要求判定兩個句子語義是否相同或者相近。所有語料來自網際網路上患者真實的問題,並經過了篩選和人工的意圖匹配標註。
數據集地址(需註冊)
數據集描述:
本次大賽數據包括:脫敏之後的醫療問題數據對和標註數據。醫療問題涉及「肺炎」、「支原體肺炎」、「支氣管炎」、「上呼吸道感染」、「肺結核」、「哮喘」、「胸膜炎」、「肺氣腫」、「感冒」、「咳血」等10個病種。數據共包含train.csv、dev.csv、test.csv三個文件,其中給參賽選手的文件包含訓練集train.csv和驗證集dev.csv,測試集test.csv 對參賽選手不可見。每一條數據由 Category,Query1,Query2,Label構成,分別表示問題類別、問句1、問句2、標籤。Label表示問句之間的語義是否相同,若相同,標為1,若不相同,標為0。其中,訓練集Label已知,驗證集和測試集Label未知。示例 類別:肺炎 問句1:肺部發炎是什麼原因引起的?問句2:肺部發炎是什麼引起的 標籤:1 類別:肺炎 問句1:肺部發炎是什麼原因引起的?問句2:肺部炎症有什麼症狀 標籤:0
數據集地址(需註冊)
線上第四名解決方案及代碼
線上第一名解決方案及代碼
數據說明: 來自某在線求醫產品的中文醫患對話數據。
原始描述:The MedDialog dataset contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances. The data is continuously growing and more dialogues will be added. The raw dialogues are from haodf.com. All copyrights of the data belong to haodf.com.
項目地址
度盤下載地址: https://pan.baidu.com/s/1ZwzNgvAAMQk4klerTspsoA
提取碼: lbo4
數據說明: 包含六個科室的醫學問答數據,來源不明。
項目地址
地址
簡介:CMeKG(Chinese Medical Knowledge Graph)是利用自然語言處理與文本挖掘技術,基於大規模醫學文本數據,以人機結合的方式研發的中文醫學知識圖譜。CMeKG的構建參考了ICD、ATC、SNOMED、MeSH等權威的國際醫學標準以及規模龐大、多源異構的臨床指南、行業標準、診療規範與醫學百科等醫學文本信息。CMeKG 1.0包括:6310種疾病、19853種藥物(西藥、中成藥、中草藥)、1237種診療技術及設備的結構化知識描述,涵蓋疾病的臨床症狀、發病部位、藥物治療、手術治療、鑑別診斷、影像學檢查、高危因素、傳播途徑、多發群體、就診科室等以及藥物的成分、適應症、用法用量、有效期、禁忌證等30餘種常見關係類型,CMeKG描述的概念關係實例及屬性三元組達100餘萬。
數據集描述:基於Pubmed提取的醫學問答數據集。PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially gen- erated QA instances.
論文地址
註:目前沒有收集到中文醫療領域的開源預訓練模型,以下列出英文論文供參考。
論文題目:BioBERT: a pre-trained biomedical language representation model for biomedical text mining
論文地址
項目地址
論文概要:以通用領域預訓練bert為初始權重,基於Pubmed上大量醫療領域英文論文訓練。在多個醫療相關下遊任務中超越SOTA模型的表現。
論文摘要:
Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from bio- medical literature has gained popularity among researchers, and deep learning has boosted the development of ef- fective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text min- ing often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results: We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation: We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
論文題目:SCIBERT: A Pretrained Language Model for Scientific Text
論文地址
項目地址
論文概要:AllenAI 團隊出品.基於Semantic Scholar 上 110萬+ 文章訓練的 科學領域bert.
論文摘要:Obtaining large-scale annotated data for NLP tasks in the scientific domain is challeng- ing and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data. SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve perfor- mance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demon- strate statistically significant improvements over BERT and achieve new state-of-the- art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.
論文題目:Publicly Available Clinical BERT Embeddings
論文地址
項目地址
論文概要:出自NAACL Clinical NLP Workshop 2019.基於MIMIC-III資料庫中的200萬份醫療記錄訓練的臨床領域bert.
論文摘要:Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.
論文題目:ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission
論文地址
項目地址
論文概要:同樣基於MIMIC-III資料庫,但只隨機選取了10萬份醫療記錄訓練的臨床領域bert.
論文摘要:Clinical notes contain information about patients that goes beyond structured data like lab values and medications. However, clinical notes have been underused relative to structured data, because notes are high-dimensional and sparse. This work develops and evaluates representations of clinical notes using bidirectional transformers (ClinicalBert). Clini- calBert uncovers high-quality relationships between medical concepts as judged by hu- mans. ClinicalBert outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit. Code and model parameters are available.
論文題目:BEHRT: TRANSFORMER FOR ELECTRONIC HEALTH RECORDS
論文地址
項目地址: 暫未開源
論文概要:這篇論文中embedding是基於醫學實體訓練,而不是基於單詞。
論文摘要:Today, despite decades of developments in medicine and the growing interest in precision healthcare, vast majority of diagnoses happen once patients begin to show noticeable signs of illness. Early indication and detection of diseases, however, can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning (more specifically, deep learning) provides a great opportunity to address this unmet need. In this study, we introduce BEHRT: A deep neural sequence transduction model for EHR (electronic health records), capable of multitask prediction and disease trajectory mapping. When trained and evaluated on the data from nearly 1.6 million individuals, BEHRT shows a striking absolute improvement of 8.0-10.8%, in terms of Average Precision Score, compared to the existing state-of-the-art deep EHR models (in terms of average precision, when predicting for the onset of 301 conditions). In addition to its superior prediction power, BEHRT provides a personalised view of disease trajectories through its attention mechanism; its flexible architecture enables it to incorporate multiple heterogeneous concepts (e.g., diagnosis, medication, measurements, and more) to improve the accuracy of its predictions; and its (pre-)training results in disease and patient representations that can help us get a step closer to interpretable predictions.
論文題目:A guide to deep learning in healthcare
論文地址
論文概要:發表於nature medicine,包含醫學領域下CV,NLP,強化學習等方面的應用綜述。
論文摘要:Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep- learning methods for genomics are reviewed.
論文地址
論文概要:發表於EMNLP2019。基於少量in-domain數據和大量out-of-domain數據進行EHR相關的遷移學習。
論文摘要:sections such as Assessment and Plan, So- cial History, and Medications. These sec- tions help physicians find information easily and can be used by an information retrieval system to return specific information sought by a user. However, it is common that the exact format of sections in a particular EHR does not adhere to known patterns. There- fore, being able to predict sections and headers in EHRs automatically is beneficial to physi- cians. Prior approaches in EHR section pre- diction have only used text data from EHRs and have required significant manual annota- tion. We propose using sections from med- ical literature (e.g., textbooks, journals, web content) that contain content similar to that found in EHR sections. Our approach uses data from a different kind of source where la- bels are provided without the need of a time- consuming annotation effort. We use this data to train two models: an RNN and a BERT- based model. We apply the learned models along with source data via transfer learning to predict sections in EHRs. Our results show that medical literature can provide helpful su- pervision signal for this classification task.
論文地址
論文概要:發表於EMNLP 2019. 基於dependency forest方法,提升對醫學語句中依存關係的召回率,同時引進了一部分噪聲,基於圖循環網絡進行特徵提取,提供了在醫療關係抽取中使用依存關係,同時減少誤差傳遞的一種思路。
論文摘要:Medical relation extraction discovers relations between entity mentions in text, such as research articles. For this task, dependency syntax has been recognized as a crucial source of features. Yet in the medical domain, 1-best parse trees suffer from relatively low accuracies, diminishing their usefulness. We investigate a method to alleviate this problem by utilizing dependency forests. Forests contain more than one possible decisions and therefore have higher recall but more noise compared with 1-best outputs. A graph neural network is used to represent the forests, automatically distinguishing the useful syntactic information from parsing noise. Results on two benchmarks show that our method outperforms the standard tree-based methods, giving the state-of-the-art results in the literature.
論文地址
論文概要:發表於nature scientificreports(2017). 基於27萬餘份電子病歷構建的疾病-症狀知識圖譜。
論文摘要:Demand for clinical decision support systems in medicine and self-diagnostic symptom checkers has substantially increased in recent years. Existing platforms rely on knowledge bases manually compiled through a labor-intensive process or automatically derived using simple pairwise statistics. This study explored an automated process to learn high quality knowledge bases linking diseases and symptoms directly from electronic medical records. Medical concepts were extracted from 273,174 de-identified patient records and maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier and a Bayesian network using noisy OR gates. A graph of disease-symptom relationships was elicited from the learned parameters and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. Our study shows that direct and automated construction of high quality health knowledge graphs from medical records using rudimentary concept extraction is feasible. The noisy OR model produces a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation. Noisy OR significantly outperforms all tested models across evaluation frameworks (p < 0.01).
論文地址
論文概要:該文章由廣州市婦女兒童醫療中心與依圖醫療等企業和科研機構共同完成,基於機器學習的自然語言處理(NLP)技術實現不輸人類醫生的強大診斷能力,並具備多場景的應用能力。據介紹,這是全球首次在頂級醫學雜誌發表有關自然語言處理(NLP)技術基於電子健康記錄(EHR)做臨床智能診斷的研究成果,也是利用人工智慧技術診斷兒科疾病的重磅科研成果。
論文摘要:Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal.
論文地址
論文地址
論文地址
論文地址
論文地址
論文地址
論文概要:發表於ACL2019,論文內容為疾病實體Linking研究。使用三元組數據,(mention,正例,負例),目標使distance(mention,負例)-distance(mention,正例)>alpha(人臉識別的經典方案),具體損失函數參看論文。論文主要包括兩部分內容1)候選數據集生成,對給定mention,與標準疾病集合數據(標準詞及同義詞)計算餘弦相似度及Jaccard overlap分數,取topK作為候選樣例。2)網絡結構基於Triplet Network。詳見論文。
論文地址
論文概要: 發表於ACL2020。基於list-wise排序學習方法。主要分為兩部分:後續數據集生成 和 基於BERT的list-wise排序。較新穎的思路:1)在樣本生成過程中,對標準詞進行了基於同義詞的擴展。2)在loss中引入了語義類型正則化。詳見論文。
說明:由於版權原因,現在無法提供度盤下載連結了,請大家前往原豆瓣連結下載吧。
語料說明:根據此豆瓣連結整理。
語料說明:哈工大開源了《大詞林》中的75萬的核心實體詞,以及這些核心實體詞對應的細粒度概念詞(共1.8萬概念詞,300萬實體-概念元組),還有相關的關係三元組(共300萬)。這75萬核心實體列表涵蓋了常見的人名、地名、物品名等術語。概念詞列表則包含了細粒度的實體概念信息。藉助於細粒度的上位概念層次結構和豐富的實體間關係,本次開源的數據能夠為人機對話、智能推薦、等應用技術提供數據支持。
語料官方下載地址
說明: 通過網上查詢,這部分資源應該是被一些公司付費使用了,可能有版權問題,所以現在下載連結都失效了。後續如果再有開源的信息再進行更新。
項目說明:發表於AMIA 2016. 開源醫學相關概念embedding.
項目地址
項目地址
項目說明:北京大學推出的多領域中文分詞工具,支持選擇醫學領域。
靈醫智慧
左手醫生
醫療領域構建自然語言處理系統的經驗教訓
awesome_Chinese_medical_NLP
中文NLP數據集搜索
medical-data(海量醫療相關數據)
由於微信平臺算法改版,公號內容將不再以時間排序展示,如果大家想第一時間看到我們的推送,強烈建議星標我們和給我們多點點【在看】。星標具體步驟為:
(1)點擊頁面最上方「深度學習自然語言處理」,進入公眾號主頁。
(2)點擊右上角的小點點,在彈出頁面點擊「設為星標」,就可以啦。
感謝支持,比心。
投稿或交流學習,備註:暱稱-學校(公司)-方向,進入DL&NLP交流群。
方向有很多:機器學習、深度學習,python,情感分析、意見挖掘、句法分析、機器翻譯、人機對話、知識圖譜、語音識別等。
整理不易,還望給個在看!