自然语言处理工具之deepnlp
2011 年 7 月 9 日
DeepNLP简介
deepnlp项目是基于Tensorflow平台的一个python版本的NLP套装, 目的在于将Tensorflow深度学习平台上的模块,结合 最新的一些算法,提供NLP基础模块的支持,并支持其他更加复杂的任务的拓展,如生成式文摘等等。
-
NLP 套装模块
- 分词 Word Segmentation/Tokenization
- 词性标注 Part-of-speech (POS)
- 命名实体识别 Named-entity-recognition(NER)
- 依存句法分析 Dependency Parsing (Parse)
- 自动生成式文摘 Textsum (Seq2Seq-Attention)
- 关键句子抽取 Textrank
- 文本分类 Textcnn (WIP)
- 可调用 Web Restful API
- 计划中: 句法分析 Parsing
-
算法实现
- 分词: 线性链条件随机场 Linear Chain CRF, 基于CRF++包来实现
- 词性标注: 单向LSTM/ 双向BI-LSTM, 基于Tensorflow实现
- 命名实体识别: 单向LSTM/ 双向BI-LSTM/ LSTM-CRF 结合网络, 基于Tensorflow实现
- 依存句法分析: 基于arc-standard system的神经网络的parser
-
预训练模型
- 中文: 基于人民日报语料和微博混合语料: 分词, 词性标注, 实体识别
DeepNLP的安装
安装说明
pip install deepnlp
下载模型:
import deepnlp # Download all the modules deepnlp.download() # Download specific module deepnlp.download('segment') deepnlp.download('pos') deepnlp.download('ner') deepnlp.download('parse') # Download module and domain-specific model deepnlp.download(module = 'pos', name = 'en') deepnlp.download(module = 'ner', name = 'zh_entertainment')
执行示例代码,报如下错误:
from deepnlp import segmenter tokenizer = segmenter.load_model(name='zh_entertainment') text = "我刚刚在浙江卫视看了电视剧老九门,觉得陈伟霆很帅" segList = tokenizer.seg(text) text_seg = " ".join(segList)
Traceback (most recent call last): File "D:/CodeHub/NLP/test_new.py", line 3, infrom deepnlp import segmenter File "D:\CodeHub\NLP\venv\lib\site-packages\deepnlp\segmenter.py", line 6, in import CRFPP ModuleNotFoundError: No module named 'CRFPP'
解决方案,安装CRFPP。
DeepNLP的使用
使用示例:
from deepnlp import segmenter, pos_tagger, ner_tagger, nn_parser from deepnlp import pipeline # 分词模块 tokenizer = segmenter.load_model(name='zh') text = "我爱吃北京烤鸭" seg_list = tokenizer.seg(text) text_seg = " ".join(seg_list) print(text_seg) # 词性标注 p_tagger = pos_tagger.load_model(name='zh') tagging = p_tagger.predict(seg_list) for (w, t) in tagging: pair = w + "/" + t print(pair) # 命名实体识别 n_tagger = ner_tagger.load_model(name='zh') # Base LSTM Based Model tagset_entertainment = ['city', 'district', 'area'] tagging = n_tagger.predict(seg_list, tagset=tagset_entertainment) for (w, t) in tagging: pair = w + "/" + t print(pair) # 依存句法分析 parser = nn_parser.load_model(name='zh') words = ['它', '熟悉', '一个', '民族', '的', '历史'] tags = ['r', 'v', 'm', 'n', 'u', 'n'] dep_tree = parser.predict(words, tags) num_token = dep_tree.count() print("id\tword\tpos\thead\tlabel") for i in range(num_token): cur_id = int(dep_tree.tree[i + 1].id) cur_form = str(dep_tree.tree[i + 1].form) cur_pos = str(dep_tree.tree[i + 1].pos) cur_head = str(dep_tree.tree[i + 1].head) cur_label = str(dep_tree.tree[i + 1].deprel) print("%d\t%s\t%s\t%s\t%s" % (cur_id, cur_form, cur_pos, cur_head, cur_label)) # Pipeline p = pipeline.load_model('zh') text = "我爱吃北京烤鸭" res = p.analyze(text) print(res[0]) print(res[1]) print(res[2]) words = p.segment(text) pos_tagging = p.tag_pos(words) ner_tagging = p.tag_ner(words) print(list(pos_tagging)) print(ner_tagging)
自己训练模型流程: