自然语言处理工具之deepnlp

DeepNLP简介

deepnlp项目是基于Tensorflow平台的一个python版本的NLP套装, 目的在于将Tensorflow深度学习平台上的模块,结合 最新的一些算法,提供NLP基础模块的支持,并支持其他更加复杂的任务的拓展,如生成式文摘等等。

  • NLP 套装模块

    • 分词 Word Segmentation/Tokenization
    • 词性标注 Part-of-speech (POS)
    • 命名实体识别 Named-entity-recognition(NER)
    • 依存句法分析 Dependency Parsing (Parse)
    • 自动生成式文摘 Textsum (Seq2Seq-Attention)
    • 关键句子抽取 Textrank
    • 文本分类 Textcnn (WIP)
    • 可调用 Web Restful API
    • 计划中: 句法分析 Parsing
  • 算法实现

    • 分词: 线性链条件随机场 Linear Chain CRF, 基于CRF++包来实现
    • 词性标注: 单向LSTM/ 双向BI-LSTM, 基于Tensorflow实现
    • 命名实体识别: 单向LSTM/ 双向BI-LSTM/ LSTM-CRF 结合网络, 基于Tensorflow实现
    • 依存句法分析: 基于arc-standard system的神经网络的parser
  • 预训练模型

    • 中文: 基于人民日报语料和微博混合语料: 分词, 词性标注, 实体识别

DeepNLP的安装

安装说明

pip install deepnlp

下载模型:

import deepnlp
# Download all the modules
deepnlp.download()
 
# Download specific module
deepnlp.download('segment')
deepnlp.download('pos')
deepnlp.download('ner')
deepnlp.download('parse')
 
# Download module and domain-specific model
deepnlp.download(module = 'pos', name = 'en') 
deepnlp.download(module = 'ner', name = 'zh_entertainment')

执行示例代码,报如下错误:

from deepnlp import segmenter
 
tokenizer = segmenter.load_model(name='zh_entertainment')
text = "我刚刚在浙江卫视看了电视剧老九门,觉得陈伟霆很帅"
segList = tokenizer.seg(text)
text_seg = " ".join(segList)
Traceback (most recent call last):
  File "D:/CodeHub/NLP/test_new.py", line 3, in 
    from deepnlp import segmenter
  File "D:\CodeHub\NLP\venv\lib\site-packages\deepnlp\segmenter.py", line 6, in 
    import CRFPP
ModuleNotFoundError: No module named 'CRFPP'

解决方案,安装CRFPP。

DeepNLP的使用

使用示例:

from deepnlp import segmenter, pos_tagger, ner_tagger, nn_parser
from deepnlp import pipeline
 
# 分词模块
tokenizer = segmenter.load_model(name='zh')
text = "我爱吃北京烤鸭"
seg_list = tokenizer.seg(text)
text_seg = " ".join(seg_list)
print(text_seg)
 
# 词性标注
p_tagger = pos_tagger.load_model(name='zh')
tagging = p_tagger.predict(seg_list)
for (w, t) in tagging:
    pair = w + "/" + t
print(pair)
 
# 命名实体识别
n_tagger = ner_tagger.load_model(name='zh')  # Base LSTM Based Model
tagset_entertainment = ['city', 'district', 'area']
tagging = n_tagger.predict(seg_list, tagset=tagset_entertainment)
for (w, t) in tagging:
    pair = w + "/" + t
    print(pair)
 
# 依存句法分析
parser = nn_parser.load_model(name='zh')
words = ['它', '熟悉', '一个', '民族', '的', '历史']
tags = ['r', 'v', 'm', 'n', 'u', 'n']
dep_tree = parser.predict(words, tags)
num_token = dep_tree.count()
print("id\tword\tpos\thead\tlabel")
for i in range(num_token):
    cur_id = int(dep_tree.tree[i + 1].id)
    cur_form = str(dep_tree.tree[i + 1].form)
    cur_pos = str(dep_tree.tree[i + 1].pos)
    cur_head = str(dep_tree.tree[i + 1].head)
    cur_label = str(dep_tree.tree[i + 1].deprel)
    print("%d\t%s\t%s\t%s\t%s" % (cur_id, cur_form, cur_pos, cur_head, cur_label))
 
# Pipeline
p = pipeline.load_model('zh')
text = "我爱吃北京烤鸭"
res = p.analyze(text)
print(res[0])
print(res[1])
print(res[2])
words = p.segment(text)
pos_tagging = p.tag_pos(words)
ner_tagging = p.tag_ner(words)
print(list(pos_tagging))
print(ner_tagging)

自己训练模型流程:

参考链接: https://github.com/rockingdingo/deepnlp