本文共 2225 字,大约阅读时间需要 7 分钟。
hanlp-ext 插件源码地址: 或
Elasticsearch 默认对中文分词是按“字”进行分词的,这是肯定不能达到我们进行分词搜索的要求的。官方有一个 SmartCN 中文分词插件,另外还有一个 IK 分词插件使用也比较广。但这里,我们采用 HanLP 这款 自然语言处理工具 来进行中文分词。
Elasticsearch
Elasticsearch 的默认分词效果是惨不忍睹的。GET /_analyze?pretty{ "text" : ["重庆华龙网海数科技有限公司"]}
输出:
{
"tokens": [{ "token": "重", "start_offset": 0, "end_offset": 1, "type": "", "position": 0},{ "token": "庆", "start_offset": 1, "end_offset": 2, "type": " ", "position": 1},{ "token": "华", "start_offset": 2, "end_offset": 3, "type": " ", "position": 2},{ "token": "龙", "start_offset": 3, "end_offset": 4, "type": " ", "position": 3},{ "token": "网", "start_offset": 4, "end_offset": 5, "type": " ", "position": 4},{ "token": "海", "start_offset": 5, "end_offset": 6, "type": " ", "position": 5},{ "token": "数", "start_offset": 6, "end_offset": 7, "type": " ", "position": 6},{ "token": "科", "start_offset": 7, "end_offset": 8, "type": " ", "position": 7},{ "token": "技", "start_offset": 8, "end_offset": 9, "type": " ", "position": 8},{ "token": "有", "start_offset": 9, "end_offset": 10, "type": " ", "position": 9},{ "token": "限", "start_offset": 10, "end_offset": 11, "type": " ", "position": 10},{ "token": "公", "start_offset": 11, "end_offset": 12, "type": " ", "position": 11},{ "token": "司", "start_offset": 12, "end_offset": 13, "type": " ", "position": 12}
]
}可以看到,默认是按字进行分词的。elasticsearch-hanlp
HanLPHanLP 是一款使用 Java 实现的优秀的,具有如下功能:
中文分词
词性标注命名实体识别关键词提取自动摘要短语提取拼音转换简繁转换文本推荐依存句法分析语料库工具安装 elasticsearch-hanlp(安装见:)插件以后,我们再来看看分词效果。GET /_analyze?pretty{ "analyzer" : "hanlp", "text" : ["重庆华龙网海数科技有限公司"]}
输出:
{
"tokens": [{ "token": "重庆", "start_offset": 0, "end_offset": 2, "type": "ns", "position": 0},{ "token": "华龙网", "start_offset": 2, "end_offset": 5, "type": "nr", "position": 1},{ "token": "海数", "start_offset": 5, "end_offset": 7, "type": "nr", "position": 2},{ "token": "科技", "start_offset": 7, "end_offset": 9, "type": "n", "position": 3},{ "token": "有限公司", "start_offset": 9, "end_offset": 13, "type": "nis", "position": 4}
]
}HanLP 的功能不止简单的中文分词,有很多功能都可以集成到 Elasticsearch 中。文章来源于羊八井的博客
转载地址:http://mevwa.baihongyu.com/