HanLP ES插件折腾小记

本文对HanLP的es插件踩坑过程做个简单的记录。

HanLP的ES插件

HanLP的ES插件基本都收录在HanLP的这个wiki页面里了:

https://github.com/hankcs/HanLP/wiki/%E8%A1%8D%E7%94%9F%E9%A1%B9%E7%9B%AE

这次踩坑选的是 https://github.com/shikeio/elasticsearch-analysis-hanlp

准备

  • elastic search 6.2.4
  • kibana 6.2.4
  • 插件的预编译release analysis-hanlp-6.2.4.zip

    gradle mvn
    
  • HanLP data data-for-1.6.4.zip

假设es解压后的目录{es_home}
,然后有另一个不相关的任意的目录 {hanlp_home}
,将data-for-1.6.4.zip解压到此目录下,得到如下目录结构:

.
├── data
│   ├── README.url
│   ├── dictionary
│   └── model
└── data-for-1.6.4.zip

开始

初始启动

  1. 启动es
  2. 启动kibana
  3. 浏览器访问 http://127.0.0.1:5601/
  4. 进入DevTools
  5. 在Console文本框里输入下面的内容,点发送
GET _search
{
  "query": {
    "match_all": {}
  }
}

如果以上步骤都没有报错,说明初始状态的所有东西都是正常的。如果不正常,请自行解决。

部署插件

analysis-hanlp-6.2.4.zip
解压得到 elasticsearch
目录,将其重命名为 analysis-hanlp
,移动到 {es_home}/plugins
目录下,然后按照插件的文档去修改 hanlp.properties
{es_home}/config/jvm.options
文件,然后重启es(不用重启kibana)。

如无意外,你将会得到如下错误:

Jun 30, 2018 10:38:11 PM com.hankcs.hanlp.HanLP$Config <clinit>
SEVERE: 没有找到hanlp.properties,可能会导致找不到data
========Tips========
请将hanlp.properties放在下列目录:
Web项目则请放到下列目录:
Webapp/WEB-INF/lib
Webapp/WEB-INF/classes
Appserver/lib
JRE/lib
并且编辑root=PARENT/path/to/your/data
现在HanLP将尝试从/home/gordon/Dev/es/6.2.4/elasticsearch-6.2.4读取data……
[2018-06-30T22:38:11,995][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [aggs-matrix-stats]
[2018-06-30T22:38:11,995][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [analysis-common]
[2018-06-30T22:38:11,996][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [ingest-common]
[2018-06-30T22:38:11,996][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [lang-expression]
[2018-06-30T22:38:11,996][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [lang-mustache]
[2018-06-30T22:38:11,996][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [lang-painless]
[2018-06-30T22:38:11,996][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [mapper-extras]
[2018-06-30T22:38:11,996][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [parent-join]
[2018-06-30T22:38:11,997][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [percolator]
[2018-06-30T22:38:11,997][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [rank-eval]
[2018-06-30T22:38:11,997][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [reindex]
[2018-06-30T22:38:11,997][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [repository-url]
[2018-06-30T22:38:11,997][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [transport-netty4]
[2018-06-30T22:38:11,997][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded module [tribe]
[2018-06-30T22:38:11,998][INFO ][o.e.p.PluginsService     ] [3dzf4Ix] loaded plugin [analysis-hanlp]
[2018-06-30T22:38:14,236][INFO ][o.e.d.DiscoveryModule    ] [3dzf4Ix] using discovery type [zen]
[2018-06-30T22:38:14,920][INFO ][o.e.n.Node               ] initialized
[2018-06-30T22:38:14,921][INFO ][o.e.n.Node               ] [3dzf4Ix] starting ...
[2018-06-30T22:38:15,093][INFO ][o.e.t.TransportService   ] [3dzf4Ix] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-06-30T22:38:18,153][INFO ][o.e.c.s.MasterService    ] [3dzf4Ix] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {3dzf4Ix}{3dzf4Ix_RgWyUEeodaAkDA}{oT5oXRhcTNuDvM8NbIAqNg}{127.0.0.1}{127.0.0.1:9300}

看上去好像找不到配置文件之后缺省配置生效了,并且也没出错,但是当你请求一下用hanlp,es就会挂了:

[2018-06-30T22:38:18,158][INFO ][o.e.c.s.ClusterApplierService] [3dzf4Ix] new_master {3dzf4Ix}{3dzf4Ix_RgWyUEeodaAkDA}{oT5oXRhcTNuDvM8NbIAqNg}{127.0.0.1}{127.0.0.1:9300}, reason: apply cluster state (from master [master {3dzf4Ix}{3dzf4Ix_RgWyUEeodaAkDA}{oT5oXRhcTNuDvM8NbIAqNg}{127.0.0.1}{127.0.0.1:9300} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2018-06-30T22:38:18,183][INFO ][o.e.h.n.Netty4HttpServerTransport] [3dzf4Ix] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2018-06-30T22:38:18,183][INFO ][o.e.n.Node               ] [3dzf4Ix] started
[2018-06-30T22:38:18,197][INFO ][o.e.g.GatewayService     ] [3dzf4Ix] recovered [0] indices into cluster_state
[2018-06-30T22:43:31,191][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[3dzf4Ix][index][T#1]], exiting
java.lang.ExceptionInInitializerError: null
        at com.hankcs.hanlp.seg.common.Vertex.newB(Vertex.java:462) ~[?:?]
        at com.hankcs.hanlp.seg.common.WordNet.<init>(WordNet.java:73) ~[?:?]
        at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:40) ~[?:?]
        at com.hankcs.hanlp.seg.Segment.seg(Segment.java:557) ~[?:?]
        at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:98) ~[?:?]
        at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:67) ~[?:?]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:266) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:243) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:164) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:80) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:293) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:286) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: java.lang.IllegalArgumentException: 核心词典data/dictionary/CoreNatureDictionary.txt加载失败
        at com.hankcs.hanlp.dictionary.CoreDictionary.<clinit>(CoreDictionary.java:44) ~[?:?]
        ... 20 more
➜  bin

那么问题来了,为什么会找不到 hanlp.properties
呢?

debug

在经过多次google之后也不知道为什么,文档也没有说,只有找源码。提示是在 com.hankcs.hanlp.HanLP
这个类里出现的,所以去找HanLP的源码: https://github.com/hankcs/HanLP

将它下下来之后全文搜索『没有找到hanlp.properties』,只有一个地方,就是 src/main/java/com/hanks/hanlp/HanLP.java
,关键的地方是:

p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ?
                             loader.getResourceAsStream("hanlp.properties") :
                             new FileInputStream(Predefine.HANLP_PROPERTIES_PATH)
, "UTF-8"));

在源码里, Predefine.HANLP_PROPERTIES_PATH
只有声明,并没赋值,因此当时以为这里造成问题的是 loader.getResourceAsStream
,所以尝试将 hanlp.properties
放到各种地方,然而都无效。后来想着尝试在加载失败之后执行多一个加载逻辑,于是改写了源码,重新打包替代插件里的hanlp-1.6.4.jar,结果发现其实在插件运行的时候, Predefine.HANLP_PROPERTIES_PATH
其实是有值的,值为 analysis-hanlp/hanlp.properties
,因为我修改的逻辑是没有吃掉异常的,所以在初始化FileInputStream的时候,找不到文件时打印异常会将文件的绝对路径打印出来,才发现这个问题。看看HanLP.java的源码,也是挺无语的。。。

Properties p = new Properties();
try
{
    ClassLoader loader = Thread.currentThread().getContextClassLoader();
    if (loader == null)
    {  // IKVM (v.0.44.0.5) doesn't set context classloader
        loader = HanLP.Config.class.getClassLoader();
    }
    try
    {
        logger.info("System.getProperties().get(/"java.class.path/"): {}", System.getProperties().get("java.class.path"));

        p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ?
                                         loader.getResourceAsStream("hanlp.properties") :
                                         new FileInputStream(Predefine.HANLP_PROPERTIES_PATH)
            , "UTF-8"));
    }
    catch (Exception e)
    {
        e.printStackTrace();
        String HANLP_ROOT = System.getenv("HANLP_ROOT");
        if (HANLP_ROOT != null)
        {
            HANLP_ROOT = HANLP_ROOT.trim();
            p = new Properties();
            p.setProperty("root", HANLP_ROOT);
            logger.info("使用环境变量 HANLP_ROOT=" + HANLP_ROOT);
        }
        else throw e;
    }
    String root = p.getProperty("root", "").replaceAll("////", "/");
    if (root.length() > 0 && !root.endsWith("/")) root += "/";
    CoreDictionaryPath = root + p.getProperty("CoreDictionaryPath", CoreDictionaryPath);
    // balabala
}
catch (Exception e)
{
    if (new File("data/dictionary/CoreNatureDictionary.tr.txt").isFile())
    {
        logger.info("使用当前目录下的data");
    }
    else
    {
        StringBuilder sbInfo = new StringBuilder("========Tips========/n请将hanlp.properties放在下列目录:/n"); // 打印一些友好的tips
        if (new File("src/main/java").isDirectory())
        {
            sbInfo.append("src/main/resources");
        }
        else
        {
            String classPath = (String) System.getProperties().get("java.class.path");
            if (classPath != null)
            {
                for (String path : classPath.split(File.pathSeparator))
                {
                    if (new File(path).isDirectory())
                    {
                        sbInfo.append(path).append('/n');
                    }
                }
            }
            sbInfo.append("Web项目则请放到下列目录:/n" +
                              "Webapp/WEB-INF/lib/n" +
                              "Webapp/WEB-INF/classes/n" +
                              "Appserver/lib/n" +
                              "JRE/lib/n");
            sbInfo.append("并且编辑root=PARENT/path/to/your/data/n");
            sbInfo.append("现在HanLP将尝试从").append(System.getProperties().get("user.dir")).append("读取data……");
        }
        logger.error("没有找到hanlp.properties,可能会导致找不到data/n" + sbInfo);
    }

吃掉异常,然后提示的信息又没什么用。。。。而且报异常的时候不打印,反而去判断另一个文件是否存在,根据另一个文件是否存在,反推 hanlp.properties
文件是否存在……简单点,说话的方式简单点……

所以,令配置文件的路径为 {es_home}/config/analysis-hanlp/hanlp.properties
即可,你可以通过链接来实现,也可以将文件拷贝过去,随你。

然后重启es,再在kibana里发送:

GET /_analyze?pretty=true
{
"analyzer": "hanlp-index",
"text": "张柏芝士蛋糕店"
}

(插件的github README里那段是不行的)

正常的话会得到:

{
  "tokens": [
    {
      "token": "张柏",
      "start_offset": 0,
      "end_offset": 2,
      "type": "nr",
      "position": 0
    },
    {
      "token": "芝士蛋糕",
      "start_offset": 2,
      "end_offset": 6,
      "type": "nf",
      "position": 1
    },
    {
      "token": "芝士",
      "start_offset": 2,
      "end_offset": 4,
      "type": "nf",
      "position": 2
    },
    {
      "token": "蛋糕",
      "start_offset": 4,
      "end_offset": 6,
      "type": "nf",
      "position": 3
    },
    {
      "token": "店",
      "start_offset": 6,
      "end_offset": 7,
      "type": "n",
      "position": 4
    }
  ]
}

至此完毕。

原文 

http://bungder.github.io/2018/06/30/hanlp-es-plugin/

本站部分文章源于互联网,本着传播知识、有益学习和研究的目的进行的转载,为网友免费提供。如有著作权人或出版方提出异议,本站将立即删除。如果您对文章转载有任何疑问请告之我们,以便我们及时纠正。

PS:推荐一个微信公众号: askHarries 或者qq群:474807195,里面会分享一些资深架构师录制的视频录像:有Spring,MyBatis,Netty源码分析,高并发、高性能、分布式、微服务架构的原理,JVM性能优化这些成为架构师必备的知识体系。还能领取免费的学习资源,目前受益良多

转载请注明原文出处:Harries Blog™ » HanLP ES插件折腾小记

赞 (1)
分享到:更多 ()

评论 1

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
  1. Gordon我昨晚发的,今天就被爬下来了,被抄也算是我这个小透明的荣幸吧?看来后面要修改许可了。回复