15、Analyzer分析器之中文分析器的扩展

本文主要是介绍15、Analyzer分析器之中文分析器的扩展，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

其实在第五章节里已经有介绍过下面的分析器了，只是没有做例子，今天将下面没有做过例子分析器进行一个例子说明 

 paoding： 庖丁解牛最新版在  https://code.google.com/p/paoding/  中最多支持Lucene 3.0，且最新提交的代码在 2008-06-03，在svn中最新也是2010年提交，已经过时，不予考虑。 

 mmseg4j：最新版已从  https://code.google.com/p/mmseg4j/  移至  https://github.com/chenlb/mmseg4j-solr ，支持Lucene 4.10，且在github中最新提交代码是2014年6月，从09年～14年一共有：18个版本，也就是一年几乎有3个大小版本，有较大的活跃度，用了mmseg算法。 

 IK-analyzer： 最新版在https://code.google.com/p/ik-analyzer/上，支持Lucene 4.10从2006年12月推出1.0版开始， IKAnalyzer已经推出了4个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算法的中文分词组件。从3.0版本开 始，IK发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。在2012版本中，IK实现了简单的分词 歧义排除算法，标志着IK分词器从单纯的词典分词向模拟语义分词衍化。 但是也就是2012年12月后没有在更新。 这里我们不在说明这个分析器，感兴趣的小伙伴可以看看第5章节的说明哦 

 ansj_seg：最新版本在  https://github.com/NLPchina/ansj_seg  tags仅有1.1版本，从2012年到2014年更新了大小6次，但是作者本人在2014年10月10日说明：“可能我以后没有精力来维护ansj_seg了”，现在由”nlp_china”管理。2014年11月有更新。并未说明是否支持Lucene，是一个由CRF（条件随机场）算法所做的分词算法。 

 imdict-chinese-analyzer：最新版在  https://code.google.com/p/imdict-chinese-analyzer/  ， 最新更新也在2009年5月，下载源码，不支持Lucene 4.10 。是利用HMM（隐马尔科夫链）算法。 

 Jcseg：最新版本在git.oschina.net/lionsoul/jcseg，支持Lucene 4.10，作者有较高的活跃度。利用mmseg算法。 

 MMseg 的分析器的使用 

首先将引用相关的jar包， 

<!--mmseg4j 的分析器的使用  -->
<dependency><groupId>com.chenlb.mmseg4j</groupId><artifactId>mmseg4j-core</artifactId><version>1.10.0</version>
</dependency>

 具体代码的实现 

package mmseg;
import com.chenlb.mmseg4j.*;
import java.io.File;
import java.io.IOException;
import java.io.StringReader;/*** Created by kangz on 2016/12/19.*/
public class MMsegAnalyzerTest {public static void main(String[] args) throws IOException {String txt = "";txt = "那个好看的笑容里面全是悲伤白富美，他在行尸走肉的活着，他的故事悲伤的像一场没有结局的黑白电影，他是她小说里的主角， 她懂他，他爱过她，她不知道自己是爱他的的外表，还是爱他的故事，还是爱他身上的那个自己。";File file = new File("D:\\LucentTest\\luceneIndex2");//词典的目录Dictionary dic = Dictionary.getInstance();//建立词典实例，与比较老的版本中不相同。不能直接new。 默认读取的是jar包中 words.dic（可修改其内容）也可指定词典目录  可以是 File 也可以是String 的形式Seg seg = null;//seg = new SimpleSeg(dic);//简单的seg = new ComplexSeg(dic);//复杂的MMSeg mmSeg = new MMSeg(new StringReader(txt), seg);Word word = null;while((word = mmSeg.next())!=null) {if(word != null) {System.out.print(word + "|");}}}
}

 Jcseg 的分析器的使用 

首先将引用相关的jar包， 

<!--Jcseg 的分析器的使用 -->
<dependency><groupId>org.lionsoul</groupId><artifactId>jcseg-core</artifactId><version>2.0.1</version>
</dependency>
<dependency><groupId>org.lionsoul</groupId><artifactId>jcseg-analyzer</artifactId><version>2.0.1</version>
</dependency>

Lucene集成Jcseg的测试代码 

将jcseg源码包中的 lexicon和 jcseg.properties两个文件复制到src/main/resources下，并修改 jcseg.properties中的lexicon.path = src/main/resources/lexicon 

新建一个类： 

package lexicon;import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;
import org.lionsoul.jcseg.analyzer.v5x.JcsegAnalyzer5X;
import org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig;import java.io.File;
import java.nio.file.Paths;/*** Created by kangz on 2016/12/19.*/
public class LexiconAnalyzersTest {@Testpublic void test() throws Exception {//如果不知道选择哪个Directory的子类，那么推荐使用FSDirectory.open()方法来打开目录 创建一个分析器对象Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.COMPLEX_MODE);//非必须(用于修改默认配置): 获取分词任务配置实例JcsegAnalyzer5X jcseg = (JcsegAnalyzer5X) analyzer;JcsegTaskConfig config = jcseg.getTaskConfig();//追加同义词, 需要在 jcseg.properties中配置jcseg.loadsyn=1config.setAppendCJKSyn(true);//追加拼音, 需要在jcseg.properties中配置jcseg.loadpinyin=1config.setAppendCJKPinyin(true);//更多配置, 请查看 org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig/** ------------------------------------------------------------------------ **/// 打开索引库// 指定索引库存放的位置Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));//创建一个IndexwriterConfig对象//第一个参数：lucene的版本，第二个参数：分析器对象IndexWriterConfig indexWriterConfig=new IndexWriterConfig(analyzer);//创建一个Indexwriter对象IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);indexWriter.deleteAll();//清除之前的索引  注: 全部删除索引， 请慎用。//读取文件信息//原始文档存放的目录File path = new File("D:\\LucentTest\\luceneFile");for (File file:path.listFiles()) {if (file.isDirectory()) continue;//读取文件信息//文件名String fileName = file.getName();//文件内容String fileContent = FileUtils.readFileToString(file);//文件的路径String filePath = file.getPath();//文件的大小long fileSize = FileUtils.sizeOf(file);//创建文档对象Document document = new Document();//创建域//三个参数：1、域的名称2、域的值3、是否存储 Store.YES：存储  Store.NO：不存储Field nameField = new TextField("name", fileName, Field.Store.YES);Field contentField = new TextField("content", fileContent, Field.Store.YES);Field sizeField=new LongPoint("size",fileSize);Field pathField  = new StoredField("path", filePath);//把域添加到document对象中document.add(nameField);document.add(contentField);document.add(pathField);document.add(sizeField);//把document写入索引库indexWriter.addDocument(document);}indexWriter.close();}//使用查询@Testpublic void testTermQuery() throws Exception {//以读的方式打开索引库Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex"));//创建一个IndexReaderIndexReader indexReader = DirectoryReader.open(directory);//创建一个IndexSearcher对象IndexSearcher indexSearcher = new IndexSearcher(indexReader);//创建一个查询对象Query query = new TermQuery(new Term("content", "全文检索"));//执行查询TopDocs topDocs = indexSearcher.search(query,10);System.out.println("查询结果总数量：" + topDocs.totalHits);for (ScoreDoc scoreDoc : topDocs.scoreDocs) {//取document对象Document document = indexSearcher.doc(scoreDoc.doc);System.out.println("得分：" + scoreDoc.score);//System.out.println(document.get("content"));System.out.println(document.get("path"));}indexReader.close();}
}

 ansj分析器的使用 

首先要引用jar包 

 Maven项目配置Ansj 

 根据官方手册，在 pom.xml 文件中加入依赖，如下所示 

<!--ansj 的分析器的使用-->
<dependency><groupId>org.ansj</groupId><artifactId>ansj_seg</artifactId><version>5.0.2</version>
</dependency>
<dependency><groupId>org.ansj</groupId><artifactId>ansj_lucene5_plug</artifactId><version>5.0.3.0</version>
</dependency>

 Lucene集成Ansj的测试代码 

 Ansj In Lucene 的官方参考文档： http://nlpchina.github.io/ansj_seg/ 

 到 https://github.com/NLPchina/ansj_seg  下载 ZIP 压缩文件，解压，将其中的 library 文件夹和 library.properties 文件拷贝到 maven 项目下的 src/main/resources 中，修改 library.properties 内容如下 

#redress dic file path
ambiguityLibrary=src/main/resources/library/ambiguity.dic
#path of userLibrary this is default library
userLibrary=src/main/resources/library/default.dic
#path of crfModel
crfModel=src/main/resources/library/crf.model
#set real name
isRealName=true

 具体的代码如下 

package ansj;import org.ansj.library.UserDefineLibrary;
import org.ansj.lucene5.AnsjAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.nio.file.Paths;
import java.util.Date;/*** Created by kangz on 2016/12/19.*/
public class AnsjAnalyzerTest {/*** 简单测试 AnsjAnalyzer的性能及基础应用* @throws IOException*/@Testpublic void test() throws IOException {Analyzer ca = new AnsjAnalyzer(AnsjAnalyzer.TYPE.index);Reader sentence = new StringReader("全文检索是将整本书java、整篇文章中的任意内容信息查找出来的检索，java。它可以根据需要获得全文中有关章、节、段、句、词等信息，计算机程序通过扫描文章中的每一个词");TokenStream ts = ca.tokenStream("sentence", sentence);System.out.println("start: " + (new Date()));long before = System.currentTimeMillis();while (ts.incrementToken()) {System.out.println(ts.getAttribute(CharTermAttribute.class));}ts.close();long now = System.currentTimeMillis();System.out.println("time: " + (now - before) / 1000.0 + " s");}@Testpublic void indexTest() throws IOException, ParseException {Analyzer analyzer = new AnsjAnalyzer(AnsjAnalyzer.TYPE.index);Directory directory = FSDirectory.open(Paths.get("D:\\LucentTest\\luceneIndex2"));IndexWriter iwriter;UserDefineLibrary.insertWord("蛇药片", "n", 1000);// 创建一个IndexWriterConfig 对象IndexWriterConfig config = new IndexWriterConfig(analyzer);// 创建indexwriter对象IndexWriter indexWriter = new IndexWriter(directory, config);// 创建一个文档对象Document document = new Document();Field nameField = new TextField("text", "季德胜蛇药片 10片*6板 ", Field.Store.YES);nameField.boost();document.add(nameField);//写入索引库indexWriter.addDocument(document);indexWriter.commit();indexWriter.close();System.out.println("索引建立完毕");search(analyzer, directory, "\"季德胜蛇药片\"");}//封装索引查询private void search(Analyzer queryAnalyzer, Directory directory, String queryStr) throws IOException, ParseException {IndexSearcher isearcher;DirectoryReader directoryReader = DirectoryReader.open(directory);// 查询索引isearcher = new IndexSearcher(directoryReader);QueryParser tq = new QueryParser("text", queryAnalyzer);Query query = tq.parse(queryStr);System.out.println(query);TopDocs hits = isearcher.search(query, 5);System.out.println(queryStr + ":共找到" + hits.totalHits + "条记录!");for (int i = 0; i < hits.scoreDocs.length; i++) {int docId = hits.scoreDocs[i].doc;Document document = isearcher.doc(docId);System.out.println(toHighlighter(queryAnalyzer, query, document));}}//private String toHighlighter(Analyzer analyzer, Query query, Document doc) {String field = "text";try {SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<font color=\"red\">", "</font>");Highlighter highlighter = new Highlighter(simpleHtmlFormatter, new QueryScorer(query));TokenStream tokenStream1 = analyzer.tokenStream("text", new StringReader(doc.get(field)));String highlighterStr = highlighter.getBestFragment(tokenStream1, doc.get(field));return highlighterStr == null ? doc.get(field) : highlighterStr;} catch (IOException | InvalidTokenOffsetsException e) {}return null;}}

主要参考的文档 http://codepub.cn/2016/03/23/Maven-project-integrating-Lucene-Chinese-Segmentation-tools-Jcseg-and-Ansj/ 

下面是小编的微信转帐二维码，小编再次谢谢读者的支持，小编会更努力的

----请看下方↓↓↓↓↓↓↓

百度搜索 Drools从入门到精通：可下载开源全套Drools教程

深度Drools教程不段更新中：

更多Drools实战陆续发布中………

扫描下方二维码关注公众号 ↓↓↓↓↓↓↓↓↓↓

这篇关于15、Analyzer分析器之中文分析器的扩展的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

15、Analyzer分析器之中文分析器的扩展

相关文章

SQL Server安装时候没有中文选项的解决方法

Python实现中文文本处理与分析程序的示例详解

PowerShell中15个提升运维效率关键命令实战指南

PostgreSQL的扩展dict_int应用案例解析

Redis出现中文乱码的问题及解决

Spring组件实例化扩展点之InstantiationAwareBeanPostProcessor使用场景解析

RedisTemplate默认序列化方式显示中文乱码的解决

Java常用注解扩展对比举例详解

Spring组件初始化扩展点BeanPostProcessor的作用详解

一文教你解决Python不支持中文路径的问题