分析器负责对文本进行分词、语言处理得到词条,建索引和搜索的时候都需要用到分析器。Lucene的分析器往往包括一个分词器(Tokenizer)和多个过滤器(TokenFilter),过滤器负责对切出来的词进行处理,如去掉敏感词、转换大小写、转换单复数等。
目录
1. 执行过程
1.1 Analyzer
1.2 TokenStream
1.3 Tokenizer
1.4 TokenFilter
1.5 停用词
1.6 示例
2. 中文分析器
3. 自定义分词器
从一个Reader字符流开始,创建一个基于Reader的Tokenizer分词器,经过2个TokenFilter生成语汇最小单元Token。
分析器(Analyzer)它主要包含Tokenizer和TokenFilter,它的功能是将分词器和过滤器进行合理的组合,使之产生对文本分词和过滤的效果。因此,分析器就像linux的管道,文本经过管道处理之后,就可以获取一系列Token。Lucene中的分词器有StandardAnalyzer。
它是一个抽象类,主要包含两个接口,用于生成TokenStream
public final TokenStream tokenStream(String fieldName, Reader reader)public final TokenStream tokenStream(String fieldName, String text)其中,fieldName,也就是你建索引的时候对应的字段名,比如:Field f = new Field("title","hello",Field.Store.YES, Field.Index.TOKENIZED);这句中的"title"
为了提高性能,使得在同一个线程中无需再生成新的TokenStream对象,老的可以被重用,所以有reusableTokenStream一说。Analyzer中有CloseableThreadLocal< Object > tokenStreams = new CloseableThreadLocal< Object >();成员变量,保存当前线程原来创建过的TokenStream,可用函数setPreviousTokenStream设定,用函数getPreviousTokenStream得到。
感受一下核心源代码
//引自org.apache.lucene.analysis public abstract class Analyzer implements Closeable { private final Analyzer.ReuseStrategy reuseStrategy; private Version version; CloseableThreadLocal<Object> storedValue; //全局可重用TokenStream组件 public static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY = new Analyzer.ReuseStrategy() { public Analyzer.TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) { return (Analyzer.TokenStreamComponents)this.getStoredValue(analyzer); } public void setReusableComponents(Analyzer analyzer, String fieldName, Analyzer.TokenStreamComponents components) { this.setStoredValue(analyzer, components); } }; //最近的Field级可重用TokenStream组件 public static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY = new Analyzer.ReuseStrategy() { public Analyzer.TokenStreamComponents getReusableComponents(Analyzer analyzer, String fieldName) { Map componentsPerField = (Map)this.getStoredValue(analyzer); return componentsPerField != null?(Analyzer.TokenStreamComponents)componentsPerField.get(fieldName):null; } public void setReusableComponents(Analyzer analyzer, String fieldName, Analyzer.TokenStreamComponents components) { Object componentsPerField = (Map)this.getStoredValue(analyzer); if(componentsPerField == null) { componentsPerField = new HashMap(); this.setStoredValue(analyzer, componentsPerField); } ((Map)componentsPerField).put(fieldName, components); } }; public Analyzer() { this(GLOBAL_REUSE_STRATEGY); } public Analyzer(Analyzer.ReuseStrategy reuseStrategy) { this.version = Version.LATEST; this.storedValue = new CloseableThreadLocal(); this.reuseStrategy = reuseStrategy; } //此次很关键 protected abstract Analyzer.TokenStreamComponents createComponents(String var1); protected TokenStream normalize(String fieldName, TokenStream in) { return in; } public final TokenStream tokenStream(String fieldName, Reader reader) { Analyzer.TokenStreamComponents components = this.reuseStrategy.getReusableComponents(this, fieldName); Reader r = this.initReader(fieldName, reader); if(components == null) { components = this.createComponents(fieldName); this.reuseStrategy.setReusableComponents(this, fieldName, components); } components.setReader(r); return components.getTokenStream(); } public final TokenStream tokenStream(String fieldName, String text) { Analyzer.TokenStreamComponents components = this.reuseStrategy.getReusableComponents(this, fieldName); // text reader 很重要 ReusableStringReader strReader = components != null && components.reusableStringReader != null?components.reusableStringReader:new ReusableStringReader(); strReader.setValue(text); Reader r = this.initReader(fieldName, strReader); if(components == null) { components = this.createComponents(fieldName); this.reuseStrategy.setReusableComponents(this, fieldName, components); } components.setReader(r); components.reusableStringReader = strReader; return components.getTokenStream(); } public final BytesRef normalize(String fieldName, String text) { try { String e; try { StringReader attributeFactory = new StringReader(text); Throwable ts = null; try { Reader filterReader = this.initReaderForNormalization(fieldName, attributeFactory); char[] termAtt = new char[64]; StringBuilder term = new StringBuilder(); while(true) { int read = filterReader.read(termAtt, 0, termAtt.length); if(read == -1) { e = term.toString(); break; } term.append(termAtt, 0, read); } } catch (Throwable var37) { ts = var37; throw var37; } finally { if(attributeFactory != null) { if(ts != null) { try { attributeFactory.close(); } catch (Throwable var34) { ts.addSuppressed(var34); } } else { attributeFactory.close(); } } } } catch (IOException var39) { throw new IllegalStateException("Normalization threw an unexpected exception", var39); } AttributeFactory attributeFactory1 = this.attributeFactory(fieldName); TokenStream ts1 = this.normalize(fieldName, (TokenStream)(new Analyzer.StringTokenStream(attributeFactory1, e, text.length()))); Throwable filterReader1 = null; BytesRef read1; try { TermToBytesRefAttribute termAtt1 = (TermToBytesRefAttribute)ts1.addAttribute(TermToBytesRefAttribute.class); ts1.reset(); if(!ts1.incrementToken()) { throw new IllegalStateException("The normalization token stream is expected to produce exactly 1 token, but got 0 for analyzer " + this + " and input \"" + text + "\""); } BytesRef term1 = BytesRef.deepCopyOf(termAtt1.getBytesRef()); if(ts1.incrementToken()) { throw new IllegalStateException("The normalization token stream is expected to produce exactly 1 token, but got 2+ for analyzer " + this + " and input \"" + text + "\""); } ts1.end(); read1 = term1; } catch (Throwable var35) { filterReader1 = var35; throw var35; } finally { if(ts1 != null) { if(filterReader1 != null) { try { ts1.close(); } catch (Throwable var33) { filterReader1.addSuppressed(var33); } } else { ts1.close(); } } } return read1; } catch (IOException var40) { throw new IllegalStateException("Normalization threw an unexpected exception", var40); } } protected Reader initReader(String fieldName, Reader reader) { return reader; } protected Reader initReaderForNormalization(String fieldName, Reader reader) { return reader; } protected AttributeFactory attributeFactory(String fieldName) { return TokenStream.DEFAULT_TOKEN_ATTRIBUTE_FACTORY; } public int getPositionIncrementGap(String fieldName) { return 0; } public int getOffsetGap(String fieldName) { return 1; } public final Analyzer.ReuseStrategy getReuseStrategy() { return this.reuseStrategy; } public void setVersion(Version v) { this.version = v; } public Version getVersion() { return this.version; } public void close() { if(this.storedValue != null) { this.storedValue.close(); this.storedValue = null; } } private static final class StringTokenStream extends TokenStream { private final String value; private final int length; private boolean used = true; private final CharTermAttribute termAttribute = (CharTermAttribute)this.addAttribute(CharTermAttribute.class); private final OffsetAttribute offsetAttribute = (OffsetAttribute)this.addAttribute(OffsetAttribute.class); StringTokenStream(AttributeFactory attributeFactory, String value, int length) { super(attributeFactory); this.value = value; this.length = length; } public void reset() { this.used = false; } public boolean incrementToken() { if(this.used) { return false; } else { this.clearAttributes(); this.termAttribute.append(this.value); this.offsetAttribute.setOffset(0, this.length); this.used = true; return true; } } public void end() throws IOException { super.end(); this.offsetAttribute.setOffset(this.length, this.length); } } public abstract static class ReuseStrategy { public ReuseStrategy() { } public abstract Analyzer.TokenStreamComponents getReusableComponents(Analyzer var1, String var2); public abstract void setReusableComponents(Analyzer var1, String var2, Analyzer.TokenStreamComponents var3); protected final Object getStoredValue(Analyzer analyzer) { if(analyzer.storedValue == null) { throw new AlreadyClosedException("this Analyzer is closed"); } else { return analyzer.storedValue.get(); } } protected final void setStoredValue(Analyzer analyzer, Object storedValue) { if(analyzer.storedValue == null) { throw new AlreadyClosedException("this Analyzer is closed"); } else { analyzer.storedValue.set(storedValue); } } } public static class TokenStreamComponents { protected final Tokenizer source; protected final TokenStream sink; transient ReusableStringReader reusableStringReader; public TokenStreamComponents(Tokenizer source, TokenStream result) { this.source = source; this.sink = result; } public TokenStreamComponents(Tokenizer source) { this.source = source; this.sink = source; } protected void setReader(Reader reader) { this.source.setReader(reader); } public TokenStream getTokenStream() { return this.sink; } public Tokenizer getTokenizer() { return this.source; } } } public final class StandardAnalyzer extends StopwordAnalyzerBase { public static final CharArraySet ENGLISH_STOP_WORDS_SET; public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; private int maxTokenLength; public static final CharArraySet STOP_WORDS_SET; public StandardAnalyzer(CharArraySet stopWords) { super(stopWords); this.maxTokenLength = 255; } public StandardAnalyzer() { this(STOP_WORDS_SET); } public StandardAnalyzer(Reader stopwords) throws IOException { this(loadStopwordSet(stopwords)); } public void setMaxTokenLength(int length) { this.maxTokenLength = length; } public int getMaxTokenLength() { return this.maxTokenLength; } protected TokenStreamComponents createComponents(String fieldName) { final StandardTokenizer src = new StandardTokenizer(); src.setMaxTokenLength(this.maxTokenLength); StandardFilter tok = new StandardFilter(src); LowerCaseFilter tok1 = new LowerCaseFilter(tok); final StopFilter tok2 = new StopFilter(tok1, this.stopwords); return new TokenStreamComponents(src, tok2) { protected void setReader(Reader reader) { src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength); super.setReader(reader); } }; } protected TokenStream normalize(String fieldName, TokenStream in) { StandardFilter result = new StandardFilter(in); LowerCaseFilter result1 = new LowerCaseFilter(result); return result1; } static { List stopWords = Arrays.asList(new String[]{"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"}); CharArraySet stopSet = new CharArraySet(stopWords, false); ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet); STOP_WORDS_SET = ENGLISH_STOP_WORDS_SET; } }分词流(TokenStream),它是一个由分词后的Token结果组成的流,能够不断的得到下一个分成的Token。在Lucene 3.0以后,next方法改成了incrementToken,并增加了end方法,TokenStream是继承于AttributeSource,其包含Map,保存从class到对象的映射,从而可以保存不同类型的对象的值。
tokenStream是一个抽象类,它是所有分析器的基类,分词器和分词过滤器继承它
它主要包含以下几个方法:
public abstract boolean incrementToken() //得到下一个Tokenpublic void reset() //设置tokenStream初始状态,否则会抛异常public void end() //调用endAttribute(),把termlength设置为0,意思是没有词汇了为close()做准备public void close() //结束它经常用的对象有:
TermAttributeImpl //用来保存Token字符串PositionIncrementAttributeImpl //用来保存位置信息OffsetAttributeImpl //用来保存偏移量信息用法要将上面常用的对象添加到tokenStream中,比如:tokenStream.addAttribute(CharTermAttribute.class),然后获取对象中的值。
分词器它主要负责接收字符流Reader,将Reader进行分词操作。实现类有StandardTokenizer。
TokenFilter主要是对分词器切分的最小单位进入索引进行预处理,如:大写转小写、复数转单数、也可以根据语义进行错误单词的纠错。
什么是停用词?停用词是为节省存储空间和提高搜索效率,搜索引擎在索引页面或处理搜索请求时会自动忽略某些字或词,这些字或词即被称为Stop Words(停用词)。比如语气助词、副词、介词、连接词等,通常自身并无明确的意义,只有将其放入一个完整的句子中才有一定作用,如常见的“的”、“在”、“是”、“啊”等。
开源的中文分词库比较多,比如有:StandardAnalyzer、ChineseAnalyzer、CJKAnalyzer、IK_CAnalyzer、MIK_CAnalyzer、MMAnalyzer(JE分词)、PaodingAnalyzer。单纯的中文分词的实现一般为按字索引或者按词索引。按字索引顾名思义,就是按单个字建立索引。按词索引就是按词喽,根据词库中的词,将文字进行切分。车东的交叉双字分割或者叫二元分词我觉得应该算是按字索引的改进,应该还是属于字索引的范畴吧。
按字StandardAnalyzer
Lucene自带的标准分析器。
ChineseAnalyzer
Lucene contrib中附带的分析器,与StandardAnalyzer类似。注意是类似啊,还是有区别的。
CJKAnalyzer
Lucene contrib中附带的二元分词
按词
IK_CAnalyzer、MIK_CAnalyzer
http://lucene-group.group.iteye.com/group/blog/165287。使用版本为2.0.2
MMAnalyzer
现在能找到的最新版本是1.5.3。不过在原始网站已经找不到下载了,而且据说声明为不提供维护和支持。因为谈论的人比较多,所以列出来。但在使用中感觉不太稳定。
PaodingAnalyzer
庖丁解牛。http://code.google.com/p/paoding/downloads/list。使用版本为2.0.4beta。
对于一般性的应用,采用二元分词法应该就可以满足需求。如果需要分词的话,从分词效果、性能、扩展性、还是可维护性来综合考虑,建议使用选用 IK 分词,原因有以下几点:
IK 分细粒度ikmaxword和粗粒度ik_smart两种分词方式。IK 更新字典只需要在词典末尾添加关键词即可,支持本地和远程词典两种方式。IK 分词插件的更新速度更快,和最新版本保持高度一致。了解源码TokenStream的创建过程的同学,会发现Analyzer.createComponents是个拓展入口。
1. 建立自己的Attribute接口及实现类
public interface MyCharAttribute extends Attribute { void setChars(char[] buffer, int length); char[] getChars(); int getLength(); String getString(); } public class MyCharAttributeImpl extends AttributeImpl implements MyCharAttribute{ private char[] chatTerm = new char[255]; private int length = 0; @Override public void setChars(char[] buffer, int length) { this.length = length; if (length > 0) { System.arraycopy(buffer, 0, this.chatTerm, 0, length); } } public char[] getChars() { return this.chatTerm; } public int getLength() { return this.length; } @Override public String getString() { if (this.length > 0) { return new String(this.chatTerm, 0, length); } return null; } @Override public void clear() { this.length = 0; } @Override public void reflectWith(AttributeReflector attributeReflector) { } @Override public void copyTo(AttributeImpl attribute) { } }2. 建立分词器MyWhitespaceTokenizer,实现对英文按空白字符进行分词
public final class MyWhitespaceTokenizer extends Tokenizer { // 需要记录的属性 private MyCharAttribute charAttr = this.addAttribute(MyCharAttribute.class); // 存词的出现位置,存放词的偏移 char[] buffer = new char[255]; int length = 0; int c; @Override public boolean incrementToken() throws IOException { // 清除所有的词项属性 clearAttributes(); length = 0; while (true) { c = this.input.read(); if (c == -1) { if (length > 0) { // 复制到charAttr this.charAttr.setChars(buffer, length); return true; } else { return false; } } if (Character.isWhitespace(c)) { if (length > 0) { // 复制到charAttr this.charAttr.setChars(buffer, length); return true; } } buffer[length++] = (char) c; } } }3. 建立过滤器,把大写字母转换为小写字母
public final class MyLowerCaseTokenFilter extends TokenFilter { private MyCharAttribute charAttr = this.addAttribute(MyCharAttribute.class); public MyLowerCaseTokenFilter(TokenStream input) { super(input); } @Override public boolean incrementToken() throws IOException { boolean res = this.input.incrementToken(); if (res) { char[] chars = charAttr.getChars(); int length = charAttr.getLength(); if (length > 0) { for (int i = 0; i < length; i++) { chars[i] = Character.toLowerCase(chars[i]); } } } return res; } }4. 建立分析器
public final class MyWhitespaceAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String s) { Tokenizer source = new MyWhitespaceTokenizer(); TokenStream filter = new MyLowerCaseTokenFilter(source); return new TokenStreamComponents(source, filter); } }5. 运行分析器
@Test public void go_customAnalyzer() throws IOException { Analyzer analyzer = new MyWhitespaceAnalyzer(); TokenStream tokenStream = analyzer.tokenStream("aa", str); MyCharAttribute myCharAttr = tokenStream.getAttribute(MyCharAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { System.out.println(myCharAttr.getString()); } tokenStream.end(); tokenStream.close(); }总结,lucene的分析器是由分词器和过滤器构成,需掌握这些对象的核心API,本质上它是个工具,内部实现也不是特别复杂。