6.Synonyms, aliases and word that mean same
关于同义词的处理 可以提供一个SynonymAnalyzer 来处理,这样可以把同一个词的同义词索引到同一个位置这样在搜索的时候就可以根据同义词来搜索了. 可以看看书中的测试代码,带有详细注释(点击我).
7.Stemming Analyzer
PositionalPorterStopAnalyzer
这是一个非Build-in 的Analyzer实现,她把所有的词都分析为一个基词形式(root form),例如:breathe,breathes,breathing
, and breathed 都分析为breath了.她还去除所有的Stop word 并且保留stop word 的位置, 例如: the quick brown fox jumps ower the lazy dog 就被分析为下面的term了
2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]
因为 去掉了the 词 所有1,7没有东西. 她是利用一个filter来实现该功能的.如:
01 package lia.analysis.positional;
02
03 import org.apache.lucene.analysis.TokenStream;
04 import org.apache.lucene.analysis.Token;
05 import org.apache.lucene.analysis.TokenFilter;
06 import java.util.Set;
07 import java.io.IOException;
08
09 public class PositionalStopFilter extends TokenFilter {
10 private Set stopWords;
11
12 public PositionalStopFilter(TokenStream in, Set stopWords) {
13 super(in);
14 this.stopWords = stopWords;
15 }
16
17 public final Token next() throws IOException {
18 int increment = 0;
19 for (Token token = input.next();
20 token != null; token = input.next()) {
21
22 if (!stopWords.contains(token.termText())) {
23 token.setPositionIncrement( //为Stop word 保留Position位置
24 token.getPositionIncrement() + increment);
25 return token;
26 }
27
28 increment++;
29 }
30
31 return null;
32 }
33 }
有PositionalPorterStopAnalyzer.java 提供Stop word 列表.
01 package lia.analysis.positional;
02
03 import org.apache.lucene.analysis.Analyzer;
04 import org.apache.lucene.analysis.LowerCaseTokenizer;
05 import org.apache.lucene.analysis.PorterStemFilter;
06 import org.apache.lucene.analysis.StopAnalyzer;
07 import org.apache.lucene.analysis.StopFilter;
08 import org.apache.lucene.analysis.TokenStream;
09
10 import java.io.Reader;
11 import java.util.Set;
12
13 public class PositionalPorterStopAnalyzer extends Analyzer {
14 private Set stopWords;
15
16 public PositionalPorterStopAnalyzer() {
17 this(StopAnalyzer.ENGLISH_STOP_WORDS); /// 使用默认的英文Stop word
18 }
19
20 public PositionalPorterStopAnalyzer(String[] stopList) { /// 使用自己指定的Stop word
21 stopWords = StopFilter.makeStopSet(stopList);
22 }
23
24 public TokenStream tokenStream(String fieldName, Reader reader) {
25 return new PorterStemFilter(
26 new PositionalStopFilter(
27 new LowerCaseTokenizer(reader), stopWords));
28 }
29 }
测试代码如下:
01 package lia.analysis.positional;
02
03 import junit.framework.TestCase;
04 import lia.analysis.AnalyzerUtils;
05 import org.apache.lucene.document.Document;
06 import org.apache.lucene.document.Field;
07 import org.apache.lucene.index.IndexWriter;
08 import org.apache.lucene.queryParser.QueryParser;
09 import org.apache.lucene.search.Hits;
10 import org.apache.lucene.search.IndexSearcher;
11 import org.apache.lucene.search.Query;
12 import org.apache.lucene.store.RAMDirectory;
13
14 import java.io.IOException;
15
16 public class PositionalPorterStopAnalyzerTest extends TestCase {
17 private static PositionalPorterStopAnalyzer porterAnalyzer =
18 new PositionalPorterStopAnalyzer();
19
20 private RAMDirectory directory;
21
22 public void setUp() throws Exception {
23 directory = new RAMDirectory();
24 IndexWriter writer =
25 new IndexWriter(directory, porterAnalyzer, true);
26
27 Document doc = new Document();
28 doc.add(Field.Text("contents",
29 "The quick brown fox jumps over the lazy dogs"));
30 writer.addDocument(doc);
31 writer.close();
32 }
33
34 public void testStems() throws Exception { //(3)
35 IndexSearcher searcher = new IndexSearcher(directory);
36 Query query = QueryParser.parse("laziness",
37 "contents",
38 porterAnalyzer);
39 Hits hits = searcher.search(query);
40 assertEquals("lazi", 1, hits.length());
41
42
43 query = QueryParser.parse("\"fox jumped\"",
44 "contents",
45 porterAnalyzer);
46
47 hits = searcher.search(query);
48 assertEquals("jump jumps jumped jumping", 1, hits.length());
49 }
50
51 public void testExactPhrase() throws Exception {//// 测试 丢失Position信息 引起的麻烦 (1)
52 IndexSearcher searcher = new IndexSearcher(directory);
53 Query query = QueryParser.parse("\"over the lazy\"",
54 "contents",
55 porterAnalyzer);
56
57 Hits hits = searcher.search(query);
58 assertEquals("exact match not found!", 0, hits.length());
59 }
60
61 public void testWithSlop() throws Exception {
62 IndexSearcher searcher = new IndexSearcher(directory);
63
64 QueryParser parser = new QueryParser("contents",
65 porterAnalyzer);
66 parser.setPhraseSlop(1); // (2)
67
68 Query query = parser.parse("\"over the lazy\"");
69
70 Hits hits = searcher.search(query);
71 assertEquals("hole accounted for", 1, hits.length());
72 }
73
74 public static void main(String[] args) throws IOException {
75 AnalyzerUtils.displayTokensWithPositions(porterAnalyzer,
76 "The quick brown fox jumps over the lazy dogs");
77 }
78 }
下面是一些问题的出现 和解决的办法.
(1),As shown, an exact phrase query didn’t match. This is disturbing, of course. Unlike the synonym analyzer situation, using a different analyzer won’t solve the problem. The difficulty lies deeper inside
PhraseQuery
does allow a little looseness, called slop. This is covered in greater detail in section 3.4.5; however, it would be unkind to leave without showing a phrase query working. Setting the slop to 1 allows the query to effectively ignore the gap:(参考代码(2))The value of the phrase slop factor, in a simplified definition for this case, represents how many stop words could be present in the original text between indexed words. Introducing a slop factor greater than zero, however, allows even more inexact phrases to match. In this example, searching for “over lazy” also matches. With stop-word removal in analysis, doing
exact phrase matches is, by definition, not possible: The words removed aren’t there, so you can’t know what they were. The slop factor addresses the main problem with searching using stop-word removal that leaves holes; you can now see the benefit our analyzer provides, thanks to the stemming:(参考代码(3))Both
laziness and the phrase “fox jumped” matched our indexed document,allowing users a bit of flexibility in the words used during searching.8.语言分析问题
在处理非英语语言时,有点麻烦 尤其是在处理中文的时候, 关于这一点 搜索一下网站资料.
9.Nutch analysis
关于如何处理Stop word 才合理的问题 可以看看Nutch 的实现.当你单独搜索一个Stop word 的时 这是没有意义的 因为大多数Text 都包含这些词, 如:the, 如果对the 建立索引 那么该索引文件一点会很大 所以这样是没有意义的.当使用stop word 加 noStopWord时 stop word 就会去除掉,但是当用 引号("")时候 如:"the quick brown" 这时候就有一点微秒的变化了.Nutch 的Analyzer 是如何处理的呢? 来看看Nutch 使用了什么技术:
Nutch combines an index-time analysis
bigram (grouping two consecutive words as a single token) technique with a query-time optimization of phrases.This results in a far smaller document space considered during searching; for example, far fewer documents have
the quick side by side than contain the. Using the internals of Nutch, we created a simple example to demonstrate the Nutch analysis trickery.看个例子:
01 package lia.analysis.nutch;
02
03 import net.nutch.analysis.NutchDocumentAnalyzer;
04 import net.nutch.searcher.QueryTranslator;
05 import org.apache.lucene.analysis.Analyzer;
06 import org.apache.lucene.analysis.Token;
07 import org.apache.lucene.analysis.TokenStream;
08 import org.apache.lucene.search.Query;
09
10 import java.io.IOException;
11 import java.io.StringReader;
12 import java.util.ArrayList;
13
14 public class NutchExample {
15 public static void main(String[] args) throws IOException {
16 NutchDocumentAnalyzer analyzer = new NutchDocumentAnalyzer(); //新建NutchDocumentAnalyzer实例A Nutch Query is translated into a Lucene Query instance
17 displayTokensWithDetails(analyzer, "The quick brown fox..."); (1)
18
19 net.nutch.searcher.Query nutchQuery =
20 net.nutch.searcher.Query.parse("\"the quick brown\"");
21 Query query = QueryTranslator.translate(nutchQuery); //
22 System.out.println("query = " + query);
23 }
24
25 /**
26 * Copy of AnalyzerUtils.displayTokensWithPositions, except
27 * uses the "content" field instead of "contents". Nutch
28 * demands "content".
29 */
30 private static void displayTokensWithDetails(Analyzer analyzer,
31 String text) throws IOException {
32 Token[] tokens = tokensFromAnalysis(analyzer, text);
33
34 int position = 0;
35
36 for (int i = 0; i < tokens.length; i++) {
37 Token token = tokens[i];
38
39 int increment = token.getPositionIncrement();
40
41 if (increment > 0) {
42 position = position + increment;
43 System.out.println();
44 System.out.print(position + ": ");
45 }
46
47 System.out.print("[" + token.termText() +
48 ":" + token.type() + "] ");
49 }
50 System.out.println();
51 }
52
53 /**
54 * Copy of AnalyzerUtils.tokensFromAnalysis, except
55 * uses the "content" field instead of "contents". Nutch
56 * demands "content".
57 */
58 private static Token[] tokensFromAnalysis(Analyzer analyzer,
59 String text) throws IOException {
60 TokenStream stream =
61 analyzer.tokenStream("content", new StringReader(text));
62 ArrayList tokenList = new ArrayList();
63 while (true) {
64 Token token = stream.next();
65 if (token == null) break;
66
67 tokenList.add(token);
68 }
69
70 return (Token[]) tokenList.toArray(new Token[0]);
71 }
72
73 }
(1) displayTokensWithDetail is similar to our previous AnalyzerUtils methods, except Nutch demands the field name content. So, we create a custom one-off version of this utility to inspect Nutch.
结果如下:
1: [the:<WORD>] [the-quick:gram]
2: [quick:<WORD>]
3: [brown:<WORD>]
4: [fox:<WORD>]
结果显示“the quick” 变成了上文提到的 bigram,该bigram 的Position值是The 单词的位置, the 并没有被去除掉.
对stop word 这样处理 有得有失 看看如下解释:
Because additional tokens are created during analysis, the index is larger, but the benefit of this trade-off is that searches for exact-phrase queries are much faster.
And there’s a bonus: No terms were discarded during indexing.
During querying, phrases are also analyzed and optimized. The query output (recall from section 3.5.1 that
Query’s toString() is handy) of the Lucene Query instance for the query expression "the quick brown" is query = (+url:"the quick brown"^4.0) ? (+anchor:"the quick brown"^2.0) (+content:"the-quick quick brown" A Nutch query expands to search in the url and anchor fields as well, with higher boosts for those fields, using the exact phrase. The content field clause is optimized to only include the bigram of a position that contains an additional <WORD> type token.在后面还会详细的讨论Nutch .
10 总结
到此 第四章就完了 老规矩 来个Summary:
Analysis, while only a single facet of using Lucene, is the aspect that deserves the most attention and effort. The words that can be searched are those emitted during indexing analysis. Sure, using StandardAnalyzer may do the trick for your
needs, and it suffices for many applications. However, it’s important to understand the analysis process. Users who take analysis for granted often run into confusion later when they try to understand why searching for “to be or not to
be” returns no results (perhaps due to stop-word removal).
It takes less than one line of code to incorporate an analyzer during indexing. Many sophisticated processes may occur under the covers, such as stop-word removal and stemming of words. Removing words decreases your index size but
can have a negative impact on precision querying. Because one size doesn’t fit all when it comes to analysis, you may need to tune the analysis process for your application domain. Lucene’s elegant analyzer architecture decouples each of the processes internal to textual analysis, letting you reuse fundamental building blocks to construct custom analyzers. When you’re working with analyzers, be sure to use our AnalyzerUtils, or something similar, to see first-hand how your text is tokenized. If you’re changing analyzers, you should rebuild your index using the new analyzer so that all documents are
analyzed in the same manner.
下一章讨论高级搜索技术.
明天要休息一下 再回顾一下前四章的内容, 过一天再进行第五章的研读.
