lucene中的高亮

这里我们搜索的内容是“一步一步跟我学习lucene”，搜索引擎展示的结果中对用户的输入信息进行了配色方面的处理，这种区分正常文本和输入内容的效果即是高亮显示；

这样做的好处：

视觉上让人便于查找有搜索对应的文本块；
界面展示更友好；

lucene提供了highlighter插件来体现类似的效果；

highlighter对查询关键字高亮处理；

highlighter包包含了用于处理结果页查询内容高亮显示的功能，其中Highlighter类highlighter包的核心组件，借助Fragmenter, fragment Scorer, 和Formatter等类来支持用户自定义高亮展示的功能；

示例程序

这里边我利用了之前的做的目录文件索引

   [java] 
   view plaincopy
package com.lucene.search.util;  
  
import java.io.IOException;  
import java.io.StringReader;  
import java.util.concurrent.ExecutorService;  
import java.util.concurrent.Executors;  
  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.TokenStream;  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.TermQuery;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.search.highlight.Highlighter;  
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;  
import org.apache.lucene.search.highlight.QueryScorer;  
import org.apache.lucene.search.highlight.SimpleFragmenter;  
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;  
import org.apache.lucene.util.BytesRef;  
  
public class HighlighterTest {  
      public static void main(String[] args) {  
        IndexSearcher searcher;  
        TopDocs docs;   
        ExecutorService service = Executors.newCachedThreadPool();  
        try {  
            searcher = SearchUtil.getMultiSearcher("index", service);  
            Term term = new Term("content",new BytesRef("lucene"));  
            TermQuery termQuery = new TermQuery(term);  
            docs = SearchUtil.getScoreDocsByPerPage(1, 30, searcher, termQuery);  
            ScoreDoc[] hits = docs.scoreDocs;  
            QueryScorer scorer = new QueryScorer(termQuery);  
            SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//设定高亮显示的格式<B>keyword</B>,此为默认的格式    
            Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);     
            highlighter.setTextFragmenter(new SimpleFragmenter(20));//设置每次返回的字符数  
            Analyzer analyzer = new StandardAnalyzer();  
            for(int i=0;i<hits.length;i++){     
                Document doc = searcher.doc(hits[i].doc);     
                String str = highlighter.getBestFragment(analyzer, "content", doc.get("content")) ;  
                System.out.println(str);     
            }     
  
        } catch (IOException e1) {  
            // TODO Auto-generated catch block  
            e1.printStackTrace();  
        } catch (InvalidTokenOffsetsException e) {  
            // TODO Auto-generated catch block  
            e.printStackTrace();  
        }finally{  
            service.shutdown();  
        }  
    }  
}  

lucene的highlighter高亮展示的原理：

根据Formatter和Scorer创建highlighter对象，formatter定义了高亮的显示方式，而scorer定义了高亮的评分；

评分的算法是先根据term的评分值获取对应的document的权重，在此基础上对文本的内容进行轮询,获取对应的文本出现的次数，和它在term对应的文本中出现的位置（便于高亮处理），评分并分词的算法为：

   [java] 
   view plaincopy
public float getTokenScore() {  
    position += posIncAtt.getPositionIncrement();//记录出现的位置  
    String termText = termAtt.toString();  
  
    WeightedSpanTerm weightedSpanTerm;  
  
    if ((weightedSpanTerm = fieldWeightedSpanTerms.get(  
              termText)) == null) {  
      return 0;  
    }  
  
    if (weightedSpanTerm.positionSensitive &&  
          !weightedSpanTerm.checkPosition(position)) {  
      return 0;  
    }  
  
    float score = weightedSpanTerm.getWeight();//获取权重  
  
    // found a query term - is it unique in this doc?  
    if (!foundTerms.contains(termText)) {//结果排重处理  
      totalScore += score;  
      foundTerms.add(termText);  
    }  
  
    return score;  
  }  

formatter的原理为：对搜索的文本进行判断，如果scorer获取的totalScore不小于0，即查询内容在对应的term中存在，则按照格式拼接成preTag+查询内容+postTag的格式

详细算法如下：

   [java] 
   view plaincopy
public String highlightTerm(String originalText, TokenGroup tokenGroup) {  
    if (tokenGroup.getTotalScore() <= 0) {  
      return originalText;  
    }  
  
    // Allocate StringBuilder with the right number of characters from the  
    // beginning, to avoid char[] allocations in the middle of appends.  
    StringBuilder returnBuffer = new StringBuilder(preTag.length() + originalText.length() + postTag.length());  
    returnBuffer.append(preTag);  
    returnBuffer.append(originalText);  
    returnBuffer.append(postTag);  
    return returnBuffer.toString();  
  }  

其默认格式为“<B></B>”的形式；

Highlighter根据scorer和formatter，对document进行分析，查询结果调用getBestTextFragments,TokenStream tokenStream,String text,boolean mergeContiguousFragments,int maxNumFragments)，其过程为

scorer首先初始化查询内容对应的出现位置的下标，然后对tokenstream添加PositionIncrementAttribute，此类记录单词出现的位置；
对文本内容进行轮询，判断查询的文本长度是否超出限制，如果超出文本长度提示过长内容；
如果获取到指定的文本，先对单次查询的内容进行内容的截取（截取值根据setTextFragmenter指定的值决定），再调用formatter的highlightTerm方法对文本进行重新构建
在本次轮询和下次单词出现之前对文本内容进行处理

查询工具类

   [java] 
   view plaincopy
package com.lucene.search.util;  
  
import java.io.File;  
import java.io.IOException;  
import java.nio.file.Paths;  
import java.util.Set;  
import java.util.concurrent.ExecutorService;  
  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.index.DirectoryReader;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.MultiReader;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.queryparser.classic.ParseException;  
import org.apache.lucene.queryparser.classic.QueryParser;  
import org.apache.lucene.search.BooleanQuery;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.MatchAllDocsQuery;  
import org.apache.lucene.search.NumericRangeQuery;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.TermQuery;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.search.BooleanClause.Occur;  
import org.apache.lucene.search.highlight.Highlighter;  
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;  
import org.apache.lucene.search.highlight.QueryScorer;  
import org.apache.lucene.search.highlight.SimpleFragmenter;  
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.util.BytesRef;  
  
/**lucene索引查询工具类 
 * @author lenovo 
 * 
 */  
public class SearchUtil {  
    /**获取IndexSearcher对象 
     * @param indexPath 
     * @param service 
     * @return 
     * @throws IOException 
     */  
    public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{  
        MultiReader reader = null;  
        //设置  
        try {  
            File[] files = new File(parentPath).listFiles();  
            IndexReader[] readers = new IndexReader[files.length];  
            for (int i = 0 ; i < files.length ; i ++) {  
                readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));  
            }  
            reader = new MultiReader(readers);  
        } catch (IOException e) {  
            // TODO Auto-generated catch block  
            e.printStackTrace();  
        }  
        return new IndexSearcher(reader,service);  
    }  
    /**多目录多线程查询 
     * @param parentPath 父级索引目录 
     * @param service 多线程查询 
     * @return 
     * @throws IOException 
     */  
    public static IndexSearcher getMultiSearcher(String parentPath,ExecutorService service) throws IOException{  
        File file = new File(parentPath);  
        File[] files = file.listFiles();  
        IndexReader[] readers = new IndexReader[files.length];  
        for (int i = 0 ; i < files.length ; i ++) {  
            readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));  
        }  
        MultiReader multiReader = new MultiReader(readers);  
        IndexSearcher searcher = new IndexSearcher(multiReader,service);  
        return searcher;  
    }  
    /**根据索引路径获取IndexReader 
     * @param indexPath 
     * @return 
     * @throws IOException 
     */  
    public static DirectoryReader getIndexReader(String indexPath) throws IOException{  
        return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));  
    }  
    /**根据索引路径获取IndexSearcher 
     * @param indexPath 
     * @param service 
     * @return 
     * @throws IOException 
     */  
    public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{  
        IndexReader reader = getIndexReader(indexPath);  
        return new IndexSearcher(reader,service);  
    }  
      
    /**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源 
     * @param oldSearcher 
     * @param service 
     * @return 
     * @throws IOException 
     */  
    public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{  
        DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();  
        DirectoryReader newReader = DirectoryReader.openIfChanged(reader);  
        return new IndexSearcher(newReader, service);  
    }  
      
    /**多条件查询类似于sql in 
     * @param querys 
     * @return 
     */  
    public static Query getMultiQueryLikeSqlIn(Query ... querys){  
        BooleanQuery query = new BooleanQuery();  
        for (Query subQuery : querys) {  
            query.add(subQuery,Occur.SHOULD);  
        }  
        return query;  
    }  
      
    /**多条件查询类似于sql and 
     * @param querys 
     * @return 
     */  
    public static Query getMultiQueryLikeSqlAnd(Query ... querys){  
        BooleanQuery query = new BooleanQuery();  
        for (Query subQuery : querys) {  
            query.add(subQuery,Occur.MUST);  
        }  
        return query;  
    }  
    /**从指定配置项中查询 
     * @return 
     * @param analyzer 分词器 
     * @param field 字段 
     * @param fieldType 字段类型 
     * @param queryStr 查询条件 
     * @param range 是否区间查询 
     * @return 
     */  
    public static Query getQuery(String field,String fieldType,String queryStr,boolean range){  
        Query q = null;  
        try {  
            if(queryStr != null && !"".equals(queryStr)){  
                if(range){  
                    String[] strs = queryStr.split("\\|");  
                    if("int".equals(fieldType)){  
                        int min = new Integer(strs[0]);  
                        int max = new Integer(strs[1]);  
                        q = NumericRangeQuery.newIntRange(field, min, max, true, true);  
                    }else if("double".equals(fieldType)){  
                        Double min = new Double(strs[0]);  
                        Double max = new Double(strs[1]);  
                        q = NumericRangeQuery.newDoubleRange(field, min, max, true, true);  
                    }else if("float".equals(fieldType)){  
                        Float min = new Float(strs[0]);  
                        Float max = new Float(strs[1]);  
                        q = NumericRangeQuery.newFloatRange(field, min, max, true, true);  
                    }else if("long".equals(fieldType)){  
                        Long min = new Long(strs[0]);  
                        Long max = new Long(strs[1]);  
                        q = NumericRangeQuery.newLongRange(field, min, max, true, true);  
                    }  
                }else{  
                    if("int".equals(fieldType)){  
                        q = NumericRangeQuery.newIntRange(field, new Integer(queryStr), new Integer(queryStr), true, true);  
                    }else if("double".equals(fieldType)){  
                        q = NumericRangeQuery.newDoubleRange(field, new Double(queryStr), new Double(queryStr), true, true);  
                    }else if("float".equals(fieldType)){  
                        q = NumericRangeQuery.newFloatRange(field, new Float(queryStr), new Float(queryStr), true, true);  
                    }else{  
                        Analyzer analyzer = new StandardAnalyzer();  
                        q = new QueryParser(field, analyzer).parse(queryStr);  
                    }  
                }  
            }else{  
                q= new MatchAllDocsQuery();  
            }  
              
            System.out.println(q);  
        } catch (ParseException e) {  
            // TODO Auto-generated catch block  
            e.printStackTrace();  
        }  
        return q;  
    }  
    /**根据field和值获取对应的内容 
     * @param fieldName 
     * @param fieldValue 
     * @return 
     */  
    public static Query getQuery(String fieldName,Object fieldValue){  
        Term term = new Term(fieldName, new BytesRef(fieldValue.toString()));  
        return new TermQuery(term);  
    }  
    /**根据IndexSearcher和docID获取默认的document 
     * @param searcher 
     * @param docID 
     * @return 
     * @throws IOException 
     */  
    public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{  
        return searcher.doc(docID);  
    }  
    /**根据IndexSearcher和docID 
     * @param searcher 
     * @param docID 
     * @param listField 
     * @return 
     * @throws IOException 
     */  
    public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{  
        return searcher.doc(docID, listField);  
    }  
      
    /**分页查询 
     * @param page 当前页数 
     * @param perPage 每页显示条数 
     * @param searcher searcher查询器 
     * @param query 查询条件 
     * @return 
     * @throws IOException 
     */  
    public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{  
        TopDocs result = null;  
        if(query == null){  
            System.out.println(" Query is null return null ");  
            return null;  
        }  
        ScoreDoc before = null;  
        if(page != 1){  
            TopDocs docsBefore = searcher.search(query, (page-1)*perPage);  
            ScoreDoc[] scoreDocs = docsBefore.scoreDocs;  
            if(scoreDocs.length > 0){  
                before = scoreDocs[scoreDocs.length - 1];  
            }  
        }  
        result = searcher.searchAfter(before, query, perPage);  
        return result;  
    }  
    public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{  
        TopDocs docs = searcher.search(query, getMaxDocId(searcher));  
        return docs;  
    }  
    /**高亮显示字段 
     * @param searcher 
     * @param field 
     * @param keyword 
     * @param preTag 
     * @param postTag 
     * @param fragmentSize 
     * @return 
     * @throws IOException  
     * @throws InvalidTokenOffsetsException  
     */  
    public static String[] highlighter(IndexSearcher searcher,String field,String keyword,String preTag, String postTag,int fragmentSize) throws IOException, InvalidTokenOffsetsException{  
        Term term = new Term("content",new BytesRef("lucene"));  
        TermQuery termQuery = new TermQuery(term);  
        TopDocs docs = getScoreDocs(searcher, termQuery);  
        ScoreDoc[] hits = docs.scoreDocs;  
        QueryScorer scorer = new QueryScorer(termQuery);  
        SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter(preTag,postTag);//设定高亮显示的格式<B>keyword</B>,此为默认的格式    
        Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);     
        highlighter.setTextFragmenter(new SimpleFragmenter(fragmentSize));//设置每次返回的字符数  
        Analyzer analyzer = new StandardAnalyzer();  
        String[] result = new String[hits.length];  
        for (int i = 0; i < result.length ; i++) {  
            Document doc = searcher.doc(hits[i].doc);  
            result[i] = highlighter.getBestFragment(analyzer, field, doc.get(field));  
        }  
        return result;  
    }  
    /**统计document的数量,此方法等同于matchAllDocsQuery查询 
     * @param searcher 
     * @return 
     */  
    public static int getMaxDocId(IndexSearcher searcher){  
        return searcher.getIndexReader().maxDoc();  
    }  
      
}  

lucene中的高亮

示例程序

查询工具类

浏览过的版块