lucene中的高亮

论坛 期权论坛 脚本     
匿名技术用户   2021-1-7 13:41   14   0

这里我们搜索的内容是“一步一步跟我学习lucene”,搜索引擎展示的结果中对用户的输入信息进行了配色方面的处理,这种区分正常文本和输入内容的效果即是高亮显示;

这样做的好处:

  • 视觉上让人便于查找有搜索对应的文本块;
  • 界面展示更友好;

lucene提供了highlighter插件来体现类似的效果;

highlighter对查询关键字高亮处理;

highlighter包包含了用于处理结果页查询内容高亮显示的功能,其中Highlighter类highlighter包的核心组件,借助Fragmenter, fragment Scorer, 和Formatter等类来支持用户自定义高亮展示的功能;

示例程序

这里边我利用了之前的做的目录文件索引

[java] view plain copy
  1. package com.lucene.search.util;
  2. import java.io.IOException;
  3. import java.io.StringReader;
  4. import java.util.concurrent.ExecutorService;
  5. import java.util.concurrent.Executors;
  6. import org.apache.lucene.analysis.Analyzer;
  7. import org.apache.lucene.analysis.TokenStream;
  8. import org.apache.lucene.analysis.standard.StandardAnalyzer;
  9. import org.apache.lucene.document.Document;
  10. import org.apache.lucene.index.Term;
  11. import org.apache.lucene.search.IndexSearcher;
  12. import org.apache.lucene.search.ScoreDoc;
  13. import org.apache.lucene.search.TermQuery;
  14. import org.apache.lucene.search.TopDocs;
  15. import org.apache.lucene.search.highlight.Highlighter;
  16. import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
  17. import org.apache.lucene.search.highlight.QueryScorer;
  18. import org.apache.lucene.search.highlight.SimpleFragmenter;
  19. import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
  20. import org.apache.lucene.util.BytesRef;
  21. public class HighlighterTest {
  22. public static void main(String[] args) {
  23. IndexSearcher searcher;
  24. TopDocs docs;
  25. ExecutorService service = Executors.newCachedThreadPool();
  26. try {
  27. searcher = SearchUtil.getMultiSearcher("index", service);
  28. Term term = new Term("content",new BytesRef("lucene"));
  29. TermQuery termQuery = new TermQuery(term);
  30. docs = SearchUtil.getScoreDocsByPerPage(1, 30, searcher, termQuery);
  31. ScoreDoc[] hits = docs.scoreDocs;
  32. QueryScorer scorer = new QueryScorer(termQuery);
  33. SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//设定高亮显示的格式<B>keyword</B>,此为默认的格式
  34. Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);
  35. highlighter.setTextFragmenter(new SimpleFragmenter(20));//设置每次返回的字符数
  36. Analyzer analyzer = new StandardAnalyzer();
  37. for(int i=0;i<hits.length;i++){
  38. Document doc = searcher.doc(hits[i].doc);
  39. String str = highlighter.getBestFragment(analyzer, "content", doc.get("content")) ;
  40. System.out.println(str);
  41. }
  42. } catch (IOException e1) {
  43. // TODO Auto-generated catch block
  44. e1.printStackTrace();
  45. } catch (InvalidTokenOffsetsException e) {
  46. // TODO Auto-generated catch block
  47. e.printStackTrace();
  48. }finally{
  49. service.shutdown();
  50. }
  51. }
  52. }

lucene的highlighter高亮展示的原理:

  • 根据Formatter和Scorer创建highlighter对象,formatter定义了高亮的显示方式,而scorer定义了高亮的评分;

评分的算法是先根据term的评分值获取对应的document的权重,在此基础上对文本的内容进行轮询,获取对应的文本出现的次数,和它在term对应的文本中出现的位置(便于高亮处理),评分并分词的算法为:

[java] view plain copy
  1. public float getTokenScore() {
  2. position += posIncAtt.getPositionIncrement();//记录出现的位置
  3. String termText = termAtt.toString();
  4. WeightedSpanTerm weightedSpanTerm;
  5. if ((weightedSpanTerm = fieldWeightedSpanTerms.get(
  6. termText)) == null) {
  7. return 0;
  8. }
  9. if (weightedSpanTerm.positionSensitive &&
  10. !weightedSpanTerm.checkPosition(position)) {
  11. return 0;
  12. }
  13. float score = weightedSpanTerm.getWeight();//获取权重
  14. // found a query term - is it unique in this doc?
  15. if (!foundTerms.contains(termText)) {//结果排重处理
  16. totalScore += score;
  17. foundTerms.add(termText);
  18. }
  19. return score;
  20. }

formatter的原理为:对搜索的文本进行判断,如果scorer获取的totalScore不小于0,即查询内容在对应的term中存在,则按照格式拼接成preTag+查询内容+postTag的格式

详细算法如下:

[java] view plain copy
  1. public String highlightTerm(String originalText, TokenGroup tokenGroup) {
  2. if (tokenGroup.getTotalScore() <= 0) {
  3. return originalText;
  4. }
  5. // Allocate StringBuilder with the right number of characters from the
  6. // beginning, to avoid char[] allocations in the middle of appends.
  7. StringBuilder returnBuffer = new StringBuilder(preTag.length() + originalText.length() + postTag.length());
  8. returnBuffer.append(preTag);
  9. returnBuffer.append(originalText);
  10. returnBuffer.append(postTag);
  11. return returnBuffer.toString();
  12. }

其默认格式为“<B></B>”的形式;

  • Highlighter根据scorer和formatter,对document进行分析,查询结果调用getBestTextFragments,TokenStream tokenStream,String text,boolean mergeContiguousFragments,int maxNumFragments),其过程为
  1. scorer首先初始化查询内容对应的出现位置的下标,然后对tokenstream添加PositionIncrementAttribute,此类记录单词出现的位置;
  2. 对文本内容进行轮询,判断查询的文本长度是否超出限制,如果超出文本长度提示过长内容;
  3. 如果获取到指定的文本,先对单次查询的内容进行内容的截取(截取值根据setTextFragmenter指定的值决定),再调用formatter的highlightTerm方法对文本进行重新构建
  4. 在本次轮询和下次单词出现之前对文本内容进行处理

查询工具类

[java] view plain copy
  1. package com.lucene.search.util;
  2. import java.io.File;
  3. import java.io.IOException;
  4. import java.nio.file.Paths;
  5. import java.util.Set;
  6. import java.util.concurrent.ExecutorService;
  7. import org.apache.lucene.analysis.Analyzer;
  8. import org.apache.lucene.analysis.standard.StandardAnalyzer;
  9. import org.apache.lucene.document.Document;
  10. import org.apache.lucene.index.DirectoryReader;
  11. import org.apache.lucene.index.IndexReader;
  12. import org.apache.lucene.index.MultiReader;
  13. import org.apache.lucene.index.Term;
  14. import org.apache.lucene.queryparser.classic.ParseException;
  15. import org.apache.lucene.queryparser.classic.QueryParser;
  16. import org.apache.lucene.search.BooleanQuery;
  17. import org.apache.lucene.search.IndexSearcher;
  18. import org.apache.lucene.search.MatchAllDocsQuery;
  19. import org.apache.lucene.search.NumericRangeQuery;
  20. import org.apache.lucene.search.Query;
  21. import org.apache.lucene.search.ScoreDoc;
  22. import org.apache.lucene.search.TermQuery;
  23. import org.apache.lucene.search.TopDocs;
  24. import org.apache.lucene.search.BooleanClause.Occur;
  25. import org.apache.lucene.search.highlight.Highlighter;
  26. import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
  27. import org.apache.lucene.search.highlight.QueryScorer;
  28. import org.apache.lucene.search.highlight.SimpleFragmenter;
  29. import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
  30. import org.apache.lucene.store.FSDirectory;
  31. import org.apache.lucene.util.BytesRef;
  32. /**lucene索引查询工具类
  33. * @author lenovo
  34. *
  35. */
  36. public class SearchUtil {
  37. /**获取IndexSearcher对象
  38. * @param indexPath
  39. * @param service
  40. * @return
  41. * @throws IOException
  42. */
  43. public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{
  44. MultiReader reader = null;
  45. //设置
  46. try {
  47. File[] files = new File(parentPath).listFiles();
  48. IndexReader[] readers = new IndexReader[files.length];
  49. for (int i = 0 ; i < files.length ; i ++) {
  50. readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
  51. }
  52. reader = new MultiReader(readers);
  53. } catch (IOException e) {
  54. // TODO Auto-generated catch block
  55. e.printStackTrace();
  56. }
  57. return new IndexSearcher(reader,service);
  58. }
  59. /**多目录多线程查询
  60. * @param parentPath 父级索引目录
  61. * @param service 多线程查询
  62. * @return
  63. * @throws IOException
  64. */
  65. public static IndexSearcher getMultiSearcher(String parentPath,ExecutorService service) throws IOException{
  66. File file = new File(parentPath);
  67. File[] files = file.listFiles();
  68. IndexReader[] readers = new IndexReader[files.length];
  69. for (int i = 0 ; i < files.length ; i ++) {
  70. readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
  71. }
  72. MultiReader multiReader = new MultiReader(readers);
  73. IndexSearcher searcher = new IndexSearcher(multiReader,service);
  74. return searcher;
  75. }
  76. /**根据索引路径获取IndexReader
  77. * @param indexPath
  78. * @return
  79. * @throws IOException
  80. */
  81. public static DirectoryReader getIndexReader(String indexPath) throws IOException{
  82. return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
  83. }
  84. /**根据索引路径获取IndexSearcher
  85. * @param indexPath
  86. * @param service
  87. * @return
  88. * @throws IOException
  89. */
  90. public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
  91. IndexReader reader = getIndexReader(indexPath);
  92. return new IndexSearcher(reader,service);
  93. }
  94. /**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
  95. * @param oldSearcher
  96. * @param service
  97. * @return
  98. * @throws IOException
  99. */
  100. public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
  101. DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
  102. DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
  103. return new IndexSearcher(newReader, service);
  104. }
  105. /**多条件查询类似于sql in
  106. * @param querys
  107. * @return
  108. */
  109. public static Query getMultiQueryLikeSqlIn(Query ... querys){
  110. BooleanQuery query = new BooleanQuery();
  111. for (Query subQuery : querys) {
  112. query.add(subQuery,Occur.SHOULD);
  113. }
  114. return query;
  115. }
  116. /**多条件查询类似于sql and
  117. * @param querys
  118. * @return
  119. */
  120. public static Query getMultiQueryLikeSqlAnd(Query ... querys){
  121. BooleanQuery query = new BooleanQuery();
  122. for (Query subQuery : querys) {
  123. query.add(subQuery,Occur.MUST);
  124. }
  125. return query;
  126. }
  127. /**从指定配置项中查询
  128. * @return
  129. * @param analyzer 分词器
  130. * @param field 字段
  131. * @param fieldType 字段类型
  132. * @param queryStr 查询条件
  133. * @param range 是否区间查询
  134. * @return
  135. */
  136. public static Query getQuery(String field,String fieldType,String queryStr,boolean range){
  137. Query q = null;
  138. try {
  139. if(queryStr != null && !"".equals(queryStr)){
  140. if(range){
  141. String[] strs = queryStr.split("\\|");
  142. if("int".equals(fieldType)){
  143. int min = new Integer(strs[0]);
  144. int max = new Integer(strs[1]);
  145. q = NumericRangeQuery.newIntRange(field, min, max, true, true);
  146. }else if("double".equals(fieldType)){
  147. Double min = new Double(strs[0]);
  148. Double max = new Double(strs[1]);
  149. q = NumericRangeQuery.newDoubleRange(field, min, max, true, true);
  150. }else if("float".equals(fieldType)){
  151. Float min = new Float(strs[0]);
  152. Float max = new Float(strs[1]);
  153. q = NumericRangeQuery.newFloatRange(field, min, max, true, true);
  154. }else if("long".equals(fieldType)){
  155. Long min = new Long(strs[0]);
  156. Long max = new Long(strs[1]);
  157. q = NumericRangeQuery.newLongRange(field, min, max, true, true);
  158. }
  159. }else{
  160. if("int".equals(fieldType)){
  161. q = NumericRangeQuery.newIntRange(field, new Integer(queryStr), new Integer(queryStr), true, true);
  162. }else if("double".equals(fieldType)){
  163. q = NumericRangeQuery.newDoubleRange(field, new Double(queryStr), new Double(queryStr), true, true);
  164. }else if("float".equals(fieldType)){
  165. q = NumericRangeQuery.newFloatRange(field, new Float(queryStr), new Float(queryStr), true, true);
  166. }else{
  167. Analyzer analyzer = new StandardAnalyzer();
  168. q = new QueryParser(field, analyzer).parse(queryStr);
  169. }
  170. }
  171. }else{
  172. q= new MatchAllDocsQuery();
  173. }
  174. System.out.println(q);
  175. } catch (ParseException e) {
  176. // TODO Auto-generated catch block
  177. e.printStackTrace();
  178. }
  179. return q;
  180. }
  181. /**根据field和值获取对应的内容
  182. * @param fieldName
  183. * @param fieldValue
  184. * @return
  185. */
  186. public static Query getQuery(String fieldName,Object fieldValue){
  187. Term term = new Term(fieldName, new BytesRef(fieldValue.toString()));
  188. return new TermQuery(term);
  189. }
  190. /**根据IndexSearcher和docID获取默认的document
  191. * @param searcher
  192. * @param docID
  193. * @return
  194. * @throws IOException
  195. */
  196. public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
  197. return searcher.doc(docID);
  198. }
  199. /**根据IndexSearcher和docID
  200. * @param searcher
  201. * @param docID
  202. * @param listField
  203. * @return
  204. * @throws IOException
  205. */
  206. public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{
  207. return searcher.doc(docID, listField);
  208. }
  209. /**分页查询
  210. * @param page 当前页数
  211. * @param perPage 每页显示条数
  212. * @param searcher searcher查询器
  213. * @param query 查询条件
  214. * @return
  215. * @throws IOException
  216. */
  217. public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
  218. TopDocs result = null;
  219. if(query == null){
  220. System.out.println(" Query is null return null ");
  221. return null;
  222. }
  223. ScoreDoc before = null;
  224. if(page != 1){
  225. TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
  226. ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
  227. if(scoreDocs.length > 0){
  228. before = scoreDocs[scoreDocs.length - 1];
  229. }
  230. }
  231. result = searcher.searchAfter(before, query, perPage);
  232. return result;
  233. }
  234. public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{
  235. TopDocs docs = searcher.search(query, getMaxDocId(searcher));
  236. return docs;
  237. }
  238. /**高亮显示字段
  239. * @param searcher
  240. * @param field
  241. * @param keyword
  242. * @param preTag
  243. * @param postTag
  244. * @param fragmentSize
  245. * @return
  246. * @throws IOException
  247. * @throws InvalidTokenOffsetsException
  248. */
  249. public static String[] highlighter(IndexSearcher searcher,String field,String keyword,String preTag, String postTag,int fragmentSize) throws IOException, InvalidTokenOffsetsException{
  250. Term term = new Term("content",new BytesRef("lucene"));
  251. TermQuery termQuery = new TermQuery(term);
  252. TopDocs docs = getScoreDocs(searcher, termQuery);
  253. ScoreDoc[] hits = docs.scoreDocs;
  254. QueryScorer scorer = new QueryScorer(termQuery);
  255. SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter(preTag,postTag);//设定高亮显示的格式<B>keyword</B>,此为默认的格式
  256. Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);
  257. highlighter.setTextFragmenter(new SimpleFragmenter(fragmentSize));//设置每次返回的字符数
  258. Analyzer analyzer = new StandardAnalyzer();
  259. String[] result = new String[hits.length];
  260. for (int i = 0; i < result.length ; i++) {
  261. Document doc = searcher.doc(hits[i].doc);
  262. result[i] = highlighter.getBestFragment(analyzer, field, doc.get(field));
  263. }
  264. return result;
  265. }
  266. /**统计document的数量,此方法等同于matchAllDocsQuery查询
  267. * @param searcher
  268. * @return
  269. */
  270. public static int getMaxDocId(IndexSearcher searcher){
  271. return searcher.getIndexReader().maxDoc();
  272. }
  273. }
分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:7942463
帖子:1588486
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP