Lucene/Solr Dev 一:Lucene indexing Time(Date)& Lucene Query Time(Date)
先看一段代码及其运行结果:
File indexFile = new File("lucene.index_all"); QueryService service = new QueryService(); IndexReader reader = CBEUtil.getIndexReader(indexFile); IndexSearcher searcher = new IndexSearcher(reader); String start = "2010-07-27T14:30:57.78Z", end = "2010-07-27T14:44:49.187Z"; BooleanQuery.setMaxClauseCount(999999999); service.singleRangeQuery(start, end, searcher); service.multiRangeQuery(start, end, searcher); service.queryDateService(indexFile, start, end, "creationTimeStr");
运行结果:
range = creationTimeStr:[2010-07-27T14:30:57.78Z TO 2010-07-27T14:44:49.187Z] hits = 355210 Single range spent: 1593ms booleanQuery = +creationTimeStr:[2010-07-27T14:30:57.78Z TO zzzzzzzzz] +creationTimeStr:[000000000 TO 2010-07-27T14:44:49.187Z] hits = 355210 multi Range spent: 15500ms query result: total matching documents 355210 total spent 750 milliseconds
比较运行结果发现,同样查找到355210个Document,singleRangeQuery()方法用了1593毫秒,multiRangeQuery()用了15500毫秒,而最后queryDateService()方法只用了750毫秒,他们效率相差很大,multiRangeQuery()是singleRangeQuery()的10倍,是queryDateService()的20倍,下面对此现象做一简单分析:
贴出singleRangeQuery()方法代码:
public void singleRangeQuery(String fromDate, String toDate, IndexSearcher indexSearcher) throws IOException { long start = System.currentTimeMillis(); RangeQuery range = new RangeQuery(new Term("creationTimeStr", fromDate), new Term("creationTimeStr", toDate), true); System.out.println("range = " + range); Hits hits = indexSearcher.search(range); long end = System.currentTimeMillis(); System.out.println("hits = " + hits.length()); System.out.println("Single range spent: " + (end -start) + "ms"); }
此方法主要用了RangeQuery 来查询Field对应值大于起始时间,小于结束时间的Document,这种方法在现在已经被弃用;
multiRangeQuery()代码:
public void multiRangeQuery(String fromDate, String toDate, IndexSearcher indexSearcher) throws IOException { long start = System.currentTimeMillis(); RangeQuery from = new RangeQuery(new Term("creationTimeStr", fromDate), new Term("creationTimeStr",DateField.MAX_DATE_STRING()), true); RangeQuery to = new RangeQuery(new Term("creationTimeStr",DateField.MIN_DATE_STRING()), new Term("creationTimeStr", toDate) , true); BooleanQuery booleanQuery = new BooleanQuery(); booleanQuery.add(new BooleanClause(from, BooleanClause.Occur.MUST)); booleanQuery.add(new BooleanClause(to,BooleanClause.Occur.MUST)); System.out.println("booleanQuery = " + booleanQuery); Hits hits = indexSearcher.search(booleanQuery); long end = System.currentTimeMillis(); System.out.println("hits = " + hits.length()); System.out.println("multi Range spent: " + (end -start) + "ms"); }
此方法用了BooleanQuery 来完成查询查询Field对应值大于起始时间,小于结束时间的Document,BooleanQuery 有add(Query query, BooleanClause.Occur occur)方法,所以它可以包含多个Query,此处包含两个RangeQuery ,不难看出此种方法的效率的不能够满足Application的需求的,同样此方法中用到的许多方法现在已经弃用;
由上面两种方法的比较可以解释一个关于Lucene Time Range 的结论:“Date searchers should use a single Range term rather than two”.
queryDateService()代码:
public void queryDateService(File indexFile, String start, String end, String dateField) { Count.set(); IndexReader reader = null; try { reader = CBEUtil.getIndexReader(indexFile); IndexSearcher searcher = new IndexSearcher(reader); TermRangeQuery query = new TermRangeQuery(dateField, start, end, true,true); TopDocs matches = searcher.search(query, null, 10, new Sort(dateField)); System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds")); Count.destory(); } catch (IOException e) { errorHanlder("",e); } }
从运行结果数据看此方法效率最高,在Application开发中应用此方法;
对Lucene Time 做索引及查询的总结
前面在Lucene学习笔记(二)中提到Lucene对时间的索引及查询,这里我主要针对查询效率对Lucene indexing Time(Date)& Lucene Query Time(Date)做一总结:
1 两种思路做索引:
Method One:Time(Date)它对应一个Long型数字,所以可以用NumericField做索引;
Method Two: 将Time(Date)转化为格式了的字符串,用普通Field
为了详细研究,我们把Method One:分为两种情况(分别以毫秒和秒做索引)
贴出做索引代码:
public Document getDocument() { Document doc = new Document(); doc.add(new NumericField("creationTimeSec", Field.Store.YES, true) .setLongValue(new Date().getTime() / 1000)); doc.add(new NumericField("creationTimeMill", Field.Store.YES, true) .setLongValue(new Date().getTime())); doc.add(new Field("creationTimeStr", new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.S'Z'").format(new Date()), Field.Store.YES, Field.Index.NOT_ANALYZED)); return doc; }
如代码所示,在每个Document上添加三个Field分别表示:NumericField/秒 NumericField/毫秒 Field/字符串;
要对上述索引做查询同样需两种方法,直接贴出两种方法:
public void queryDateService(File indexFile, long startDate, long endDate, String dateField) { Count.set(); IndexReader reader = null; try { reader = CBEUtil.getIndexReader(indexFile); IndexSearcher searcher = new IndexSearcher(reader); NumericRangeQuery query = NumericRangeQuery.newLongRange(dateField, startDate, endDate, true,true); TopDocs matches = searcher.search(query, null, 10, new Sort(dateField)); System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds")); Count.destory(); } catch (IOException e) { errorHanlder("",e); } } public void queryDateService(File indexFile, String start, String end, String dateField) { Count.set(); IndexReader reader = null; try { reader = CBEUtil.getIndexReader(indexFile); IndexSearcher searcher = new IndexSearcher(reader); TermRangeQuery query = new TermRangeQuery(dateField, start, end, true,true); TopDocs matches = searcher.search(query, null, 10, new Sort(dateField)); System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds")); Count.destory(); } catch (IOException e) { errorHanlder("",e); } }
分析上述代码:
queryDateService(File indexFile, long startDate, long endDate, String dateField)传入参数为要做查询的索引文件,开始Time(Date)对应long值,结束Time(Date)对应long值,及 Time(Date)对应Field名字;此处传入long值可以是毫秒对应值(new Date().getTime()),也可以是秒对应值(new Date().getTime() / 1000);
queryDateService(File indexFile, String start, String end, String dateField)传入参数为要做查询的索引文件,开始Time(Date)对应格式字符串的值,结束Time(Date)对应格式字符串的值,及 Time(Date)对应Field名字;
下面给出测试结果:
在上图中:X轴表示索引文件的大小,单位为MB,本实验开始索引文件从0MB一直到最后的1456MB,Y轴表示查询时间,单位为毫秒,本实验查询最多耗时1922;
图中三条曲线:
query by milliseconds range 表示:索引NumericField/毫秒,查询时,Time Range 对应为毫秒
query by seconds range表示:索引NumericField/秒,查询时,Time Range 对应为秒
query by string range表示:索引Field/字符串,String Range查询
分析上图:
1、 让索引文件为200MB左右时,三种方式查询用时相差最小,都为400毫秒左右
2、 NumericField/毫秒 方式查询最耗时,Field/字符串最省时
3、 随着索引文件的增加Field/字符串方式查询时间增长最慢,是最理想的Time Range 查询模式
上图对应表格数据如下:
Indexed file size(MB) | 207 | 416 | 624 | 837 | 1040 | 1248 | 1456 |
Time(query by milliseconds range) | 453 | 734 | 953 | 1218 | 1438 | 1687 | 1922 |
Time(query by seconds range) | 406 | 563 | 781 | 1000 | 1188 | 1360 | 1562 |
Time(query by string range) | 344 | 484 | 609 | 765 | 875 | 1015 | 1140 |
上面表格和曲线图是一种一一对应关系,分析上述结果不难看出:将Time(Date)转化为格式了的字符串,用普通Field做索引,查询时用String range查询是最佳选择;
结论:Time(Date)做索引,并对索引结果进行查询的最佳方案为:将Time(Date)转化为格式了的字符串,用普通Field做索引,查询时用String range查询;
完