Lucene/Solr Dev 一:Lucene indexing Time(Date)& Lucene Query Time(Date)

Lucene/Solr Dev 1:Lucene indexing Time(Date)& Lucene Query Time(Date)

先看一段代码及其运行结果:

File indexFile = new File("lucene.index_all");
QueryService service = new QueryService();	
IndexReader reader = CBEUtil.getIndexReader(indexFile);
IndexSearcher searcher = new IndexSearcher(reader);
String  start = "2010-07-27T14:30:57.78Z", end = "2010-07-27T14:44:49.187Z";
BooleanQuery.setMaxClauseCount(999999999);
service.singleRangeQuery(start, end, searcher);
service.multiRangeQuery(start, end, searcher);
service.queryDateService(indexFile, start, end, "creationTimeStr");

 

运行结果:

range = creationTimeStr:[2010-07-27T14:30:57.78Z TO 2010-07-27T14:44:49.187Z]
hits = 355210
Single range spent: 1593ms

booleanQuery = +creationTimeStr:[2010-07-27T14:30:57.78Z TO zzzzzzzzz] +creationTimeStr:[000000000 TO 2010-07-27T14:44:49.187Z]
hits = 355210
multi Range spent: 15500ms

query result: total matching documents 355210 total spent 750 milliseconds

比较运行结果发现,同样查找到355210个Document,singleRangeQuery()方法用了1593毫秒,multiRangeQuery()用了15500毫秒,而最后queryDateService()方法只用了750毫秒,他们效率相差很大,multiRangeQuery()是singleRangeQuery()的10倍,是queryDateService()的20倍,下面对此现象做一简单分析:

贴出singleRangeQuery()方法代码:

public  void singleRangeQuery(String fromDate, String toDate, IndexSearcher indexSearcher) throws IOException {
        long start = System.currentTimeMillis();
        RangeQuery range = new RangeQuery(new Term("creationTimeStr", fromDate), new Term("creationTimeStr", toDate), true);

        System.out.println("range = " + range);
        Hits hits = indexSearcher.search(range);

        long end = System.currentTimeMillis();
        System.out.println("hits = " + hits.length());
        System.out.println("Single range spent: " + (end -start) + "ms");
    }

 此方法主要用了RangeQuery 来查询Field对应值大于起始时间,小于结束时间的Document,这种方法在现在已经被弃用;

multiRangeQuery()代码:

public void multiRangeQuery(String fromDate, String toDate, IndexSearcher indexSearcher) throws IOException {
        long start = System.currentTimeMillis();
        RangeQuery from = new RangeQuery(new Term("creationTimeStr", fromDate), new Term("creationTimeStr",DateField.MAX_DATE_STRING()), true);
        RangeQuery to = new RangeQuery(new Term("creationTimeStr",DateField.MIN_DATE_STRING()), new Term("creationTimeStr", toDate) , true);
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(from, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(to,BooleanClause.Occur.MUST));

        System.out.println("booleanQuery = " + booleanQuery);
        Hits hits = indexSearcher.search(booleanQuery);

        long end = System.currentTimeMillis();
        System.out.println("hits = " + hits.length());
        System.out.println("multi Range spent: " + (end -start) + "ms");
    }

 此方法用了BooleanQuery 来完成查询查询Field对应值大于起始时间,小于结束时间的Document,BooleanQuery 有add(Query query, BooleanClause.Occur occur)方法,所以它可以包含多个Query,此处包含两个RangeQuery ,不难看出此种方法的效率的不能够满足Application的需求的,同样此方法中用到的许多方法现在已经弃用;

由上面两种方法的比较可以解释一个关于Lucene Time Range 的结论:“Date searchers should use a single Range term rather than two”.

queryDateService()代码:

public void queryDateService(File indexFile, String start, String end, String dateField) {
		Count.set();
		IndexReader reader = null;
		try {
			reader = CBEUtil.getIndexReader(indexFile);
			IndexSearcher searcher = new IndexSearcher(reader);
			TermRangeQuery query = new TermRangeQuery(dateField, start, end, true,true);
			TopDocs matches = searcher.search(query, null, 10, new Sort(dateField));
			System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds"));
			Count.destory();
		}  catch (IOException e) {
			errorHanlder("",e);
		}
	}

 从运行结果数据看此方法效率最高,在Application开发中应用此方法;

 

对Lucene Time 做索引及查询的总结

 前面在Lucene学习笔记(二)中提到Lucene对时间的索引及查询,这里我主要针对查询效率对Lucene indexing Time(Date)& Lucene Query Time(Date)做一总结:

1 两种思路做索引:

Method One:Time(Date)它对应一个Long型数字,所以可以用NumericField做索引;

Method Two: 将Time(Date)转化为格式了的字符串,用普通Field

为了详细研究,我们把Method One:分为两种情况(分别以毫秒和秒做索引)

贴出做索引代码:

public Document getDocument() {
		Document doc = new Document();
		doc.add(new NumericField("creationTimeSec", Field.Store.YES, true)
				.setLongValue(new Date().getTime() / 1000));
		doc.add(new NumericField("creationTimeMill", Field.Store.YES, true)
				.setLongValue(new Date().getTime()));
		doc.add(new Field("creationTimeStr", new SimpleDateFormat(
				"yyyy-MM-dd'T'HH:mm:ss.S'Z'").format(new Date()),
				Field.Store.YES, Field.Index.NOT_ANALYZED));
		return doc;
	}

 如代码所示,在每个Document上添加三个Field分别表示:NumericField/秒 NumericField/毫秒 Field/字符串;

要对上述索引做查询同样需两种方法,直接贴出两种方法:

 

public void queryDateService(File indexFile, long startDate, long endDate, String dateField) {
		Count.set();
		IndexReader reader = null;
		try {
			reader = CBEUtil.getIndexReader(indexFile);
			IndexSearcher searcher = new IndexSearcher(reader);
			NumericRangeQuery query = NumericRangeQuery.newLongRange(dateField, startDate, endDate, true,true);
			TopDocs matches = searcher.search(query, null, 10, new Sort(dateField));
			System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds"));
			Count.destory();
		}  catch (IOException e) {
			errorHanlder("",e);
		}
	} 
	
	public void queryDateService(File indexFile, String start, String end, String dateField) {
		Count.set();
		IndexReader reader = null;
		try {
			reader = CBEUtil.getIndexReader(indexFile);
			IndexSearcher searcher = new IndexSearcher(reader);
			TermRangeQuery query = new TermRangeQuery(dateField, start, end, true,true);
			TopDocs matches = searcher.search(query, null, 10, new Sort(dateField));
			System.out.println(" query result: total matching documents " + matches.totalHits + " total spent " + (Count.result() + " milliseconds"));
			Count.destory();
		}  catch (IOException e) {
			errorHanlder("",e);
		}
	}

分析上述代码:

queryDateService(File indexFile, long startDate, long endDate, String dateField)传入参数为要做查询的索引文件,开始Time(Date)对应long值,结束Time(Date)对应long值,及 Time(Date)对应Field名字;此处传入long值可以是毫秒对应值(new Date().getTime()),也可以是秒对应值(new Date().getTime() / 1000);

queryDateService(File indexFile, String start, String end, String dateField)传入参数为要做查询的索引文件,开始Time(Date)对应格式字符串的值,结束Time(Date)对应格式字符串的值,及 Time(Date)对应Field名字;

 

下面给出测试结果:


Lucene/Solr Dev 一:Lucene indexing Time(Date)& Lucene Query Time(Date)
 

在上图中:X轴表示索引文件的大小,单位为MB,本实验开始索引文件从0MB一直到最后的1456MB,Y轴表示查询时间,单位为毫秒,本实验查询最多耗时1922;

图中三条曲线:

         query by milliseconds range 表示:索引NumericField/毫秒,查询时,Time Range 对应为毫秒

         query by seconds range表示:索引NumericField/秒,查询时,Time Range 对应为秒

         query by string range表示:索引Field/字符串,String Range查询

分析上图:

1、  让索引文件为200MB左右时,三种方式查询用时相差最小,都为400毫秒左右

2、  NumericField/毫秒 方式查询最耗时,Field/字符串最省时

3、  随着索引文件的增加Field/字符串方式查询时间增长最慢,是最理想的Time Range 查询模式

 

 

上图对应表格数据如下: 

Indexed file size(MB) 207 416 624 837 1040 1248 1456
Time(query by milliseconds range) 453 734 953 1218 1438 1687 1922
Time(query by seconds range) 406 563 781 1000 1188 1360 1562
Time(query by string range) 344 484 609 765 875 1015 1140

 

上面表格和曲线图是一种一一对应关系,分析上述结果不难看出:将Time(Date)转化为格式了的字符串,用普通Field做索引,查询时用String range查询是最佳选择;

结论:Time(Date)做索引,并对索引结果进行查询的最佳方案为:将Time(Date)转化为格式了的字符串,用普通Field做索引,查询时用String range查询;