Near Real Time Search ver 3.x

Near Real Time Search With Solr-RA ver 3.6/3.5/3.4/3.3/3.2

By
Nagendra Nagarajayya
http://solr-ra.tgels.com

Summary

Apache Solr is a very popular open source search platform which uses Lucene as the underlying search library. Solr-RA is Solr with RankingAlgorithm as the underlying search library. RankingAlgorithm uses Lucene indexing to read and write documents but scores and ranks on its own. Solr-RA enables adding documents to the index with concurrent searches and ranking in real time. The real time is near real time. Near real time is very close to real time but is not guaranteed to be real time. With Solr-RA when adding a document, no commit is needed and the Index Searchers are not closed, and the cache is not cleared. As there is no commit, the indexing is quite fast while enabling searches concurrently. An update performance of 5000 docs in 498 ms with one concurrent searcher while adding 1 million documents in batches of 5k has been observed on a dual core Intel Linux system with 2GB heap.

Steps to enable RT

Add 
	<realtime visible="200">true</realtime> 
	<library>rankingalgorithm</library>

to solrconfig.xml

The visible attribute can be as low as 0, meaning a document added becomes 

visible almost immediately or as high as you want. With visible set to 

"200", a performance as high as 5000 docs in 498ms has been seen with Solr-RA 3.5.

To enable real time faceting:
Add 
	<realtime visible="200" facet="true">true</realtime> 
	<library>rankingalgorithm</library>

Note: with real time faceting at high update, you may experience performance problems

Adding documents

No changes to adding documents except, you don't need to call commit after you add a document. Commit is only needed if the index is empty and to create the first document. After that no commits are needed. See below example:


Example:

curl "http://localhost:8983/solr/update/csv?stream.file=/tmp/x1.csv&encapsulator=%1f"

Note: 

1. A script generated the 500 martists records and called curl as above to load
2. The performance measurements were done after 10000 records were added.

Search concurrently while the indexing is going on

As before, no changes.

http://localhost:8983/solr/twitter/select/?q=john ab180027&fl=score

Performance

Indexing:

Indexing about 10000 mbartist entries with curl, visible attribute set to 200 (after warmup)

time:

real	0m37.254s
user	0m8.827s
sys	0m24.072s


Time measure of shell script running time without curl [create mbartist entries, etc]:

real	0m36.218s
user	0m9.147s
sys	0m24.816s

So time to load 10000 documents = 37.254 - 35.218 = ~1 secs (10000 docs/sec)

Concurrent search during load:

http://192.168.1.126:8983/solr/twitter/select/?fl=score&q=john ab180027&fl=score


Eliminating duplicates in an update:

If documents have unique id and multiple documents with the same unique id are added and if only the last document updated should be visible in searches add the following to solrconfig.xml :

Search for <indexDefault> and then under <indexDefaults>, look for <maxBufferedDocs>. Add below <maxBufferedDocs>,

<maxBufferedDeleteTerms>1</maxBufferedDeleteTerms>

or

add <deleteDuplicates>true</deleteDuplicates> to solrconfig.xml. See example below:

       <realtime visible="200">true</realtime>
       <library>rankingalgorithm</library>
       <deleteDuplicates>true</deleteDuplicates>

Implementation

The Near Real Time has been implemented by retrieving the IndexReader from the IndexWriter.getReader() method after a document has been added to the index. The addDoc function in DirectHandlerUpdate2.java has been modified so that retrieved IndexReader is stored in a HashMap in SolrCore.java. To avoid locking, a non locking concurrent time managed access is used to make available the IndexReader to SolrIndexSearchers. The SolrIndexSearchers access this IndexReader instead of the SolrIndexReader and pass this as a parameter to RankingAlgorithm for the search. RankingAlgorithm uses the reader to access the index and returns the results which are in near real time as it is using the updated IndexReader.

The NRT implementation supports faceting, filter queries, etc. The faceting count can be seen changing as documents are added in the screenshots below Fig 1 and Fig2. Fig 1 shows a facet query for “john” from the mbartists index (from the book Solr-14-Enterprise-Search-Server). Fig 2 shows the same query after (browser cache is cleared in firefox 4.0) adding a new artist to the index as below:

curl "http://localhost:8990/solr/mbartists/update/csv?stream.file=/tmp/x.csv&encapsulator=%1f"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">163</int></lst>
</response>

cat /tmp/x:
id,type,a_name,a_name_sort,a_alias,a_type,a_begin_date,a_end_date,a_member_name,a_member_id,a_release_date_latest,a_spell,a_spellPhrase,r_name,r_name_sort,r_name_facetLetter,r_a_name,r_a_id,r_attributes,r_type,r_official,r_lang,r_tracks,r_event_country,r_event_date,r_event_date_earliest,l_name,l_name_sort,l_type,l_begin_date,l_end_date,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r_tracks,t_trm_lookups,word,includes
Artist:3991866,Artist,John Ab Davis,John Ab Davis,,person,1942-12-29T00:00:00Z,1999-12-10T00:00:00Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Fig 1, shows numFound as 3256, and the facet count for “john” as 3256. Fig 2 after adding a doc with curl shows 3257, and the facet count for “john” as 3257. The Solr query is as below:

http://192.168.1.126:8990/solr/mbartists/select/?q=john&facet=on&facet.field=a_name&facet.field=a_type&fl=score

Fig 1 NRT12 Fig11.png
NRT12 Fig12.png


Fig 2

NRT12 Fig21.png
NRT12 Fig22.png

Caveat

1. The performance is limited by how fast the IndexWriter.getReader() returns. This seems to take the most time between 2ms to 70ms avg. The faster this goes, the faster the index time.

2. Caching needs to be disabled at the moment to see NRT updates, as with cache enabled, the first time a search is executed, the results are cached and for the next matching search the results are retrieved directly from cache. The solution here is to look at the docs added and to invalidate/update the cache as needed based on the cache query but at 10000 docs / sec this will become the new bottleneck and may limit scalability.

3. Setting maxBufferedDeleteTerms=1 will slow down update performance.

4. Setting maxBufferedDeleteTerms=1 will remove duplicates with unique ids but search results may still show the old document content, if the new document added has changed content, even though the new document content is searchable ie.

if the most recent doc has afield set to 
        <doc><afield>abc</afield></doc> 

and this is updated, and the old docs were 
        <doc><afield>xyz</afield>, 

at query time, q=afield:abc matches, but the results show may show 
        <doc><afield>xyz</afield>

Download


You can download Solr-RA from here:
http://solr-ra.tgels.com


Conclusion

The near real time search in Solr-RA works well and allows concurrent search with indexing in parallel without closing the IndexSearchers or clearing the cache providing the ability to offer searches in near real time. The indexing performance observed on a 2 core intel system with Fedora Linux 12 is about 10000 tps (new document adds) with visible set to 200ms.


Note:
1. solr and lucene are registered trandemarks of apache software foundation.
2. http://wiki.apache.org/solr/NearRealtimeSearch_3.x