realime-search (a very fast nrt) with Solr-RA ver 4.1
By Nagendra Nagarajayya http://solr-ra.tgels.com
Apache Solr is a very popular open source search platform which uses Lucene as the underlying search library. Solr-RA is Solr with RankingAlgorithm as the underlying search library. RankingAlgorithm uses Lucene indexing to read and write documents but scores and ranks on its own. Solr-RA uses realtime-search to enable very fast adding/updating/searching of documents. realtime-search is a very fast nrt that enables adding close to 70,000 docs / sec to the index. realtime-search offers multiple granularities, request and intrarequest.
Realtime-search is different from realtime-get which is a simple lookup by id and needs the transaction log to be enabled. realtime-get does not have search capability. Realtime-search allows full search, so you could search by id, text, location, etc. using boolean, dismax, faceting, range queries ie. no change to existing functionality. No new request handlers to be defined in solrconfig.xml. So all of your existing queries work as it is with no changes, except that the results returned are in near real time. Realtime-search also does not need the transaction update log needed by realtime-get. So you can turn this off for improved performance. autoCommit freq can also be increased to an hour from the default of 15 secs for improved performance (remember commits can slow down your update performance)
Realtime-search is different from soft-commit which is designed to close and reopen the SolrIndexSearcher object which is a heavy weight object and can be a performance limitation if the soft-commits happen more than one per second, see blog: http://searchhub.org/dev/2011/09/07/realtime-get/.
Steps to enable RT
Add <realtime visible="200" granularity="request">true</realtime> <library>rankingalgorithm</library> to solrconfig.xml
The visible attribute controls the max milli-seconds before a newly added document becomes visble in search results. So setting this to 0 means newly added documents are visible immediately but can affect update performance. Setting this to about 150-200ms seems to offer the most optimum performance. With visible set to "200", a performance as high as 70000 docs/sec has been seen with Solr-RA 4.0.
The granularity attribute controls the NRT behavior. With the default granularity="request", all search components like search, faceting, highlighting, etc. will see a consistent view of the index and will all report the same of number of documents. With granularity="intrarequest", the componets may each report the most recent changes to the index.
No changes to adding documents except, you don't need to call commit after you add a document. Commit is only needed if the index is empty and to create the first document. After that no commits are needed. See below example:
Example: curl "http://localhost:8983/solr/update/csv?stream.file=/tmp/x1.csv&encapsulator=%1f"
Search concurrently while the indexing is going on
As before, no changes.
Indexing with the SolrJ code on a 2 core linux desktop with the default hard disk is about 26000 docs / sec
Eliminating duplicates in an update:
If documents have unique id and multiple documents with the same unique id are added and if only the last document updated should be visible in searches add the following to solrconfig.xml :
Search for <indexDefault> and then under <indexDefaults>, look for <maxBufferedDocs>. Add below <maxBufferedDocs>,<maxBufferedDeleteTerms>1</maxBufferedDeleteTerms>
add <deleteDuplicates>true</deleteDuplicates> to solrconfig.xml. See example below:
realtime-search has been implemented by retrieving the IndexReader from the IndexWriter.getReader() method after a document has been added to the index. The addDoc function in DirectHandlerUpdate2.java has been modified so that retrieved IndexReader is stored in a HashMap in SolrCore.java. To avoid locking, a non locking concurrent time managed access is used to make available the IndexReader to SolrIndexSearchers. The SolrIndexSearchers access this IndexReader instead of the SolrIndexReader and pass this as a parameter to RankingAlgorithm for the search. RankingAlgorithm uses the reader to access the index and returns the results which are in near real time as it is using the updated IndexReader.
realtime-search supports all Solr features like faceting, filter queries, etc. The faceting count can be seen changing as documents are added in the screenshots below Fig 1 and Fig2. Fig 1 shows a facet query for “john” from the mbartists index (from the book Solr-14-Enterprise-Search-Server). Fig 2 shows the same query after (browser cache is cleared in firefox 4.0) adding a new artist to the index as below:
curl "http://localhost:8990/solr/mbartists/update/csv?stream.file=/tmp/x.csv&encapsulator=%1f" <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">163</int></lst> </response> cat /tmp/x: id,type,a_name,a_name_sort,a_alias,a_type,a_begin_date,a_end_date,a_member_name,a_member_id,a_release_date_latest,a_spell,a_spellPhrase,r_name,r_name_sort,r_name_facetLetter,r_a_name,r_a_id,r_attributes,r_type,r_official,r_lang,r_tracks,r_event_country,r_event_date,r_event_date_earliest,l_name,l_name_sort,l_type,l_begin_date,l_end_date,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,t_r_attributes,t_r_tracks,t_trm_lookups,word,includes Artist:3991866,Artist,John Ab Davis,John Ab Davis,,person,1942-12-29T00:00:00Z,1999-12-10T00:00:00Z,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Fig 1, shows numFound as 3256, and the facet count for “john” as 3256. Fig 2 after adding a doc with curl shows 3257, and the facet count for “john” as 3257. The Solr query is as below:
1. The performance is limited by how fast the IndexWriter.getReader() returns. This seems to take the most time between 2ms to 70ms avg. The faster this goes, the faster the index time.
2. queryResultsCache needs to be disabled at the moment to see NRT updates, as with cache enabled, the first time a search is executed, the results are cached and for the next matching search the results are retrieved directly from cache. The solution here is to look at the docs added and to invalidate/update the cache as needed based on the cache query but at 10000 docs / sec this will become the new bottleneck and may limit scalability.
3. Setting maxBufferedDeleteTerms=1 will slow down update performance.
4. Setting maxBufferedDeleteTerms=1 will remove duplicates with unique ids but search results may still show the old document content, if the new document added has changed content, even though the new document content is searchable ie.
if the most recent doc has afield set to <doc><afield>abc</afield></doc> and this is updated, and the old docs were <doc><afield>xyz</afield>, at query time, q=afield:abc matches, but the results show may show <doc><afield>xyz</afield>
You can download Solr-RA from here:
The realtime-search in Solr-RA works well and allows concurrent search with indexing in parallel without closing the IndexSearchers or clearing the cache providing the ability to offer searches in near real time. The indexing performance observed on a intel linux system with 48GB memory is about 70,000 tps (new document adds).
1. solr and lucene are registered trandemarks of apache software foundation.