While Apache Solr (underneath Lucene library) is a powerful text analysis and search engine, it has also been at least pitched as a NoSQL solution, in a great talk by Yonik here
I was looking for a solution that will serve fast reads by ID (in a millisecond or so) as well as allow running analytics on data in real time. I have been using Solr / Lucene as a Search Engine for serving low latency searches for more than 3 years now, and had seen that even Lucene 2.9 performs really fast when it is serving get by ID in memory. When Solr 4.1 came out, various bench marks showed that it cuts memory footprint by nearly 30% and the Primary Key Lookup throughput is very very good. Here is a great bench mark maintained by Mike Mcandless, that runs daily on Lucene trunk. This is impressive.
Encouraged by all this, yesterday, I set out to profile Solr 4.1 for 100 million entries. I had access to single machine, though with very good specifications, to work with. While I ran the Solr node on this machine, I ran JMeter on my desktop as a client, going over the network in the same subnet.
Here is my test environment
Solr 4.1 Server -
- Single 16 core, 64GB physical machine
- Running Solr 4.1 (setup as cloud, but running only one node)
- ZooKeeper running on different machine
- Solr index on MMapDirectory
- No writes, just reads
- 100 million entries
- Document Structure – 128 bit random number as unique ID, one multivalued field serving as a category field, one field as zip+4 and 13 other string fields.
- Each document had category field with 4-5 random categories out of a 100 categories
- Each document had zip+4 from 45 million zip+4 values
- Rest of the fields were just string values, random
Here is the fields snippet from schema.xml
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="zip" type="string" indexed="true" stored="true"/> <field name="category" type="string" indexed="true" stored="true" multiValued="true"/> <field name="t1" type="string" indexed="true" stored="true"/> <field name="t2" type="string" indexed="true" stored="true"/> <field name="t3" type="string" indexed="true" stored="true"/> <field name="t4" type="string" indexed="true" stored="true"/> <field name="t5" type="string" indexed="true" stored="true"/> <field name="t6" type="string" indexed="true" stored="true"/> <field name="t7" type="string" indexed="true" stored="true"/> <field name="t8" type="string" indexed="true" stored="true"/> <field name="t9" type="string" indexed="true" stored="true"/> <field name="t10" type="string" indexed="true" stored="true"/> <field name="t11" type="string" indexed="true" stored="true"/> <field name="t12" type="string" indexed="true" stored="true"/> <field name="t13" type="string" indexed="true" stored="true"/>
And the snippets from solrconfig.xml
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/>
Everything else was pretty much left to defaults
Client
- JMeter
- Running on a desktop
- Single client with 10, 20 and 30 threads, each cycling through a randomized file
- 10 Threads, sending get by ID as fast as they can
2. 20 Threads, sending get by ID as fast as they can
3. 30 Threads, sending get by ID as fast as they can
Results were impressive. Also, as expected, as number of threads increased, the throughput remained almost at the same place, but response time percentiles started to go little bit up.
So, Lucene + Solr is not just very good open source search solution, it is also turning out to be a very impressive NoSQL solution for medium sized data sets.
In response to a comment about real tests – While I was part of Comcast, I had run some tests with a 3 node Solr Cloud, each running on a VM (4 core machine), replication factor of 3 and ZooKeeper running on one of the nodes. The intent at that time was to compare SolrCloud to MongoDB, Redis, Cassandra, Aerospike.
I did not see any performance drop, but the Node toggling did show some interesting results. I have not checked what was causing this, but my guess is it is the functionality of Solrj client and how SolrCloud state is checked in ZooKeeper. While the recovering nodes are marked in “recovering” state in ZooKeeper, perhaps traffic is still being sent to those nodes by SolrJ CloudServer class, causing this problem.
While certainly SolrCloud is NOT as performant NoSQL solution as MongoDB, Redis or Aerospike, it is not far behind for medium size data sets.










You can’t even compare this to nosql due to the fact you tested on single machine with reads only. Setup SolrCloud cluster and run some real tests.
I ran some tests in cloud setup as well (3 node virtual machine cloud), 1 million entries and replication factor of 3. I had 4 client machines running JMeter and each client had 5 threads.
Bringing down a node had no real lasting impact on the read, but bringing up node did cause performance problems. Node recovery (rebalancing of cluster) in NoSQL solutions has performance implications (CouchBase, Riak) while others handle it better (MongoDB, Redis).
I would be really interested in your guidance to real tests that you have run and the results that you achieved, so I can improve on my tests.