Solr(九)Solr Index Replication on Ubuntu and Scala Client

Solr(9)Solr Index Replication on Ubuntu and Scala Client
Solr(9)Solr Index Replication on Ubuntu and Scala Client

1. Create One More Core
Go to the example directory, copy the core directory
> pwd
/opt/solr/example
> cp -r collection1 jobs
> rm -fr collection1/

Go to the example directory, start the server
> java -jar start.jar

Go to the console web UI
http://ubuntu-master:8983/solr/#/

Add Core -
name: jobs
instanceDir: /opt/solr/example/solr/jobs
dataDir: /opt/solr/example/solr/jobs/data
config: /opt/solr/example/solr/jobs/conf/solrconfig.xml
schema: /opt/solr/example/solr/jobs/conf/schema.xml

Add one Record to the Solr System
Jobs —> Documents —> Request-Handler —>Document Type (Solr Command Raw XML or JSON)
<add>
<doc>
<field name=“id”>1</field>
<field name=“title”>senior software engineer</field>
</doc>
</add>

It is not working, maybe because of the commit issue. I tried with JSON, it works.
{
  “id”:”1”,
  “title”:”software engineer"
}

Click on the “Query” Tab, you will get all your data from there.

2. Set up the Replicate Server
> scp -r ubuntu-master:/opt/solr/example/solr/jobs ./

Check the master configuration, search for “/replication”, adding these configuration
       <lst name="master">
         <str name="enable">${master.enable:false}</str>
         <str name="replicateAfter">commit</str>
         <str name="replicateAfter">startup</str>
         <str name="confFiles">schema.xml,stopwords.txt</str>
       </lst>
       <lst name="slave">
         <str name="enable">${slave.enable:false}</str>
         <str name="masterUrl">${master.url:http://ubuntu-master:8983/solr/jobs}</str>
         <str name="pollInterval">00:00:60</str>
         <str name="httpConnTimeout">5000</str>
        <str name="httpReadTimeout">10000</str>
       </lst>

Also, change the auto commit time in the configuration.
     <autoCommit>
          <maxDocs>300000</maxDocs>
          <!-- 5 minutes -->
          <maxTime>300000</maxTime>
          <openSearcher>true</openSearcher>
     </autoCommit>

I will do the same thing on other solr configuration files on slaves. I have 2 slaves, ubuntu-dev1, ubuntu-dev2

Start the master with this command with the master option enabled
> java -Dmaster.enable=true -jar start.jar

Start the slaves on my slave servers.
> java -Dslave.enable=true -jar start.jar

From the Server Web UI Console, I can only see the replication is enabled.
http://ubuntu-master:8983/solr/#/jobs/replication

We can go to the slave console to check
http://ubuntu-dev1:8983/solr/#/jobs/query

Right now, I can add one more data on master and check if it gets indexed on the slaves.
Some console logging on slaves
529867 [snapPuller-10-thread-1] INFO  org.apache.solr.handler.SnapPuller  – Slave in sync with master.
589867 [snapPuller-10-thread-1] INFO  org.apache.solr.handler.SnapPuller  – Master's generation: 8
589868 [snapPuller-10-thread-1] INFO  org.apache.solr.handler.SnapPuller  – Slave's generation: 7
589869 [snapPuller-10-thread-1] INFO  org.apache.solr.handler.SnapPuller  – Starting replication process
589883 [snapPuller-10-thread-1] INFO  org.apache.solr.handler.SnapPuller  – Number of files in latest index in master: 52

After the process, we can search any latest data on slaves and masters.

3. Set up the Load Balance
I am running HA PROXY with the SOLR master, so I need to choose another port number, the configuration will be as follow:
listen solr_cluster 0.0.0.0:8984
       acl master_methods method POST DELETE PUT
       use_backend solr_master_backend if master_methods
       default_backend solr_read_backends

backend solr_master_backend
       server solr-master ubuntu-master:8983 check inter 5000 rise 2 fall 2
   
backend solr_read_backends
       balance roundrobin
       server solr-slave1 ubuntu-dev1:8983 check inter 5000 rise 2 fall 2
       server solr-slave2 ubuntu-dev2:8983 check inter 5000 rise 2 fall 2

It is working well, we can check from here
http://ubuntu-master/haproxy-status

4. Build a Simple Client

https://github.com/takezoe/solr-scala-client
This class helps a lot. CaseClassMapper

package com.sillycat.jobsconsumer.persistence

import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.{IncludeConfig, IncludeLogger}
import jp.sf.amateras.solr.scala.SolrClient
import jp.sf.amateras.solr.scala.sample.Param


/**
* Created by carl on 8/6/15.
*/
object SolrClientDAO extends IncludeLogger with IncludeConfig{

  private val solrClient = {
    try {
      logger.info("Init the SOLR Client ---------------")
      val solrURL = config.getString(envStr("solr.url.jobs"))
      logger.info("SOLR URL = " + solrURL)
      val client = new SolrClient(solrURL)
      client
    } catch {
      case x: Throwable =>
        logger.error("Couldn't connect to SOLR: " + x)
        null
    }
  }

  def releaseResource = {
    if(solrClient != null){
      solrClient.shutdown()
    }
  }


  def addJob(job:Job): Unit ={
    //logger.debug("Adding job (" + job + ") to solr")
    solrClient.add(job)
  }

  def query(query:String):Seq[Job] = {
    logger.debug("Fetching the job results with query = " + query)
    val result = solrClient.query(query).getResultAs[Job]()
    result.documents
  }

  def commit = {
    solrClient.commit()
  }

}

The dependency will be as follow:
//for solr scala driver
resolvers += "amateras-repo" at "http://amateras.sourceforge.jp/mvn/"

  "jp.sf.amateras.solr.scala" %% "solr-scala-client" % "0.0.12",

And the Test Class is as follow:
package com.sillycat.jobsconsumer.persistence

import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.IncludeConfig
import org.scalatest.{BeforeAndAfterAll, Matchers, FunSpec}
import redis.embedded.RedisServer

/**
* Created by carl on 8/7/15.
*/
class SolrDAOSpec extends FunSpec with Matchers with BeforeAndAfterAll with IncludeConfig{

  override def beforeAll() {
    if(config.getString("build.env").equals("test")){

    }
  }

  override def afterAll() {

  }

  describe("SolrDAO") {
    describe("#add and query"){
      it("Add one single job to Solr") {
        val expect = Job("id1","title1","desc1","industry1")

        val num = 10000
        val start = System.currentTimeMillis()
        for ( i<- 1 to num){
          val job = (Job("id" + i, "title" + i, "desc" + i, "industry" + i))
          SolrClientDAO.addJob(job)
        }

        val end = System.currentTimeMillis()

        println("total time for " + num + " is " + (end-start))
        println("it is " + num / ((end-start)/1000) + " jobs/second")


//        SolrDAO.commit
//        val result = SolrDAO.query("title:title1")
//        result should not be (null)
//        result.size > 0 should be (true)
//        result.foreach { item =>
//          println(item.toString + "\n")
//        }
      }
    }
  }

}


Clean all the data during testing
http://ubuntu-master:8983/solr/jobs/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true

Actually the data schema is stored and defined in conf/schema.xml, I should update as follow:
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
   <field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
   <field name="industry" type="text_general" indexed="true" stored="true" multiValued="false"/>


add single job at one time
total time for 10000 is 180096
it is 55 jobs/second

Find the log4j.properties here and change the log level
/opt/solr/example/resources/log4j.properties

I turned off the logging and used 2 threads on the clients, I get performance about below on each.
total time for 10000 is 51688
it is 196 jobs/second

The performance is as follow for single threads
total time for 10000 is 28398
it is 357 jobs/second


References:
Setup Scaling Servers
http://blog.****.net/thundersssss/article/details/5385699
http://lutaf.com/197.htm
http://blog.warningrc.com/2013/06/10/Solr-data-backup.html

Single mode on Jetty
http://sillycat.iteye.com/blog/2227398

load balance on the slaves
http://davehall.com.au/blog/dave/2010/03/13/solr-replication-load-balancing-haproxy-and-drupal
https://gist.github.com/feniix/1974460
http://stackoverflow.com/questions/10090386/how-to-check-solr-healthy-using-haproxy

solr clients
https://github.com/takezoe/solr-scala-client
https://wiki.apache.org/solr/Solrj