Solr(九)Solr Index Replication on Ubuntu and Scala Client
Solr(9)Solr Index Replication on Ubuntu and Scala Client
Solr(9)Solr Index Replication on Ubuntu and Scala Client
1. Create One More Core
Go to the example directory, copy the core directory
> pwd
/opt/solr/example
> cp -r collection1 jobs
> rm -fr collection1/
Go to the example directory, start the server
> java -jar start.jar
Go to the console web UI
http://ubuntu-master:8983/solr/#/
Add Core -
name: jobs
instanceDir: /opt/solr/example/solr/jobs
dataDir: /opt/solr/example/solr/jobs/data
config: /opt/solr/example/solr/jobs/conf/solrconfig.xml
schema: /opt/solr/example/solr/jobs/conf/schema.xml
Add one Record to the Solr System
Jobs —> Documents —> Request-Handler —>Document Type (Solr Command Raw XML or JSON)
<add>
<doc>
<field name=“id”>1</field>
<field name=“title”>senior software engineer</field>
</doc>
</add>
It is not working, maybe because of the commit issue. I tried with JSON, it works.
{
“id”:”1”,
“title”:”software engineer"
}
Click on the “Query” Tab, you will get all your data from there.
2. Set up the Replicate Server
> scp -r ubuntu-master:/opt/solr/example/solr/jobs ./
Check the master configuration, search for “/replication”, adding these configuration
<lst name="master">
<str name="enable">${master.enable:false}</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">startup</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
<lst name="slave">
<str name="enable">${slave.enable:false}</str>
<str name="masterUrl">${master.url:http://ubuntu-master:8983/solr/jobs}</str>
<str name="pollInterval">00:00:60</str>
<str name="httpConnTimeout">5000</str>
<str name="httpReadTimeout">10000</str>
</lst>
Also, change the auto commit time in the configuration.
<autoCommit>
<maxDocs>300000</maxDocs>
<!-- 5 minutes -->
<maxTime>300000</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
I will do the same thing on other solr configuration files on slaves. I have 2 slaves, ubuntu-dev1, ubuntu-dev2
Start the master with this command with the master option enabled
> java -Dmaster.enable=true -jar start.jar
Start the slaves on my slave servers.
> java -Dslave.enable=true -jar start.jar
From the Server Web UI Console, I can only see the replication is enabled.
http://ubuntu-master:8983/solr/#/jobs/replication
We can go to the slave console to check
http://ubuntu-dev1:8983/solr/#/jobs/query
Right now, I can add one more data on master and check if it gets indexed on the slaves.
Some console logging on slaves
529867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave in sync with master.
589867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Master's generation: 8
589868 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave's generation: 7
589869 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Starting replication process
589883 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Number of files in latest index in master: 52
After the process, we can search any latest data on slaves and masters.
3. Set up the Load Balance
I am running HA PROXY with the SOLR master, so I need to choose another port number, the configuration will be as follow:
listen solr_cluster 0.0.0.0:8984
acl master_methods method POST DELETE PUT
use_backend solr_master_backend if master_methods
default_backend solr_read_backends
backend solr_master_backend
server solr-master ubuntu-master:8983 check inter 5000 rise 2 fall 2
backend solr_read_backends
balance roundrobin
server solr-slave1 ubuntu-dev1:8983 check inter 5000 rise 2 fall 2
server solr-slave2 ubuntu-dev2:8983 check inter 5000 rise 2 fall 2
It is working well, we can check from here
http://ubuntu-master/haproxy-status
4. Build a Simple Client
https://github.com/takezoe/solr-scala-client
This class helps a lot. CaseClassMapper
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.{IncludeConfig, IncludeLogger}
import jp.sf.amateras.solr.scala.SolrClient
import jp.sf.amateras.solr.scala.sample.Param
/**
* Created by carl on 8/6/15.
*/
object SolrClientDAO extends IncludeLogger with IncludeConfig{
private val solrClient = {
try {
logger.info("Init the SOLR Client ---------------")
val solrURL = config.getString(envStr("solr.url.jobs"))
logger.info("SOLR URL = " + solrURL)
val client = new SolrClient(solrURL)
client
} catch {
case x: Throwable =>
logger.error("Couldn't connect to SOLR: " + x)
null
}
}
def releaseResource = {
if(solrClient != null){
solrClient.shutdown()
}
}
def addJob(job:Job): Unit ={
//logger.debug("Adding job (" + job + ") to solr")
solrClient.add(job)
}
def query(query:String):Seq[Job] = {
logger.debug("Fetching the job results with query = " + query)
val result = solrClient.query(query).getResultAs[Job]()
result.documents
}
def commit = {
solrClient.commit()
}
}
The dependency will be as follow:
//for solr scala driver
resolvers += "amateras-repo" at "http://amateras.sourceforge.jp/mvn/"
"jp.sf.amateras.solr.scala" %% "solr-scala-client" % "0.0.12",
And the Test Class is as follow:
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.IncludeConfig
import org.scalatest.{BeforeAndAfterAll, Matchers, FunSpec}
import redis.embedded.RedisServer
/**
* Created by carl on 8/7/15.
*/
class SolrDAOSpec extends FunSpec with Matchers with BeforeAndAfterAll with IncludeConfig{
override def beforeAll() {
if(config.getString("build.env").equals("test")){
}
}
override def afterAll() {
}
describe("SolrDAO") {
describe("#add and query"){
it("Add one single job to Solr") {
val expect = Job("id1","title1","desc1","industry1")
val num = 10000
val start = System.currentTimeMillis()
for ( i<- 1 to num){
val job = (Job("id" + i, "title" + i, "desc" + i, "industry" + i))
SolrClientDAO.addJob(job)
}
val end = System.currentTimeMillis()
println("total time for " + num + " is " + (end-start))
println("it is " + num / ((end-start)/1000) + " jobs/second")
// SolrDAO.commit
// val result = SolrDAO.query("title:title1")
// result should not be (null)
// result.size > 0 should be (true)
// result.foreach { item =>
// println(item.toString + "\n")
// }
}
}
}
}
Clean all the data during testing
http://ubuntu-master:8983/solr/jobs/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true
Actually the data schema is stored and defined in conf/schema.xml, I should update as follow:
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="industry" type="text_general" indexed="true" stored="true" multiValued="false"/>
add single job at one time
total time for 10000 is 180096
it is 55 jobs/second
Find the log4j.properties here and change the log level
/opt/solr/example/resources/log4j.properties
I turned off the logging and used 2 threads on the clients, I get performance about below on each.
total time for 10000 is 51688
it is 196 jobs/second
The performance is as follow for single threads
total time for 10000 is 28398
it is 357 jobs/second
References:
Setup Scaling Servers
http://blog.****.net/thundersssss/article/details/5385699
http://lutaf.com/197.htm
http://blog.warningrc.com/2013/06/10/Solr-data-backup.html
Single mode on Jetty
http://sillycat.iteye.com/blog/2227398
load balance on the slaves
http://davehall.com.au/blog/dave/2010/03/13/solr-replication-load-balancing-haproxy-and-drupal
https://gist.github.com/feniix/1974460
http://stackoverflow.com/questions/10090386/how-to-check-solr-healthy-using-haproxy
solr clients
https://github.com/takezoe/solr-scala-client
https://wiki.apache.org/solr/Solrj
Solr(9)Solr Index Replication on Ubuntu and Scala Client
1. Create One More Core
Go to the example directory, copy the core directory
> pwd
/opt/solr/example
> cp -r collection1 jobs
> rm -fr collection1/
Go to the example directory, start the server
> java -jar start.jar
Go to the console web UI
http://ubuntu-master:8983/solr/#/
Add Core -
name: jobs
instanceDir: /opt/solr/example/solr/jobs
dataDir: /opt/solr/example/solr/jobs/data
config: /opt/solr/example/solr/jobs/conf/solrconfig.xml
schema: /opt/solr/example/solr/jobs/conf/schema.xml
Add one Record to the Solr System
Jobs —> Documents —> Request-Handler —>Document Type (Solr Command Raw XML or JSON)
<add>
<doc>
<field name=“id”>1</field>
<field name=“title”>senior software engineer</field>
</doc>
</add>
It is not working, maybe because of the commit issue. I tried with JSON, it works.
{
“id”:”1”,
“title”:”software engineer"
}
Click on the “Query” Tab, you will get all your data from there.
2. Set up the Replicate Server
> scp -r ubuntu-master:/opt/solr/example/solr/jobs ./
Check the master configuration, search for “/replication”, adding these configuration
<lst name="master">
<str name="enable">${master.enable:false}</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">startup</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
<lst name="slave">
<str name="enable">${slave.enable:false}</str>
<str name="masterUrl">${master.url:http://ubuntu-master:8983/solr/jobs}</str>
<str name="pollInterval">00:00:60</str>
<str name="httpConnTimeout">5000</str>
<str name="httpReadTimeout">10000</str>
</lst>
Also, change the auto commit time in the configuration.
<autoCommit>
<maxDocs>300000</maxDocs>
<!-- 5 minutes -->
<maxTime>300000</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
I will do the same thing on other solr configuration files on slaves. I have 2 slaves, ubuntu-dev1, ubuntu-dev2
Start the master with this command with the master option enabled
> java -Dmaster.enable=true -jar start.jar
Start the slaves on my slave servers.
> java -Dslave.enable=true -jar start.jar
From the Server Web UI Console, I can only see the replication is enabled.
http://ubuntu-master:8983/solr/#/jobs/replication
We can go to the slave console to check
http://ubuntu-dev1:8983/solr/#/jobs/query
Right now, I can add one more data on master and check if it gets indexed on the slaves.
Some console logging on slaves
529867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave in sync with master.
589867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Master's generation: 8
589868 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave's generation: 7
589869 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Starting replication process
589883 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Number of files in latest index in master: 52
After the process, we can search any latest data on slaves and masters.
3. Set up the Load Balance
I am running HA PROXY with the SOLR master, so I need to choose another port number, the configuration will be as follow:
listen solr_cluster 0.0.0.0:8984
acl master_methods method POST DELETE PUT
use_backend solr_master_backend if master_methods
default_backend solr_read_backends
backend solr_master_backend
server solr-master ubuntu-master:8983 check inter 5000 rise 2 fall 2
backend solr_read_backends
balance roundrobin
server solr-slave1 ubuntu-dev1:8983 check inter 5000 rise 2 fall 2
server solr-slave2 ubuntu-dev2:8983 check inter 5000 rise 2 fall 2
It is working well, we can check from here
http://ubuntu-master/haproxy-status
4. Build a Simple Client
https://github.com/takezoe/solr-scala-client
This class helps a lot. CaseClassMapper
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.{IncludeConfig, IncludeLogger}
import jp.sf.amateras.solr.scala.SolrClient
import jp.sf.amateras.solr.scala.sample.Param
/**
* Created by carl on 8/6/15.
*/
object SolrClientDAO extends IncludeLogger with IncludeConfig{
private val solrClient = {
try {
logger.info("Init the SOLR Client ---------------")
val solrURL = config.getString(envStr("solr.url.jobs"))
logger.info("SOLR URL = " + solrURL)
val client = new SolrClient(solrURL)
client
} catch {
case x: Throwable =>
logger.error("Couldn't connect to SOLR: " + x)
null
}
}
def releaseResource = {
if(solrClient != null){
solrClient.shutdown()
}
}
def addJob(job:Job): Unit ={
//logger.debug("Adding job (" + job + ") to solr")
solrClient.add(job)
}
def query(query:String):Seq[Job] = {
logger.debug("Fetching the job results with query = " + query)
val result = solrClient.query(query).getResultAs[Job]()
result.documents
}
def commit = {
solrClient.commit()
}
}
The dependency will be as follow:
//for solr scala driver
resolvers += "amateras-repo" at "http://amateras.sourceforge.jp/mvn/"
"jp.sf.amateras.solr.scala" %% "solr-scala-client" % "0.0.12",
And the Test Class is as follow:
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.IncludeConfig
import org.scalatest.{BeforeAndAfterAll, Matchers, FunSpec}
import redis.embedded.RedisServer
/**
* Created by carl on 8/7/15.
*/
class SolrDAOSpec extends FunSpec with Matchers with BeforeAndAfterAll with IncludeConfig{
override def beforeAll() {
if(config.getString("build.env").equals("test")){
}
}
override def afterAll() {
}
describe("SolrDAO") {
describe("#add and query"){
it("Add one single job to Solr") {
val expect = Job("id1","title1","desc1","industry1")
val num = 10000
val start = System.currentTimeMillis()
for ( i<- 1 to num){
val job = (Job("id" + i, "title" + i, "desc" + i, "industry" + i))
SolrClientDAO.addJob(job)
}
val end = System.currentTimeMillis()
println("total time for " + num + " is " + (end-start))
println("it is " + num / ((end-start)/1000) + " jobs/second")
// SolrDAO.commit
// val result = SolrDAO.query("title:title1")
// result should not be (null)
// result.size > 0 should be (true)
// result.foreach { item =>
// println(item.toString + "\n")
// }
}
}
}
}
Clean all the data during testing
http://ubuntu-master:8983/solr/jobs/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true
Actually the data schema is stored and defined in conf/schema.xml, I should update as follow:
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="industry" type="text_general" indexed="true" stored="true" multiValued="false"/>
add single job at one time
total time for 10000 is 180096
it is 55 jobs/second
Find the log4j.properties here and change the log level
/opt/solr/example/resources/log4j.properties
I turned off the logging and used 2 threads on the clients, I get performance about below on each.
total time for 10000 is 51688
it is 196 jobs/second
The performance is as follow for single threads
total time for 10000 is 28398
it is 357 jobs/second
References:
Setup Scaling Servers
http://blog.****.net/thundersssss/article/details/5385699
http://lutaf.com/197.htm
http://blog.warningrc.com/2013/06/10/Solr-data-backup.html
Single mode on Jetty
http://sillycat.iteye.com/blog/2227398
load balance on the slaves
http://davehall.com.au/blog/dave/2010/03/13/solr-replication-load-balancing-haproxy-and-drupal
https://gist.github.com/feniix/1974460
http://stackoverflow.com/questions/10090386/how-to-check-solr-healthy-using-haproxy
solr clients
https://github.com/takezoe/solr-scala-client
https://wiki.apache.org/solr/Solrj