将 spark 作为 java web 应用程序运行

问题描述:

我使用了 Spark ML 并且能够在预测我的业务问题时获得合理的准确性

I have used Spark ML and was able to get reasonable accuracy in prediction for my business problem

数据并不大,我能够使用 stanford NLP 转换输入(基本上是一个 csv 文件)并在我的本地机器上运行朴素贝叶斯进行预测.

The data is not huge and I was able to transform the input ( basically a csv file ) using stanford NLP and run Naive Bayes for prediction in my local machine.

我想像一个简单的 java 主程序或一个简单的 MVC Web 应用程序一样运行这个预测服务

I want to run this prediction service like a simple java main program or along with a simple MVC web application

目前我使用 spark-submit 命令运行我的预测?相反,我可以从我的 servlet/控制器类创建 spark 上下文和数据帧吗?

Currently I run my prediction using the spark-submit command ? Instead , can I create spark context and data frames from my servlet / controller class ?

我找不到有关此类情况的任何文档.

I could not find any documentation on such scenarios.

请就上述可行性提出建议

Kindly advise regarding the feasibility of the above

Spark 有 REST apis 可以通过调用 spark master 主机名来提交作业.

Spark has REST apis to submit jobs by invoking spark master hostname.

提交申请:

curl -X POST http://spark-cluster-ip:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
  "action" : "CreateSubmissionRequest",
  "appArgs" : [ "myAppArgument1" ],
  "appResource" : "file:/myfilepath/spark-job-1.0.jar",
  "clientSparkVersion" : "1.5.0",
  "environmentVariables" : {
    "SPARK_ENV_LOADED" : "1"
  },
  "mainClass" : "com.mycompany.MyJob",
  "sparkProperties" : {
    "spark.jars" : "file:/myfilepath/spark-job-1.0.jar",
    "spark.driver.supervise" : "false",
    "spark.app.name" : "MyJob",
    "spark.eventLog.enabled": "true",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://spark-cluster-ip:6066"
  }
}'

提交回复:

{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20151008145126-0000",
  "serverSparkVersion" : "1.5.0",
  "submissionId" : "driver-20151008145126-0000",
  "success" : true
}

获取已提交申请的状态

curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000

状态响应

{
  "action" : "SubmissionStatusResponse",
  "driverState" : "FINISHED",
  "serverSparkVersion" : "1.5.0",
  "submissionId" : "driver-20151008145126-0000",
  "success" : true,
  "workerHostPort" : "192.168.3.153:46894",
  "workerId" : "worker-20151007093409-192.168.3.153-46894"
}

现在在您提交的 spark 应用程序中应该执行所有操作并将输出保存到任何数据源并通过 thrift 服务器访问数据 因为没有太多数据要传输(您可以想到如果您想在 MVC 应用程序数据库和 Hadoop 集群之间传输数据,请使用 sqoop).

Now in the spark application which you submit should perform all the operations and save output to any datasource and access the data via thrift server as don't have much data to transfer(you can think of sqoop if you want to transfer data between your MVC app db and Hadoop cluster).

信用:link1link2

(根据评论中的问题)使用必要的依赖项构建 spark 应用程序 jar 并在本地模式下运行作业.以读取 CSV 的方式编写 jar 并使用 MLib,然后将预测输出存储在某个数据源中以从 Web 应用程序访问它.

(as per question in comment) build spark application jar with necessary dependencies and run the job in local mode. Write the jar in way to read the CSV and make use of MLib then store the prediction output in some data source to access it from web app.