如何在没有 DataFrames/SparkContext 的情况下评估 spark.ml 模型?

问题描述：

使用 Spark MLLib，我可以构建一个模型(如 RandomForest)，然后可以通过加载模型并使用 predict 在 Spark 之外对其进行评估在它上面传递一个特征向量.

With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.

似乎在 Spark ML 中，predict 现在被称为 transform，并且只作用于 DataFrame.

It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.

有没有什么方法可以在 Spark 之外构建 DataFrame，因为似乎需要 SparkContext 来构建 DataFrame?

Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?

我错过了什么吗?

答

这是我在 spark 上下文之外使用 spark 模型的解决方案(使用 PMML):

Here is my solution to use spark models outside of spark context (using PMML):

您使用这样的管道创建模型:

SparkConf sparkConf = new SparkConf();

SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();   
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet  = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));

使用 PPML 进行预测(在本地，没有 spark 上下文，可以应用于参数映射而不是数据帧):

Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):

PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();     
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result  = evaluator.evaluate(args);

如何在没有 DataFrames/SparkContext 的情况下评估 spark.ml 模型?

相关推荐