使用Spark读取带有where子句的HBase表

问题描述:

我正在尝试使用Spark Scala API读取HBase表.

I am trying to read a HBase table using Spark Scala API.

示例代码:

conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())

如果我使用newAPIHadoopRDD,如何添加where子句?

How to add where clause if i use newAPIHadoopRDD ?

还是我们需要使用任何Spark Hbase Connector来实现这一目标?

Or we need to use any Spark Hbase Connector to achieve this?

我看到了下面的Spark Hbase连接器,但是没有看到带where子句的示例代码.

I saw the below Spark Hbase connector, but i don't see any example code with where clause.

https://github.com/nerdammer/spark-hbase-connector

您可以使用HortonWorks的SHC连接器来实​​现.

You can use SHC connector from HortonWorks to achieve this.

https://github.com/hortonworks-spark/shc

这是Spark 2的代码示例.

Here is a code example with Spark 2.

 val catalog =
        s"""{
            |"table":{"namespace":"default", "name":"my_table"},
            |"rowkey":"id",
            |"columns":{
            |"id":{"cf":"rowkey", "col":"id", "type":"string"},
            |"name":{"cf":"info", "col":"name", "type":"string"},
            |"age":{"cf":"info", "col":"age", "type":"string"}
            |}
            |}""".stripMargin

    val spark = SparkSession
        .builder()
        .appName("hbase spark")
        .getOrCreate()

    val df = spark
        .read
        .options(
            Map(
                HBaseTableCatalog.tableCatalog -> catalog
            )
        )
        .format("org.apache.spark.sql.execution.datasources.hbase")
        .load()

    df.show()

然后可以在数据框上使用任何方法.例如:

You can then use whatever method on your dataframe. Ex :

df.where(df("age") === 20)