【大数据】面试题0001---使用SparkSQL关联两个表求和取前几行
场景:
有两个表,表可以是文本或Json数据,结构化后分别是Table1(A,B,C)和Table2(C、D、E),两个表通过C关联,要求求出D+E之和,并以(A、B、D+E)三列返回
解答:
思路:SparkSQL支持读取Json创建表,同时创建的表可以做联合查询,类似传统Sql语句进行关联查询和统计分析
代码:
package study import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession object TestDataFrame2{ def main(args:Array[String]):Unit={ import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .master("local[*]") .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() spark.sql("""create table table1 using org.apache.spark.sql.json options(path "F://0002_BigData//Soft//comoceanspark//src//resources//Table1.json")""") spark.sql("""create table table2 using org.apache.spark.sql.json options(path "F://0002_BigData//Soft//comoceanspark//src//resources//Table2.json")""") spark.sql("show tables").show() spark.sql("select A,B,(D+E) as DE from table1 inner join table2 on table1.C = table2.C order by DE desc limit 5").show() } }
Table1.json:
{"A":"A1", "B":30, "C":1} {"A":"A2", "B":31, "C":2} {"A":"A3", "B":32, "C":3} {"A":"A4", "B":33, "C":4} {"A":"A5", "B":34, "C":5} {"A":"A6", "B":35, "C":6} {"A":"A7", "B":36, "C":7} {"A":"A8", "B":37, "C":8} {"A":"A9", "B":38, "C":9}
Table2.json:
{"C":1, "D":1, "E":1} {"C":2, "D":2, "E":2} {"C":3, "D":3, "E":3} {"C":4, "D":4, "E":4} {"C":5, "D":5, "E":5} {"C":6, "D":6, "E":6} {"C":7, "D":7, "E":7} {"C":8, "D":8, "E":8} {"C":9, "D":9, "E":9}
结果:
表显示
计算结果显示:
备注:谢谢博士,希望有缘再见!!!