用Spark做去重操作

#原理很简单:先是通过flatMap函数,把rdd进行扁平化操作,再用map函数得到(k,1)的样式,然后再用groupByKey函数,合并value值,就相当于对key进行去重操作,再用keys()函数,取出key
 
实验数据:delcp.txt
    hello
    hello
    world
    world
    h
    h
    h
    g
    g
    g


from pyspark import SparkContext

sc = SparkContext('local','delcp')

rdd = sc.textFile("file:///usr/local/spark/mycode/TestPackage/delcp.txt")
delp = rdd.flatMap(lambda line : line.split(" ")
).map(lambda a : (a,1)).groupByKey().keys()

delp.foreach(print)