如何读取sparkContext特定行
您好我想读通过火花从文本文件中的特定行。
Hi I am trying to read specific lines from a text file using spark.
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
String firstLine = lines.first();
它可以使用的。首先()命令来获取data.text文件的第一行。我怎样才能访问文档的第N行?我需要的Java解决方案。
It can used the .first() command to fetch the first line of the data.text document. How can I access Nth line of the document? I need java solution.
阿帕奇火花RDDS不意味着被用于查找。最有效的方式来获得 N
日线将是 lines.take(N)获得(N)
。你这样做时,它都会将阅读第一 N
文件的线条。你可以运行 lines.cache
来避免这种情况,但它仍然会移到第一个 N
行了一个网络非常低效的舞蹈。
Apache Spark RDDs are not meant to be used for lookups. The most "efficient" way to get the n
th line would be lines.take(n).get(n)
. Every time you do this, it will read the first n
lines of the file. You could run lines.cache
to avoid that, but it will still move the first n
lines over the network in a very inefficient dance.
如果数据可以容纳一台机器上,仅仅收取这一切一次,并在本地访问它:列表&LT;串GT;本地= lines.collect(); local.get(N);
If the data can fit on one machine, just collect it all once, and access it locally: List<String> local = lines.collect(); local.get(n);
.
如果数据不适合在一个机器上,你需要一个分布式系统,支持高效的查询。最典型的例子是HBase的和卡桑德拉。
If the data does not fit on one machine, you need a distributed system which supports efficient lookups. Popular examples are HBase and Cassandra.
有也可能是您的问题可以有效地与火花来解决,而不是通过查找。如果你在一个单独的问题解释更大的问题,你可能会得到这样的解决方案。 (查找在单机应用中非常常见,但分布式算法有不同的想法。)
It is also possible that your problem can be solved efficiently with Spark, but not via lookups. If you explain the larger problem in a separate question, you may get a solution like that. (Lookups are very common in single-machine applications, but distributed algorithms have to think differently.)