从猪导出到 CSV

问题描述：

我在从 Pig 中获取数据并转换为 CSV 时遇到了很多麻烦，我可以在 Excel 或 SQL(或 R 或 SPSS 等)中使用该 CSV，而无需进行大量操作...

I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...

我尝试使用以下功能:

STORE pig_object INTO '/Users/Name/Folder/pig_object.csv'
    USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS');

它会创建具有该名称的文件夹，其中包含许多 part-m-0000# 文件.我稍后可以使用 cat part* > filename.csv 将它们全部加入，但没有标题，这意味着我必须手动将其放入.

It creates the folder with that name with lots of part-m-0000# files. I can later join them all up using cat part* > filename.csv but there's no header which means I have to put it in manually.

我已经读过 PigStorageSchema 应该创建另一个带有标题的位，但它似乎根本不起作用，例如，我得到的结果与刚刚存储的结果相同，没有头文件:STORE pig_object INTO '/Users/Name/Folder/pig_object'使用 org.apache.pig.piggybank.storage.PigStorageSchema();

I've read that PigStorageSchema is supposed to create another bit with a header but it doesn't seem to work at all, eg, I get the same result as if it's just stored, no header file: STORE pig_object INTO '/Users/Name/Folder/pig_object' USING org.apache.pig.piggybank.storage.PigStorageSchema();

(我在本地和 mapreduce 模式下都试过了).

(I've tried this in both local and mapreduce mode).

有没有什么方法可以在没有这些多个步骤的情况下将 Pig 中的数据导出到一个简单的 CSV 文件中?

Is there any way of getting the data out of Pig into a simple CSV file without these multiple steps?

任何帮助将不胜感激！

答

恐怕没有一个单线可以完成这项工作，但您可以提出以下方案 (Pig v0.10.0):

I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):

A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',') 
      as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');

当 PigStorage采用 '-schema' 它将在输出目录中创建一个 '.pig_schema' 和一个 '.pig_header'.然后你必须将 '.pig_header' 与 'part-x-xxxxx' 合并:

When PigStorage takes '-schema' it will create a '.pig_schema' and a '.pig_header' in the output directory. Then you have to merge '.pig_header' with 'part-x-xxxxx' :

1.如果结果需要复制到本地磁盘:

1. If result need to by copied to the local disk:

hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv

(由于 -getmerge 需要一个输入目录，你需要先去掉 .pig_schema)

(Since -getmerge takes an input directory you need to get rid of .pig_schema first)

2. 将结果存储在 HDFS 上:

2. Storing the result on HDFS:

hadoop fs -cat /user/hadoop/csvoutput/.pig_header 
  /user/hadoop/csvoutput/part-x-xxxxx | 
    hadoop fs -put - /user/hadoop/csvoutput/result/output.csv

如需进一步参考，您还可以查看以下帖子:
将输出存储到单个 CSV?
如何连接使用 Hadoop FS shell 将 hadoop 中的两个文件合二为一?

For further reference you might also have a look at these posts:
STORE output to a single CSV?
How can I concatenate two files in hadoop into one using Hadoop FS shell?

相关推荐