pig用上lzo的有关配置

pig用上lzo的相关配置
转载:http://*.com/questions/7277621/how-to-get-pig-to-work-with-lzo-files
还没有试过。

----------------------
I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!

NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).
Hooking PIG up to be able to work with LZOs

This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:

    1,Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.

    2,Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile this on a 64bit machine.

    3,Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.

    4,Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib

    5,Then configure hadoop and pig to have the property java.library.path point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:

    <property>
        <name>mapred.child.env</name>
        <value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
    </property>

    6,Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.

    7,Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).

    8,Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.

    9,command: ant in the elephant-bird folder in order to create a jar.

    10,For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.

    11,Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.