Pig - 将 Databag 传递给 UDF 构造函数
我有一个脚本正在加载有关场地的一些数据:
I have a script which is loading some data about venues:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
然后我想创建具有接受场地类型的构造函数的 UDF.
Then I want to create UDF which has a constructor that is accepting venues type.
所以我试着这样定义这个 UDF:
So I tried to define this UDF like that:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
这是实际的 UDF:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
@Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
执行时,脚本在 (venues);
之前的 DEFINE
部分触发错误:
When executed the script is firing error at the DEFINE
part just before (venues);
:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
显然我做错了什么,你能帮我找出问题所在吗?是不是UDF不能接受场地关系作为参数.或者这种关系不是由 DataBag
表示的,就像这样 public GenerateVenues(DataBagvenuesBag)
?谢谢!
Obviously I'm doing something wrong, can you help me out figuring out what's wrong.
Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag
like this public GenerateVenues(DataBag venuesBag)
?
Thanks!
PS 我使用的是 Pig 版本 0.11.1.1.3.0.0-107.
PS I'm using Pig version 0.11.1.1.3.0.0-107.
正如@WinnieNicklaus 已经说过的,您可以仅将字符串传递给 UDF 构造函数.
As @WinnieNicklaus already said, you can only pass strings to UDF constructors.
话虽如此,解决您的问题的方法是使用分布式缓存,您需要覆盖public List
返回将通过分布式缓存可用的文件名列表.这样,您就可以将该文件作为本地文件读取并构建您的表.
Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles()
to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.
缺点是Pig没有初始化函数,所以你必须实现类似
The downside is that Pig has no initialization function, so you have to implement something like
private void init() {
if (!this.initialized) {
// read table
}
}
然后调用它作为 exec
的第一件事.
and then call that as the first thing from exec
.