hadoop - Implementing a custom Apache pig algebraic UDF -
everyone
i implemented custom aggregate pig udf. udf implements algebraic interface, , there 3 classes - initial, intermed , final work @ different phases. works correctly, inefficiently.
the udf uses algorithm bit heavy - when running on single value. work more efficiently when running on bigger groups of data - - 100 @ time. observed initial class invoked single value, , later combined intermed , final classes.
i aware there's accumulator interface such cases, not find documentation on how use algebraic udf.
so question - there way me "force" pig pass more values initial calculation - either using accumulator interface or via other way.
an explanantion or pointer documentation or sample appreciated.
thanks amir
it seems pig's algebraic initial function receive single value in tuple (at least according this blog post).
to solve issue, ended doing return single value in initial without processing @ all. intermed , final functions perform algorithm.
since intermed function may receive outputs either initial function or intermed function (this according docs, did not see in practice, in tests, intermed received values initial function), both initial , intermed functions return tuple of 2 values. first value in tuple string telling me source of value - either "initial" or "intermed". second value in tuple actual result.
Comments
Post a Comment