hadoop - R Reducer is not working properly in Amazon EMR -
i have done map reduce code in r run in amazon emr.
my input file format: url1 word1 word2 word3 url2 word4 word2 word3 url3 word1 word7 word2
i'm expecting output as: urls concat spaces word1 url1 url3 word2 url1 url2 url3 word3 url1 url2 .. ... ..
but emr using 3 reducers , creating 3 output files. file wise output correct, combining values, no duplicate keys. if see 3 files together, there duplicate keys.
output file 1: word1 url1 url3 word2 url1 .. ..
output file 2: word2 url2 url3 word3 url1 .. ..
see, word2
distributed 2 files. need 1 key in 1 file.
i'm using, hadoop streaming in emr. please suggest me correct settings remove duplicate keys in different files.
i assume mapper working fine. reducer:
process <- function(mat){ rows = nrow(mat) cols = ncol(mat) for(i in 1:rows) { for(j in i+1:rows) { if(j<=rows) { if(tostring(mat[i,1])==tostring(mat[j,1])) { x<-paste(mat[i,2],mat[j,2],sep=" ") mat[i,2]=x mat<-mat[-j,] rows<-rows-1 } } } } write.table(mat, file=stdout(), quote=false, row.names=false, col.names=false) } reduce <- function(input){ #create column names make easier work data set names <- c("word", "value") cols = as.list(vector(length=2, mode="character")) names(cols) <- names #read input hstablereader(file=input, cols, ignorekey=true, chunksize=100000, fun=process, sep=" ") }
have tried using combiner gather same keys same reducer? way should able gather words similar key single reducer. check wordcount examples combiner understand how combiner class works.
Comments
Post a Comment