hadoop - R Reducer is not working properly in Amazon EMR -


i have done map reduce code in r run in amazon emr.

my input file format: url1 word1 word2 word3 url2 word4 word2 word3 url3 word1 word7 word2

i'm expecting output as: urls concat spaces word1 url1 url3 word2 url1 url2 url3 word3 url1 url2 .. ... ..

but emr using 3 reducers , creating 3 output files. file wise output correct, combining values, no duplicate keys. if see 3 files together, there duplicate keys.

output file 1: word1 url1 url3 word2 url1 .. ..

output file 2: word2 url2 url3 word3 url1 .. ..

see, word2 distributed 2 files. need 1 key in 1 file.

i'm using, hadoop streaming in emr. please suggest me correct settings remove duplicate keys in different files.

i assume mapper working fine. reducer:

process <- function(mat){  rows = nrow(mat) cols = ncol(mat)  for(i in 1:rows) {      for(j in i+1:rows)     {         if(j<=rows)         {             if(tostring(mat[i,1])==tostring(mat[j,1]))             {             x<-paste(mat[i,2],mat[j,2],sep=" ")             mat[i,2]=x             mat<-mat[-j,]             rows<-rows-1             }         }     } }  write.table(mat, file=stdout(), quote=false, row.names=false, col.names=false) }  reduce <- function(input){   #create column names make easier work data set   names <- c("word", "value")   cols = as.list(vector(length=2, mode="character"))   names(cols) <- names    #read input   hstablereader(file=input, cols, ignorekey=true, chunksize=100000, fun=process, sep=" ")   } 

have tried using combiner gather same keys same reducer? way should able gather words similar key single reducer. check wordcount examples combiner understand how combiner class works.


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -