r - Create a vector of unique values out of several columns with overlapping values (based on groups of word parts) -


i asked similar question before, still different. here's original question:

stackoverflow.com/questions/24402893/create-a-vector-of-unique-values-out-of-several-columns-with-overlapping-values/24403752?noredirect=1#comment37762312_24403752

in data.frame have 3 columns on subject of row. want additional column unique subject each row. first, how data looks like:

date <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7") comp <- c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b") ret <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12) class <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral") subject.1 <- c("litigation","layoff","pollution","chemical disaster","press release","people","emissions","energy","waste management","employees","management","press release","hotels","pollution") subject.2 <- c("pollution","employees","nuclear","fuels","stock option plan","executives","co2","solar","pollution","executives","press release","celebrities","celebrities","litigation") subject.3 <- c("environment","job reductions","power plants","pollution","employees","fraud","climate change","sustainability","hazardous waste","bonus pay","litigation","emissions","scandals","scandals") controlvar <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12")  mydf <- data.frame(date, comp, ret, class, subject.1, subject.2, subject.3, controlvar, stringsasfactors=f)  mydf  #    date comp   ret    class         subject.1         subject.2       subject.3 controlvar # 1     1    -2.00 positive        litigation         pollution     environment         11 # 2     2     1.10 negative            layoff         employees  job reductions         13 # 3     3     3.00 aneutral         pollution           nuclear    power plants         13 # 4     4     1.40 positive chemical disaster             fuels       pollution         14 # 5     5    -0.20 positive     press release stock option plan       employees         13 # 6     6     0.60 negative            people        executives           fraud         14 # 7     7     0.10 aneutral         emissions               co2  climate change         12 # 8     1    b -0.21 positive            energy             solar  sustainability         11 # 9     2    b -1.20 negative  waste management         pollution hazardous waste         13 # 10    3    b  0.90 negative         employees        executives       bonus pay         13 # 11    4    b  0.30 positive        management     press release      litigation         14 # 12    5    b -0.10 aneutral     press release       celebrities       emissions         13 # 13    6    b  0.30 aneutral            hotels       celebrities        scandals         14 # 14    7    b -0.12 aneutral         pollution        litigation        scandals         12 

since want include subject dummy variable (which should exclusive) later regression, want single column subject unique subject each row. i'd focus on subjects litigation, pollution , layoff.

i want go left right , check each subject column

c("litigat", "claim", "suit", "judg") -> litigation  c("pollut", "wast", "emission") -> pollution c("layoff") -> layoff  

if there 1 of word-parts litigation, pollution or layoff in first column, subject taken. if there different subject in first column, check second column , on. if none of 3 subject-columns contain word-parts litigation, pollution or layoff, subject should called other.

the output should this:

#    date comp   ret    class         subject.1         subject.2       subject.3    subject controlvar # 1     1    -2.00 positive        litigation         pollution     environment litigation         11 # 2     2     1.10 negative            layoff         employees  job reductions     layoff         13 # 3     3     3.00 aneutral         pollution           nuclear    power plants  pollution         13 # 4     4     1.40 positive chemical disaster             fuels       pollution  pollution         14 # 5     5    -0.20 positive     press release stock option plan       employees      other         13 # 6     6     0.60 negative            people        executives           fraud      other         14 # 7     7     0.10 aneutral         emissions               co2  climate change  pollution         12 # 8     1    b -0.21 positive            energy             solar  sustainability      other         11 # 9     2    b -1.20 negative  waste management         pollution hazardous waste  pollution         13 # 10    3    b  0.90 negative         employees        executives       bonus pay      other         13 # 11    4    b  0.30 positive        management     press release      litigation litigation         14 # 12    5    b -0.10 aneutral     press release       celebrities       emissions  pollution         13 # 13    6    b  0.30 aneutral            hotels       celebrities        scandals      other         14 # 14    7    b -0.12 aneutral         pollution        litigation        scandals  pollution         12 

try:

dat <- stack(sapply(c("litigation", "pollution", "layoff"),                function(x) grep(paste(get(x),collapse="|"),                        as.character(interaction(mydf[,5:7],sep=" ")))))  dat2 <- merge(dat, data.frame(values=1:14),all=true)  dat2n <- dat2[!duplicated(dat2$values),] ##delete duplicated values dat2n$ind <- as.character(dat2n$ind)  dat2n$ind[is.na(dat2n$ind)] <- "other"  ##change nas "other" transform(mydf, subject=dat2n$ind) 

explanation

you created 3 vectors: (pasting code)

c("litigat", "claim", "suit", "judg") -> litigation  c("pollut", "wast", "emission") -> pollution c("layoff") -> layoff 

so, running below code gives:

 lapply(c("litigation","pollution", "layoff"), function(x) get(x)) #get search name object  #[[1]]  #[1] "litigat" "claim"   "suit"    "judg"      #[[2]]  #[1] "pollut"   "wast"     "emission"   #[[3]]  #[1] "layoff" 

then pasted components make single string separated "|" grep

 sapply(c("litigation","pollution", "layoff"), function(x) paste(get(x),collapse="|")) #           litigation                 pollution                    layoff  #"litigat|claim|suit|judg"    "pollut|wast|emission"                  "layoff"   as.character(interaction(mydf[,5:7],sep=" ")) #pasted concerned columns rowwise #[1] "litigation pollution environment"           #[2] "layoff employees job reductions"            #[3] "pollution nuclear power plants"             #[4] "chemical disaster fuels pollution"          #[5] "press release stock option plan employees"  

and search pattern in combined rows using grep

i used stack row index along object names ie. litigation, pollution. merged dataset rownumbers. can use ?match. there multiple values mapped different objects, first 1 selected , rest deleted using duplicated. changed theindcolumn fromfactortocharacterand theind` nas changed "other".


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -