r - Create a vector of unique values out of several columns with overlapping values (based on groups of word parts) -
i asked similar question before, still different. here's original question:
in data.frame have 3 columns on subject of row. want additional column unique subject each row. first, how data looks like:
date <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7") comp <- c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b") ret <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12) class <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral") subject.1 <- c("litigation","layoff","pollution","chemical disaster","press release","people","emissions","energy","waste management","employees","management","press release","hotels","pollution") subject.2 <- c("pollution","employees","nuclear","fuels","stock option plan","executives","co2","solar","pollution","executives","press release","celebrities","celebrities","litigation") subject.3 <- c("environment","job reductions","power plants","pollution","employees","fraud","climate change","sustainability","hazardous waste","bonus pay","litigation","emissions","scandals","scandals") controlvar <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12") mydf <- data.frame(date, comp, ret, class, subject.1, subject.2, subject.3, controlvar, stringsasfactors=f) mydf # date comp ret class subject.1 subject.2 subject.3 controlvar # 1 1 -2.00 positive litigation pollution environment 11 # 2 2 1.10 negative layoff employees job reductions 13 # 3 3 3.00 aneutral pollution nuclear power plants 13 # 4 4 1.40 positive chemical disaster fuels pollution 14 # 5 5 -0.20 positive press release stock option plan employees 13 # 6 6 0.60 negative people executives fraud 14 # 7 7 0.10 aneutral emissions co2 climate change 12 # 8 1 b -0.21 positive energy solar sustainability 11 # 9 2 b -1.20 negative waste management pollution hazardous waste 13 # 10 3 b 0.90 negative employees executives bonus pay 13 # 11 4 b 0.30 positive management press release litigation 14 # 12 5 b -0.10 aneutral press release celebrities emissions 13 # 13 6 b 0.30 aneutral hotels celebrities scandals 14 # 14 7 b -0.12 aneutral pollution litigation scandals 12
since want include subject dummy variable (which should exclusive) later regression, want single column subject unique subject each row. i'd focus on subjects litigation, pollution , layoff.
i want go left right , check each subject column
c("litigat", "claim", "suit", "judg") -> litigation c("pollut", "wast", "emission") -> pollution c("layoff") -> layoff
if there 1 of word-parts litigation, pollution or layoff in first column, subject taken. if there different subject in first column, check second column , on. if none of 3 subject-columns contain word-parts litigation, pollution or layoff, subject should called other.
the output should this:
# date comp ret class subject.1 subject.2 subject.3 subject controlvar # 1 1 -2.00 positive litigation pollution environment litigation 11 # 2 2 1.10 negative layoff employees job reductions layoff 13 # 3 3 3.00 aneutral pollution nuclear power plants pollution 13 # 4 4 1.40 positive chemical disaster fuels pollution pollution 14 # 5 5 -0.20 positive press release stock option plan employees other 13 # 6 6 0.60 negative people executives fraud other 14 # 7 7 0.10 aneutral emissions co2 climate change pollution 12 # 8 1 b -0.21 positive energy solar sustainability other 11 # 9 2 b -1.20 negative waste management pollution hazardous waste pollution 13 # 10 3 b 0.90 negative employees executives bonus pay other 13 # 11 4 b 0.30 positive management press release litigation litigation 14 # 12 5 b -0.10 aneutral press release celebrities emissions pollution 13 # 13 6 b 0.30 aneutral hotels celebrities scandals other 14 # 14 7 b -0.12 aneutral pollution litigation scandals pollution 12
try:
dat <- stack(sapply(c("litigation", "pollution", "layoff"), function(x) grep(paste(get(x),collapse="|"), as.character(interaction(mydf[,5:7],sep=" "))))) dat2 <- merge(dat, data.frame(values=1:14),all=true) dat2n <- dat2[!duplicated(dat2$values),] ##delete duplicated values dat2n$ind <- as.character(dat2n$ind) dat2n$ind[is.na(dat2n$ind)] <- "other" ##change nas "other" transform(mydf, subject=dat2n$ind)
explanation
you created 3 vectors: (pasting code)
c("litigat", "claim", "suit", "judg") -> litigation c("pollut", "wast", "emission") -> pollution c("layoff") -> layoff
so, running below code gives:
lapply(c("litigation","pollution", "layoff"), function(x) get(x)) #get search name object #[[1]] #[1] "litigat" "claim" "suit" "judg" #[[2]] #[1] "pollut" "wast" "emission" #[[3]] #[1] "layoff"
then pasted components make single string separated "|" grep
sapply(c("litigation","pollution", "layoff"), function(x) paste(get(x),collapse="|")) # litigation pollution layoff #"litigat|claim|suit|judg" "pollut|wast|emission" "layoff" as.character(interaction(mydf[,5:7],sep=" ")) #pasted concerned columns rowwise #[1] "litigation pollution environment" #[2] "layoff employees job reductions" #[3] "pollution nuclear power plants" #[4] "chemical disaster fuels pollution" #[5] "press release stock option plan employees"
and search pattern in combined rows using grep
i used stack
row index along object names ie. litigation, pollution. merged dataset rownumbers. can use ?match. there multiple values mapped different objects, first 1 selected , rest deleted using duplicated. changed the
indcolumn from
factorto
characterand the
ind` nas changed "other".
Comments
Post a Comment