bash - awk merge lines based on multiple field matching/unmatching -
we have csv:
targetid , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag 51, cpd-7788990 ,9999,0, ic50 ,,10, um , 2006-07-01 00:00:00 , biochemical , 51, cpd-7788990 ,4444,5555, ki , > ,5, um , 2006-08-01 00:00:00 , enzyme , 51, cpd-7788990 ,1212,2323, ic50 ,,100, um , 2006-09-01 00:00:00 , cell ,
our final goal : if “cpd_number” ($2) same “discipline” ($10) not “cell”, merge lines “discipline” ($10) “cell” , not “cell” together. (the “discipline” has 3 options: biochemical, cell, enzyme.) following ideal output.
(note) new “result_value” ($7) = “result_value” ($7) of line “discipline” ($10) “cell” divide “result_value” ($7) of line “discipline” ($10) “biochemical” or "enzyme".
targetid , cpd_number , cell_assay_id , cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,operator, result_value, unit_value ,cell_experiment_date,experiment_date, cell_discipline , discipline 51,cpd-7788990,1212,2323, ic50 ,9999,0,ic50,,10,um, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,cell,biochemical 51,cpd-7788990,1212,2323, ic50 ,4444,5555,ki,>,20,um, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,cell,enzyme
it looks quite complicated @ 1 time. thus, trying merge whole line first: if “cpd_number” ($2) same “discipline” ($10) “different”, merge lines “discipline” ($10) “cell” lines “discipline” ($10) not “cell”. after merge, can use awk further clean up/re-name header. guru kindly offer ideas how write one-liner? toy example. actual csv files quite huge starting /^51/ might not ideal. thanks!
targetid , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag, targetid , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag 51, cpd-7788990 ,9999,0, ic50 ,,10, um , 2006-07-01 00:00:00 , biochemical , 51, cpd-7788990 ,1212,2323, ic50 ,,100, um , 2006-09-01 00:00:00 , cell , 51, cpd-7788990 ,4444,5555, ki , > ,5, um , 2006-08-01 00:00:00 , enzyme , 51, cpd-7788990 ,1212,2323, ic50 ,,100, um , 2006-09-01 00:00:00 , cell ,
extra example:
targetid , cpd_number , cell_assay_id , cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,operator, result_value, unit_value ,cell_experiment_date,experiment_date, cell_discipline , discipline 51, cpd-7788990 ,9999,0, ic50 ,,10, um , 2006-07-01 00:00:00 , biochemical , 51, cpd-7788990 ,4444,5555, ki , > ,5, um , 2006-08-01 00:00:00 , enzyme , 51, cpd-7788990 ,1212,2323, ic50 ,,100, um , 2006-09-01 00:00:00 , cell , 51, cpd-7788990 ,8888,9999, ic50 ,,200, um , 2006-09-01 00:00:00 , cell ,
output:
targetid , cpd_number , cell_assay_id , cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,operator, result_value, unit_value ,cell_experiment_date,experiment_date, cell_discipline , discipline 51,cpd-7788990,1212,2323, ic50 ,9999,0,ic50,,10,um, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,cell,biochemical 51,cpd-7788990,1212,2323, ic50 ,4444,5555,ki,>,20,um, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,cell,enzyme 51,cpd-7788990,8888,9999, ic50 ,9999,0,ic50,,20,um, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,cell,biochemical 51,cpd-7788990,8888,9999, ic50 ,4444,5555,ki,>,40,um, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,cell,enzyme
this awk script hacked based on sample input , final desired output. feel free tweak around suit needs. should enough started. takes 2 passes csv file. in first pass, builds array based on 2nd column discipline cell , in second pass formats lines together. since have not stated lines not have cell discipline, following solution ignores them.
content of script.awk
begin { fs = " *, *" # set input field sep regex ofs = "," # set output field sep comma } nr==fnr { # in first pass file if ($(nf-1) == "cell") { # if second last field cell flds[$2,$3,$4] = $3 ofs $4 ofs $5; # create array store col 3,4 , 5 separated comma date[$2,$3,$4] = $9 # store date result[$2,$3,$4] = $7 # store col 7 } next # move next record } { # second pass file (fld in flds) { # every entry in our array split (fld, tmp, subsep); # split composite key if ($(nf-1) != "cell" && tmp[1] == $2) { # if last field not cell , first piece of key same col 2 line = $0 # save current line in variable $3 = flds[fld] ofs $3 # modify col3 put value array in front of col3 $7 = result[fld] / $7 # calculate new result value $9 = date[fld] ofs $9 # add date $(nf-1) = "cell" ofs $(nf-1) # place cell text nf-- # remove last field print # print line $0 = line # swap modified line } } } $(nf-1) == "cell" { # if last field cell don't print next }
run like:
$ awk -f script.awk file file 51,cpd-7788990,1212,2323,ic50,9999,0,ic50,,10,um,2006-09-01 00:00:00,2006-07-01 00:00:00,cell,biochemical 51,cpd-7788990,8888,9999,ic50,9999,0,ic50,,20,um,2006-09-01 00:00:00,2006-07-01 00:00:00,cell,biochemical 51,cpd-7788990,1212,2323,ic50,4444,5555,ki,>,20,um,2006-09-01 00:00:00,2006-08-01 00:00:00,cell,enzyme 51,cpd-7788990,8888,9999,ic50,4444,5555,ki,>,40,um,2006-09-01 00:00:00,2006-08-01 00:00:00,cell,enzyme
you can include printing of header statement inside begin
block.
Comments
Post a Comment