bash - awk merge lines based on multiple field matching/unmatching -

we have csv:

targetid , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag  51, cpd-7788990 ,9999,0,  ic50  ,,10, um ,  2006-07-01 00:00:00 , biochemical , 51, cpd-7788990 ,4444,5555, ki , > ,5, um ,  2006-08-01 00:00:00  ,  enzyme , 51, cpd-7788990 ,1212,2323,  ic50  ,,100, um ,  2006-09-01 00:00:00  , cell ,

our final goal : if “cpd_number” ($2) same “discipline” ($10) not “cell”, merge lines “discipline” ($10) “cell” , not “cell” together. (the “discipline” has 3 options: biochemical, cell, enzyme.) following ideal output.
(note) new “result_value” ($7) = “result_value” ($7) of line “discipline” ($10) “cell” divide “result_value” ($7) of line “discipline” ($10) “biochemical” or "enzyme".

targetid , cpd_number , cell_assay_id , cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,operator,   result_value, unit_value ,cell_experiment_date,experiment_date, cell_discipline , discipline  51,cpd-7788990,1212,2323, ic50 ,9999,0,ic50,,10,um, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,cell,biochemical 51,cpd-7788990,1212,2323, ic50 ,4444,5555,ki,>,20,um, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,cell,enzyme

it looks quite complicated @ 1 time. thus, trying merge whole line first: if “cpd_number” ($2) same “discipline” ($10) “different”, merge lines “discipline” ($10) “cell” lines “discipline” ($10) not “cell”. after merge, can use awk further clean up/re-name header. guru kindly offer ideas how write one-liner? toy example. actual csv files quite huge starting /^51/ might not ideal. thanks!

targetid , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag, targetid , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag  51, cpd-7788990 ,9999,0,  ic50  ,,10, um ,  2006-07-01 00:00:00 , biochemical , 51, cpd-7788990 ,1212,2323,  ic50  ,,100, um ,  2006-09-01 00:00:00  , cell , 51, cpd-7788990 ,4444,5555, ki , > ,5, um ,  2006-08-01 00:00:00  ,  enzyme , 51, cpd-7788990 ,1212,2323,  ic50  ,,100, um ,  2006-09-01 00:00:00  , cell ,

extra example:

targetid , cpd_number , cell_assay_id , cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,operator,   result_value, unit_value ,cell_experiment_date,experiment_date, cell_discipline , discipline  51, cpd-7788990 ,9999,0,  ic50  ,,10, um ,  2006-07-01 00:00:00 , biochemical , 51, cpd-7788990 ,4444,5555, ki , > ,5, um ,  2006-08-01 00:00:00  ,  enzyme , 51, cpd-7788990 ,1212,2323,  ic50  ,,100, um ,  2006-09-01 00:00:00  , cell , 51, cpd-7788990 ,8888,9999,  ic50  ,,200, um ,  2006-09-01 00:00:00  , cell ,

output:

targetid , cpd_number , cell_assay_id , cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,operator,   result_value, unit_value ,cell_experiment_date,experiment_date, cell_discipline , discipline  51,cpd-7788990,1212,2323, ic50 ,9999,0,ic50,,10,um, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,cell,biochemical 51,cpd-7788990,1212,2323, ic50 ,4444,5555,ki,>,20,um, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,cell,enzyme 51,cpd-7788990,8888,9999, ic50 ,9999,0,ic50,,20,um, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,cell,biochemical 51,cpd-7788990,8888,9999, ic50 ,4444,5555,ki,>,40,um, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,cell,enzyme

this awk script hacked based on sample input , final desired output. feel free tweak around suit needs. should enough started. takes 2 passes csv file. in first pass, builds array based on 2nd column discipline cell , in second pass formats lines together. since have not stated lines not have cell discipline, following solution ignores them.

content of script.awk

begin {      fs  = " *, *"                             # set input field sep regex     ofs = ","                                 # set output field sep comma }  nr==fnr {                                     # in first pass file     if ($(nf-1) == "cell") {                  # if second last field cell         flds[$2,$3,$4] = $3 ofs $4 ofs $5;    # create array store col 3,4 , 5 separated comma         date[$2,$3,$4] = $9                   # store date         result[$2,$3,$4] = $7                 # store col 7     }     next                                      # move next record }   {                                             # second pass file     (fld in flds) {                       # every entry in our array         split (fld, tmp, subsep);             # split composite key         if ($(nf-1) != "cell" && tmp[1] == $2) {   # if last field not cell , first piece of key same col 2             line = $0                         # save current line in variable             $3 = flds[fld] ofs $3             # modify col3 put value array in front of col3             $7 = result[fld] / $7             # calculate new result value             $9 = date[fld] ofs $9             # add date             $(nf-1) = "cell" ofs $(nf-1)      # place cell text             nf--                              # remove last field             print                             # print line             $0 = line                         # swap modified line         }     } }  $(nf-1) == "cell" {                           # if last field cell don't print      next }

run like:

$ awk -f script.awk file file 51,cpd-7788990,1212,2323,ic50,9999,0,ic50,,10,um,2006-09-01 00:00:00,2006-07-01 00:00:00,cell,biochemical 51,cpd-7788990,8888,9999,ic50,9999,0,ic50,,20,um,2006-09-01 00:00:00,2006-07-01 00:00:00,cell,biochemical 51,cpd-7788990,1212,2323,ic50,4444,5555,ki,>,20,um,2006-09-01 00:00:00,2006-08-01 00:00:00,cell,enzyme 51,cpd-7788990,8888,9999,ic50,4444,5555,ki,>,40,um,2006-09-01 00:00:00,2006-08-01 00:00:00,cell,enzyme

you can include printing of header statement inside begin block.

Search This Blog

CSS

bash - awk merge lines based on multiple field matching/unmatching -

Comments

Post a Comment

Popular posts from this blog

sql server - MSSQL Text and Varchar(MAX) fields shown (MEMO) in DBGrid -

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

mysql - Flyway migration, Unable to obtain Jdbc connection from DataSource -