python - Search and replace missing lines in the data structure by pandas -


my data files have many missing lines (these time series data). find

the location of these missing lines (the missing of values @ time points) , replace them lines nan.

for purpose, used pandas. however, took time. so, couldn't check whether code working or not.

for input in filelist:  data=pd.read_csv(input,sep='#',index_col=[1],parse_dates=[1]) print data start = data.head(1).index end = data.tail(1).index  timestamp = pd.date_range('2013-07-01 00:00:00','2013-09-30 23:59:00',freq="t")  n=timestamp.size  cr =pd.dataframe(na,columns=data.columns,index=timestamp)  in range(n):     s="%04d%02d%02d%02d%02d"%(timestamp.year[i],timestamp.month[i],timestamp.hour[i],         timestamp.minute[i])                   try:        cr.ix[timestamp[i]]=data.ix[s]    except valueerror:       print input,s,"valueerror"    except keyerror:       print input,s,"keyerror"    output = filelist[i] + "result" 

if have idea reduce running time of code, appreciate it.

create sample series taking 100k rows regular index, simulating read_csv

in [22]: pd.set_option('max_rows',10)  in [23]: index = pd.date_range('2013-07-01 00:00:00','2013-09-30 23:59:00',freq="t")  in [24]: df = dataframe(np.random.randn(100000),index=index.take(np.sort(np.random.choice(range(len(index)),100000,replace=false))))  in [25]: df out[25]:                              0 2013-07-01 00:00:00  0.009764 2013-07-01 00:01:00  0.900823 2013-07-01 00:02:00 -1.718848 2013-07-01 00:04:00  0.197763 2013-07-01 00:05:00 -0.269920 ...                       ... 2013-09-30 23:54:00  0.368947 2013-09-30 23:55:00  1.488926 2013-09-30 23:57:00  0.749575 2013-09-30 23:58:00  0.964609 2013-09-30 23:59:00  0.112557  [100000 rows x 1 columns] 

reindex

in [26]: df.reindex(index) out[26]:                              0 2013-07-01 00:00:00  0.009764 2013-07-01 00:01:00  0.900823 2013-07-01 00:02:00 -1.718848 2013-07-01 00:03:00       nan 2013-07-01 00:04:00  0.197763 ...                       ... 2013-09-30 23:55:00  1.488926 2013-09-30 23:56:00       nan 2013-09-30 23:57:00  0.749575 2013-09-30 23:58:00  0.964609 2013-09-30 23:59:00  0.112557  [132480 rows x 1 columns] 

you can df.asfreq('t'), though has different end-point semantics (e.g. series start/stop @ last period in sample)


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

double exclamation marks in haskell -