python - Search and replace missing lines in the data structure by pandas -
my data files have many missing lines (these time series data). find
the location of these missing lines (the missing of values @ time points) , replace them lines nan.
for purpose, used pandas. however, took time. so, couldn't check whether code working or not.
for input in filelist: data=pd.read_csv(input,sep='#',index_col=[1],parse_dates=[1]) print data start = data.head(1).index end = data.tail(1).index timestamp = pd.date_range('2013-07-01 00:00:00','2013-09-30 23:59:00',freq="t") n=timestamp.size cr =pd.dataframe(na,columns=data.columns,index=timestamp) in range(n): s="%04d%02d%02d%02d%02d"%(timestamp.year[i],timestamp.month[i],timestamp.hour[i], timestamp.minute[i]) try: cr.ix[timestamp[i]]=data.ix[s] except valueerror: print input,s,"valueerror" except keyerror: print input,s,"keyerror" output = filelist[i] + "result"
if have idea reduce running time of code, appreciate it.
create sample series taking 100k rows regular index, simulating read_csv
in [22]: pd.set_option('max_rows',10) in [23]: index = pd.date_range('2013-07-01 00:00:00','2013-09-30 23:59:00',freq="t") in [24]: df = dataframe(np.random.randn(100000),index=index.take(np.sort(np.random.choice(range(len(index)),100000,replace=false)))) in [25]: df out[25]: 0 2013-07-01 00:00:00 0.009764 2013-07-01 00:01:00 0.900823 2013-07-01 00:02:00 -1.718848 2013-07-01 00:04:00 0.197763 2013-07-01 00:05:00 -0.269920 ... ... 2013-09-30 23:54:00 0.368947 2013-09-30 23:55:00 1.488926 2013-09-30 23:57:00 0.749575 2013-09-30 23:58:00 0.964609 2013-09-30 23:59:00 0.112557 [100000 rows x 1 columns]
reindex
in [26]: df.reindex(index) out[26]: 0 2013-07-01 00:00:00 0.009764 2013-07-01 00:01:00 0.900823 2013-07-01 00:02:00 -1.718848 2013-07-01 00:03:00 nan 2013-07-01 00:04:00 0.197763 ... ... 2013-09-30 23:55:00 1.488926 2013-09-30 23:56:00 nan 2013-09-30 23:57:00 0.749575 2013-09-30 23:58:00 0.964609 2013-09-30 23:59:00 0.112557 [132480 rows x 1 columns]
you can df.asfreq('t')
, though has different end-point semantics (e.g. series start/stop @ last period in sample)
Comments
Post a Comment