Reason behind speed of fread in data.table package in R -


i amazed speed of fread function in data.table on large data files how manages read fast? basic implementation differences between fread , read.csv?

i assume comparing read.csv known advice applied such setting colclasses, nrows etc. read.csv(filename) without other arguments slow because first reads memory if character , attempts coerce integer or numeric second step.

so, comparing fread read.csv(filename, colclasses=, nrows=, etc) ...

they both written in c it's not that.

there isn't 1 reason in particular, essentially, fread memory maps file memory , iterates through file using pointers. whereas read.csv reads file buffer via connection.

if run fread verbose=true tell how works , report time spent in each of steps. example, notice skips straight middle , end of file make better guess of column types (although in case top 5 enough).

> fread("test.csv",verbose=true) input contains no \n. taking filename open file opened, filesize 0.486 gb file opened , mapped ok detected eol \n (no \r afterwards), unix , mac standard. using line 30 detect sep (the last non blank line in first 'autostart') ... sep=',' found 6 columns first row 6 fields occurs on line 1 (either column names or first row of data) fields on line 1 character fields. treating column names. count of eol after first data row: 10000001 subtracted 1 last eol , trailing empty lines, leaving 10000000 data rows type codes (   first 5 rows): 113431 type codes (+ middle 5 rows): 113431 type codes (+   last 5 rows): 113431 type codes: 113431 (after applying colclasses , integer64) type codes: 113431 (after applying drop or select (if supplied) allocating 6 column slots (6 - 0 dropped) read 10000000 rows , 6 (of 6) columns 0.486 gb file in 00:00:44   13.420s ( 31%) memory map (rerun may quicker)    0.000s (  0%) sep , header detection    3.210s (  7%) count rows (wc -l)    0.000s (  0%) column type detection (first, middle , last 5 rows)    1.310s (  3%) allocation of 10000000x6 result (xmb) in ram   25.580s ( 59%) reading data    0.000s (  0%) allocation type bumps (if any), including gc time if triggered    0.000s (  0%) coercing data read in type bumps (if any)    0.040s (  0%) changing na.strings na   43.560s        total 

nb: these timings on slow netbook no ssd. both absolute , relative times of each step vary machine machine. example if rerun fread second time may notice time mmap less because os has cached previous run.

$ lscpu architecture:          x86_64 cpu op-mode(s):        32-bit, 64-bit byte order:            little endian cpu(s):                2 on-line cpu(s) list:   0,1 thread(s) per core:    1 core(s) per socket:    2 socket(s):             1 numa node(s):          1 vendor id:             authenticamd cpu family:            20 model:                 2 stepping:              0 cpu mhz:               800.000         # i.e. slow netbook bogomips:              1995.01 virtualisation:        amd-v l1d cache:             32k l1i cache:             32k l2 cache:              512k numa node0 cpu(s):     0,1 

Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

qml - Is it possible to implement SystemTrayIcon functionality in Qt Quick application -

double exclamation marks in haskell -