Reason behind speed of fread in data.table package in R -
i amazed speed of fread
function in data.table
on large data files how manages read fast? basic implementation differences between fread
, read.csv
?
i assume comparing read.csv
known advice applied such setting colclasses
, nrows
etc. read.csv(filename)
without other arguments slow because first reads memory if character
, attempts coerce integer
or numeric
second step.
so, comparing fread
read.csv(filename, colclasses=, nrows=, etc)
...
they both written in c it's not that.
there isn't 1 reason in particular, essentially, fread
memory maps file memory , iterates through file using pointers. whereas read.csv
reads file buffer via connection.
if run fread
verbose=true
tell how works , report time spent in each of steps. example, notice skips straight middle , end of file make better guess of column types (although in case top 5 enough).
> fread("test.csv",verbose=true) input contains no \n. taking filename open file opened, filesize 0.486 gb file opened , mapped ok detected eol \n (no \r afterwards), unix , mac standard. using line 30 detect sep (the last non blank line in first 'autostart') ... sep=',' found 6 columns first row 6 fields occurs on line 1 (either column names or first row of data) fields on line 1 character fields. treating column names. count of eol after first data row: 10000001 subtracted 1 last eol , trailing empty lines, leaving 10000000 data rows type codes ( first 5 rows): 113431 type codes (+ middle 5 rows): 113431 type codes (+ last 5 rows): 113431 type codes: 113431 (after applying colclasses , integer64) type codes: 113431 (after applying drop or select (if supplied) allocating 6 column slots (6 - 0 dropped) read 10000000 rows , 6 (of 6) columns 0.486 gb file in 00:00:44 13.420s ( 31%) memory map (rerun may quicker) 0.000s ( 0%) sep , header detection 3.210s ( 7%) count rows (wc -l) 0.000s ( 0%) column type detection (first, middle , last 5 rows) 1.310s ( 3%) allocation of 10000000x6 result (xmb) in ram 25.580s ( 59%) reading data 0.000s ( 0%) allocation type bumps (if any), including gc time if triggered 0.000s ( 0%) coercing data read in type bumps (if any) 0.040s ( 0%) changing na.strings na 43.560s total
nb: these timings on slow netbook no ssd. both absolute , relative times of each step vary machine machine. example if rerun fread
second time may notice time mmap less because os has cached previous run.
$ lscpu architecture: x86_64 cpu op-mode(s): 32-bit, 64-bit byte order: little endian cpu(s): 2 on-line cpu(s) list: 0,1 thread(s) per core: 1 core(s) per socket: 2 socket(s): 1 numa node(s): 1 vendor id: authenticamd cpu family: 20 model: 2 stepping: 0 cpu mhz: 800.000 # i.e. slow netbook bogomips: 1995.01 virtualisation: amd-v l1d cache: 32k l1i cache: 32k l2 cache: 512k numa node0 cpu(s): 0,1
Comments
Post a Comment