Batch inserts are a useful technique when inserting large numbers of rows into a database at once. Rather than inserting each row individually, we can send a batch to the database at once, avoiding lots of individual network message round-trips and other per-statement inefficiencies. The performance improvements of using batching can be quite considerable. Sometimes database drivers support batching natively, in which case one can create a single prepared statement and then add multiple sets of bind parameters to it before executing the batch. For instance, in JDBC you can do the following:
PreparedStatement stmt = ...; stmt.setString(1, "..."); stmt.setInt(2, 42); ... stmt.addBatch(); // Later, after adding multiple sets of bind parameters we can do: int[] returnCodes = stmt.executeBatch();
I (NEM) am currently doing some work with PostgreSQL that will be inserting very large numbers of rows (many millions), and I like to knock up prototypes in Tcl as a quick way to get to grips with the data. As far as I can tell, neither TDBC nor Pgtcl natively support any kind of batch API. However, we can fake it on top of normal prepared statements by making use of SQL multi-row VALUES expressions [L1 ]:
INSERT INTO some_table VALUES (a, b, c), (d, e, f), ...
By preparing a statement with placeholders for a large number of rows we can then build up the set of bind parameter values and execute in one go, getting the same benefits as for a batch API. As a test case, we will insert data from the worldcitiespop.txt (41.9MB) [L2 ] CSV file that contains roughly 3.2 million rows. (NB if you really just want to load a CSV into Postgres, consider the "COPY ... FROM ..." command, which is much faster than any other methods).
We will use TDBC for simplicity, although I would say that Pgtcl is significantly faster in my testing. The test code is as follows:
package require tdbc::postgres set BATCH_SIZE 1 set COLUMNS {country city accentcity region population latitude longitude} tdbc::postgres::connection create db ;# specify -user/-host etc if necessary puts "Creating/truncating table" db allrows { CREATE TABLE IF NOT EXISTS world_cities( country text, city text, accentcity text, region text, population integer, latitude decimal, longitude decimal ) } db allrows { TRUNCATE world_cities } # Prepare a batch insert statement of the given size proc prepareBatchInsert {table columns batchSize} { set sql "INSERT INTO ${table}([join $columns ,]) VALUES " for {set i 0} {$i < $batchSize} {incr i} { if {$i > 0} { append sql , } set tuple [list] foreach column $columns { lappend tuple [format {:%s%d} $column $i] } append sql ([join $tuple ,]) } db prepare $sql } set batchStmt [prepareBatchInsert world_cities $COLUMNS $BATCH_SIZE] set in [open worldcitiespop.txt] chan configure $in -encoding iso8859-1 gets $in header set start [clock seconds] db transaction { set i 0 set batch [dict create] while {[gets $in line] >= 0} { foreach column $COLUMNS value [split $line ,] { if {$value ne ""} { dict set batch $column$i $value } } if {[incr i] % $BATCH_SIZE == 0} { $batchStmt execute $batch set batch [dict create] #puts -nonewline [format "\r%8d" $i] #flush stdout } } if {[dict size $batch] > 0} { # Prepare a batch for the remaining elements set remaining [expr {$i % $BATCH_SIZE}] set stmt [prepareBatchInsert world_cities $COLUMNS $remaining] $stmt execute $batch } } set end [clock seconds] set time [expr {$end - $start}] close $in db foreach row {select count(1) from world_cities} { set rows [dict get $row count] } puts "\nInserted $i/$rows rows in $time seconds ([expr {$rows/$time}] rows/second)" db close
The results for a BATCH_SIZE of 1 (i.e. no batching) vs 1000 are as follows:
Batch size 1 : Inserted 3173958/3173958 rows in 1483 seconds (2140 rows/second) Batch size 1000: Inserted 3173958/3173958 rows in 108 seconds (29388 rows/second)
As we can see, batching provides an order of magnitude performance improvement vs unbatched inserts. With Pgtcl I can get better still - 63479 rows/second - but the code is significantly more complex due to the lower-level API.
Harm Olthof 2017-08-16: note that each time your loop executes $stmt execute $batch a record set is created which allocates memory. This could have impact on your perfomance (although you seem to have a very high performing Postgres server). If you change your code as follows, you should get more inserts per second:
set rs [$batchStmt execute $batch] incr rows [$rs rowcount] $rs close