Database Batch Inserts

Batch inserts are a useful technique when inserting large numbers of rows into a database at once. Rather than inserting each row individually, we can send a batch to the database at once, avoiding lots of individual network message round-trips and other per-statement inefficiencies. The performance improvements of using batching can be quite considerable. Sometimes database drivers support batching natively, in which case one can create a single prepared statement and then add multiple sets of bind parameters to it before executing the batch. For instance, in JDBC you can do the following:

PreparedStatement stmt = ...;
stmt.setString(1, "...");
stmt.setInt(2, 42);
...
stmt.addBatch();
// Later, after adding multiple sets of bind parameters we can do:
int[] returnCodes = stmt.executeBatch();

I (NEM) am currently doing some work with PostgreSQL that will be inserting very large numbers of rows (many millions), and I like to knock up prototypes in Tcl as a quick way to get to grips with the data. As far as I can tell, neither TDBC nor Pgtcl natively support any kind of batch API. However, we can fake it on top of normal prepared statements by making use of SQL multi-row VALUES expressions [1 ]:

INSERT INTO some_table VALUES (a, b, c), (d, e, f), ...

By preparing a statement with placeholders for a large number of rows we can then build up the set of bind parameter values and execute in one go, getting the same benefits as for a batch API. As a test case, we will insert data from the worldcitiespop.txt (41.9MB) [2 ] CSV file that contains roughly 3.2 million rows. (NB if you really just want to load a CSV into Postgres, consider the "COPY ... FROM ..." command, which is much faster than any other methods).

We will use TDBC for simplicity, although I would say that Pgtcl is significantly faster in my testing. The test code is as follows:

package require tdbc::postgres

set BATCH_SIZE 1
set COLUMNS {country city accentcity region population latitude longitude}

tdbc::postgres::connection create db ;# specify -user/-host etc if necessary

puts "Creating/truncating table"
db allrows {
    CREATE TABLE IF NOT EXISTS world_cities(
        country text,
        city text,
        accentcity text,
        region text,
        population integer,
        latitude decimal,
        longitude decimal
    )
}
db allrows { TRUNCATE world_cities }

# Prepare a batch insert statement of the given size
proc prepareBatchInsert {table columns batchSize} {
    set sql "INSERT INTO ${table}([join $columns ,]) VALUES "
    for {set i 0} {$i < $batchSize} {incr i} {
        if {$i > 0} { append sql , }
        set tuple [list]
        foreach column $columns {
            lappend tuple [format {:%s%d} $column $i]
        }
        append sql ([join $tuple ,])
    }
    db prepare $sql
}

set batchStmt [prepareBatchInsert world_cities $COLUMNS $BATCH_SIZE]

set in [open worldcitiespop.txt]
chan configure $in -encoding iso8859-1
gets $in header

set start [clock seconds]
db transaction {
    set i 0
    set batch [dict create]
    while {[gets $in line] >= 0} {
        foreach column $COLUMNS value [split $line ,] {
            if {$value ne ""} {
                dict set batch $column$i $value
            }
        }

        if {[incr i] % $BATCH_SIZE == 0} {
            $batchStmt execute $batch
            set batch [dict create]
            #puts -nonewline [format "\r%8d" $i]
            #flush stdout
        }
    }

    if {[dict size $batch] > 0} {
        # Prepare a batch for the remaining elements
        set remaining [expr {$i % $BATCH_SIZE}]
        set stmt [prepareBatchInsert world_cities $COLUMNS $remaining]
        $stmt execute $batch
    }
}
set end [clock seconds]
set time [expr {$end - $start}]
close $in

db foreach row {select count(1) from world_cities} { set rows [dict get $row count] }
puts "\nInserted $i/$rows rows in $time seconds ([expr {$rows/$time}] rows/second)"

db close

The results for a BATCH_SIZE of 1 (i.e. no batching) vs 1000 are as follows:

Batch size 1   : Inserted 3173958/3173958 rows in 1483 seconds (2140 rows/second)
Batch size 1000: Inserted 3173958/3173958 rows in 108 seconds (29388 rows/second)

As we can see, batching provides an order of magnitude performance improvement vs unbatched inserts. With Pgtcl I can get better still - 63479 rows/second - but the code is significantly more complex due to the lower-level API.


Harm Olthof 2017-08-16: note that each time your loop executes $stmt execute $batch a record set is created which allocates memory. This could have impact on your perfomance (although you seem to have a very high performing Postgres server). If you change your code as follows, you should get more inserts per second:

set rs [$batchStmt execute $batch]
incr rows [$rs rowcount]
$rs close