One of the tasks of any Object-Relational Mapping (ORM) framework is to persist to the database any changes made to entity instances in memory. Typically this is done by a process that looks roughly like the following:
- Traverse all entity instances in scope and build a dependency graph
- Traverse this dependency graph and determine which entity instances are changed ('dirty') or new or will become changed once another entity is persisted. Traversing this graph is done using a depth-first-search (DFS) algorithm.
- During the DFS traversal, all entities to persist are collected into a list
- At the end of the DFS we reverse the list we collected and we have a Topological Sorted list, where each entity E which depends on another entity F is persisted after F.
Example: you persist a new Order entity and it references a new Customer entity. This means that the Order entity depends on the Customer entity, and will receive the PK value of the Customer as an FK value once the Customer entity has been persisted. Doing a topological sort of this two-node graph will result in a list [Customer, Order].
- We now persist the entity instances in the order in which they're in the list we collected, one by one, creating a new
INSERT
or UPDATE
query for each element.
This article will focus on the last step: persisting entities one by one.
The normal way to do things is by traversing the list of entities to persist, create an insert or update query, execute it, handle the post-query activities and move on to the next entity. This effectively means that for every insert and update, a command has to be processed, send to the database, processed there and the result has to be transported back.
The sending back and forth result in latency: the time it takes to send the commands across the network and having them processed by the database engine adds to total time an insert or update action takes on an entity.
With a slow network (e.g. in a cloud setup like AWS or Azure) with a lot of latency and more than a couple of commands to execute this will be noticeable. It's that latency we're going to address with query batching.
Let's first get an idea how big this latency really is. In the OSS benchmark set for ORMs, RawDataAccessBencher, we've published a set of insert benchmarks to test how fast batch inserts are. Let's first run this benchmark with a batch size of 1, so all inserts are done with their own DbCommand object.
Set inserts of 1000 elements in one go (10 runs)
------------------------------------------------------------------------------
LLBLGen Pro v5.5.0.0 (v5.5.18.1019) : 544,99ms (6,93ms)
LLBLGen Pro needs 545ms on average to insert 1000 entities over the network (LAN) in a SQL Server database, using one DbCommand per entity. Now lets do the same benchmark but with a batchsize of 100:
Set inserts of 1000 elements in one go (10 runs)
------------------------------------------------------------------------------
LLBLGen Pro v5.5.0.0 (v5.5.18.1019) : 107,93ms (2,04ms)
That's quite the difference! Here it uses 10 DbCommands, which each 100 queries packed together in one SQL statement. For reference, Entity Framework Core 2.1.4.0 gives this result (batch size 100):
Set inserts of 1000 elements in one go (10 runs)
------------------------------------------------------------------------------
Entity Framework Core v2.1.4.0 (v2.1.4.18263) : 133,02ms (7,66ms)
There's an optimum per query/latency setup, where increasing the batch size won't make the process overall faster. For instance using a batchsize of 250 gives:
Set inserts of 1000 elements in one go (10 runs)
------------------------------------------------------------------------------
LLBLGen Pro v5.5.0.0 (v5.5.18.1019) : 160,37ms (6,77ms)
It thus depends on your situation which batch size is the fastest.
So how does this batch look like?
INSERT INTO [AdventureWorks].[Sales].[CreditCard] ([CardNumber], [CardType], [ExpMonth], [ExpYear], [ModifiedDate]) VALUES (@p601, @p602, @p604, @p605, @p606) ;SELECT @p603 = SCOPE_IDENTITY();
INSERT INTO [AdventureWorks].[Sales].[CreditCard] ([CardNumber], [CardType], [ExpMonth], [ExpYear], [ModifiedDate]) VALUES (@p607, @p608, @p610, @p611, @p612) ;SELECT @p609 = SCOPE_IDENTITY();
INSERT INTO [AdventureWorks].[Sales].[CreditCard] ([CardNumber], [CardType], [ExpMonth], [ExpYear], [ModifiedDate]) VALUES (@p613, @p614, @p616, @p617, @p618) ;SELECT @p615 = SCOPE_IDENTITY();
INSERT INTO [AdventureWorks].[Sales].[CreditCard] ([CardNumber], [CardType], [ExpMonth], [ExpYear], [ModifiedDate]) VALUES (@p619, @p620, @p622, @p623, @p624) ;SELECT @p621 = SCOPE_IDENTITY();
INSERT INTO [AdventureWorks].[Sales].[CreditCard] ([CardNumber], [CardType], [ExpMonth], [ExpYear], [ModifiedDate]) VALUES (@p625, @p626, @p628, @p629, @p630) ;SELECT @p627 = SCOPE_IDENTITY();
-- etc.
It has grouped all inserts together using ';' delimiters and appended the identity value retrieval query after the insert query so it can return these back to the runtime so these values can be inserted into the entity instances as new PK values.
How to enable this magic? With just one line of code:
using(var adapter = new DataAccessAdapter())
{
// set the batch size to 80 queries.
// This will enable insert/update batching
adapter.BatchSize = 80;
// persist the unit of work
uow.Commit(adapter, true);
}
Batching characteristics
Inserts
The PK values created inside the database for an insert give a problem: what if an insert right after a given insert requires that PK value as an FK? This is the case with entities which have a relationship with themselves (like the textbook 'Employee.Manager m:1 Employee.ManagedEmployees' relationship). Therefore entity instances of types which have a relationship with themselves aren't batchable because of this, as well as entity types using inheritance.
Another problem arises with our simple Order - Customer example above: if Customer uses an identity PK, we can't sync the value back to the Order's FK field if the Order is part of the same batch command. (Well, technically we can, by generating additional SQL into the batch but that quickly becomes fairly complex and won't work on some databases) To overcome this we have to run the Order insert in a separate batch than the Customer insert.
We determine that during the same topological sort action: we collect the order of types of the elements, which is the same as the order in which they appear in the sorted list. We then collect all elements in batches per type, and they're always in the right order. This would mean in our simple example, if we have a collection of new Order instances, each referencing a new Customer entity (some might reference the same Customer entity), we have two batches: one with Customer instances and one with Order instances, executed after the Customer batch.
Updates
For updates things aren't that limited: there's no reason to group entities together per type, as an update isn't affecting another entity instance due to PK / FK synchronization. For this reason, all updates are batched together in a single batch, no matter the type.
When can you use this magic?
Insert / Update query batching is supported on our paradigm 'Adapter', and on the databases SQL Server, PostgreSQL and Oracle.
See for more information the documentation on Insert / Query batching.
Happy batching!