Transient Error Recovery

The LLBLGen Pro Runtime Framework supports basic Transient Error Recover, or Connection Resilience, during operations. The LLBLGen Pro Runtime Framework contains strategy classes which implement a recovery strategy and are to be used together with the action you'd like to perform on the database (Selfservicing and Adapter) or can be set automatically once and be used for all queries from then on (Adapter).

This means for SelfServicing the strategies are not 'set-and-forget' but require you to add code to be able to recover a given database action. You can use the same pattern for Adapter too, but Adapter also offers a set-and-forget approach to strategies which is described later in this section.

This section describes the strategies available to you, how to write your own and how to use the strategies to make database work transient-error robust.

Transient errors

Transient errors are errors which are recoverable: a connection was temporarily not available, timeouts etc.. Some strategies check for transient errors, and if such an error occurs, they will retry, and otherwise fail. Other strategies will always retry, no matter what the error is.

Currently LLBLGen Pro supports dedicated transient error checking for SQL Server, however it's easy to add your own for additional databases.

Strategies

To use the transient error recovery, we have implement a set of strategies. A strategy is a definition how work is retried. A strategy has three parameters:

  • The maximum number of retries. Default is 5
  • The maximum delay (in seconds). This is the maximum amount of time recovery can take, starting with the initial failure till the last attempt. Default is 30 seconds.
  • The recovery delay object itself. Default is a delay object using the recovery type Exponential and a delay parameter of 2.

A strategy is asked to execute a piece of code through a lambda (which does or doesn't return a value and which is executed asynchronously or synchronously). If that piece of code fails and the error causing that failure is a transient error, the code is retried till the maximum number of retries is reached or the maximum delay is reached.

Delay between retries calculation

The period of inactivity between retry attempts is calculated with the RecoveryDelay object specified with the strategy. The RecoveryDelay object works with a delayParameter, which is used in the calculations. There are a couple of different delay period calculations possible: (attempt is the attempt number). The calculation is always resolving into a number of seconds.

  • Exponential (default). Returns delayParameter^attempt .
  • Linear. Returns delayParameter.
  • Random. Returns a random value between 1 and delayParameter (delayParameter included)
  • Custom. Calls a specified lambda, which has to return the delay, passing in attempt and delayparameter.

The next delay value calculated will be the minimum value of the maximum delay specified on the recovery delay object and the new calculated value. So if you specified in the RecoveryDelay constructor a Timespan of 30 seconds and the calculation of the next delay results in e.g. 40, the delay will be 30 seconds, not 40.

Defaults for RecoveryDelay parameters

The defaults for RecoveryDelay parameters are:

  • delayParameter: 2
  • delayType: Exponential
  • maximumDelay: 30 seconds

Each strategy uses these defaults if no RecoveryDelay object is specified.

Different strategies provided

The following strategies are provided, which are classes available in the ORMSupportClasses assembly and located in the SD.LLBLGen.Pro.ORMSupportClasses namespace. They all derive from the base class RecoveryStrategyBase.

  • SimpleRetryRecoveryStrategy. This is a strategy which will retry 5 times in all situations using an exponential retry delay. It won't check whether an exception is a transient error.
  • SqlAzureRecoveryStrategy. This is a strategy which is usable mainly on Azure (but you can use it also on SQL Server in a non-azure setup) and is suitable for SQL Server transient errors. It checks for the transient errors using error codes available on the following page: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-develop-error-messages#transient-fault-error-codes which contains the SQL Azure transient errors. Additionally it sees timeout errors as transient errors.

There are others possible, you can implement your own using the classes provided.

Exceptions gathered

When errors occur during the execution and they're transient errors, they're still collected within the strategy used. To obtain the exceptions, read the CollectedExceptions property of the strategy after the execution of the work, which contains the collected exceptions. These exceptions are the wrapping exceptions, and are of type ORMException.

Recovery failure

If after the maximum number of attempts or the maximum delay the work still fails, the strategies will report this with an ORMTransientRecoveryFailedException exception. It can be used to perform other ways to recover from the errors. It contains all exceptions collected during the execution and retry phase of the strategy.

Thread safety and strategies

It's tempting to re-use a strategy object across code and across threads. This won't work, as a strategy object is a controller of the recovery process and can handle one batch of work at a time.

Tracing and recovery

Recovery strategies use tracing to signal what they're doing. The tracer used is ORMQueryExecution with two levels:

  • Level 3 (informative), it will trace when a query was recovered through transient recovery.
  • Level 4 (verbose), it will trace a retry has been attempted and how long the delay was that was taken. 

Usage: manually specifying strategy with a query

This usage pattern focuses on specifying the strategy together with the query to execute. It's not set-and-forget, and available for SelfServicing and Adapter. The example below contains all the overhead to use a strategy with a call.

It's recommended to create a generic method which wraps the calls so you don't have to add the overhead with each call: simply call your wrapper method to perform work and apply a strategy in one go.

var q = from c in metaData.Customer
        where c.Country=="USA"
        select c;
var strategy = new SqlAzureRecoveryStrategy(); // use defaults
var l = strategy.Execute(()=>q.ToList());

// async usage
var l2 = strategy.ExecuteAsync(()=>q.ToListAsync());

The lambda passed in to execute is the work which is retried if it fails, using the strategy parameters.

It is key to wrap the complete work to re-try in a single anonymous method/lambda, so retrying the work can succeed: it's not sufficient to start a transaction outside the Execute call and then try to retry inserts within the transaction: a batch of work that is meant to be retried has to start its own transaction and commit its own work.

Rule of thumb: it should be able to succeed by opening its own connection and close it afterwards.

Usage: automatic usage of strategy with any query (Adapter only)

When the property DataAccessAdapter.ActiveRecoveryStrategy is set to a RecoveryStrategyBase derived class instance, that strategy is used to execute the recoverable methods automatically. This property is meant to be set once per DataAccessAdapter instance, with an instance that's not shared among threads.

The default of ActiveRecoveryStrategy is null, meaning no strategy is used.

Automatically setting the ActiveRecoveryStrategy property on adapter instantiation

To make sure the property is always set to the right strategy, you can override the method CreateRecoveryStrategyToUse in a partial class of DataAccessAdapter and return the recovery strategy object to use. Return an instance which isn't shared among threads.

The returned object is placed in the ActiveRecoveryStrategy property automatically. If the property ActiveRecoveryStrategy is set to a value, CreateRecoveryStrategyToUse isn't called.

Strategy re-entrance protection

When the Execute/ExecuteAsync methods are called on a RecoveryStrategyBase derived class instance, it is made sure that if indirectly the Execute/ExecuteAsync method is called again, it won't wrap the call in the retry pipeline but will call the lambda specified directly.

This is to prevent the second call is retried when it fails instead of the first call, as the first call is the one which should be retried too. This is also the reason you shouldn't share strategy objects among adapter instances or threads.

What calls are covered with the strategy set?

All methods which start and complete an action, e.g. FetchEntity, SaveEntity, are covered by the strategy object set and will be retried through the recovery strategy logic, so the methods one would call with the Execute method manually. Methods like StartTransaction aren't covered, as it's part of an action completed by another method.

Additionally, FetchDataReader overloads aren't covered, as the read action of the datareader can fail but it's outside the scope of the method and thus outside the strategy.

UnitOfWork2.Commit/CommitAsync

Additionally, the UnitOfWork2 Commit methods (and async variants) will use the strategy available through the DataAccessAdapter instance specified on the unit of work. If the transaction is already in progress, or auto-commit is set to false, the strategy set is ignored, as the action done by the Commit is part of a larger process and that larger process should be retried in case of an error.