April 22, 2014

Reliable inter-system communication

Sometimes a software system needs to communicate with another in order to complete a user request. Having a good framework in place for such inter-system communication is both crucial and tricky, particularly if both systems store a portion of the data (duplication across systems = not advisable, yet sometimes unavoidable). Improper implementations can result in split-brain issues, costly debugging sessions and hair-splitting reconciliation.

There are some key guidelines for building a good framework:

First, determine if you need all inter-system user requests to be synchronous i.e. blocking requests with user on-hold for a response. One way is to start with an assumption that everything can be asynchronous and then let the product owner justify why some calls should be synchronous instead.
Asynchronocity allows for the system to be eventually consistent (rather than strongly consistent) and sets the right expectations to the users of the system (via appropriate messaging). Asynchronous calls can then utilize queues or an ESB to route calls, manage contention, etc.

Calls to system B should be idempotent from system A’s perspective. This can be done by using a GUID to tag every request and maintaining the status of the request on both sides. One may also use a shared log to track this.
Distinguish between errors and failures. Treating every error as a failure would be incorrect. Errors are temporary failures and typically a retry mechanism kicks in.
Failures are permanent and should result in a rollback as needed. If System B returns a failure despite System A forwarding the request, this can indicate a problem with rules of validation. System A must use a validation superset of the rules employed in System B.
Handle timeouts. Timeouts are tricky because they may have occurred after System B has successfully processed or the call may never have been received by System B. In either case System A wouldn’t know the status of the request and simply can’t assume failure. Both retry and idempotent capabilities would be required to address this situation.
In situations where more than 2 systems are involved, a post-commit rollback may be required if the web service to third system fails. Post commit rollbacks may be addressed through compensating transactions. An audit trail of DB changes would help to roll back updates already committed.
Finally, to take care of things that have been rolled back and errors that haven’t been successfully retried, reconciliation jobs must be put in place on a desired schedule (hourly, daily, monthly, etc).

Kudos

Reliable inter-system communication

Now read this

Cloud computing over the years