Building a choreographed microservice architecture with the Decorated Saga pattern

Having recently read the second edition of Sam Newman’s book “Building Microservices” I felt the need to discuss an approach I have taken a few times over the years to achieve a very purist choreographed microservice architecture. I read the first edition 8 years ago when I first started this journey into the world of microservices, before web resources were so prolific. For several weeks I poured over the book trying to extract for myself a set of principles that defined the key characteristics of such a system. What was important? What was not? What would a gold standard microservice architecture look like? The list looked something like this

Microservices principles:

A microservice should be independently deployable.
Services should be loosely coupled and not know anything about each other.
The architecture should prefer Choreography over Orchestration.
All services should own their own data stores.
All services should output comprehensive logs.
Microservices should be platform agnostic and avoid vendor lock-in
Microservices should be easily scalable.
All services should have metrics that at a minimum include the timing of all incoming and outgoing calls.
The architecture should avoid putting business logic into 3rd party software (i.e. the database, message broker etc)
It should be easy for a human to rationalise what is going on.
Cross domain reporting must not be performed on the microservice architecture, but offloaded to some other system.

Over the years since I first read this book I have worked on many microservice architectures. Some I have designed myself and some I have inherited. I have also noticed a few trends.

There is an over reliance on new cloud-vendor technologies. This immediately locks you to that vendor. If you are happy with that, then that is fine, but understand your system is in no way agnostic and will be difficult to move.
The same is true of log aggregation systems, message brokers and storage providers.
Event driven architectures are hard to understand and most developers tend towards a request-response and 3-tier mindset.
Choreography is hard and many systems have ignored it in favour of orchestration which naturally tends towards a more coupled system.
Coupled systems have a tendency to become a complex spiders-web of inter-service calls that are no less difficult to understand.
Coupled systems have a tendency to loose the benefits of a microservice architecture and become “distributed monoliths” requiring big bang deployments.
In the second edition of Sam Newman’s book, even he seems to have relaxed some of his more controversial and evangelical positions.

Personally I feel that the principles above still hold true. I believe that when you begin to violate any one of them you begin to compromise the integrity of the ideas that underpin microservices. Over reliance on 3rd party systems, platforms or cloud vendor technologies will lead to lock-in. Orchestration will lead to coupling. Shared data stores or domain objects will lead to coupling. Coupling will lead to lock-step deployments. Lock-step deployments lead to monoliths. So if you are not willing to embrace choreography and a platform agnostic approach in the first place, just be honest with yourself and build a monolith to begin with. There is nothing wrong with that and it will save you a lot of time and hassle.

Innovative consultancy for the Public and Private Sectors

A Little History of the Decorated Saga pattern

In essence it is a simple idea, though it has many implications, some of which are huge advantages, and some of which are potential pitfalls. In this article I am going to attempt to comprehensively lay these out once and for all, and describe an implementation involving not only the Decorated Saga Pattern, but also the importance of the Service Chassis and how the two concepts interrelate. But lets start with a short introduction to the history and how these ideas fit in with microservices more generally.

Once upon a time I was working in one of the large UK credit bureau agencies, tech leading their global innovation team to help them explore new markets. It was 2015 and Docker was just gaining traction and fitted well with our philosophy of agility. We were about to embark on a new project and one of our team members had just returned form a talk in Europe by Fred George. He was singing the praises of an idea he had encountered there that he was calling Needs and Solutions.

Together these three ideas (Micorservice, needs/solutions & docker) bubbled in my mind. The then Architect and I came up with the first iteration of an architecture I am going to share with you here. He quickly stepped away leaving me to solve the challenges of the initial idea. For years now I have re-implemented these ideas many times. Each time refining it and improving on what had come before. Up until now I have always referred to a big part of this as the Need(s)/Solutions Pattern, but I feel it is time to refine this further, so from this moment on I am going to call it the “Decorated Saga Pattern”. That name feels a better fit with what this is all about and embraces modern ideas in the industry. The other half of this design I have come to call the Service Chassis. The Service Chassis is all about how a service should work. The pattern is about how the services communicate.

Back in 2015, off the back of the principles above I began my fist microservice prototype. The innovation team I was working on was a mix of Java, C/C++ and Python developers. I myself had come from a .NET background and so to ensure I could participate in this new microservice world, my prototype was written to use Mono (able to run on either windows or POSIX compliant systems – principle 6). This was before the advent of .NETCore. The prototype is still on Github to this day, though a little dated now:

https://github.com/radicalgeek/SampleMicroservice

The idea was simple. We would use a technology agnostic inter service message format (Json – principles 6 & 10) and services would publish a message to a fan out RabbitMQ message exchange that expressed a “need”. All services would then see that message on their own queue and have the opportunity to participate in providing a partial or full “solution” to that need. If a service was able to provide a solution (or partial solution) it would decorate the existing message and re-publish it. All services would again see this message, now with additional information and once again have the opportunity to participate in providing a part of the solution. The whole thing was a kind of an implicit workflow, something I later came to understand as akin to a saga. There were a number of advantages to this.

Services were loosely coupled. No service needed to know anything about any other (principles 2 & 3).
Scaling a service was easy, all you needed to do was add another instance listening to the same queue and RabbitMQ would distribute the messages between the instances automatically (principle 7).
Services could register the queues themselves based on their name and major version number (principle 9). Now if I needed to make a braking change I could simply deploy the service again registering a new queue. Both versions of the service could attempt to provide a solution and decorate the message with it. Any downstream services that did not recognise the new solution could ignore it.

What remained was to solve challenges that largely revolved around service hosting, development practices and CI/CD processes. Ensuring each services was hosted in it’s own repo that produced only one artefact satisfied principle 1. Solving the challenges around principles 4,5 & 9 however required something else.

Introducing The Service Chassis

Although I fully understand the concerns Sam Newman lays out in his section entitled “DRY and the perils of code reuse in a microservice world”, I have come to believe that there is very much a place for the concept of a service chassis. I also feel that this falls well within the acceptable exceptions he himself acknowledges in the 2nd edition of his book.

In any system there are a number of things that a service must do in order to interoperate with it’s underlying infrastructure and other services. Services also need to do this in a predictable and repeatable way. If every service is entirely self contained then it is too easy to introduce inconsistencies in things like logging. What if one team starts outputting a completely different log format, or forgets to include metrics? The service chassis should comprise of a reusable package/library or collection of packages/libraries that provide the following functionality.

Read environment variables from the host operating system or container
Publish and consume messages
Host HTTP endpoints
Output metrics and logs to an observability platform
Store and retrieve data
Scaling and concurrency
Emitting data changes to a Datasink for reporting tools
Provide abstractions over framework and 3rd party packages/libraries
Centralised configuration of 3rd party packages/libraries
Define common exceptions
Implementation of circuit-breaker pattern to wrap calls to 3rd party services

For every language you intend to use, you should have a chassis that provides the same functionality consistent with other chassis. In this way managing things like log aggregation becomes predictable, and you can have confidence that the services that are released based on these chassis are mature enough to operate on your platform.

These packages of course have the potential to become forms of coupling. If for example I need to fix a bug with the way data access is implemented then it will be tempting to update all the services that consume this package and release them all again. This form of lock-step release of multiple services is of course a violation of principle 1. The key then is to minimise these kinds of changes and ensure they are not braking changes, allowing teams to update these dependencies in their own time but with a little encouragement to do so. Following good practices like S.O.L.I.D can help minimise this issue. Specifically the Open/Closed principle. If a public method is open to extension but closed to modification then it is trivial to introduce a new public method to address whatever issue is being resolved. Further, most languages support some form of method deprecation feature. If old methods can be flagged as deprecated and the tooling can report warnings or errors based on this, then deprecated methods can be spotted by the teams consuming the packages and updated accordingly.

The chassis should be seen as a mechanism to get new services up and running quickly. If all the cross cutting concerns of how a service should run within the wider context of the system are already addressed, then developers can simply focus on implementing the business logic required by the service. In the context of the Decorated Saga Pattern we are specifically interested in the publishing and consuming of messages, which we will come back to in a little while.

The Choreographed Saga.

Now that I have presented the history of this idea and laid some of the ground work in terms of fundamental concepts, it is time to explore the idea of the Decorated Saga Pattern in more detail. For simplicity and for ease of comparison, and I am going to use the same scenario Sam Newman describes in the second edition of his book when discussing the use of choreographed sagas. Lets first recap that scenario.

The following text is copied from Sam’s book in order to accurately articulate his scenario and diagram. The bold emphasis is mine to highlight key points to keep in mind for later:

A choreographed saga aims to distribute responsibility for the operation of the saga among multiple collaborating services…Choreographed sagas represent a trust-but-verify architecture…Choreographed sagas make heavy use of events…
…First, these microservices are reacting to events being received. Conceptually, events are broadcast in the system, and interested parties are able to receive them…You don’t send events to a microservice; you fire them out, and the microservices that are interested in these are able to receive them and act accordingly. In our example, when the Warehouse service receives that first Order Placed event, it knows it’s job is to reserve the appropriate stock and fire an event once that is done. If the stock couldn’t be ~~received~~ reserved [typo in the book], the Warehouse would need to raise an appropriate event (an Insufficient Stock event, perhaps), which might lead to the order being aborted.
We also see in this example how events can facilitate parallel processing. When the Payment Taken event is fired by the Payment Gateway, it causes reactions in both the Loyalty and Wearhouse microservices. The Warehouse reacts by dispatching the package, while the Loyalty microservice reacts by awarding points
Sam Newman – Building Microservices 2nd Edition. P 192-193

I very much like what Sam is talking about here, but I feel there are a few problems with the implimentation.

Depending on the technology used, it is easy for this to become a form of coupling. Specifically if your message broker relies on topics (potential violation of principle 6). In this case your consuming services have to have knowledge of the topics that must be subscribed to. Both the Warehouse service and the Loyalty service need to subscribe to Payment Taken event. Where is this topic created? It would make sense for it to belong to the Payment service, so will the Payment service create the topic when the service starts up so that it can emit the events? If so the Warehouse and the Loyalty service now have knowledge of what the Payment service is expected to do (potential violation of principle 2). Or are dependant on the Payment service being release first (violation of principle 1) Or is the topic created using the broker it’s self? Are you now building the “smarts” of your system into a 3rd party product (violation of principle 9)?
In order for this Saga to complete a total of 4 different events need to be emitted. While it is true that these events can be reconciled later via their Correlation ID using log aggregation, it just feels overly complex. Events, spawning more events has the potential to loose contextual information. What if we forget to include the correlation ID in the new event? Each of these events may later have even more subscribers and the complexity of the system grows quickly. For example what if a fraud detection service becomes interested in the Payment event, or if a stock replenishment service becomes interested in the Order Shipped event? Are the violations of our principles also growing?

While I fully accept this is a completely legitimate and standard way to perform choreography, what if there was a simpler, even more loosely coupled way to achieve a choreographed system?

Enter the Decorated Saga

The key to the Decorated Saga is that the saga is the event, and the event lives for the lifetime of the saga. Rather than services emitting new events for other services to consume, the original event (lets say message from now on) is simply decorated with further information and republished. Collaborating services see the message each time it is published and only decorate it once the message contains enough information for that service to act upon it.

Let’s assume we are using RabbitMQ (An AMPQ complaint system, easily swapped out – principle 6) as our message broker. At start up a service can register it’s own queue based on it’s name and major version number (e.g. queue-warehouse-v1) and connect it to the default exchange (principle 9). With many services registering their own queues this results in a fan-out configuration, where any message sent to the exchange is routed to all queues. This means a service can publish a message and all other services will see it.

When a service receives a message on it’s own queue, it can check to see if it can enrich the message further with it’s own data. If it can’t it simply throws the message away. But if it can it decorates the message with anything it can add, and then republishes it for all the other services to see again.

Lets look at the same example operation again, but this time using the Decorated Saga Pattern.

First of all the Order service published the initial event/message that an order has been placed. All of the services receive this message. Both the Payment service and the Loyalty service know that they are interested in this Saga, so they pick up the message, but in both instances the message doesn’t include enough information for them to process it so they throw the message away. In the case of the Payment service, it is only interested in the Order Placed saga when stock has been reserved. The Loyalty service is only interested in the Order Placed saga when a payment has been taken. The Warehouse service is interested in the message though, and so it reserves the stock, decorates the message and republishes it. Lets look at an example message to get a better idea of what the Warehouse service has published back to the message broker.

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,
  “sourceService”: “OrderService”,										  
  “publishTime”: “2023-08-03T13:07:51.4005517Z”,              
  “lastServiceDecoration”: “WearhouseService”,                               
  “lastDecorationTime”: “2032-08-03T13:07:52.9574687Z”,                                
  “saga”: “OrderPlaced”,                                            
  “context”:
    { “UserId”: “12345678-1234-1234-1234-1234567890AB”, 
      ”SKU”: ”PRODUCT-056”,
      ”quantity”: ”1”,
      ”pricePaid”: ”9.99”,
      ”currency”: ”GBP” 
    },    
  “decorations”:												  
    &#91;
     { “status”: “stockReserved”, 
       “stockItem”: “PRODUCT-056”, 
       ”quantity”: ”1”
     } 
    ]
 }

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,

“sourceService”: “OrderService”,

“publishTime”: “2023-08-03T13:07:51.4005517Z”,

“lastServiceDecoration”: “WearhouseService”,

“lastDecorationTime”: “2032-08-03T13:07:52.9574687Z”,

“saga”: “OrderPlaced”,

“context”:

{ “UserId”: “12345678-1234-1234-1234-1234567890AB”,

”SKU”: ”PRODUCT-056”,

”quantity”: ”1”,

”pricePaid”: ”9.99”,

”currency”: ”GBP”

“decorations”:

[

{ “status”: “stockReserved”,

“stockItem”: “PRODUCT-056”,

”quantity”: ”1”

}

]

}

In this message we can see a few key concepts. First of all the saga is defined as part of the header. This is the first property any service will look at to determine if it is interested in consuming the message to potentially decorate it. Also note the Correlation Id that allows us to later review the messages in our log aggregation software if we need to (principle 10), and find all the instances of this message that were published by our collaborating services. The time stamps also help us to understand the sequence these messages were processed in. We will look at the other header properties in a moment but for now lets focus on the context and decorations properties.

The context property is the information this message was originally published with. This is the information that a service will need in order to process the message. After checking the header information to determine if a service is interested in the message, the next thing a service will do is unmarshall/deserialise the Json into an object entity that it can work with internally. If the context can not be deserialised then it is assumed that the message was not intended for this service and is thrown away. This still allows, however, for changes to be made to the context without braking the Warehouse service. Any new properties that are added to the context will be ignored by the deserialisation. This allows for an amount of backwards compatibility. Changes to, or removal of existing properties will still represent a braking change, but new properties can be added that will not impact the current service implementation.

The decorations property is an array of decorations added by services as the message progresses through the system. In the example above we can see that the Warehouse Service has decorated the message to state that the stock has been reserved. A service can either add a new decoration object to the decorations array, or it can append additional information to an existing decoration. Once a message has been decorated the Correlation Id can be stored so that if this message is received again it doesn’t need to process for a second time.

Lets look at the message again once it has been decorated by the payment service.

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,
  “sourceService”: “OrderService”,										  
  “publishTime”: “2023-08-03T13:07:51.4005517Z”,              
  “lastServiceDecoration”: “PaymentService”,                               
  “lastDecorationTime”: “2032-08-03T13:07:53.6821676Z”,                                
  “saga”: “OrderPlaced”,                                            
  “context”:
    { “UserId”: “12345678-1234-1234-1234-1234567890AB”, 
      ”SKU”: ”PRODUCT-056”,
      ”quantity”: ”1”,
      ”pricePaid”: ”9.99”,
      ”currency”: ”GBP” 
    },    
  “decorations”:												  
    &#91;
     { “status”: “stockReserved”, 
       “stockItem”: “PRODUCT-056”, 
       ”quantity”: ”1”
     },
     { “status“: “paymentTaken“,
       “amount“: “9.99“,
       “paymentMethod“: “creditCard“ 
    ]
 }

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,

“sourceService”: “OrderService”,

“publishTime”: “2023-08-03T13:07:51.4005517Z”,

“lastServiceDecoration”: “PaymentService”,

“lastDecorationTime”: “2032-08-03T13:07:53.6821676Z”,

“saga”: “OrderPlaced”,

“context”:

{ “UserId”: “12345678-1234-1234-1234-1234567890AB”,

”SKU”: ”PRODUCT-056”,

”quantity”: ”1”,

”pricePaid”: ”9.99”,

”currency”: ”GBP”

“decorations”:

[

{ “status”: “stockReserved”,

“stockItem”: “PRODUCT-056”,

”quantity”: ”1”

{ “status“: “paymentTaken“,

“amount“: “9.99“,

“paymentMethod“: “creditCard“

]

}

In this example you can see that the correlationId, sourceService and publishTime remain the same. The lastServiceDecoration and lastDecorationTime properties, however, have been updated. The important part is a new decoration object has also been added to the array. When the Payment Service saw this message it first checked the saga property and deserialised the context just like the Warehouse service before it. In this instance though the Payment service also tried to deserialise all the decorations looking for one that could be deserialised into a stockStatus object. If the Payment service had been unable to do this, it would assume that all though it is interested in OrderPlaced Saga messages, that this one was not yet ready to be processed. Assuming that the message can be processed by the Payment service it would again be decorated and republished.

Rather than look at the message again as the Loyalty service receives it, lets look instead at an alternative version of the last message.

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,
  “sourceService”: “OrderService”,										  
  “publishTime”: “2023-08-03T13:07:51.4005517Z”,              
  “lastServiceDecoration”: “PaymentService”,                               
  “lastDecorationTime”: “2032-08-03T13:07:53.6821676Z”,                                
  “saga”: “OrderPlaced”,                                            
  “context”:
    { “UserId”: “12345678-1234-1234-1234-1234567890AB”, 
      ”SKU”: ”PRODUCT-056”,
      ”quantity”: ”1”,
      ”pricePaid”: ”9.99”,
      ”currency”: ”GBP” 
    },    
  “decorations”:												  
    &#91;
     { “status”: “paymentTaken”, 
       “stockItem”: “PRODUCT-056”, 
       ”quantity”: ”1”
       “amount“: “9.99“,
       “paymentMethod“: “creditCard“ 
    ]
 }

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,

“sourceService”: “OrderService”,

“publishTime”: “2023-08-03T13:07:51.4005517Z”,

“lastServiceDecoration”: “PaymentService”,

“lastDecorationTime”: “2032-08-03T13:07:53.6821676Z”,

“saga”: “OrderPlaced”,

“context”:

{ “UserId”: “12345678-1234-1234-1234-1234567890AB”,

”SKU”: ”PRODUCT-056”,

”quantity”: ”1”,

”pricePaid”: ”9.99”,

”currency”: ”GBP”

“decorations”:

[

{ “status”: “paymentTaken”,

“stockItem”: “PRODUCT-056”,

”quantity”: ”1”

“amount“: “9.99“,

“paymentMethod“: “creditCard“

]

}

The difference here is that rather than add a new object to the array of decorations, instead the existing object has been enriched further. Both versions of the message work exactly the same, a receiving service would try and deserialise the decoration object into a type that it recognises. Services that do not know about the additional properties would still be able to deserialise this into their object as they would ignore the additional properties, but services that are expecting the additional properties would not be able to deserialise an object with the additional properties missing. There are a few reasons why you may want to do this that I will come to in a moment, but first lets look at some of the advantages of implementing sagas in this way.

Firstly this is extremely loosely coupled. There is no contract shared between services, they simply try to cast the Json object into a type that only they recognise. Different services can have a totally different objects that only includes the properties that they care about. Further, no service has any knowledge of the existence of another service or what methods/functionality any other service may be offering. They simply receive messages, determine if they can operate on them, decorate them and republish.
Because each service has it’t own queue, scaling is easy and is just a matter of adding another instance of any given service listening to the same queue.
Adding or removing services to the saga is simple. Because services know nothing about each other a new service can be added watching for the same saga. All it needs to do is cast the context and/or any decorations it is interested in, into objects it can work with. It can even add new decorations or new properties to existing decorations without impacting any other services. Likewise the service could be removed too with no impact.
Messages are sent in Json format. This ensures that services written in any language can consume these messages and that they remain human readable. While I do understand the value of binary formats I feel that the ability to inspect messages in the logs to understand exactly what is happening in a choreographed system is invaluable, and a worthwhile trade off for any potential savings in message size.
Braking changes can be introduced easily. If you need to change the format of the decoration (remove or rename a property) you simply increment the major version number and deploy the service along side the old one. Because the services registers the queue including the version number (e.g. queue-warehouse-v2) the new version of the service will be on a separate new queue. The new version of the service will then simply add a new alternative decoration to any messages it processes. Existing services will not recognise this and thus ignore it, but continue to use the existing decoration. This leaves other services/teams free to update at their own cadence.

Parallel processing

You may have noticed that in the last example that I omitted the parallel processing that was included in Sam’s scenario. There was a good reason for that, and it is because it can be abused easily and lead to problems with this approach. Before we discuss this though I would like to note there would be absolutely no issue with implementing Sam’s example with the Decorated Saga Pattern because there is no suggestion of any further processing after the Loyalty and Payment services. The issue can occur if either of two factors are true. The original service to publish the message is watching as the message gets decorated in order to consume some kind of response. Or there is a lot of sequential parallel processing involved in the saga involving multiple services. Lets consider the following scenario.

In this example both the Loyalty service and the payment Service consume the message at the same time, they each perform their processing and decorate the message with their data and republish the message. At this point all services will see two versions of the message on their queue. They both have the same correlation ID but they have different decorations.

In many cases this could be fine. If the Order service wants to mark the order as completed in it’s own database it may only be interested in Payment Taken decoration and ignore the Loyalty decoration entirely. Alternatively if it is interested in the fact that loyalty points have been awarded for this order, it may processes both messages and update it’s database in two operations.

There is a third option to be aware of though. Because all services receive all messages, and because services inspect the message to see if there is anything they need to do, both the Loyalty service and the Payment service will receive a second instance of the message. For example both services see the message published by the Warehouse service and decorate it with their data and republish. At this point they both see the message again, but it has not been decorated with the data that they can add, it is currently only decorated by the other service. So the Payment service will see a copy of the message decorated by the Loyalty service and vice-versa. This could lead to the Payment service charging the customer again, or the Loyalty service awarding the points twice. For this reason it is important that services keep a record of the messages they have already processed using the Correlation Id. In some instances this double processing may not matter, and given the speed of execution on modern hardware simply be put down as part of the cost of agility. If that is true then the originating Orders service could simply wait for the two decorating services to both process the message twice, resulting in two messages on the queue both having been decorated by both services. In cases like this though were we a talking about things like payments and loyalty points I would suggest that this message duplication must be considered and dealt with.

Another issue with this duplication of messages is the exponential growth of messages you could see if you add further parallel services. With 2 parallel services you may consider that it is not too bad. As a result of these two services decorating the message at the same time you will see a total of 4 messages being sent between them. You could simply consider this part of the cost of agility. If however you have 3 services, we are now talking about 9 messages being consumed and decorated. With 4 services this grows to 16 messages.

For these reasons I would suggest that the Decorated Saga pattern is not well suited to parallel processing where there are further down-stream services that will continue the work flow. If you really must have parallel processing, use it sparingly with minimal services and keep in mind the potential pitfalls. Where there are no further down-stream services, I would not be adverse to introducing this kind of parallel processing.

Error handling

An important consideration of the saga is error handling. Unlike traditional database transactions, the saga has no concept of atomicity which can be used to roll back a transaction if a failure occurs. Because the change of state being applied to the system is distributed across many services, each one of which is responsible for it’s own data update, if an error occur during one of them we may end up with some services having updated their data while others have not. This can of course leave the system in an inconsistent state. We therefor need a way to handle this type of failure.

Or course it really depends upon the type of failure that the occurs as to the appropriate method of recovery. In some cases the action can simply be re-tried (forward recovery) until it succeeds. In his book Sam Newman gives the example of the payment service failing to take a payment because of insufficient funds. In this case we could simply place the message back in the queue for it to be re-tried again later, and repeat this until there are enough funds available for the payment to succeed. It is much more likely though that we need to deal with rolling back all the transactions in each service that participated in the saga (backward recovery).

The nature of the decorated saga makes this fairly simple. A failing state is easily implemented as just another decoration. Consider the following example message.

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,
  “sourceService”: “OrderService”,										  
  “publishTime”: “2023-08-03T13:07:51.4005517Z”,              
  “lastServiceDecoration”: “PaymentService”,                               
  “lastDecorationTime”: “2032-08-03T13:07:53.6821676Z”,                                
  “saga”: “OrderPlaced”,                                            
  “context”:
    { “UserId”: “12345678-1234-1234-1234-1234567890AB”, 
      ”SKU”: ”PRODUCT-056”,
      ”quantity”: ”1”,
      ”pricePaid”: ”9.99”,
      ”currency”: ”GBP” 
    },    
  “decorations”:												  
    &#91;
     { “status”: “paymentFailed”, 
       “reason”: “Payment Gateway unavailable”, 
       ”error”: ”500: Internal Service Error”
     } 
    ]
 }

{ “correlationId”: “12345678-1234-1234-1234-1234567890AB”,

“sourceService”: “OrderService”,

“publishTime”: “2023-08-03T13:07:51.4005517Z”,

“lastServiceDecoration”: “PaymentService”,

“lastDecorationTime”: “2032-08-03T13:07:53.6821676Z”,

“saga”: “OrderPlaced”,

“context”:

{ “UserId”: “12345678-1234-1234-1234-1234567890AB”,

”SKU”: ”PRODUCT-056”,

”quantity”: ”1”,

”pricePaid”: ”9.99”,

”currency”: ”GBP”

“decorations”:

[

{ “status”: “paymentFailed”,

“reason”: “Payment Gateway unavailable”,

”error”: ”500: Internal Service Error”

}

]

}

In this example the Payment service has failed to take a payment because the payment gateway on which it depends has returned an error. From the point of view of the decorated saga pattern, this is just like any other message. The Payment service has received a message, processed it, decorated it and re-published it. It just so happens that in this instance the information with which it has decorated the message describes an error.

From the point of view of other services this is also not much different from any other message. Just as they do with other messages, each service will receive this message and inspect it in order to determine whether or not they need to process it. Services that are interested in the OrderPlaced saga can attempt to deserialise the message into a type they recognise. So in each service involved with the OrderPlaced saga we must implement a handler that is looking for messages with decorations in this format. If services remember what database operations they have performed for the message with this correlation Id, then when they receive a message with this error format they can simply lookup the associated correlation Id and rollback any transactions they previously performed.

Now we have looked at the Decorate Saga pattern in a little detail, we are ready to revisit the Service Chassis, and see how we might implement the message handling.

Using the Service Chassis to create services.

As discussed above, a Service Chassis is the perfect place to handle cross cutting concerns that all services must deal with. This in essence boils down to interactions with the rest of the system and the technologies that underpin the architecture. Lets revisit the features listed above in further detail.

1. Dealing with environment variables

The Chassis is the perfect place to implement logic for reading environment variables from the host operating system, or more often the container. A fairly trivial idea really, but key for a number of reasons. Firstly using environment variables for per-environment configuration ensures that any service released can run in any environment without the need for re-compilation. For the purposes of this article though, I want to highlight a variable that contains the service’s version number. The major version number should be used when registering the service’s queue. As mentioned above this means that when we increment the major version number of the service, we can deploy the old and new versions side by side which allows for gradual adoption of new functionality by down stream services.

2. Publish and consume messages

This is one of the most important features of the Service Chassis in a Decorated Saga environment. And it is the consumption of messages that needs looking at the most. In essence the service needs to consume every message it sees and throw away anything it is not interested in. Anything it is interested in must then be passed to a handler, or group of potential handlers that will try and match the context and/or any existing decorations to a type that the service knows about. So the logic in the chassis is more about what messages the service is not interested in. Handlers are implemented in the services that use the chassis packages. The chassis can also be responsible for re-publishing messages that have been decorated as well as registering the queue that the service will be consuming messages from (principle 9). Here is a potential flow for a service consuming, processing and republishing a message.

When a service takes a message off of the queue, the first thing it checks is the saga property. If the service doesn’t participate in this saga, then the message can be discarded.
Upon checking the lastServiceDecoration property, the service may discover that it was the last service to decorate this message. If this is true the message can be discarded, as there is little point processing it again.
The next thing the service tries to do is check if the message is decorated with an error. If it is then the Correlation Id can be checked to see if the service has already taken any action for this message that may need to be rolled back.
The service checks to see if it has already decorated the message. This is useful in a parallel processing environment where message may have already been decorated, but this service was not the last service to do so.
Finally the service can attempt to deserialise the message into a type known to this service. If this succeeds then the message can be processed and further decorated before being re-published. If this check fails then although this service is interested in messages for this saga, the message may not yet contain enough information to be processed and likely must be decorated by another service that will add additional information before this service can process it.

3. Hosting HTTP endpoint

Most services need some form of public interface to initiate actions. In most cases this is realised as a RESTful API (though not always). The business of creating the HTTP server, handling sessions, authenticating tokens etc is going to be the same in all services. Therefor this functionality should be provided by the chassis, leaving the service to implement only the required business/domain focused endpoints. It is also useful to add endpoints for things like heath checks and liveliness/readiness to the chassis, as these are required by all services. The chassis is also the perfect place to put middleware into the request pipeline which leads me neatly onto.

4. Metrics and logs to an observability platform

These are also things that all services will need to do, but are also easy to forget or be seen as less important than getting the business functionality in place. The more of this that can be done in the chassis the better. Having a handler or controller base class that does most of this stuff for you can go a long way. That way when developers are working on a service, all they need to do is inherit from these base classes when implementing their handlers or endpoints and they get a lot of this functionality for free (principle 5). I would also recommend outputting these logs and metrics to the console. Not only does this make it easy for developers to see while they are working on the services, but there are many tools in the docker and kubernetes ecosystems capable of parsing console output and forwarding them into observability platforms (principle 6).

5.Storing and retrieving data

The obvious part of this is usually a generic repository pattern so that any service that uses the chassis can store and retrieve objects related to it’s domain. But other features to consider are code-first deployments and migrations. If a service owns it’s own database (principle 4) it should also be responsible for managing it. So when a deployment of a new version of a service requires and update to it’s data schema, it should also perform the migration as it is deployed. You may also want to consider adding data seeding functionality too, allowing services that are created using the chassis to insert data into their database. Very useful in testing environment. Another point of consideration is this; Should all services use the same database technology? Some service domains may be best suited to storing their data in a relational database, while others might be better suited to a document store. It is worth considering a generic data interface with a number of different implementations, and allowing the developers the freedom to choose the correct type of data storage for the domain. I would recommend four implimentations:

Relational data storage
key/value data storage
Document data storage
Wide column data storage

There are of course other forms of data storage you may wish to consider such as graph storage. It all really depends upon the type of data your services need to store.

6.Scaling and concurrency and state

It is worth dealing with some scaling scenarios in the service chassis (principle 7). As I mentioned earlier, simply adding a new instance of a service connected to the same queue will result in messages being delivered in a round-robin fashion to the instances of the service allowing for a simple form of scaling. However, this comes with some challenges. The first example would be if your service offers HTTP endpoints and has an open connection with a client. This can be an issue if your service has requested something from the rest of the system and another instance of the service picks up the response from the queue. It would be worth considering implementing some form of shared state across instances of a service, perhaps using a Redis cache. Another challenge you will face when scaling a service is database concurrency. Make sure your data layer implementations take this into account. You will also experience challenges around data migrations if you are running multiple instances of your service. Scaling down to one instance to perform a migration, or having some form of leader election will alleviate this problem.

7. Emitting data changes to a Datasink for reporting tools

This is something I have not touched on so far, and is directly related to principle 11 (Cross domain reporting must not be performed on the microservice architecture, but offloaded to some other system).

The reason for this is that reporting can be a heavy workload, and you do not want to tie your services up with processing requests for large volumes of data when they should be servicing customer requests. This becomes even more true if you are aggregating the data across multiple services and inter-service communication relies on the Distributed Saga pattern. A simple solution is to build functionality into the chassis that duplicates all data changes (create, update and delete actions) into some form of shared storage. This functionality is known as the Datasink, feeding data into a Datalake (shared storage). It is in the Datalake that you can then create relationships between data from different domains, which you can then plug reporting tools into. This ensures that reporting is not performed directly on your microservices, and gives your reporting tools a rich set of data on which to query. A simple way to do this is with a wide column data store, allowing you to add columns for the data properties that each service will be sending in. It is important that how ever you choose to implement your Datasink and Datalake it remains resilient to changing data structures owned by the services. After all the services own their databases (principle 4) and should be free to change structures as they see fit.

8. Provide abstractions over framework and 3rd party packages/libraries

Many of the packages you use will have alternatives. As we already discussed when talking about databases, a common interface with alternative implementations can be useful. The same is true of other packages you use too. What if you want to swap out your logging framework, or the message broker? Because these things are used by all services, it is worth ensuring you create abstractions over these packages so that you can easily replace them (principle 6). In addition, it can sometimes be difficult to test code that depends on core framework functionality such as file system access. Creating abstractions in your service chassis can help with testing too.

9. Centralised configuration of 3rd party packages/libraries

Packages used for things like logging or message publishing will be used in the same way across all services. The chassis provides the opportunity to ensure consistent configuration of these dependencies.

10. Define common exceptions

Common exceptions that your services may throw can be defined in your chassis, again ensuring consistency and conformity across your services. Consider adding ResorceNotFoundException, ConfigurationException and other common exceptions to your chassis packages.

11. Implementation of circuit-breaker pattern to wrap calls to 3rd party services

Many services will become proxies to 3rd party systems that are outside of your control. In these cases it is important that your services are resilient to issues with these dependencies. Consider including a Circuit-Breaker Pattern implementation in your chassis to ensure that any time a developer needs to call a 3rd party API, it can be done in a resilient way.

The Opinionated Chassis

The chassis is where your architecture becomes very opinionated, and a method by which you control the available technologies in your architecture. For example there are many relational database technologies to choose from. It is likely, however, that you only want to support one at a time. If you allow each team to select their own relational data storage you might very well quickly end up supporting MS SQL Server, PostgreSQL and MySQL which of course increases both the complexity of the system and the cost of supporting and maintaining it. the same is true of message queuing software (RabbitMQ, Kafka, SQS…), observability (Elastic, Splunk, Datadog…) and just about every other 3rd party dependency.

When selecting 3rd party technologies to integrate with the chassis it is usual to select only one for each required feature. In this way the chassis becomes opinionated on which technologies the developers are free to use. This could lead to violations of principle 6 & 9 so it is important to ensure that a.) All third party dependencies are implemented using proxy/repository/facade or similar patterns so that the underlying technology can be swapped out at a later date (principle 6), and b.) Not to use to much of the bespoke functionality of any given technology. For example RabbitMQ attaches a message ID and a number of header fields to each message. It might be tempting to use these properties, but doing so may lock you into using this product and make it difficult to switch to a different messaging system later. If you use too much of the functionality of any 3rd party software you may also be risking violation of principle 9.

Closing Thoughts

I am yet to see another architecture that embraces loosely coupled choreography to quite this degree, with such strict adherence to the 11 principles listed at the beginning of the article. For this reason I describe this as a purist microservice architecture. The ideas are relatively simple and yet very effective. Like any pattern though, it is not well suited to every scenario. You have to be careful not to abuse it and fall into the traps of anti-patterns.

Be careful of parallel processing. As we discussed above this could lead to exponential growth in the number of messages if not implemented with care.
It is just as tempting with this pattern as it is with any other to fall into the Service Fan-Out anti-pattern. Remember that a Microservice should be able to answer a client query without the need to query another service. Sagas are for backend operations that cross multiple services/domains.

But if you are looking for a way to keep your services decoupled, yet perform implicit workflow-like operations across your system the Decorated Saga Pattern could be for you.

Mark Jones

View More Posts

Mark is an experienced Technical Architect, Consultant and Entrepreneur. From a C# and DevOps background, for the last 8 years he has worked with startups, FTSE 100 and government organisations to build innovative cloud orientated microservice platforms.

Previous Article Building a Kubernetes Development Cloud with Raspberry Pi 4, Synology NAS and OpenWRT – Part 8 – Installing the Elastic Stack

Next Article Building Microservices: The Service Chassis

2 thoughts on “Building a choreographed microservice architecture with the Decorated Saga pattern”

Scott Wilson-Billing August 18, 2023, 2:58 pm

Great article Mark, thanks for sharing 👍

Reply
- Mark Jones August 31, 2023, 11:18 am
  
  Thanks Scott!
  
  Reply

Building a choreographed microservice architecture with the Decorated Saga pattern

A Little History of the Decorated Saga pattern

Introducing The Service Chassis

The Choreographed Saga.

Enter the Decorated Saga

Parallel processing

Error handling

Using the Service Chassis to create services.

Mark Jones

2 thoughts on “Building a choreographed microservice architecture with the Decorated Saga pattern”

Leave a Reply Cancel reply

Building Microservices: The Service Chassis

Navigating the Fine Line of Microservices: When a pattern is an anti-pattern