This post provides some thoughts on how you can consider resilience within the context of your solution or application.
In my previous blog post, I kept banging on about the word "Context". Guess what; I am going to bring it up again! It is critical when designing a cloud service.
Q: Why is it that important?
A: Your solution is entirely dependent on another vendor. Therefore, you need to be clear about the purpose and requirements of your system, so that you can design it a way that functions as you expect, in the domain of the vendor's services.
Context is crucial. Remember your non-functional requirements.
How comfortable are you with your non-functional requirements? Could you explain them to me?
If your answer to the above is yes, good! However, let's slightly change the exam question.
How comfortable is your team with your non-functional requirements? Again, could your team explain them to me?
Again, this is slightly vague. Is "your team" just your direct reports?
Ok, third time lucky. How comfortable is your entire development team (Business Analysts, Developers across function teams, Testers across function teams, Project/Program Managers, Security Managers, etc.) with your functional requirements?
Could your whole team articulate the same functional requirements to me?
I know it sounds like I am teaching to suck eggs, but I i's a topic that has become evident since my involvement in this space. If I asked you to construct me a building and you did not have some form of plan, where would you start?
You are probably thinking that I am preaching about Waterfall right now, but that is not the case (Agile all the way!). The point I am making is that your entire team should be pointing in the same direction.
What are the SLAs that your solution should have? (Availability, Throughput, Duration of Transactions - These are all contextual!)
What are the data residency requirements of your solution?
What are the data residency requirements of your solution? Do you have some form of compliance constraints of which the team are unaware?
What is the usage profile of your solution, and what is the forecasted growth of usage?
There are many other areas that we could investigate around the softer requirements of the solution, but that gives you an initial flavour.
Once you have those answers, then begin considering the components that you use within your cloud solution. If you have a solution that has an SLA for 100% availability, there is likely going to be a problem if you pick a cloud component that has an SLA of 99.95%.
You have a few options;
Identify the risk to your project stakeholders, and highlight the difference in requirements. You could then challenge whether the 100% SLA is required.
Again, identify the risk to your project stakeholders, and highlight the difference in requirements. They may decide to run at risk. If so, this must be captured in some form of sign-off.
Alternatively, this SLA could be required. If so, then you should consider this when architecting the solution and design a Highly Available system that helps align to this requirement. However, it adds extra complexity. We can come back to this point in a separate blog post.
Once we have our requirements and a candidate architecture, we can head to our second stage.
Analyse the dependencies across your solution; cloud, on-premises and third party.
By this stage, you would have some form of architecture diagram, or a strong understanding of how your solution ties together.
Walk through that solution diagram, and component by component review:
Whether that component hits your non-functional requirements
Whether that component requires some form of geographical redundancy (This is quite complex, and would likely be performed at the solution level, e.g. a blue/green deployment. What would happen if a region of your cloud provider went down, due to a natural disaster, a poor network link, etc.? Are you protected against that?)
What would happen if that particular component degrades or entirely fails? Play through the scenario, to see how this would affect your entire solution (i.e. Determine how tightly coupled the solution is and whether it can degrade gracefully).
The above points are very cloud-focused. What would happen if there are on-premises components (e.g. Databases or Identity systems), or third party services (For example, payments handlers) used as a pivotal part of your solution?
Ask yourself the question again, what would happen if that on-premises component, or third party services degrades, or worse, entirely fails?
This scenario is the very reason that I mentioned context. Context is incredibly important, as some of these decisions may not be necessary for a system that does not require such high levels of availability. However, for business critical solutions, this extra complexity may be needed.
It is up to you to determine the acceptable risk versus complexity. If this is deemed necessary, then there are many well-documented patterns and practices.
For a broader list of patterns and practices, take a look at the Azure Guidance page maintained by the Patterns & Practices team. You can also contribute to that page by using GitHub, so well worth a look!
Resilience is just one area to think about when building a solution in the cloud. In future blog posts, we can explore additional areas of consideration.