Think of the abstraction that the cloud brings:
- The fact you can't go and physically inspect any of your infrastructure
- The concept of treating your infrastructure as cattle, rather than treating like pets.
Monitoring in the cloud is an interesting topic. What does monitoring mean in this evolving paradigm?
In an on-premises world, we had the mindset of caring for each of our servers. Quite likely, we had made a significant investment in our data centre. As such, we would care about the metrics relating to each server; Memory Usage, CPU Usage, etc. This approach would help us determine the performance impact on our servers, and plan for additional required capacity and future investments.
This approach is still possible in the cloud - Sure, in Azure you may be able to create dashboards based upon infrastructure-related metrics - though what value does this provide you?
Take a step back and think about that for a moment. Presumably, you are bought into the concept of elasticity if you are utilising the cloud. Some of the infrastructure that you have requested may not necessarily be deployed 100% of the time, and will only be scaled up during peak times.
In the cloud, you aren't building the infrastructure. Depending on your choice of IaaS or PaaS, you may have some degree of responsibility for the Operating System level upwards. But ultimately, you are building your solution - Your application. What value is it to know that 5/7 of your machines are utilising 95% of the available CPU? In fact, what if you had scaled up to all 7 of those instances, and they were continued averaging at 95% CPU? Is that a bad thing?
Not necessarily. By the sounds of it, your scalability rules are working. However, the issue in the above scenario is that you have no intelligence around the usage of your application. The infrastructure may be working hard, but your application may not be doing anything. Or, your users could be encountering 1000s of applications due to a bug that you had introduced in your most recent sprint.
Consider PaaS scenarios, where you will have limited monitoring information of the infrastructure. In some cases, your PaaS services could be built on a shared infrastructure, meaning infrastructure monitoring may be skewed, and not representative of your application. Infrastructure level monitoring is not enough in the cloud; a DevOps mentality is necessary, focused around continuous telemetry to support the operations of your solution.
- Are your users encountering errors when they try to purchase a product from your custom e-commerce solution?
- Are your users facing slowness due to a registration system that depends on a slow component in your environment?
- Are your users suffering slow response times, as you have deployed your infrastructure in only one region, but you serve users globally?
Infrastructure level monitoring would likely not give you that detail.
Consider taking a look at Application Performance Monitoring (APM) tools such as New Relic, Application Insights or App Dynamics. These tools typically give you a wealth of information out of the box. The value-add of these solutions is that you can instrument your solution to serve custom telemetry to these services, based on events that are important to you.
Additionally, a single user action may pass through multiple components of your solution. How do you track down the part that caused your user to receive an error in your application? A standard technique is to employ something a correlation ID, that you can pass between different pieces of your code where you send your custom events to your APM service. By using a correlation ID, you can then begin tying related events together, and following a story through your system.
Some APM tools automatically detect dependencies in your code, though are traditionally IaaS based. Both Application Insights and App Dynamics have potential solutions to this and are worth exploring in some further detail.
The previous two paragraphs could solve points 1 and 2, though not Point 3. The efficient use of APM tools could resolve the final thought; though, an additional approach could be using a remote ping or web tests based upon numerous locations from across the globe.
The benefit of these geographically dispersed latency tests is that you will understand a realistic estimate of your user load times. Though, you will also gain insight into your end user's perception of the application. If you are running a geographically redundant solution, based in numerous locations across the world - How will you know if your application is down in one region? If you were already aware that it was down, how do you determine the latency impact of the users redirected from Europe to America?
These are all important questions, and worth considering when building a cloud solution.