AI Aided Observability

Joelina Wüst

16. October 2023

Reading time: 1 min

Current Development Trends

In today’s development landscape the use of microservices is ever increasing. That is not only a current phenomenon but seems to be a trend that market analysts seem to predict for the near future as well. Currently it is expected that by 2027 the market for microservices will roughly double in size from more than 5 billion USD (today) to about 11 billion USD (Source).

Where monolithic architectures do not provide the right fit for an application microservices will take the place due to their benefits when it comes to scalability, flexibility, or agility.

So far so good! Microservices, due to their inherent properties, are easier to maintain as their components are not so interwoven with the rest of the application structure, which makes changes easier to implement and errors easier to fix. In sum, microservices are more dynamic. Dynamic qualities in turn are a relevant domain to consider in a landscape, where the next disrupting technology might be just around the corner.

Not only is the likelihood of using microservices in general increasing, but for those using it already, the complexity of their microservice architectures will increase as well if their services grow either in scale or functionality. And complexity comes with its own set of hurdles and problems that need to be managed.

That is why this blogpost wants to shed some light on the topic of complexity in microservices and how it can be managed efficiently using AI aided observability.

Why Complexity Needs to be Observable

First, let us start with a short example that underpins that observability is needed in complex environments.

Assume you have an orchestration tool, like Kubernetes, in place. In your Kubernetes cluster you are running a bunch of interconnected processes that make up your product. So far this is nothing exotic. Even if you are just running a simple product with only a few services failures can occur. That failure might be caused by high response times between services, or resource allocation just to name a few. Assuming you are running a simple application with only a handful of components, your cluster will be easy to monitor and to maintain to ensure reliability. This monitoring might even be done with a completely manual approach.

But now assume that your product might need to be expanded and further processes need to be included. Implementing that change in your infrastructure will be the easy part. That is because microservices and orchestration tools are inherently built to support such changes.

Yet, they are not built (at least by default) to monitor or observe your environment for maximum reliability. But as the number of components in your infrastructure increases, so does the chance that the communication between the services encounter problems that can, in the worst case, cause an outage of your whole product. In an ideal world the process of monitoring, problem mitigation and intervention would be fully automated, and you would not have to do anything. Even if it sounds great at first, the fully automated scenario comes with its own set of problems and pitfalls to consider as well. For that reason, the focus of this article is to take a closer look at the middle ground solution between classical monitoring and fully AI backed, automated monitoring solutions.

To be more specific, we will look at AI aided observability. AI aided observability is defined here as a human-in-the-loop process in which the monitoring aspect is mostly automated. Yet, intervention and problem mitigation are still in the hands of humans. Nonetheless, the gap between monitoring your system and taking appropriate action still needs to be bridged, especially for non-trivial scenarios. Before we take a closer look at a promising solution to that problem a quick overview of observability is given.

Observability as a Process

The first step is monitoring. It describes all the ways you could possibly collect the data of your system. This could be the query execution times of a database, the response latency of your webserver or resource allocation in compute-heavy scenarios (e.g., parallel processing). It is also important to consider that some values might be inherently tracked by your system’s components and others need to be exposed to be made use of. The descriptive information collected in that stage alone can be of value for you in an observability setting but it may have limitations that you need to work around which is why we need to broaden the view a bit.

That is where the next step – Metrification – comes into play. Sometimes it can make sense to develop higher level metrics, that combine the prior collected monitoring information into more expressive information that inform further steps in creating or applying a solution for the problem at hand.

Now that information is monitored and refined (if needed) the next step is to set up alerts. These can range from general to stakeholder specific depending on their content. An example could be that there is a service-level-objective that requires that response times of a component in your infrastructure might not cross a certain threshold for longer than X minutes because reaching said value would trigger a downwards spiral in performance of the overall system. To ensure that problems are handled as quickly as possible, the necessary team members are informed about the critical changes of a given metric via different means (e.g., E-Mail, Slack, etc.).

As a last step in the observability process, it is useful to ensure that the alerts and metrics should have a high degree of usability. The easier you understand not only the descriptive part of your metrics but also the causal chain of it, the faster you can implement correct solutions. If you are interested in a practical use case of observability at this point, you might consider scrolling down to the “Example” section for one out of many solutions that can help you to reach an appropriate level of observability.

Why AI aided though?

So far, I have described a situation in which a robust set of rules for warnings, critical situations and errors can help to properly monitor your system, at least on a descriptive level. Yet not all monitoring artifacts that we have or create access to are to be treated equally.

Some problems might be solved just by reading an alert, especially if you are proficient in monitoring. Other problems are more niche and researching them might cost a lot of time and effort.

Then there is also the possibility to automate reactions to system states with runbooks. Runbooks are routines to be executed when a certain definition of a problem is met. Problem with the latter approach is that those might just help in either the extremely easy to solve cases or in scenarios where the problem space is so unambiguous that the solution contained in that runbook might solve the problem at hand and does not cause another problem when applied. In general, runbooks are helpful, but they do not tackle the problems that are hard(er) to solve.

This is where AI, and especially LLMs, can “aid” us. Implementing an LLM (e.g., ChatGPT) in the monitoring and problem-solving process might be helpful as it can help you cut down research time for your problem or provide you with a correct solution for your problem. We can even further increase the probability that we generate feasible solutions with the LLM if we use a model that is specialized for a given domain of knowledge. The specialization might even be useful when it comes to the aspect of model cost (when self-hosted) and response quality (as shown in: Microsoft Phi-1).

But how would an LLM be implemented in the process of observability? We could do so by converting the content of an alert of our system to a prompt that we would subsequently feed to said LLM. The prompt can also further be contextualized by adding information about the cluster components for even more precise feedback if needed.

For a summarized visual overview of the process described so far see the following image:

As LLMs are not flawless when it comes to the artifacts they produce, we still require the human in the loop for the evaluation of the LLMs responses even if we are using specialized models. This might not make AI aided observability the holy grail that we all have been waiting for when it comes to maintenance and reliability, but it is still a useful tool to ease the harder parts of observing your system. It could help you pinpoint the possible reasons for problems faster, which provides it with a clear advantage over classical monitoring, where you would need to tediously dig through documentation, if that documentation even exists for your exact problem. Given that you might use complex and unique cluster infrastructures further decreases the chance of success using the “classical approach”. Further, including a LLM into your infrastructure is especially helpful when a system is growing because growth adds possible points of interaction and vulnerability to your system that you need to manage.

To get a little more practical and see how a real-world implementation of the described observability stack can look like, the following section will provide you with a hands-on example.

Example:

Where should we begin our journey to successful AI aided observability solutions? As our first step, we need to choose an appropriate monitoring solution.

Considering our introductory glance at the development landscape let us go with something that fits the microservice trend well. A feasible solution could be Prometheus, as it has been developed for highly dynamic container environments!

Without a solution like that in place you do not really have any insight into your Kubernetes cluster. The only thing you would realize would be the after-effects of events occurring (e.g., high latency in the network because one service is sending error messages in an endless loop). Especially in more complex scenarios, where you have many puzzle pieces creating your product, the chances for problems increase and the effort to reverse engineer from the observed phenomenon back to the root cause of it can be complex and tedious.

Solutions like Prometheus can cut down on search and reaction time for problems as they allow you to monitor your cluster and its components in a constant manner. Here is a quick example: Should a metric (e.g., average CPU usage) that you have set up be outside of the range that you have set as acceptable for it, an alert will be triggered and sent to the appropriate person.

Alerts can also have various kinds of severity, from simple warnings to errors. The description of Prometheus so far was rather conceptual, but how does Prometheus achieve all that on a technical level?

Prometheus has a main component – the Prometheus Server – which does the actual monitoring work on a set target (e.g., an application) based on given units (e.g., number of requests). This data then gets included in a time-series database. But wait! That data has to come from somewhere. This is where the data retrieval worker comes into play. It pulls the data from applications, services, servers, etc. that have exposed their metrics and saves them in the times-series database. The last component is a HTTP server that you can use to query the collected data and subsequently visualize it.

The latter can be done using Grafana which makes use of the PromQL language of Prometheus to create useful dashboards about your system and its current status. But why even use Grafana if you make use of Prometheus’ alert manager? Here usability comes back into play. Sometimes you do not need an alert to know that there is something going on in your system that you need to take care of. Having set up some relevant and easy to digest dashboards with Grafana can help you monitor your system in real-time and even anticipate when and why something is happening. This helps immensely when it comes to the investigation of causes for a given problem. See the image below for a visual example of a Grafana Dashboard (Source: Wikimedia Commons).

Yet, Grafana cannot only be used for visualization of our time-series data, even if that is a job it can handle especially well! We can also use Grafana to develop plugins that expand the basic monitoring architecture described so far with custom, use case-specific functionality.

Taken together, if we stick to our Kubernetes example, we could use Grafana in addition to Prometheus to monitor and visualize our cluster for a more holistic overview of the system. Having set up both of those solutions leaves us prepared for many cases. Yet, through plugins we can supercharge our system. I will get back to that latter point in just a moment.

Regardless of the richness of our monitoring solution, we do not necessarily have a good strategy at hand for when an alert or problem occurs, especially the more niche kind. This is even more true if you are managing a complex infrastructure. Not all types of failures or all kinds of problems are something that you can anticipate and prepare for!

That is part where you can supercharge the system by making use of the “AI aided” bit of observability. Why should you try to find documentation for a niche scenario, if that documentation even exists, when the answer to your problem might be “hidden” somewhere in an LLM that you can integrate into your monitoring solution? Of course, this does not guarantee a correct solution, even if you use a specialized LLM. But your journey to fix the problem has to begin somewhere!

Here we come back to the Grafana plugins that we mentioned earlier. We could create a plugin that allows us to integrate our alert messages into a routine that forwards the message and/or alert type as prompt to a LLM for additional information. We could even make use of contextual information of our cluster to refine the artifacts that the LLM will produce.

The idea of AI assisted maintenance is still quite novel. That is why we here at evoila are working on a Grafana based variant of AI aided observability, as described in the article. Stay tuned for updates on that!

Conclusion

Taken together we can see that traditional monitoring is by far not an obsolete artifact of the past when it comes to the surveillance of a system to ensure reliability and availability. Nonetheless, the traditional way is not a cure for everything, especially if the complexity of your system is increasing or the problem you are facing is not a common one. As environments become increasingly complex, solutions to your problems might become more complex as well. Furthermore, you cannot be prepared for every problem that may arise. This is why it is worth considering AI aided observability as an extension for your monitoring solution!