Using service graphs to reduce MTTR in a HTTP-based architecture

10 min readJul 30, 2020

(source: leica-geosystems.com)

Microservices is a new trend in software development which many organizations around the world are adopting (and counting…). From e-commerce websites, to social network platforms, everybody is talking about this new way to design software architectures. If you have premises like scalability and high availability or if you want to reorganize your teams around business capabilities, this new approach can help you. A lot.

The growth of the “Microservices” term in Google Search in the last 5 years (source: Google Trends)

At B2W, we adopted this new paradigm in mid/2014 and since that, we have had enormous gains on our projects, but with a price. Sometimes with a huge price. Microservices brings new challenges, and in my opinion, dealing with many dependencies is the biggest one. As your business grows, you’ll probably have more new services over time. Consequently, problems related to monitoring and operating the platform will also grow exponentially at the same rate.

If you didn’t architect your platform considering basic aspects such as resiliency, it’s not rare that one component (Microservice) breaks the whole system and the consequence can be thousands, or even millions of dollars in losses.

Monitoring is always a good practice. You have to do it. But how can you monitor a hundred of Microservices in an effective manner? How do you know if a service in the end of your dependency chain is impacting some other services in the beginning? If you don’t have these answers fast, more time you’ll demand to react.

A situation that we must avoid at B2W is stop selling. When something goes wrong we need to know what happened as soon as possible. But as we have a lot of Microservices, sometimes find the problem is like to find a needle in a haystack… a real nightmare!

We have been using a lot of monitoring and alert tools to help us in the day-to-day operation. We also use logging extensively, recording every relevant information in our ELK stack. But as I said before, in a Microservice architecture the whole is important. Even if you designed your architecture to be resilient, there are some moments where a unified vision of the whole is the most important. Even more when you have a symptom like an alarm in many applications/services at the same time.

Last year, I watched a presentation of a Brazilian guy (Eduardo Saito) that works for LinkedIn. In his presentation, he talked about the challenges of operating the LinkedIn platform. They have a complex process to avoid risks when deploying service updates in production, and to help them they developed an internal tool called inVisualize.

InVisualize is a visual tool intended to provide insights over LinkedIn’s platform health and, inspired by that, I started to think about how we can implement this at B2W.

B2W is an e-commerce platform and our revenue depends a lot on our availability. We must be online 24x7. But reality is tough and as Werner Vogels says:

“Everything fails all the time”

There is a specific measure that is very common when we’re talking about maintainability, it is called MTTR (mean time to repair).

It means how long does it take to solve a problem, or how long we take to identify an unexpected situation and fixed it.

Setted by this mindset provided by Mr. Vogels and considering that our MTTR must be as lower as possible, I get back to the question: how can we monitor a hundred of Microservices in an efficient manner?

I’m pretty sure that visual representations are much more human-readable than textual representations, so my idea was create a visual representation of our architecture considering key aspects that we use to infer a problem, or better, a unhealthy situation.

When our platform is degrading, usually we consider three aspects:

Latency: response time between two services;
Error rate: HTTP 5** family errors;
Throughput: the number of requests from one service to another in a given period.

The throughput metric alone doesn’t mean so much isolatedly, but it can be the cause of a high latency scenario. In the same way, a high latency scenario can explain a high error rate scenario.

But how do we know it? Using levels. For each indicator is possible to have three different levels (one at a time): normal, warning and critical. They are self-explanatory.

To represent a relationship between two services, we use graphs. A Graph, is a very appropriated data type to represent relationships, and in our case, what we want is to represent the relationship among services (N:N).

A simple graph relationship (source: Wikipedia)

And what about the visual representation of each indicator mixing with levels? That’s a good question.

For latency, we’re using the line type. If the line is not dotted the situation is considered normal. If the line is dotted, but with small spaces, we consider a warning situation. And if the line is dotted with long spaces, probably we’re in a critical situation. For error rate, we’re using the line color. Blue represents normal, yellow represents warning and red represents a critical situation. And finally, for the throughput indicator, we’re using the line fill. The thicker the line, more latency (critical) we have.

InnerSection is our answer to identify suspicious/strange behaviours in B2W’s Microservices ecosystem, reducing MTTR.

We must avoid scenarios like this:

When an e-commerce portal stop selling, that’s the sensation (source: ripe)

During one of our Hackathons, I joined with 4 brilliant guys in a team to develop InnerSection in 24 hours. The outcomes were really incredible.

Our mission were to create a holistic vision represented by a graph of our platform, containing all Microservices running in production. To confirm my theory I started to think about all the limitations that we had without InnerSection, and that’s what I got:

Alarms work very well, but they need to exists;
APM’s are also good, but we must know where is the problem;
Logs are necessary, however all applications must log in a specific pattern, otherwise, will not be useful to correlate problems;
As more service we have, more difficult will be to detect problems, then a visual representation looks interesting.

With InnerSection we addressed all limitations above, then we consider simple decisions:

Visual representation must be a services graph;
Three indicators that we use to infer a problem: error rate, latency, throughput;
Three levels for each indicator: normal, warning and critical;
Well defined symbologies per indicator and levels.
Isolation: agent must be isolated at application/service level;
Easy to plug: minimum of configuration;
Agnostic: it does not depend on a specific technology or infrastructure design;
Not intrusive: can not impact the application;
Without customization: you don’t need to do anything in the application.

The main flow of InnerSection is about listen all TCP packets in a given machine, identify which are HTTP and then send them to an API.

This approach allow us, to still use the best technology to solve a specific problem. Technology agnosticism is one of our premises in B2W.

InnerSection’s Stack — Big Picture

Below, the four main components of InnerSection’s architecture.

The main componente is the agent, that must run in the same VM of the application.

The agent, developed in GoLang, monitors all TCP Packets in a machine and identify which are HTTP. We’re doing this using libpcap, same that tcpdump uses.

This agent has the ability to store all packets/events in a local buffer, in memory, and when a size/bytes or a time period is reached, it flush them.

A flush is an action about send a set of events in a specific pre-formatted JSON to a HTTP API, that we call Ingest API.

Using an agent helped us to keep our premises related to agnosticism, isolation, and be easy to plug and not intrusive. The agent uses API’s hostname to identify the target, and an OS environment variable to identify which API is the source.

This API receives all events sent by agents and persists in a MongoDB database. We opted for MongoDB for many reasons: the team already had a knowledge about how to operate Mongo, it is schemaless, works natively with JSON, is easy to filter and aggregate data, and also because Mongo has a built-in graph model.

We’re using the async java driver of MongoDB. It works with callbacks to ensure that all operations in the database are non-blocking. In the same way, the Ingest API is async too, so every call returns a 202 HTTP status code, and another thread do the hard work to persist the data.

We have plans to use a message broker, like Kafka or Kinesis, to decouple and scale better the production of the consumption of data.

This API is responsible to retrieve all data persisted in MongoDB in a given period. This data is returned already preformatted in a way that SigmaJS can understand. Graph API also apply some math formulas to generate a score (level) for each indicator.

We decided to calculate each score in this API rather than Ingest API, because if we change some business rule associated to the score calculation, we wouldn’t need to reprocess the entire dataset.

Our frontend was developed using SigmaJS, a JS library that generates graphs from a JSON.

The frontend constantly pool new fresh data from Graph API, is like a living organism.

I know that at this point, maybe you are thinking about a better and efficient way to retrieve data. We thought the same. Something bidirectional makes sense for this use case, like Websocket.

Our idea in the long term is to use some ML model to learn with the past and predict the future. Currently the formula is very simple, but meets our needs.

There are two important things here. The first thing is that all relationships are 1:1, that is, caller (API A)-> provider (API B). All indicators (error rate, latency and throughput) and levels are also defined per relationship. This is how data is persisted in MongoDB.

The second thing is that some indicators are fixed, based on our experience and observation acquired by empirical knowledge.

Considering that, we have:

normal ≤ 3%, warning > normal < critical, critical > 10%

normal ≤ 45ms, warning > normal < critical, critical > 150ms

Is not easy to identify in an accurate way the throughput. There are many situations that we must be aware, for example, if a new Microservice is deployed.

We consider the throughput (service calls) of all the services in the graph and from this we define three levels based on the median.

To increase the range of the “warning” group, that is the median, we applied a factor of 1.01. The same for the “critical” group, where we applied a factor of 1.05.

To clarify what I’m saying, please consider this specific set (RPM K): {190, 100, 80, 55, 45, 10, 5}.

The median of this given set should be 55, but to avoid outliers, we applied the average of all numbers after the median (critical group), in this case (190,100,80). The average should be: 90.

Considering the rule above and applying factors, the three levels will be like this: normal < warning (55550 RPM) < critical (94500 RPM)

This is a screenshot of InnerSection using some test data.

Each node in the graph is a different Microservice.

InnerSection Working! (data is not from production environments)

Normally in a Microservice architecture, databases are a dependency of a service, so if some database is slow, this impact will probably be reflected in the response time of a Microservice.

Futhermore, would be insane to understand and monitor each protocol of any dependency (filesystem, message brokers, etc) of a Microservice.

We don’t have plans to adapt InnerSection to work with Lambda Functions, it would be hard as we don’t have direct access to the machines where Lambdas are running. But if you’re looking for a solution to work with serverless paradigm, I recommend X-Ray from AWS; in case of your Lambdas are already deployed in AWS.

As I mentioned before, we have plans to use ML models to predict the score levels of each indicator. I pretty sure that will turn the model more accurate.

Another important thing is to support containers running in a host machine. As in a container normally you only have one process, a good feature can be run the agent from the host machine.

We’re also working on have a better way to deal with applications that don’t have a semantic hostname. In this case, is difficult to identify the application in the service graph. Currently, you can configure this option through a properties file, but ideally it should be automatic.

Finally, the main idea is to open InnerSection to the community, but some fine tuning is necessary before release a public version.

Originally published at medium.com on March 23, 2018.

Using service graphs to reduce MTTR in a HTTP-based architecture

Written by americanas tech