An Approach to Building Self-Healing APIs

6 min readNov 18, 2021

I had recently spoken at API World on Self-Healing APIs. I would like to share a summary of it along with a few more thoughts.

Enterprise-grade APIs today run mission-critical capabilities and need to be highly available. Any failure related to these could result in huge business losses. The operations infrastructure monitors and ensures the smooth running of APIs by identifying and resolving issues instantaneously. It also predicts and provides preventive guidance. In this article, an API means a spec, an implementation running on a server and a proxy configuration on an API gateway.

Common Operational Issues Related to APIs and Current Approach to Resolution

Some of the common operational issues related to APIs are:

· Connectivity and latency issues

· Security issues

· Rate limits

· Data errors

· Versioning

Most connectivity issues are due to the API implementation or a dependent API running slow or going down. Latency in API responses could also be due to network congestion. Common security issues are typically related to invalid credentials, incorrect authentication mechanism, for instance, basic authentication being used instead of an API key. Rate related errors are due to API consumers inadvertently making a greater number of requests than allowed in their plan while common data errors are related to data format, insufficient or incorrect data such as missing mandatory fields or incorrect date formats. Finally, version mismatch between the API specification and the request from the consumer due to the consumer using an older version of the API.

The current approach to resolving API issues is based on incident management and proactive monitoring of APIs. The support team is responsible to analyze and categorize the issue into one of the types mentioned above. Once the type of the issue is determined, the appropriate steps are taken to resolve it.

APIs Heal Thyself

When there is a wound or a cut, the cells in a human body use an original DNA as a reference to compare with, and they try to bring back themselves to it by self-healing. Likewise, in the world of APIs or software at large, self-healing can be termed as an automatic discovery and fixing of issues by applying the necessary actions to bring the system back to normal operation.

When we build an approach to self-healing, as with the human cell, we want to start with checking the API state. We could also have an incident created that starts the healing process. The key step is in analyzing and categorizing the issue. An API to determine the issue by itself, would need to be based on the current event, past history, context and external factors. A combination of these would be used to identify the issue. For instance, requests exceeding a plan for a short period could be due to a season or an event that is taking place or a time of the day. Similarly, the load on the API could increase at certain times of the day, for example, FOREX rate APIs could see a peak at the opening of the stock exchange.

Once we analyze the issue and categorize it, the corresponding fix to resolve it is invoked. The fix having the details of the issue proceeds with remedial actions. Once the remedial action restores the API to its normal state, the incident, if any, was created in closed in the system. The key component of this approach is observability, where the API state is continuously monitored for changes.

Connectivity errors are amongst the most common issues related to APIs. Use of incorrect HTTP method — GET instead of POST, not setting right the content type etc. are common causes of connectivity errors. As mentioned in the approach, the remediation would involve understanding the patterns, modifying the request and sending it. For errors where the API implementation is unavailable, the API management layer could send a cached response, especially for APIs that serve information that does not change frequently, e.g. Customer data.

Most security issues are related to certificates, authentication mechanism and incorrect protocols. The fixes include certificates that need to be renewed, changing to the right protocol, for instance, HTTP instead of HTTPS and in case of authentication, changing the API keys or the appropriate authentication mechanism.

When the number of requests exceed, instead of a response, the HTTP error 429 Too Many Requests is sent. There are situations where API limits may have to be increased for a short period based on certain factors such as time of the day, season, external events, location or channel. Some of the factors are more APIs getting invoked from a tourist location, payroll at the end of the month or week, spike in orders in the holiday season etc. The information on the factors would be gleaned from external sources using relevant APIs and fed into the rate limiting policies for implementation. If a consumer is frequently going over the rate limit, temporarily increase the API rate limit for the request to go through and suggest an option to permanently increase the limit in the plan

For data errors, the mechanism would provide meaningful responses along with the error codes to inform data type mismatches and suggest fixes e.g. number stored as a string. Fixes for format errors e.g. malformed JSON/XML requests by suggesting syntax and a snippet can be automatically applied.

Versioning issues are related to consumers using an old version of the API and the newer version having breaking changes. Using tools such as Akita, we can compare versions of API definitions to find source code breakages and inform the consumers to migrate to the new version. Also, in line with continuously observing the API state, understand the API usage and alert if any breaking changes are pushed as part of the definition. This would proactively reduce request failures.

Architecture for Self-Healing APIs

Let us now look at the architecture for self-healing APIs. The API Management layer is an entry point for the architecture. The endpoint exposed are continuously observed and monitored through an API Observability component. In addition, all events are logged and transported to a historical store by an event broker. The event broker also sends events in real time to the inference engine. There is another entry point to the architecture which is the incident management ticketing system. The rules engine gets the issue details from the ticketing system. The inference engine is called by the rules engine to analyze the event and determine the problem and possible cause. The inference engine would look at the historical data and also invoke external APIs to get the context behind the problem. The engine identifies events based on patterns for further evaluation as those events in the past could have led to larger issues. Based on the inference, the rules engine invokes the respective resolver API gateway extension to resolve the issue.

Way Forward

Self-healing systems are the future of application systems and APIs will be no different. Just like a human cell, API architectures will have in-built features to bring themselves to the normal state. Observability and machine learning will be essential to implementing self-healing. The initial focus will be on healing operational issues as functional error resolving is domain specific and would require far more research. We will continue to see self-healing evolve as organizations experiment before planning to adopt.

Cross posted at Danesh’s LinkedIn

An Approach to Building Self-Healing APIs

Written by Dânesh Hussain Zaki