Advanced Computing in the Age of AI | Friday, March 29, 2024

The Problem with Microservices: ‘Deep Systems’ 

The move away from monoliths and toward new tools and processes, such as cloud, microservices and Kubernetes, has changed the IT game. IT processes today are far more agile than they were a few years ago, and that means new opportunities for developers. But it’s also introduced new problems that can drain resources and make life hellish for developers and IT teams. One of these is a phenomenon called “deep systems,” and it’s poorly understood.

What are deep systems? Enterprises have come to realize that their tech stack is their future. So they have massive teams of skilled developers to build software applications to put them (and keep them) on the map. The trouble is there was a limit to the number of developers that could simultaneously work on a single application. So the industry divided applications into hundreds of tiny pieces called "microservices” – separately managed units or services that can be developed and operated independently.

In this new world, individual microservices belonged to small, agile teams that were able to operate autonomously while still collaborating with neighboring services and teams as a single business unit. This should have made developers’ lives easier and more exciting. But microservices had unintended consequences. Ordinary systems began to scale deep as independently-managed layers (composed of microservices, monoliths and managed cloud services) were added to end-to-end application stacks. The result was the birth of “deep systems” – an architecture with four or more layers of independently operated infrastructures.

Problems for Developers

By design, microservice architectures limit a developer’s scope of control to the service they’re working on. After all, typically the only services they are authorized to change or deploy are the small subset that they maintain. Unfortunately, developers are still held responsible for the performance and functionality of the entire stack below them, not just the part they work on. A small change made by a team controlling “Service A” could impact the performance of “Service D,” so in that way, the developers of Service D are responsible for Service A’s performance and reliability despite having no control over it. And the very structure of deep systems impairs visibility. It’s nigh-on impossible for developers to get the context they need to understand how one microservice impacts all the others (not to mention the CX of the app’s end user). As systems grow ever deeper, this disconnect between a developer’s scope of responsibility and control gets bigger (see image).

As a result, it’s not uncommon for developers to spend days, weeks, even months pinpointing the cause of performance issues and unexplained regressions. This spells on-call shifts and endless troubleshooting with little or no time for innovation. Is it any wonder developers find themselves overworked, burned out and frustrated with their teams?

How to Navigate Deep Systems

The standard answer to this conundrum is “better observability.” For some, that boils down to the cliché of the “Three Pillars of Observability”: logs, metrics and tracing. Typically, this is thought of as three separate dashboards all meant to provide visibility into the same underlying system. Sometimes, IT teams even use three separate solutions! But when you’re looking at deep systems, these pillars can’t support real observability. Don’t get me wrong: logs, metrics, and tracing are all relevant, but they are “the telemetry,” not “the observability.” And beyond that, the industry fundamentally misunderstands the role of tracing, which typically shows the activity for an individual transaction or request within the application being monitored.

The truth is tracing is not the “third pillar.” With deep systems, the volume of tracing data is so large that individual traces are rarely enough to reveal the most valuable and actionable patterns to be found therein. Instead, tracing should form the groundwork of unified observability. Only trace aggregates can provide the context needed to understand the complexity of deep systems. Even more so, developer teams should never consider metrics, logs and traces individually. Treating them as separate products and deploying them in three parallel tabs will only lead to context-switching and confusion.

Instead, developers should use portable, high-performance instrumentation, such as OpenTelemetry or OpenTracing, to gather traces, logs and metrics in a portable, high-performance manner. Additionally, developers should build an observability strategy around specific use cases – the three most important use cases are (1) reliable, high-velocity service deployments, (2) improvements to steady-state performance and reliability, and (3) reducing Mean Time To Repair (MTTR) for production incidents. If you disconnect observability from these fundamental use cases, it’s easy to drown in the details of the telemetry without actually solving the hair-on-fire problems that microservice developers deal with every day.

These deep systems aren’t going anywhere. As businesses continue to rely even more heavily on microservices and other emerging technologies, systems will only grow deeper. But we can help developers work through the complexity with simple observability. This will allow them to quickly identify and address application performance issues in deep systems so they spend less time troubleshooting and more time developing quality software. Businesses rely more on their developers every day. Let’s give them the new tools they need to actually do their job.

Ben Sigelman is CEO and co-founder of LightStep.

EnterpriseAI