Something I really like about living in the city is the fact that it is made for the masses. Despite its many defects (the rain not being one), Seattle is architected to enable hundreds of thousands of people to go through their busy days. It has a transportation system that interconnects different areas, it mandates different land usage policies for parks, residences, commerces and schools, and it provides restricted parking zones. It is designed for walking (assuming you like hills), it provides easy access to hospitals and it is guarded by police and fire departments.
But Seattle wasn’t initially a big city, its growth is more of a work-in-progress kind of thing. Like many other cities including all-mighty New York, Seattle is constantly under development and re-planning so it can scale to support even more people. It needs more efficient transportation (think subway), bigger highways, more parking and more recreational areas and residential zones.
The similarities between city planning and software engineering are fascinating to me, they are well described by Sam Newman in his “Building Microservices” . Just like cities started as small towns, most services started as simple servers sitting under someone’s desk, and processing a few hundred requests per day. Given some time and if the idea is right, a service may become popular —that’s a great thing to happen except that the challenge of increasing demand quickly turns into a problem of dealing with higher customer expectations. This is similar to how we expect better transportation and more efficient police enforcement when a town evolves into a city.
I personally like the story of Twitter’s Fail Whale. As Yao Yue , the whale is a story of growing up. Twitter started with a monolithic service called Monorail, represented as a gigantic box of functionality with enormous scope. While this might be the best way of getting started, Monorail soon became too complex to reason about. Availability and slow performance problems surfaced as Twitter’s engineering team grew to a bigger set of people that were constantly adding features. This required more resources and a more principled architecture with better failure handling. Twitter nicely covered up its system errors with the image of a failing whale.
If any of that sounds familiar, perhaps one should consider following Twitter’s approach of embracing a microservices-based architecture. Evolving a monolithic architecture into a set of microservices is about splitting a big I-can-do-everything box into more manageable boxes that have scoped responsibility. It is also about splitting teams into smaller teams that can focus on a subset of those services (). The resulting services are autonomous —like the teams who manage them—, so they can be independently deployed.
Getting a microservice architecture right is quite an engineering journey, it requires discipline and patience. The Microservices Maestro (i.e. you) must be able to orchestrate principles of separation of concerns, the system’s overall cohesion, failure degradation, security and privacy.
I found that establishing some initial key architecture principles and ensuring they are attained can help alleviate some of these challenges.
From the engineering and operational readiness standpoints:
- The ultimate test environment. There are many strategies for testing in production. One of my favorites is canary testing, a common practice in which less-stable code is enabled to only a small percentage of production traffic and “baked” for a predetermined period of time, before its broad roll-out.
- Continuous deployment. Ideally, the time it takes from check-in to production is hours, not months. Getting there usually requires building a fully automated deployment pipeline that is able to safely and reliably make progress.
- Watch and learn. Any feature that does not include basic monitoring of its performance, its resource utilization or its unexpected failure cases cannot be taken seriously —releasing it is like flying blind, hoping it will just work. In a system that scales to millions of requests per minute, things that are supposed to fail only 0.01% of the time will fail once every minute and will require almost immediate remediation. Hope is not a strategy in software engineering, but monitoring services and learning from their data is.
- Incremental improvements. I usually avoid waiting for the perfect system to be implemented. It is better to release something small and improve it over time.
From the internal architecture standpoint:
- The force of autonomy. It may sound non-trivial, but sharing code that is tightly coupled with business logic is an antipattern and should be avoided in favor of autonomy and system decentralization.
- Embrace contract changes. Expecting that a contract between two services will not change is not realistic. Versioning support, backwards compatibility and upgradeability are essential for the long-term stability of any feature.
- Go async. Services that spend resources waiting for responses can be wasteful. In the world of async, communication abides to event-based models with pub/sub services that don’t wait. There is no free lunch though, the diagnosis of event-based systems has proven to be a challenging area.
- Don’t store what can be re-learned. Stateless services are easy to reason about. In many cases they can be rolled back with minimum risk if their most recent version proves to be unstable. Rolling back stateful services, on the other hand, may cause multiple side effects that are hard to understand and increase the chance of data corruption. Furthermore, persisted data typically requires consistency models, high availability and quick retrieval.
- Be resilient. I really like the idea of Netflix’s Chaos Monkey – a service that is constantly triggering controlled failures in Netflix’s services in order to ensure that all services have sound failure isolation guarantees. Michael Nygard’s book Release It! recommends circuit breakers as protection mechanisms against fault propagation that are worth exploring. In short, circuit breakers are about cutting further communication to a service that is unavailable in order to avoid filling up queues of requests. Communication can be restored automatically once the external services are back online.
In general, I think scale is hard and often underestimated; it is an area that I expect will have dramatic innovation over the next few years.
Did you like this article? to get new posts by email.
Image credit: