Loosely coupled monoliths and where to find them

Andras Gerlits
ITNEXT
Published in
6 min readNov 23, 2022

--

For the last decade, the tech sector has been on a mission to decrease its dependence on central SQL-monoliths. The mainstay of these efforts were focused on microservices and its “loose coupling” and “separation of concerns” as guiding principles.

In an earlier series, I discussed the inherent problems in designing such systems; in this one we’ll look at the ideal outcome.

  • What went into designing such a system?
  • What are its implications for the people running and developing it?
  • What are going to be our ongoing concerns?
  • Any way to improve on these?

In centralised setups, schema- and data-changes can impact teams (and their operations) halfway around the world. In decentralised ones, blast radiuses are reduced: we trade resilience for software-development and maintenance complexity.

Monolithic SQL-instances provide robust solutions to exchanging data between different processes, but they not only introduce a data-flow bottleneck, but are also single points of failure for the whole system. Since these instances are overwhelmingly single-datacentre solutions, these choke-points often end up defining the stability and availability of such systems.

Decentralised setups on the other hand, are much harder to develop. A distributed system is defined by its ability to serve the requests of a shared client-API. This means that its constituent nodes must share information between each other. Since each of them must be able to ‘make up its own mind’, the services are usually either entirely stateless, or maintain their state over multiple, independent data-stores. Two particularly hard problems around data are nodes agreeing on:

  • How do we describe a record such as a ‘bank-account’?
  • What is the history of these records?

The first problem is called ‘data-semantics’, the second one is referred to as ‘consistency’. Both of these problems have come to haunt microservice and cloud-based projects, especially in hybrid-configurations or if they incorporate existing, legacy sub-systems.

Loose-coupling and its “Liveness properties

Decentralised systems are often referred to as “loosely-coupled”. This means that the sub-systems presume very little about each other. Let’s do some blue-sky thinking and say that this laissez-faire approach works perfectly for us. All the features of the system can be perfectly compartmentalised in different services and we’ve solved all the data-ordering problems that can arise.

What are the remaining problems in this case? It needs different nodes to communicate with each other through some messaging infrastructure, and the sender and receiver must be able to talk to each other. In other words: the message-structure must be agreed by both parties and the messaging platform needs to be working for the system to run as designed. Another problem any distributed system has to deal with are the “delivery semantics”, which represents the problem that the atomic event of sending the message is different than its transmission and its acceptance. There’s a period of uncertainty for each participant, where they can’t know for sure what “the other side” thinks the state of the exchange is and (if not managed carefully) that can lead to all sorts of misunderstandings, which eventually manifest as bugs. Let’s also sweep this issue under the rug and say that we’ve designed and developed our way out of this problem. What are we left with?

The Perfect System

Our rug is getting quite lumpy by now, but we finally have the perfect system. We can fulfill client requests, we are resilient to failure and we can scale our services much better than we could before. We can also bring services geographically closer to where they are used and if we’re lucky, we pulled this off just by designing and implementing better software.

To achieve this, we had to design an atomic, consistent and isolated data-exchange mechanism between our services, which (since this is a so-called client-centric solution) requires careful consideration with each code-change. I’ve never seen a distributed system come even close to this ideal state, but even if we lower the bar to “acceptable to everyone”, successes are very rare and come with a lot of extra complexity and maintenance concerns.

To put it differently, projects aiming for this will always need specialists who understand the nuances of distributed systems and even if they manage to hire such a team, the bills for this flexibility will never stop rolling in. The price we have to pay are extra development time, slower and more uncertain release cycles and more risk of system instability.

Trouble at the end of the rainbow

Our new system is clearly much better than our previous one, but it brought with it a different set of problems, which are much harder to understand. We now have a mesh of services, communicating with each other through a redundant, fault-tolerant message-bus (such as Kafka or RedPanda) and the responsibilities of each of these services is strictly limited to a small subset of problems within the company.

In other words: we shifted the complexity of dealing with service inter-dependence to the network and our team of (expensive) specialists. To actually put these concerns to rest, we would need some kind of a “distributed authority” to tell us about both data-semantics and record-consistency the way centralised solutions did. In monoliths, we used database schemas to enforce the first, and ACID properties for the second. We don’t want to reintroduce the problem we just solved by putting back the monolith, so we need to look for an alternative.

Hey, what’s that in the corner?

At the heart of the reliability of our system is the promise of our message-bus to stay resilient, even in the face of different kinds of failures. Kafka (for example) relies on a “consensus-protocol” called “Raft” to achieve this property. Since our system is only as resilient as its ability to communicate internally, our “distributed authority” should also rely on the same guarantees. This is exactly how our system works.

We’ve built an entirely new kind of service to address these concerns. It uses the established message-bus and fixes both “shared semantics” and “consistency” problems by allowing the sharing of SQL-tables between different SQL-servers over the message-bus. Since each microservice already relies on its own datastore, and since they talk to each other over the message-bus, all the essential elements are already present in the project.

Original solution

In the illustration above, there are two microservices: one in London and one in New York. They talk to each other over Kafka and rely on their internal SQL databases (Postgres in LDN, MySQL in NY). They suffer from both the consistency and the data semantics problems, since they are fully independent.

Solution with Omniledger modules

We introduce a number of new elements into the architecture, which maintain our “distributed authority” over the Kafka bus. Microservices write into their own SQL servers. The changes are automatically picked up by the JDBC driver sitting between the app and the database. These are then communicated through Kafka to the Omniledger Platform nodes, where they are synchronised and the potential conflicts are resolved. New information created by other microservices are introduced to applications through the “Omniledger sync” modules, sitting close to the app instance.

In short: if the microservice in New York decides to update a record which is used by both London and New York, it only needs to commit this information locally for the changes to be reconciled between both services. If the London microservice successfully updated the same record while New York was working on it, the affected transaction in New York will have to be rolled back, just like it does now when it encounters an optimistic lock. In other words: developers don’t need to give any thought to data-conflicts, as the Omniledger Platform nodes deal with them. They don’t need to introduce new components into their system, since the solution works through the existing infrastructure. They don’t need to learn the ins and outs of distributed systems, as those issues are reduced back to basic SQL operations.

Since the platform only relies on information established by Kafka, each node (both platform- and sync-modules) can be stopped and restarted without impacting the availability of the system. As long as Kafka is working, our platform is also available.

The simplicity of a monolith with the benefits of a best-in-class microservice solution. A mythical beast for sure.

In our next article, we’ll be taking a look at how these concepts work in practice, i.e.: how a microservice-developer’s workflow looks. In the meantime, check out the research underlying the project, drop us a mail at info@omniledger.io, follow me on medium and on Twitter: @AndrasGerlits for progress updates.

--

--

Writing about distributed consistency. Also founded a company called omniledger.io that helps others with distributed consistency.