How to build distributed system, maintain it and do not go mad?

Sometimes we are faced with the task of building an application that requires both multiple resources, high availability and fast response time. This usually requires several servers to communicate simultaneously, which is quite a challenge for a software engineer. In the article below, I’ve compiled some tips and thoughts from my experience that will help you build and maintain a distributed system.

What is a distributed system?

Simply put, a distributed system is a group of processes that are located on different machines and communicate with each other. This name can be understood as Data Warehouses, Big Data systems or Computing Clusters.

The key features of any distributed system are:

Scalability – both administrative (regulations/organizations), geographical (speed of light), and physical (servers parameters).
Fault tolerance – the whole system should work continuously, despite errors in parts of the system.
Openness – the architecture should allow for adding new modules to the existing system.
Portability – the system should allow easy transfer to other servers/cloud, both for maintenance and to save costs.

Where should the machines of the emerging project be placed?

By far the most convenient place to expose your system is in the big cloud (AWS, GCP, Azure). We don’t have to know exactly how many resources we ultimately need, so we can provide physical scalability. With the cloud, we can make the system available in several data centers simultaneously, which facilitates geographic scalability. Besides, the listed clouds allow you to expose infrastructure as code (e. g. via terraform). This in turn forces us to maintain order in the servers, accesses to them and firewall, which is not easy already with several dozens of machines.

If we are forced to set up a system in Poland (due to the customer requirement) we can use our local cloud service. Unfortunately, here we lose geographical scalability at start and some of functionality connected with the exposure of infrastructure as code (as of today).In the worst case, we will be forced to expose the system on indicated machines in the indicated server room. Unfortunately, it takes away the ability to scale geographically, makes physical scaling more difficult and often introduces limitations related to administrative scalability.

How to choose technology for a project?

First we should collect the requirements of the system, as you can see from the brief description, it is usually a complex system. Next, we should determine the tasks that the system should perform, pre-plan the architecture, estimate the traffic between specific parts, and determine how to communicate. The next step is research, usually a significant portion of the problems we encounter already have a solution, or a framework/library/component to solve (e. g. Spark, Kafka, Cassandra, cockroachDB, gRPC).

If we have several technologies, which we consider as equally good solutions, we should use those that are most popular. With a sufficiently big community around technology, we have a bigger chance to find somebody, who has encountered a similar problem to ours. Only if now we have a tie, we ought to pick the one we would most like to learn. Any technology used should be able to be painlessly replicated, add shard or add another machine. Other dedicated components should be always built with a docker. Why? Firstly, we have an isolated environment, which runs on each machine the same. Secondly, there is no problem with deploying software to other machines. If we have more than one docker image it is a sensible solution to use container orchestrator (e. g. Kubernetes or Docker Swarm). This in turn, will allow to to easily allocate resources to the dockers, as well as creating quickly replicas of dockes if needed. In this way, we are able to maintain physical scalability.

How to start building a system?

We have initially selected technology, architecture and machines on which the system will run. Since the whole will be continuously changed and the processes have lots of dependencies between each other, we should be able to check if our system works correctly after the deployment. Every commit added to the repository should automatically build a docker image and pass all the tests we wrote (unit as well as integration). To achieve this goal,continuous integration (CI) is an extremely useful tool. Most modern code repositories offer the possibility to define CI in their environment, but sometimes it is more convenient to use additional system (e. g. Jenkins).

With larger projects after some time we will read the code more often rather than write it, and in extreme cases we will spend on that 80% of our time. Therefore, it is the best to establish the rules of writing right away with the team so that it is as clean and readable, and next transfer the rules on linter definitions in the programming languages of your choice. With CI already defined it is easy to add checking code by linter. Unfortunately, not all the rules can be always transferred into linter. Because of this, every change should go through a code review process. Without it it will soon turn out that we are unproductive, the code is inconsistent and the system loses the property of being open to the extension.

How to deploy new versions of a system that is already live?

At some point we want to run our service in a production state. This can be done manually, except that manual deployment always costs much time. Instead, it is a good idea to create a continuous deployment (CD) mechanism. At the beginning it is a certain time investment, but it protects us from human errors in the future (especially forgetting some steps of deployment). Moreover, a well-written CD mechanizm, allows a quick rollback of the system to a previous version in case uploading change would cause errors.

With many changes, it is definitely a good practise to have a staging environment handling 1% of the move and a production environment handling the rest (or duplicating the production system if data is only received). Daily deployments should be done on staging, whereas once a week staging tested configuration should be deployed on production. If we have a staging environment, it is important to rotate the users between production and staging. Furthermore, with the CD mechanism maintaining two (or more) environments should be basically painless.

How to know if the system is alive/ works correctly?

If we have the system running, something can always happen along the way. It may run out of disk on some machines, or the tasks we wrote require too much RAM, or the code we deployed does not work properly/efficiently. The question is, how to know about that? First step should be to create a centralized log database. Thanks to this we will be able to check information about the errors, duration of them or the status of them in one place. One of the best open source solutions is the logstash-elasticsearch-kibana (ELK) stack. Second step is collecting the state of the system. One example of that mechanizm is Prometheus which collects the state of machines and containers and stores them in grafana.

On the other hand, it is hard to expect us to spend all the time reading and analysing system logs. We should have a mechanism to define rules of possible threats in all system components and send information about them further. Example of that mechanism is Prometheus Alert Manager. It is important to limit the number of occurring alerts. If we often receive alerts, we will quickly start to ignore them. This unfortunately leads to lack of response in case of an actually corrupted system.

Summary

When building a distributed system, it is worth spending a few more hours on research rather than being forced to throw away several weeks of work. If we don’t know how to start, it is good to get familiar with the knowledge presented by the more experienced software engineers: e.g Google https://landing.google.com/sre/sre-book/toc/index.html

With any bigger system it is worth to have a condition monitoring system, centralized logs, CI/CD, high test coverage, and several production deployments. Besides, it’s a good idea to make our system as stateless and as portable as possible.

Author: Paweł Kamiński