To talk about scaling platforms and growing your application to handle whatever requirements you can think of, we first need to understand the different factors that can trigger the need to scale. These factors don’t have to act together in a perfect-storm scenario to become a real headache. With only one of them present, you’re done; you need to either start scaling or say goodbye to the stability of your system.
In this chapter, I’ll go over the most common scaling triggers that might pop up during your application’s development lifecycle, or even after going live.
External Factors
External factors are the ones you can’t control. Yes, they can be expected, and you can and should plan accordingly. You can even predict them, given the right amount of data. But you don’t really have a say as to whether they happen or not.
The two most common external factors that will trigger your need to scale are a considerable change in the traffic your application receives and an increase in the data you need to process.
Let’s quickly go over them individually.
Traffic Increase
This is probably one of the most obvious and common cases where in getting what you wanted, you end up regretting ever asking for it.
Let’s say you made a post in Hacker News about your brand-new app, and it suddenly got the attention of way too many people. Or maybe you published a new mobile app into Google’s Play store and it was featured, and now your APIs are receiving 400% more traffic than you expected for the first five months. Or maybe your online massive multiplayer game is now popular and even though your game servers are able to handle the load, your tiny log server is crashing every 2 hours from the amount of internal traffic it’s receiving.
With any of these, you now have a problem. It might directly affect your entire application, or it might be a challenge for only part of it, but you will have to fix it if you want your platform to perform as expected.
An increase in incoming traffic could affect your system in different ways; we can describe these as direct or indirect.
Direct Effects
The most obvious direct effect is overloading your servers’ capacity to handle incoming traffic. No matter how good your server hardware is, if you have only one server (or a limited number of them), you’ll be limited by it. Even if you were running a web application and had a very well-configured Apache server, so that you made the most of your resources, your capacity to handle traffic would still be limited by the number of processors and amount of RAM you paid for. There is no other way around it.
Particularly, Apache Httpd spawns a new process for every request, so multiple concurrent requests might cause this scenario to get out of hand quickly. Nginx, on the contrary, has a non-blocking I/O approach (much like Node.js), so it is capable of managing high levels of traffic with constant memory consumption. With this in mind, swapping them out might seem a good idea, but eliminating the bottleneck on your web server might prove to expose one in your own application.
In the following chapters I’ll go over different techniques to overcome this, but rest assured, they will at some point imply spending more money on more hardware.
Indirect Effects
An increase in traffic can affect your application indirectly by overloading one of your internal processes. In a microservices-based architecture, the communication between services needs to be carefully planned. The fact that you’re capable of handling the increased traffic on your user-facing service doesn’t mean the rest of your architecture will be able to handle it.

Before and after a traffic increase affecting your platform indirectly through the log server
Another possible indirect effect occurs when an increase in traffic starts affecting the performance of a service you share with other platforms. This is what is normally known as service degradation , when the service is still active and working, but is responding slower than usual.
In this scenario, your lack of planning and scaling capabilities will start affecting those who use your service. This is why you always want to make sure that whenever you depend on third-party services, they can actually assure you that their service will not be degraded by anything.
Increased Processing Power Required
A need for increased processing power could be related to the previous case, and sometimes can even be caused by it; but it can also develop independently, which is why it’s worth discussing as a whole different category.
Specifically, this is purely a resource-related problem; you’re trying to process more information than your current resource utilization technique allows you to (“trying to bite off more than you can chew”). Notice that I didn’t blame the server directly; instead, I’m sharing the blame between you and your server.
In this scenario, you’re trying to do something with one or more sources of data, and suddenly, they start providing considerably more data than you expected. And when this happens, things can go wrong in one of two ways: your service may be degraded, or it may crash completely.
Your Service Is Degraded
In the best-case scenario, even though you weren’t completely prepared for the increase, your architecture and code are capable of coping with it. You’re obviously being negatively affected by it, but your service is still running, although slower than usual. Once again, you’ve got a mob of angry users. That’s right; they’re coming, especially anyone who is paying for your services and suddenly not getting what they’re paying you for.
In an ideal world, you don’t want this scenario to happen, of course; you want your platform to be able to handle any kind of increase in the size of your data sources, and I’ll cover that in future chapters. But trust me—compared to the alternative, you’ve got it easy.
Your Service Is Dead
If you were so naive as to think a complete failure would never happen (it happens to the best of us), then most likely your system will end up in this category. Your service is crashing every time it tries to process the new data; and what’s even worse, if the source of the data has some kind of retry mechanism, or you automatically start reprocessing the data after a restart, your system is going to keep crashing, no matter how many times you auto-restart it.
Is your platform a black monolith of code? Then your whole platform is doomed (of course, it was already doomed if you went with a monolithic approach).
Is your service or its output used directly by your clients? You’re definitely in trouble here. If you’re designing an app/service/platform/whatever that people need to pay to use, you definitely need to think about scaling techniques for your first production version, no matter what.
Is your service used internally by your own platform? Maybe you’re in luck here, especially if your platform is capable of recovering from a failed service.
Is your service logging during this endless rebooting and crashing loop? If it is, you might compromise your logging system through an increase in traffic from a crashing faulty service. Then an increase in processing needs causes an increase in internal traffic and you have two problems, maybe more if you have a non-scalable centralized logging system. Now you’ve compromised every component of your platform that needs to save a log (which ideally would be all of them). See where this is going?
These are some of the most common external factors that might trigger the need to scale on your platform. But what about your own requirements for the platform? Let’s call them internal factors.
Internal Factors
Internal factors are closely related to the external ones just discussed. But instead of having them laid down on top of us, we’re the ones pursuing them. That’s because they provide positive benefits to the application, even though they require extra work (and sometimes not only from a scaling perspective). And they are traits you should always aim for in your architectures, unless the applications don’t provide a very sensitive service to anyone.
Of course, I’m talking about fault tolerance (FT and high availability (HA). At first glance, these two terms might seem to describe the same thing, but they’re slightly different concepts. Let’s go a bit deeper into each one.
High Availability
For an architecture to be highly available, it must ensure that whatever service it provides will always be available and will not lose performance, despite having internal problems (such as a loss of processing nodes). The availability of a system is usually also known as its uptime, and commercially for service providers you sign onto an SLA (Service Level Agreement) which is measured in “nines” of availability. For instance, Amazon ensures three nines of availability for their S3 service, which in practice means they ensure 99.9% of monthly availability. Put another way, if their service fails more than 43.2 minutes a month, then they’ll be over service credits. More critical services, like mobile carriers, ensure five nines of availability, which translates to 99.999% of uptime, which in turn, translates to 5.2 minutes a year of downtime allowed.
There are several techniques that can be used to achieve this, but the most common one is the master-slave pattern.

The failure on “Master M2” does not affect the entire system
The final component of the master-slave model is the monitoring service, which makes sure all master nodes work correctly.
The second half of Figure 1-2 shows what happens when one of the nodes fails (in this case M2). Its slave node is promoted to master (essentially taking the place of M2, connecting M1’s output into its input and its output into M3’s input).
In practice you normally don’t need to worry about switching the connections or manually monitoring nodes for that matter; usually load balancers are used between nodes to do exactly that, act as “fixed” points of connectivity and decide by themselves (and a set of conditions you configure on them) whether or not to promote a slave.
That works well if your nodes are simple processing nodes; in other words, if you can simply exchange a master for a slave without any loss of data or any kind of information on your platform. But what happens if your nodes are part of a storage system, like a database? This scenario is slightly different from the previous one, because here, you’re not trying to avoid a lack of processing power, or a processing step in your flow. You’re trying to prevent data loss, without affecting your performance at the same time.

Data replication between master and slaves to avoid data loss
In some cases, like Redis (with Sentinel enabled), slaves are not just there waiting to be promoted, they’re used for read-only queries, thus helping shed some load off their masters, which in turn, take care of all the write operations.

Topology change on a replica set once the primary node fails

HDFS high availability setup
As a side-note, all DataNodes need to send heartbeats to all NameNodes to ensure that if a failover is required, it will happen as fast as possible.
As you can see, the slave-master approach can have slightly different implementations, but if you dig deep enough, they all end-up being the same.
Let’s take a look now at fault tolerance, to understand how it differs from HA.
Fault Tolerance
You can think of fault tolerance as a less strict version of HA. The latter was all about keeping the offline time of your platform to a minimum and always trying to keep performance unaffected. With FT, we will again try to minimize downtime, but performance will not be a concern—in fact, you could say that degraded performance is to be expected.
That being said, the most important difference between these two is that if an error occurs during an action, a highly available system does not ensure the correct end state of that action, while a fault-tolerant one does.
For example, if a web request is being processed by your highly available platform, and one of the nodes crashes, the user making that request will probably get a 500 error back from the API, but the system will still be responsive for following requests. In the case of a fault-tolerant platform, the failure will somehow (more on this in a minute) be worked-around and the request will finish correctly, so the user can get a valid response. The second case will most likely take longer, because of the extra steps.
This distinction is crucial because it will be the key to understanding which approach you’ll want to implement for your particular use case.
Usually fault-tolerant systems try to catch the error at its source and find a solution before it becomes critical. An example of this is having mirrored hard drives in case one of them fails, instead of letting a single drive fail. That would require replacing the entire server, affecting whatever actions the server could have been performing at the time.
Hardware-level fault tolerance is beyond the scope of this book; here, I will cover some of the most common techniques used to ensure FT at a software level.
Redundancy
One way to design fault-tolerant architectures is by incorporating redundancy into your key components. Essentially, this means that you have one or more components performing the same task and some form of checking logic to determine when one of them is has failed and its output needs to be ignored.
This is a very common practice for mission-critical components, and it can be applied to many scenarios.
For example, in 2012, SpaceX sent its Dragon capsule to berth with the International Space Station. During the ascent, the Falcon9 rocket used suffered a failure on one of its nine Merlin engines; and thanks to the implemented redundancy, the onboard computer was able to reconfigure the other eight engines to ensure the success of the mission.
Because these systems are so complex to code and to test, the cost-benefit ratio is not always something the normal software project can handle. Instead, these types of systems are usually present in critical applications, where human lives might be at risk (such as air traffic controllers, rocket guidance systems, and nuclear power plants).
Let’s go over some techniques to provide software redundancy and fault tolerance.
Triple Modular Redundancy

Generic example of a triple modular redundancy system
This is a particular implementation of the N-modular redundancy systems, where as you might’ve guessed, you can add as many parallel systems as you see the need for, in order to provide a higher degree of fault tolerance for a given component. A particularly interesting real-world use case for this type of solution (in this case a 5-modular redundancy system) is the FlexRay2 system.
FlexRay is a network communication protocol used in cars; it was developed by the FlexRay Consortium to govern onboard car computing. The consortium disbanded in 2009, but the protocol became a standard. Cars such as the Audi A4 and BMW 7 series use FlexRay. This protocol uses both data redundancy, sending extra information for problem detection purposes as metadata in the same messages, and structural redundancy in the form of a redundant communication channel.
Forward Error Correction
Yet another way to add a form of redundancy to the system, Forward Error Correction (FEC) adds redundancy into the message itself. That way, the receiver can verify the actual data and correct a limited number of detected errors caused by noisy or unstable channels.
Depending on the algorithm used to encode the data, the degree of redundancy on the channel may vary and with it, the amount of actual data that can be transferred through it.
There are two main types of encoding algorithms: block codes and convolutional codes. The first kind deals with fixed-length blocks of data, and one of the most common algorithms is Reed-Solomon. A classic example of this is two-dimensional bar codes, which are encoded in such a way that the reader can withstand a certain number of missing bits from the code.
Another very interesting real-world example of this type of redundancy can be found on the messages sent by the Voyager space probe and similar probes. As you can imagine, the communication with these devices can’t really afford retransmissions due to a faulty bit, so this type of encoding is used to ensure that the receiving end takes care of solving as many errors caused by a problematic channel as it can.
By contrast, convolutional codes deal with streams of arbitrary length of data, and the most common algorithm used for this is the Viterbi algorithm. This algorithm is used for CDMA (Code Division Multiple Access) and GSM (Global System for Mobiles) cellular networks, dial-up models, and deep-space communications (sometimes it’s even used in combination with Reed-Solomon to ensure that whatever defect can’t be fixed using Viterbi is fixed using R-S).
Checkpointing
Checkpointing is yet another way to provide tolerance to failure; it is in fact one method that is commonly used by many programs regular users interact with daily, one of them being word processors.
This technique consists of saving the current state of the system into reliable storage and restarting the system by preloading that saved state whenever there is a problem. Rings a bell now? Word processors usually do this while you type—not on every keystroke; that would be too expensive, but at preset periods of time, the system will save your current work, in case there is some sort of crash.
Now, this sounds great for small systems, such as a word processor which is saving your current document, but what about whole distributed platforms?
Dealing with Distributed Checkpointing
For these cases the task is a bit more complex because there is usually a dependency between nodes, so when one of them fails and is restored to a previous checkpoint, the others need to ensure that their current state is consistent. This can cause a cascade effect, forcing the system to return to the only common stable state: its original checkpoint.
There are already some solutions designed to deal with this problem, so you don’t have to. For example, the tool DMTCP (Distributed MultiThreading CheckPointing), provides the ability to checkpoint the status of an arbitrary number of distributed systems.
Another solution, which is used in RFID tags, is called Mementos. In this particular use case, the tags don’t have a power source; they use the environment background energy to function, and this can lead to arbitrary power failures. This tool actively monitors the power levels, and when there is enough to perform a checkpoint, it stores the current tag’s state into a nonvolatile memory, which can later be used to reload that information.
When to Use?
This technique is one that clearly doesn’t work on every system, and you need to carefully analyze your particular needs before starting to plan for it.
Since you’re not checkpointing every time there is new input on your system, you can’t ensure that the action taking place during the error will be able to finish, but what you can ensure is that the system will be able to handle sudden problems and will be restored to the latest stable state. (Whether that meets your needs is a different question.)
In cases such as a server crash during an API request, the request will most likely not be able to complete; and if it’s retried, it could potentially return an unexpected value because of an old state on the server side.
Byzantine Fault-Tolerance
I intentionally left this one for last, because it could be considered the sum of all of the above. What we have here is the “Byzantine Generals Problem”, basically a distributed system where some components fail, but the majority of the monitoring modules can’t reach a consensus. In other words, you’re in trouble.

Example of a Byzantine problem, where there is a faulty component sending random data
There are different approaches to tackle this type of problem; in fact, there are too many out there to cover in a single chapter, so I’ll just go over the most common ones, to try to give you an idea of where to start.
The simplest approach, and it is not so much a solution as a workaround, is to let your status checkers default to a specific value whenever consensus can’t be reached. That way, the system is not stalled, and the current operation can continue.
Another possible solution, especially when the fault is on the data channel and not on the component generating the message itself, is to sign the messages with some sort of CRC algorithm, so that faulty messages can be detected and ignored.
Finally, yet another approach to ensure the authenticity of the message sent is to use blockchain, just as Bitcoin does, with a Proof of Work approach, in which each node that needs to send a message must authenticate it by performing a heavy computation. I’m simply mentioning this approach, since it could be the subject of an entire book, but the idea behind this approach is that it solves the Byzantine Generals problem without any inconvenience.
Summary
Your traffic increasing.
An increase in your processing needs.
Some form of side effect from one of the above (such as a faulty log server caused by the increased traffic your whole platform is getting).
Or you’re looking for a very specific side effect from your actions, such as high availability or fault tolerance.
The next chapter will cover some of the most common architectural patterns. We might revisit some of the ideas covered here, but we’ll look at them from a different point of view.








































