Want to create interactive content? It’s easy in Genially!

Data Systems - Three Concerns

Learning Design Team

Created on November 27, 2023

Start designing with a free template

Discover more than 1500 professional designs like these:

Teaching Challenge: Transform Your Classroom

Frayer Model

Math Calculations

Interactive QR Code Generator

Interactive Scoreboard

Interactive Bingo

Interactive Hangman

Explore all templates

Three Concerns

When Designing a Data System

Many different people will work on the system, and they should all be able to work on it productively.

Maintainability

Scalability

As the system grows, there should be reasonable ways of dealing with that growth.

Reliability

The system should continue to work correctly even in the face of adversity.

+ info

Reliability

Maintainability

Scalability

The system should continue to work correctly even in the face of adversity.

An effort should be made to design the system to perform the correct functionality at the desired level of performance even when there are hardware or software faults, and even human error. A system designed to cope with faults is called fault-tolerant or resilient. Of course it isn't reasonable that a system can tolerate any and all faults so it only makes sense to design for certain types of faults. We should also note the distinction between fault and failure. A fault is when a component of the system is not performing according its specification whereas a failure is when the system as a whole stops providing the service the user expects. The goal is to design fault-tolerance mechanisms that prevent faults from causing failures. It isn't enough to design for fault-tolerance, we also need to test our system. This includes injecting faults into a system under load just to see how it responds. Netflix has taken an extreme approach by injecting faults into their production system. They use Chaos Monkey to randomly terminate instances so that engineers can learn where they need to improve resilience. There are some faults that are impossible to tolerate and in these cases, it may be possible to prevent them from happening. This is the case with security matters where the goal is to prevent the attacker from getting in to exploit the system. If the attacker gets in, there is no way to undo that. In this course, we will focus on faults that can be cured and the strategies for doing so.

Cost of Reliability

Software Errors

Human Errors

Hardware Faults

Scalability

Maintainability

Reliability

As the system grows, there should be reasonable ways of dealing with that growth.

The data volume, traffic volume, or complexity of behavior are all aspects of growth. We use the term scalability to describe a system's ability to cope with increased load. Keep in mind that scalability is multi-dimensional. It is not enough to say a system is scalable. We need to qualify the type of growth the system is experiencing and what options we have for coping with the growth. To address scalability concerns, we need to have a way to describe the load and measure performance. Only then can we discuss how to cope with an increase in load.

Describing Performance

Describing Load

Coping with Increased Load

Depending on the nature of the service, accommodating a ten-fold increase in load can be straight forward or nearly impossible.

+info

Maintainability

Reliability

Scalability

Over time, many different people will work on the system, and they should all be able to work on it productively.

Over time, many different people will work on the system, and they should all be able to work on it productively. These activities include engineering and operations, both maintaining current behavior and adapting the system to new use cases. Today's data systems are predominantly software running on commodity hardware and any system that users are willing to adopt is going to require changes. These include bug fixes, new features, or major upgrades that essentially replace major software components with newer technology. Therefore, we should design the software in a way that it will minimize the difficulty in the maintenance activities.

We should pay particular attention to three aspects of maintainability:

Evolvability

Operability

Simplicity

Human Errors

Tips for Reliability

Minimize opportunities for error using well-designed abstractions, APIs, and operator interfaces
Provide a fully-featured non-production sandbox environment where people can explore and experiment safely without affecting real users
Use automated testing at all levels: unit-tests and whole-system integration tests in addition to manual testing
Minimize the impact of human failures by making it easy to roll back changes or roll out new code gradually
Setup detailed and clear monitoring to enable operators to detect problems early on or diagnose a problem after it occurs
Implement good management practices and training

Humans are involved in the design, build, and operation of systems and try as they might, humans can make mistakes. A larger number of system outages are attributed to human operator error than hardware failure. There are several approaches to make systems more reliable despite being designed, built and operated by unreliable humans.

Describing Load

In the reading, Kleppmann presents an example using Twitter from 2012. There are two main operations: 1) Post tweet and 2) Home timeline. Load parameters for each operation are provided in requests per second: requests to post a tweet vs requests to produce the home timeline. While handling the tweet volume seems straight forward, the scaling problem for twitter is fan-out not just tweet volume. A user will follow other people and the same users is followed by other people. As users follow more people, drawing the home timeline takes longer to produce. Are there tradeoffs that can help balance the performance issue? Can they precompute information during a post (slowing down the post process) that will speed up the ability to produce the home timeline?

We typically describe load with a few numbers known as load parameters. The choice of parameters depends on the system architecture. A few examples are shown in the table.

Coping with Increased Load

Depending on the nature of the service, accommodating a ten-fold increase in load can be straight forward or nearly impossible. A stateless web service can be easily distributed over many machines with a load balancer. However, stateful systems are much harder to load balance. Consider the popular terms Scale Up and Scale Out. In the case of a database server that needs to handle an increased load, the only option may be to scale it up. That means adding additional resources (memory, CPU, disk, etc..) to the server. There are obvious limits to scaling up and the costs are not linear. The stateless web server, on the other hand, can be scaled out by adding additional nodes that run the service.

With the advancements in distributed databases, we can have stateful systems that can scale out. The ability to scale out is achieved with a shared-nothing architecture. Each node in the cluster has its own memory, CPU, and storage. These scalable systems are typically application specific and not general purpose. However, they are built from general purpose components.

Hardware Faults

Consider an application running on a single server in a data center. A computer is made up of hardware components that can fail. The hard drive can crash, the RAM can become faulty, or the power supply could fail. Our computer also relies on a power source and the application relies on network connectivity. If the network fails, the application cannot serve clients. The power distribution components in the rack could fail or the power to building could go out. system is operational. A lot can be done to prevent hardware faults from becoming a system failure. We can add redundancy to components in the server. The disk drives can be setup in a RAID configuration to survive a disk failure. The CPU can be hot-swappable. The server can have redundant power supplies and the rack can have two independent power distribution units on separate circuits. The rack can also have a battery backup to survive power bumps and the data center can have a generator to allow the center to survive a power outage.

With all this redundancy, a single system failure can be rare. For critical applications, a hot standby (another server) can be configured moving the redundancy to the machine level. With increased demands of data volume and compute capacity, applications are using increasingly larger numbers of machines. With a large number of machines, the probability that one will fail goes up. Considering that systems today are typically distributed across multiple nodes, there is a move towards designs that can tolerate loss of an entire node. Software fault-tolerance techniques are used instead of or in addition to hardware redundancy. There are operational advantages to this approach over the single-node solution. For a single-node application, planned down-time is needed to perform maintenance or upgrades. With a system designed to tolerate node failure, upgrades and maintenance can be accomplished while the system is operational.

Simplicity

Large software projects can quickly become complex and difficult to understand. Anyone who needs to maintain the software will be slowed down by the complexity of the system. This increases the cost and timeline of maintenance. Making a system simpler does not mean reducing the functionality. Sometimes, we end up with accidental complexity that can be removed by refactoring the code. One of the key ways to simplify the system design is to use meaningful abstractions. A good abstraction can hide complexity of the implementation which can lead to reuse and higher quality software.

Throughout this course, we will consider good abstractions for data systems. Sometimes, such abstractions can be hard to find especially with distributed systems.

Describing Performance

There are two ways to look at performance when load increases:

Under increased load but fixed resources (e.g. CPU, memory, network bandwidth, etc..) how is performance affected?
Under increased load, how much do you need to increase the resources if you want to maintain the same performance?

Again, the performance metric used will depend on the architecture. For a batch processing system, we are usually concerned with throughput. How long does it take to process a batch of data of a particular size (number of records or number of bytes)? Many systems we consider today are online systems. A client sends a query and expects a response. Here we are concerned with response time. How long does it take for a client to receive a response?

We should be aware of the distinctions between the terms latency and response time. These terms are often used interchangeably in industry. For data-intensive applications, the queuing delays may be the dominating factor in response time.

While it may be common to see average response time reported for a given service, the arithmetic mean is not a very good metric to use. In the reading, Kleppmann spends a good amount of time explaining why percentiles are a much better way to report response time. The median, for example, is very meaningful since it tells us that half of the time users can expect a response time at or below the median value. Interesting things can happen in networked systems with distributed backend services and a dynamic workload. There may be response times that are very high. Knowing how often these occur is important. For systems with millions of users making requests, we are also concerned about the 99th percentile. If the 99th percentile response time is 100 times longer than the median, there could be 1000 users having a rough time using your service.

Operability

Some of the things a good data system should do to make routine tasks easy:

Provide visibility into the internals of the system including runtime behavior
Provide support for automating tasks and integration with external tools
Avoiding single points of failure to enable maintenance without an outage
Provide good documentation and easy-to-understand procedures
Provide good default behavior but allow administrators to override defaults when needed
Provide a self-healing capability but giving administrators manual control when needed
Minimize surprises by exhibiting predictable behavior

We should strive to make life easy for the system operators. Some of the duties that operators perform include:

Monitoring the system's health and quickly restoring any services that are degraded
Tracking down the cause of failures or degraded performance
Updating software and platforms they rely on (e.g. security patches)
Understanding the dependencies between different system components to avoid cascading effects
Anticipating future problems (e.g. capacity planning)
Maintaining security patches
Defining best practices and training new operators

Built-In Reliability

While we do want to design and implement our systems to be reliable, we also need to consider the cost. Any system that is intended to be used by customers should have an appropriate level of reliability built in. The customers deserve it and your reputation can be harmed by reliability problems. That said, there are situations where we may choose to sacrifice reliability in order to reduce the development cost. Maybe this is the prototype system being developed. The important thing is that we are conscious of any decisions to cut corners.

Software Errors

Examples

A bug that causes every instance of an application server to crash when given bad input
A runaway process that uses up a shared resource such as CPU time, memory, disk space, or network bandwidth
A critical service that the system depends on slows down or becomes unresponsive
Cascading failures: a small fault component triggers other components to fail which lead to other failures and so on

While hardware faults are usually random and unrelated to each other, a software error is categorized as a systematic error. These are harder to anticipate and tend to cause many more simultaneous node failures than hardware faults. Software bugs can lie dormant for a long time until the right set of circumstances arise. In this case, the software designers may have made assumptions about the environment that are no longer true.

Solutions

There are no easy solutions to the problem of systematic faults in software. There are many things the design team can do to help:

carefully thinking about assumptions and interactions
thorough testing
process isolation
allowing processes to crash and restart
measuring, monitoring, and analyzing system behavior in production

Evolvability

Inevitably, the system requirements will change for multiple reasons. We are always learning new facts and previously unanticipated use cases. Users will request new features, business priorities will change, platforms the system depends on need to be replaced, and regulatory requirement can change. If your system is successful, the growth of the system may force architectural changes. The ease of change is directly linked to simplicity of the original design and the abstractions that were used. There are also organizational considerations for the support team. While agile methods are very popular for making small incremental changes, we may need to consider agility at the architectural level by decomposing the system into multiple services that can be composed to create new capabilities. Later in the course will consider some of these approaches.

Data Systems - Three Concerns

Start designing with a free template

View

Teaching Challenge: Transform Your Classroom

View

Frayer Model

View

Math Calculations

View

Interactive QR Code Generator

View

Interactive Scoreboard

View

Interactive Bingo

View

Interactive Hangman

Transcript

Three Concerns

When Designing a Data System

Maintainability

Scalability

Reliability

Reliability

Scalability

Maintainability

Evolvability

Operability

Simplicity

Human Errors

Tips for Reliability

Describing Load

Coping with Increased Load

Hardware Faults

Simplicity

Describing Performance

Operability

Built-In Reliability

Software Errors

Examples

Solutions

Evolvability