Want to make creations as awesome as this one?

Transcript

When Designing a Data System

Reliability

Three Concerns

The system should continue to work correctly even in the face of adversity.

+ info

Scalability

As the system grows, there should be reasonable ways of dealing with that growth.

+ info

Many different people will work on the system, and they should all be able to work on it productively.

Maintainability

+ info

An effort should be made to design the system to perform the correct functionality at the desired level of performance even when there are hardware or software faults, and even human error. A system designed to cope with faults is called fault-tolerant or resilient. Of course it isn't reasonable that a system can tolerate any and all faults so it only makes sense to design for certain types of faults. We should also note the distinction between fault and failure. A fault is when a component of the system is not performing according its specification whereas a failure is when the system as a whole stops providing the service the user expects. The goal is to design fault-tolerance mechanisms that prevent faults from causing failures. It isn't enough to design for fault-tolerance, we also need to test our system. This includes injecting faults into a system under load just to see how it responds. Netflix has taken an extreme approach by injecting faults into their production system. They use Chaos Monkey to randomly terminate instances so that engineers can learn where they need to improve resilience. There are some faults that are impossible to tolerate and in these cases, it may be possible to prevent them from happening. This is the case with security matters where the goal is to prevent the attacker from getting in to exploit the system. If the attacker gets in, there is no way to undo that. In this course, we will focus on faults that can be cured and the strategies for doing so.

The system should continue to work correctly even in the face of adversity.

Reliability

Software Errors

Cost of Reliability

Human Errors

Hardware Faults

Scalability

Maintainability

The data volume, traffic volume, or complexity of behavior are all aspects of growth. We use the term scalability to describe a system's ability to cope with increased load. Keep in mind that scalability is multi-dimensional. It is not enough to say a system is scalable. We need to qualify the type of growth the system is experiencing and what options we have for coping with the growth.To address scalability concerns, we need to have a way to describe the load and measure performance. Only then can we discuss how to cope with an increase in load.

As the system grows, there should be reasonable ways of dealing with that growth.

Scalability

Reliability

Maintainability

Coping with Increased Load

Depending on the nature of the service, accommodating a ten-fold increase in load can be straight forward or nearly impossible.

+info

Describing Performance

Describing Load

Over time, many different people will work on the system, and they should all be able to work on it productively. These activities include engineering and operations, both maintaining current behavior and adapting the system to new use cases.Today's data systems are predominantly software running on commodity hardware and any system that users are willing to adopt is going to require changes. These include bug fixes, new features, or major upgrades that essentially replace major software components with newer technology. Therefore, we should design the software in a way that it will minimize the difficulty in the maintenance activities.

Over time, many different people will work on the system, and they should all be able to work on it productively.

Maintainability

Reliability

Operability

Simplicity

Evolvability

We should pay particular attention to three aspects of maintainability:

Scalability

Humans are involved in the design, build, and operation of systems and try as they might, humans can make mistakes. A larger number of system outages are attributed to human operator error than hardware failure. There are several approaches to make systems more reliable despite being designed, built and operated by unreliable humans.

Human Errors

  • Minimize opportunities for error using well-designed abstractions, APIs, and operator interfaces
  • Provide a fully-featured non-production sandbox environment where people can explore and experiment safely without affecting real users
  • Use automated testing at all levels: unit-tests and whole-system integration tests in addition to manual testing
  • Minimize the impact of human failures by making it easy to roll back changes or roll out new code gradually
  • Setup detailed and clear monitoring to enable operators to detect problems early on or diagnose a problem after it occurs
  • Implement good management practices and training

Tips for Reliability

We typically describe load with a few numbers known as load parameters. The choice of parameters depends on the system architecture. A few examples are shown in the table.

Describing Load

In the reading, Kleppmann presents an example using Twitter from 2012. There are two main operations: 1) Post tweet and 2) Home timeline. Load parameters for each operation are provided in requests per second: requests to post a tweet vs requests to produce the home timeline. While handling the tweet volume seems straight forward, the scaling problem for twitter is fan-out not just tweet volume. A user will follow other people and the same users is followed by other people. As users follow more people, drawing the home timeline takes longer to produce. Are there tradeoffs that can help balance the performance issue? Can they precompute information during a post (slowing down the post process) that will speed up the ability to produce the home timeline?

Depending on the nature of the service, accommodating a ten-fold increase in load can be straight forward or nearly impossible. A stateless web service can be easily distributed over many machines with a load balancer. However, stateful systems are much harder to load balance. Consider the popular terms Scale Up and Scale Out. In the case of a database server that needs to handle an increased load, the only option may be to scale it up. That means adding additional resources (memory, CPU, disk, etc..) to the server. There are obvious limits to scaling up and the costs are not linear. The stateless web server, on the other hand, can be scaled out by adding additional nodes that run the service.

Coping with Increased Load

With the advancements in distributed databases, we can have stateful systems that can scale out. The ability to scale out is achieved with a shared-nothing architecture. Each node in the cluster has its own memory, CPU, and storage. These scalable systems are typically application specific and not general purpose. However, they are built from general purpose components.

Consider an application running on a single server in a data center. A computer is made up of hardware components that can fail. The hard drive can crash, the RAM can become faulty, or the power supply could fail. Our computer also relies on a power source and the application relies on network connectivity. If the network fails, the application cannot serve clients. The power distribution components in the rack could fail or the power to building could go out. system is operational. A lot can be done to prevent hardware faults from becoming a system failure. We can add redundancy to components in the server. The disk drives can be setup in a RAID configuration to survive a disk failure. The CPU can be hot-swappable. The server can have redundant power supplies and the rack can have two independent power distribution units on separate circuits. The rack can also have a battery backup to survive power bumps and the data center can have a generator to allow the center to survive a power outage.

Hardware Faults

With all this redundancy, a single system failure can be rare. For critical applications, a hot standby (another server) can be configured moving the redundancy to the machine level. With increased demands of data volume and compute capacity, applications are using increasingly larger numbers of machines. With a large number of machines, the probability that one will fail goes up. Considering that systems today are typically distributed across multiple nodes, there is a move towards designs that can tolerate loss of an entire node. Software fault-tolerance techniques are used instead of or in addition to hardware redundancy. There are operational advantages to this approach over the single-node solution. For a single-node application, planned down-time is needed to perform maintenance or upgrades. With a system designed to tolerate node failure, upgrades and maintenance can be accomplished while the system is operational.

Large software projects can quickly become complex and difficult to understand. Anyone who needs to maintain the software will be slowed down by the complexity of the system. This increases the cost and timeline of maintenance.Making a system simpler does not mean reducing the functionality. Sometimes, we end up with accidental complexity that can be removed by refactoring the code. One of the key ways to simplify the system design is to use meaningful abstractions. A good abstraction can hide complexity of the implementation which can lead to reuse and higher quality software.

Simplicity

Throughout this course, we will consider good abstractions for data systems. Sometimes, such abstractions can be hard to find especially with distributed systems.

There are two ways to look at performance when load increases:

  • Under increased load but fixed resources (e.g. CPU, memory, network bandwidth, etc..) how is performance affected?
  • Under increased load, how much do you need to increase the resources if you want to maintain the same performance?
Again, the performance metric used will depend on the architecture. For a batch processing system, we are usually concerned with throughput. How long does it take to process a batch of data of a particular size (number of records or number of bytes)? Many systems we consider today are online systems. A client sends a query and expects a response. Here we are concerned with response time. How long does it take for a client to receive a response?

Describing Performance

We should be aware of the distinctions between the terms latency and response time. These terms are often used interchangeably in industry. For data-intensive applications, the queuing delays may be the dominating factor in response time.

While it may be common to see average response time reported for a given service, the arithmetic mean is not a very good metric to use. In the reading, Kleppmann spends a good amount of time explaining why percentiles are a much better way to report response time. The median, for example, is very meaningful since it tells us that half of the time users can expect a response time at or below the median value. Interesting things can happen in networked systems with distributed backend services and a dynamic workload. There may be response times that are very high. Knowing how often these occur is important. For systems with millions of users making requests, we are also concerned about the 99th percentile. If the 99th percentile response time is 100 times longer than the median, there could be 1000 users having a rough time using your service.

Latency is used to describe the time a request may be waiting in a queue verses processing time.

The response time includes the queuing delays and processing time.

We should strive to make life easy for the system operators. Some of the duties that operators perform include:

  • Monitoring the system's health and quickly restoring any services that are degraded
  • Tracking down the cause of failures or degraded performance
  • Updating software and platforms they rely on (e.g. security patches)
  • Understanding the dependencies between different system components to avoid cascading effects
  • Anticipating future problems (e.g. capacity planning)
  • Maintaining security patches
  • Defining best practices and training new operators

Operability

Some of the things a good data system should do to make routine tasks easy:

  • Provide visibility into the internals of the system including runtime behavior
  • Provide support for automating tasks and integration with external tools
  • Avoiding single points of failure to enable maintenance without an outage
  • Provide good documentation and easy-to-understand procedures
  • Provide good default behavior but allow administrators to override defaults when needed
  • Provide a self-healing capability but giving administrators manual control when needed
  • Minimize surprises by exhibiting predictable behavior

Built-In Reliability

While we do want to design and implement our systems to be reliable, we also need to consider the cost. Any system that is intended to be used by customers should have an appropriate level of reliability built in. The customers deserve it and your reputation can be harmed by reliability problems. That said, there are situations where we may choose to sacrifice reliability in order to reduce the development cost. Maybe this is the prototype system being developed. The important thing is that we are conscious of any decisions to cut corners.

While hardware faults are usually random and unrelated to each other, a software error is categorized as a systematic error. These are harder to anticipate and tend to cause many more simultaneous node failures than hardware faults. Software bugs can lie dormant for a long time until the right set of circumstances arise. In this case, the software designers may have made assumptions about the environment that are no longer true.

Software Errors

There are no easy solutions to the problem of systematic faults in software. There are many things the design team can do to help:

  • carefully thinking about assumptions and interactions
  • thorough testing
  • process isolation
  • allowing processes to crash and restart
  • measuring, monitoring, and analyzing system behavior in production

Solutions

  • A bug that causes every instance of an application server to crash when given bad input
  • A runaway process that uses up a shared resource such as CPU time, memory, disk space, or network bandwidth
  • A critical service that the system depends on slows down or becomes unresponsive
  • Cascading failures: a small fault component triggers other components to fail which lead to other failures and so on

Examples

Inevitably, the system requirements will change for multiple reasons. We are always learning new facts and previously unanticipated use cases. Users will request new features, business priorities will change, platforms the system depends on need to be replaced, and regulatory requirement can change. If your system is successful, the growth of the system may force architectural changes.The ease of change is directly linked to simplicity of the original design and the abstractions that were used. There are also organizational considerations for the support team. While agile methods are very popular for making small incremental changes, we may need to consider agility at the architectural level by decomposing the system into multiple services that can be composed to create new capabilities. Later in the course will consider some of these approaches.

Evolvability