Roberto Lupi

The Math of Reliability: Control Hierarchies

Roberto Lupi — Thu, 07 Mar 2024 18:06:56 GMT

SREs are software engineers who primarily focus on software systems. These systems are adaptable and quick to respond to changes.

Cover Image: Functional levels of a Distributed Control System(DCS) by Daniele Pugliesi, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

From this post, I'll start to move away from SRE and share personal insights drawn more from my education (SWE+EE, "Ingegneria Informatica e dell'Automazione", CS and Automation Engineering), personal interests and from my first decade of work (2000-2012), which includes experiences as a consultant, tech lead, or CTO in smaller companies.

Every company, organization, technical system, or individual has specific goals they aim to achieve. Their journey towards these goals takes place within a dynamic system. In reality, these paths are filled with variability, making the process generally unpredictable and influenced by many smaller factors. Since most real-world processes involve feedback loops and uncertainty, they can lead to complex behaviors. The field that studies these systems is known as Cybernetics. Over time, it has expanded into various areas across different disciplines. Nowadays, you rarely hear the term mentioned, and with this fragmentation, we've also lost an holistic, comprehensive view of these topics.

Every company, organization, or technical system worth building in these settings is an actor that exercise control over the trajectory leading to its goals. (In a general sense: this is true even for pure dashboards... their goal is to nudge their users' trajectories toward a more informed state)

When humans create systems with complex behaviors, these systems usually follow certain patterns. We organize control systems in hierarchies.

These patterns may arise from the mathematics of the problems, as we'll soon explore, or from the most efficient methods of accomplishing tasks, and they are ultimately constrained by and reflect our own biology. (If octopuses were the dominant specie, we'd see quite a different structure emerge!)

Deterministic vs. Stochastic processes

I wrote about stochastic processes, let's consider deterministic processes.

If statistical noise leads to such small changes in the output that they don't matter, we're dealing with a deterministic process. These types of problems pop up in many areas of life and in high school or college-level physics. You'd typically solve these problems using difference or differential equations. In their simplest form, they look like this:

$$\begin{align*} y(t+1)-y(t)=f(y(t)) && \text{discrete time}\\ \dot{y}(t)=f(y(t)) && \text{continuous time} \end{align*}$$

Ordinary deterministic processes are simpler to understand than stochastic processes. However, deterministic processes can only approximate their stochastic counterparts under certain conditions.

An interesting question is:

Under what conditions statistical noise does or doesn't matter?

I claim there are two scenarios worth considering (for this practical discussion):

When is so small in time or effect compared to the process step or increment that it is small or negligible. This is background noise that when completely negligible gives us a fully deterministic case, and when noise is small but nonzero leads to a non-equilibrium stationary state.

On Chaos...

The Math of Reliability: stochastic processes

Roberto Lupi — Wed, 06 Mar 2024 07:03:41 GMT

Take a collection random variables and organize them in some way, you get a stochastic process or a random field.

We need to explore more than just basic statistics and discuss how sequences of random variables change over time and space. (This is another introductory post, so I can build upon these concepts later)

What does organizing mean? It involves assigning each variable an index from a set. If the index set consists of natural or real numbers (or can be mapped to these), we refer to it as a stochastic process. If the index set is higher-dimensional, it's called a random field.

The state space is the shared space from which all these random variables draw their values.

When we consider the product of index set and state space, there are several typical combinations:

discrete-time, discrete-state
discrete-time, continuous-state
continuous-time, discrete-state
continuous-time, continuous-state
(note: this list isn't complete! there are more options!)

The index set doesn't have to be time.

When we do talk about time, we should be really thinking about times in the plural. There are multiple domains of time, not just one. A lot of confusion and mistakes happen when different concepts of time are mixed without careful consideration.

Random variables aren't limited to single numbers. They can be vectors, matrices, functions, and much more. All they require is a method to measure things, which in mathematical terms, is called a -algebra. A measurable space is actually defined by a pair (X,), where X is a set and is a -algebra.

The result of a stochastic process is a function that connects the index set with the state space. This is known by several names: as the realization, sample function, or, when it involves time, as a trajectory or sample path. The difference between two random variables (for example, two steps in the same sample path) is called an increment.

In my previous article, I talked about Bernoulli random variables. A Bernoulli process is a series of independent and identically distributed (i.i.d.) random variables with a state space of {0,1}, where there's a constant probability p over time. A Bernoulli process operates in discrete time and has discrete states.

A random walk is a sum of random variables or vectors. Many ideas can be thought of as sums: you just need an associative way to combine things and a starting point (a zero or identity element), this is called a Monoid. If you've heard about Haskell, you might know about Monads, which apply similar concepts in programming. Monads can represent many things: from simple optional values and lists (like a series of changes over time), to changes in state, ongoing processes, and even parallel operations. Random walks are processes that operate in discrete time.

Random walks are typically described as the sum of iid random variables or vectors. However, we shouldn't restrict ourselves to fixed parameters. In reality, especially in incident response, transient behavior often arises from parameters or conditions that change over time.

The simplest form, a simple random walk, is indeed stationary discrete-time, discrete-space. It is based on Bernoulli trials that are mapped to increments of {-1, 1}, and the resulting state space includes all integers.

A Wiener process is the continuous equivalent of a simple random walk: it operates in continuous time and has stationary, independent, and identically distributed increments that follow a normal distribution. This concept is widely used in areas like quantitative finance (e.g., Black-Scholes model) and physics (e.g., Brownian motion).

A Poisson process counts the random number of events up to some time. If the rate of events remains constant over time, it is called an homogeneous Poisson process. When the probability of events changes over time, they are called nonhomogeneous.

The Math of SRE Explained: SLI and SLOs

Roberto Lupi — Mon, 04 Mar 2024 12:00:00 GMT

Let me share what I've learned from 12 years of experience, defining and defending SLOs on various software and hardware systems, including TPU pods, HPC/GPU clusters, and general compute clusters. (This is just an introductory post, but I need to cover the basics before diving into more juicy topics)

SRE, by the book, defines SLI as "quantitative measure of some aspects of the level of a service that is provided" and SLO as " a target value or range of values for a service level that is measured by an SLI"; the SRE workbook warns that "It can be tempting to try to math your way out of these problems [...] Unless each of these dependencies and failure patterns is carefully enumerated and accounted for, any such calculations will be deceptive".

I agree with the last statement, but let's not shy away from the journey. Let's embrace math, and we'll uncover a rich and fascinating landscape.

Fantastic SLIs/SLOs and where to find them

Here is a bestiary of SLIs and SLOs commonly found in the wild, to lay the ground of what we'll be talking about.

SLIs for software services usually cover latency, which is the time it takes to respond to a request or finish a task, error rate, the ratio of failed requests to total requests received, and some measure of system throughput, like requests per second or GBps.

Not all requests are created equal. It's common to set different SLOs for different types or classes of requests. Various terms such as priority, criticality, or quality of service (QoS) are used in different situations, but they all serve the purpose of treating requests differently and managing load when a service is under stress.

Compute services, such as Kubernetes clusters or Borg cells, operate on a different time scale compared to web services, but they are not entirely different. Users care about how long it takes to schedule a job (or to receive an error), that scheduling failures are rare, and how efficiently compute resources are used compared to the theoretical zero-overhead scenario.

For both software services and compute resources, it's common to define availability.

In systems that show measurable progress, we focus on the portion of throughput that is truly effective. This concept of application-level throughput is known as goodput in networking, and the term has also been embraced by ML (at least at Google).

In storage systems, we also care about durability, which is the likelihood of accurately reading back what we wrote without any errors or silent corruption.

Given enough size, silent data corruption is also a problem that you care about in compute infrastructure.

SLIs are usually defined as a certain quantile over a sample or measurement, and these quantiles often fall at the tail of distributions, not in the center. This is an important point because it changes the mathematical rules of the game!

Perfection is costly. It is common to define SLI as the latency of 95/99% of requests, and you rarely or never see max latency in SLO definitions. When it does appear, there are always escape clauses.

Most hyperscaler services and the business scenarios they support aim for between 3 (99.9%) to 5 (99.999%) nines of availability. However, many users can tolerate even less at the instance- or region-level, e.g. with 99% availability allowing for up to 7.2 hours of downtime per month.

There are users in the world who can't tolerate more than 30 milliseconds of server downtime per year. They require 99.9999999% availability, and there are systems capable of delivering these impressive numbers. However, this level of reliability is not typical for common computing infrastructure. (For those interested, the number above is the stated reliability of an IBM z16 mainframe.)

A side trip into probability distribution and basic statistics

If you're already familiar with basic frequentist versus Bayesian statistics, distribution families, and the central limit theorem, feel free to skip this section.

If you're not familiar or just want a refresher, let's start with the basics. We'll discuss counting things, as we aim to measure how often a system fails.

A random variable is a variable with a value that is uncertain and determined by random events. In mathematical terms, it's not actually a variable, but a function that maps possible outcomes in a sample space to a measurable space (known as the support).

So, when the possible outcomes are discrete and finite or countably infinitely many outcomes or uncountably infinite (and piecewise continuous), we can define the expectation of a random variable X respectively as:

$$\begin{align*} E[X] &= x_1p_1+x_2p_2+\dots+x_kp_k\\ E[X] &= \sum_{i=1}^{\infty}x_ip_i \\ E[X] &= \int_{-\infty}^{\infty}xf(x)dx \end{align*}$$

where p_i is the probability of outcome x_i and f(x) is the probability density function (the corresponding name in the discrete case i.e. pmf(x_i)=p_i is called the probability mass function).

There are two main approaches to statistics.

Frequentist or classical statistics, assigns probability to data, focusing on the frequency of events, and tests yield results that are either true or false.
Bayesian statistics, assigns probabilities to hypotheses, producing credible intervals and the probability that a hypothesis is true or false; it also incorporates prior knowledge into the analysis, updating this as more data becomes available.

Probability distributions aren't isolated; they are connected to each other in complex ways. They can be transformations, combinations, approximations (for example, from discrete to continuous), or compositions of each other, and so on...

The binary outcome {0, 1}, like available/broken, of independent events is a Bernoulli random variable (rv) with parameter p, which represents the probability of events. When we sample from a Bernoulli distribution, we get a set of values that can either be 0 or 1.

When counting events in discrete time, such as if we consider events that fall in a given second/minute/hour/month:

The time between positive events follows a Geometric distribution with the same parameter p;
The time between every second positive event follows a Negative Binomial distribution with the same success probability p as before and r=2, meaning we are focusing on every second positive event.
- An notable point is that the Negative Binomial random variable is the sum of Geometric random variables.
The number of events in a given interval or sample of size N follows a Binomial distribution with parameters p and N.
- For instance, the raw number of available TPU slices in a TPUv2 or v3 pod, or the number of DGX systems in a DGX Superpod, can be roughly estimated using this distribution... but what makes the difference in these systems is how you manage disruptions, planned and unplanned maintenance.

When we start counting events in continuous time, we discover matching distributions with the same mathematical structure and relationships. These are the limit distributions of the previous ones when the interval sizes become infinitely small. For those who are particularly adventurous, let me mention that category theory offers an alternative and much broader explanation, as is often the case.

The time between independent events happening at a constant average rate and with a probability p follows an Exponential random variable with a rate parameter =1/p.
The time between every k-th event follows a Gamma random variable with parameters k and =1/p.
- The Gamma random variable is the sum of k exponential random variables, the same relationship as the negative binomial and geometric random variables above.
The number of events in a given time span of size N is a Poisson random variable with rate parameter =N/p.

We refer to a "sample" as a subset of values taken from a statistical population. A "sample statistic" is a value calculated from the sample. When we use a statistic to estimate a population parameter, we call it an "estimator".

For a sufficiently large number of events in Bernoulli samples, if we count the number of positive events and divide by the total number of events, the result will be close to p. As the number of events grows to infinity, the result converges to p, which is the "mean" of the Bernoulli distribution. This is known as the "law of large numbers": the average obtained from a large number of i.i.d. random samples converges to the true value if it exists.

The average of many samples of a random variable, which has a finite mean and variance, becomes a random variable that follows a Normal distribution, under specific conditions. This is known as the "central limit theorem".

This is what most people recall from their introductory statistics courses.

The central limit theorem, need not apply... Extreme Value Theory

Let's talk about when the central limit theorem does not apply, as this is a common situation with the data that SREs deal with. Not knowing its limits is one reason why relying on what people remember from introductory statistics is discouraged (though it's not the only reason).

A popular way to normalize data is to compute the Z-score, defined as:

$$Z = {{x - \mu} \over {\sigma}}$$

where Z is known as the standard score, and it normalizes samples in terms of standard deviations () and the mean ().

But don't attempt to analyze variability in error rates by normalizing error and request counts, and then calculating the ratio. The ratio of two independent Normal random variables with a mean of zero (Cauchy random variables) has an undefined mean!

There are two other crucial distributions that frequently appear in SLI and SLO, yet many SREs seem unaware of them. These distributions are connected to quantiles and significant deviations from the median of a probability distribution. Therefore, the field of statistics that examines them is known as Extreme Value Theory.

There are two primary methods for analyzing the tails of distributions.

You can calculate the maximum or minimum values over a specific period or block, and then analyze the distribution of these values (for example, the highest values of each year based on the highest values of each month). This method helps you understand the distribution of the minimum or maximum values from very large sets of identical, independently distributed random variables from the same distribution. The distribution you find is often a form of the Generalized Extreme Value Distribution. Here we consider the maxima/minima and then compute the distribution that it follows given a arbitrarily large or infinity amount of time. It's good to estimate the worst-case scenario for a steady-state process.
You can count how many times peak values go above or below a certain threshold within any period. There are two distributions to look at: the number of events in a specific period (like a Poisson distribution mentioned earlier), and how much the values exceed the threshold, which usually follows a generalized Pareto distribution. Here we consider the frequency of violations and how severe they can be.

OKR planning as belief revision

Roberto Lupi — Thu, 29 Feb 2024 08:22:33 GMT

Objective and Key Results (OKR) is a well-known framework used to set and track an organization's goals. Leaders establish top-level OKRs, which then flow through the organization. In this journey, more details are added, these strategic goals drive alignment and but also adjust according to tactical needs and constraints. Progress is monitored, and any necessary changes are made throughout the organization.

Large organizations are structured in complex hierarchies because this setup helps with planning and execution, compared to other options.

The concept of belief revision has been explored in various fields for a long time. In 1982, Judea Pearl introduced belief propagation, an algorithm designed for making predictions using graphical models like Bayesian networks or Markov random fields. When applied to a tree-shaped model, this algorithm finishes in just two complete cycles. On more complex graphs, an approximate version of belief propagation can be used, though it requires more steps which is costly when implemented at the speed dictated by human processes.

AI and LLM will make these processes faster and lower coordination costs.

The key question is whether LLMs will lead to a tipping point, and the emergence of new organizational structures and coordination dynamics that adapt to change more efficiently than our current ones.

There are many other functions and factors that stabilize bureaucracies, but they are ultimately bound by what they achieve. The often unstated purpose of large corporations is not to compete in the market and make profits, it is to grow beyond that game and "engulf enough of the world to shield themselves from uncertainty" (rephrasing cit. John Kenneth Galbraith and Danella Meadows's Leverage Points). When they are successful for a long time, they can loose the ingredients and factors that made them good at the innovation game in their earlier history. Increasing profits is a necessary condition to continue to play. Inefficiency and slow speed are a moral hazard and mortal sin for a corporation that faces market shakeups and looses mindshare even before marketshare.