<![CDATA[Roberto Lupi]]>https://rlupi.comRSS for NodeThu, 12 Sep 2024 13:28:15 GMT60<![CDATA[The Lupi Program]]>https://rlupi.com/the-lupi-programhttps://rlupi.com/the-lupi-programSun, 14 Jul 2024 09:10:29 GMT<![CDATA[<div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ¤¯</div><div data-node-type="callout-text">I am taking time off work to deal with burnout. I have obsessed over a few ideas in the past months, and I need to get them out of my brain so I can start to heal. I didn't clean up or AI-correct this post on purpose. What I describe in this article requires some understanding of abstract math and how large corporations work, I don't expect to be accessible or understandable. I wrote it out mostly to swap it out of my mind, and to have a benchmark against which to compare my intuitions now and later (to see how much of it was false beliefs and connections caused by excessive stress). Disclosure: I work for Google. I never discussed my conjectures at work, due to the situation that led to this burnout. My conjectures are not and are not based on confidential business information of my employer.</div></div><blockquote><p><strong>Endgame</strong>: I believe we can enormously increase the efficiency of large organizations, reduce the waste of hundreds of thousand of employee-years spent in coordination, to the point that CxO and VP can get answers to what-if questions in matters of hours, not months.</p></blockquote><p>I am not blind to the <a target="_blank" href="https://www.cambridge.org/core/books/social-mechanisms/F54BB7A4A77F7308D5FEA7D9C0EAD086">terrifying social implications</a> of treating organizations as mechanisms, but unfortunately I think we're headed there in the not so distant future. It's better to think about them in advance and consciously, rather than proceed unaware and consider risks post-hoc.</p><p>I think it's worth pursuing and I'd like to research it, so I use the term <strong>Lupi Program</strong> for my research project in honor of the famous <a target="_blank" href="https://en.wikipedia.org/wiki/Langlands_program">Langlands program</a> in mathematics. I do not have any pretension to compare myself to Langlands, but I want to highlight how what follows is the program for a multi-year research program with long-range links between disciplines, and not a ready-made solution that can be implemented tomorrow.</p><p>In a few bullet points:</p><ul><li><p>"<a target="_blank" href="https://sre.google/">SRE is what you get when you treat operations as if its a software problem</a>." ==> My Program is what you get when you treat companies as if it's a software problem.</p></li><li><p>Companies are complex hierarchical systems that achieve goals.</p></li><li><p>They are computational systems. We can "program" and describe their properties. That's what we do, in a very imprecise way, using diagrams and docs and tons of meetings... but they are very imprecise, and lead to constant mis-alignments and very expensive errors.</p></li><li><p>but, we could do much better! If we bake into a flexible formalism some basic properties, that come from abstract algebra (abstraction, structure determines semantics and dynamics) and physics (symmetries and conservation laws), we can solve the mis-alignment problems at least for the technical parts of orgs.</p></li><li><p>what is different now? AI can turn a step paradigm shift into a smooth slope of complexity, so that it becomes progressively more precise and quantifiable. Apply it as a recursive process, keep the rest of the company aligned, and it turns a lot of separate islands into a monorepo of up-to-date company behavior, a full organization turned into a differentiable program parametrized by its KPIs or SLOs.</p></li></ul><p><em>At this level of abstraction,</em> with the right formalism and approach, it's possible to connect the dots (think of it as higher-order functional programming for an organization value streams and dynamics):</p><ul><li><p>define and align the value streams, in a formal way that ensures hierarchical composition and prevents unaccountable waste products;</p></li><li><p>that can be mechanically (algebraically) applied to all business phases</p></li><li><p>using a method that is flexible enough not to hinder discussion, with tuneable complexity and formalism...</p></li><li><p>...at its simplest, it will require only pen&paper and a napkin (much in the same a <a target="_blank" href="https://en.wikipedia.org/wiki/Feynman_diagram">Feynman diagram</a> encodes a very complex differential equation in a drawing)...</p></li><li><p>...but can be smoothly (in simple cases, automatically with AI; in more complex ones, with CoPilot-style assistants) transformed into full formal specifications.</p></li></ul><p>It will enable:</p><ul><li><p>formal proofs (e.g. <a target="_blank" href="https://learntla.com/intro/conceptual-overview.html">TLA+</a>, <a target="_blank" href="https://en.wikipedia.org/wiki/Metric_interval_temporal_logic">MITL</a>, <a target="_blank" href="https://web.stanford.edu/class/cs357/cegar.pdf">CEGAR</a>) that avoid design errors as business processes evolve</p></li><li><p>probabilistic conformance checking</p></li><li><p>coarse simulations, including continuous qualitative or quantitative estimations of connected risks, as current situation or what-if scenarios are updated</p></li><li><p>traceability, observability, and debugging of the whole objectives graph of a company</p></li></ul><p>high-level business thinking and leaders are hobbled by the same "<a target="_blank" href="https://www.youtube.com/watch?v=BJsEygxdl98&t=1246s">one-mind-barrier</a>" problem that plagues mathematics: the very few at the top only have a vague idea of what is going on, so they have to test their hypotheses for business change and what-if scenarios through a painful and long process of indirection.</p><p>While we can't eliminate that, we can surely short-circuit it. We should learn from AI: solving a simpler and more general problem is often more fruitful than solving many tiny detailed instances;</p><p>It requires a long detour into abstract mathematics, many parts of compute science and physics, but at the end much of the complexity can be removed (very few write programs in <a target="_blank" href="https://lean-lang.org/">lean4</a>, but many do write in <a target="_blank" href="https://github.blog/2023-08-30-why-rust-is-the-most-admired-language-among-developers/">Rust</a>):</p><ul><li><p>the dynamics of complex organizations depend on the interactions among their parts, not on the low-level details (<a target="_blank" href="https://ar5iv.labs.arxiv.org/html/2402.09090">Software in the natural world</a>, <a target="_blank" href="https://arxiv.org/abs/cond-mat/9907176">Computational mechanics</a>): with a crude parallel to SRE terms, horizontal coarse-graining are SLI and and vertical coarse-graining are SLO.</p></li><li><p>mathematical proofs, mathematical spaces, dependent type systems are deeply connected and equivalent (<a target="_blank" href="https://homotopytypetheory.org/book/">HoTT</a>, <a target="_blank" href="https://1lab.dev/index.html">Cubic homotopy theory</a>)</p></li><li><p>The mathematical structure of the organizations (space) and their evolution (time) can help us understand what discipline of math can be used to describe their stochastic behavior, or if this is possible at all (<a target="_blank" href="https://www.deepuncertainty.org/">DMDU</a>, differential geometry, stochastic physics and QFD, deterministic chaos, complexity, actuarial sciences, agent-based dynamical systems, are all inspirations and tools we can look at);</p></li><li><p>we should take a <a target="_blank" href="https://castle.princeton.edu/sda/">unified look at how companies and systems take decisions</a>, study them with the lens of category theory, geometry and topology, to understand the invariants and the commonality;</p></li><li><p>these ideas are not new, but we haven't put them together in a coherent way in a business setting, nor we could automate choice and implementation until AI progressed to the current point;</p></li><li><p>Our own language and concepts (written or visual) are of the same nature:</p></li><li><p>my conjecture is that we can find analogous mathematical spaces if we train LLMs (or a new class of them) on these problems;</p></li><li><p>that we can find smooth transformations (categorical sheaves) from fuzzy representation (<a target="_blank" href="https://arxiv.org/pdf/2307.08891">it can be done at the meta-level</a>) on a napkin to dependent typed formal languages (using the properties we want in these processes as desirable "boundary conditions").</p></li></ul>]]><![CDATA[<div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ¤¯</div><div data-node-type="callout-text">I am taking time off work to deal with burnout. I have obsessed over a few ideas in the past months, and I need to get them out of my brain so I can start to heal. I didn't clean up or AI-correct this post on purpose. What I describe in this article requires some understanding of abstract math and how large corporations work, I don't expect to be accessible or understandable. I wrote it out mostly to swap it out of my mind, and to have a benchmark against which to compare my intuitions now and later (to see how much of it was false beliefs and connections caused by excessive stress). Disclosure: I work for Google. I never discussed my conjectures at work, due to the situation that led to this burnout. My conjectures are not and are not based on confidential business information of my employer.</div></div><blockquote><p><strong>Endgame</strong>: I believe we can enormously increase the efficiency of large organizations, reduce the waste of hundreds of thousand of employee-years spent in coordination, to the point that CxO and VP can get answers to what-if questions in matters of hours, not months.</p></blockquote><p>I am not blind to the <a target="_blank" href="https://www.cambridge.org/core/books/social-mechanisms/F54BB7A4A77F7308D5FEA7D9C0EAD086">terrifying social implications</a> of treating organizations as mechanisms, but unfortunately I think we're headed there in the not so distant future. It's better to think about them in advance and consciously, rather than proceed unaware and consider risks post-hoc.</p><p>I think it's worth pursuing and I'd like to research it, so I use the term <strong>Lupi Program</strong> for my research project in honor of the famous <a target="_blank" href="https://en.wikipedia.org/wiki/Langlands_program">Langlands program</a> in mathematics. I do not have any pretension to compare myself to Langlands, but I want to highlight how what follows is the program for a multi-year research program with long-range links between disciplines, and not a ready-made solution that can be implemented tomorrow.</p><p>In a few bullet points:</p><ul><li><p>"<a target="_blank" href="https://sre.google/">SRE is what you get when you treat operations as if its a software problem</a>." ==> My Program is what you get when you treat companies as if it's a software problem.</p></li><li><p>Companies are complex hierarchical systems that achieve goals.</p></li><li><p>They are computational systems. We can "program" and describe their properties. That's what we do, in a very imprecise way, using diagrams and docs and tons of meetings... but they are very imprecise, and lead to constant mis-alignments and very expensive errors.</p></li><li><p>but, we could do much better! If we bake into a flexible formalism some basic properties, that come from abstract algebra (abstraction, structure determines semantics and dynamics) and physics (symmetries and conservation laws), we can solve the mis-alignment problems at least for the technical parts of orgs.</p></li><li><p>what is different now? AI can turn a step paradigm shift into a smooth slope of complexity, so that it becomes progressively more precise and quantifiable. Apply it as a recursive process, keep the rest of the company aligned, and it turns a lot of separate islands into a monorepo of up-to-date company behavior, a full organization turned into a differentiable program parametrized by its KPIs or SLOs.</p></li></ul><p><em>At this level of abstraction,</em> with the right formalism and approach, it's possible to connect the dots (think of it as higher-order functional programming for an organization value streams and dynamics):</p><ul><li><p>define and align the value streams, in a formal way that ensures hierarchical composition and prevents unaccountable waste products;</p></li><li><p>that can be mechanically (algebraically) applied to all business phases</p></li><li><p>using a method that is flexible enough not to hinder discussion, with tuneable complexity and formalism...</p></li><li><p>...at its simplest, it will require only pen&paper and a napkin (much in the same a <a target="_blank" href="https://en.wikipedia.org/wiki/Feynman_diagram">Feynman diagram</a> encodes a very complex differential equation in a drawing)...</p></li><li><p>...but can be smoothly (in simple cases, automatically with AI; in more complex ones, with CoPilot-style assistants) transformed into full formal specifications.</p></li></ul><p>It will enable:</p><ul><li><p>formal proofs (e.g. <a target="_blank" href="https://learntla.com/intro/conceptual-overview.html">TLA+</a>, <a target="_blank" href="https://en.wikipedia.org/wiki/Metric_interval_temporal_logic">MITL</a>, <a target="_blank" href="https://web.stanford.edu/class/cs357/cegar.pdf">CEGAR</a>) that avoid design errors as business processes evolve</p></li><li><p>probabilistic conformance checking</p></li><li><p>coarse simulations, including continuous qualitative or quantitative estimations of connected risks, as current situation or what-if scenarios are updated</p></li><li><p>traceability, observability, and debugging of the whole objectives graph of a company</p></li></ul><p>high-level business thinking and leaders are hobbled by the same "<a target="_blank" href="https://www.youtube.com/watch?v=BJsEygxdl98&t=1246s">one-mind-barrier</a>" problem that plagues mathematics: the very few at the top only have a vague idea of what is going on, so they have to test their hypotheses for business change and what-if scenarios through a painful and long process of indirection.</p><p>While we can't eliminate that, we can surely short-circuit it. We should learn from AI: solving a simpler and more general problem is often more fruitful than solving many tiny detailed instances;</p><p>It requires a long detour into abstract mathematics, many parts of compute science and physics, but at the end much of the complexity can be removed (very few write programs in <a target="_blank" href="https://lean-lang.org/">lean4</a>, but many do write in <a target="_blank" href="https://github.blog/2023-08-30-why-rust-is-the-most-admired-language-among-developers/">Rust</a>):</p><ul><li><p>the dynamics of complex organizations depend on the interactions among their parts, not on the low-level details (<a target="_blank" href="https://ar5iv.labs.arxiv.org/html/2402.09090">Software in the natural world</a>, <a target="_blank" href="https://arxiv.org/abs/cond-mat/9907176">Computational mechanics</a>): with a crude parallel to SRE terms, horizontal coarse-graining are SLI and and vertical coarse-graining are SLO.</p></li><li><p>mathematical proofs, mathematical spaces, dependent type systems are deeply connected and equivalent (<a target="_blank" href="https://homotopytypetheory.org/book/">HoTT</a>, <a target="_blank" href="https://1lab.dev/index.html">Cubic homotopy theory</a>)</p></li><li><p>The mathematical structure of the organizations (space) and their evolution (time) can help us understand what discipline of math can be used to describe their stochastic behavior, or if this is possible at all (<a target="_blank" href="https://www.deepuncertainty.org/">DMDU</a>, differential geometry, stochastic physics and QFD, deterministic chaos, complexity, actuarial sciences, agent-based dynamical systems, are all inspirations and tools we can look at);</p></li><li><p>we should take a <a target="_blank" href="https://castle.princeton.edu/sda/">unified look at how companies and systems take decisions</a>, study them with the lens of category theory, geometry and topology, to understand the invariants and the commonality;</p></li><li><p>these ideas are not new, but we haven't put them together in a coherent way in a business setting, nor we could automate choice and implementation until AI progressed to the current point;</p></li><li><p>Our own language and concepts (written or visual) are of the same nature:</p></li><li><p>my conjecture is that we can find analogous mathematical spaces if we train LLMs (or a new class of them) on these problems;</p></li><li><p>that we can find smooth transformations (categorical sheaves) from fuzzy representation (<a target="_blank" href="https://arxiv.org/pdf/2307.08891">it can be done at the meta-level</a>) on a napkin to dependent typed formal languages (using the properties we want in these processes as desirable "boundary conditions").</p></li></ul>]]><![CDATA[The Math of Reliability: Control Hierarchies]]>https://rlupi.com/the-math-of-sre-explained-control-hierarchieshttps://rlupi.com/the-math-of-sre-explained-control-hierarchiesThu, 07 Mar 2024 18:06:56 GMT<![CDATA[<p>SREs are software engineers who primarily focus on software systems. These systems are adaptable and quick to respond to changes.</p><p><em>Cover Image:</em> <a target="_blank" href="https://en.wikipedia.org/wiki/Hierarchical_control_system#/media/File:Functional_levels_of_a_Distributed_Control_System.svg"><em>Functional levels of a Distributed Control System</em></a><em>(DCS) by Daniele Pugliesi, CC BY-SA 3.0</em> <a target="_blank" href="https://creativecommons.org/licenses/by-sa/3.0"><em>https://creativecommons.org/licenses/by-sa/3.0</em></a><em>, via Wikimedia Commons</em></p><p>From this post, I'll start to move away from SRE and share personal insights drawn more from my education (SWE+EE, "Ingegneria Informatica e dell'Automazione", CS and Automation Engineering), personal interests and from my first decade of work (2000-2012), which includes experiences as a consultant, tech lead, or CTO in smaller companies.</p><p>Every company, organization, technical system, or individual has specific goals they aim to achieve. Their journey towards these goals takes place within a dynamic system. In reality, these paths are filled with variability, making the process generally unpredictable and influenced by many smaller factors. Since most real-world processes involve feedback loops and uncertainty, they can lead to complex behaviors. The field that studies these systems is known as <a target="_blank" href="https://en.wikipedia.org/wiki/Cybernetics">Cybernetics</a>. Over time, it has expanded into various areas across different disciplines. Nowadays, you rarely hear the term mentioned, and with this fragmentation, we've also lost an holistic, comprehensive view of these topics.</p><p>Every company, organization, or technical system worth building in these settings is an <strong>actor</strong> that exercise control over the trajectory leading to its goals. (In a general sense: this is true even for pure dashboards... their goal is to nudge their users' trajectories toward a more informed state)</p><p>When humans create systems with complex behaviors, these systems usually follow certain patterns. We organize <a target="_blank" href="https://en.wikipedia.org/wiki/Hierarchical_control_system">control systems</a> in <strong>hierarchies</strong>.</p><p>These patterns may arise from the mathematics of the problems, as we'll soon explore, or from the most efficient methods of accomplishing tasks, and they are ultimately constrained by and reflect our own biology. (If octopuses were the dominant specie, we'd see quite a different structure emerge!)</p><h1 id="heading-deterministic-vs-stochastic-processes">Deterministic vs. Stochastic processes</h1><p>I wrote about <a target="_blank" href="https://rlupi.com/the-math-of-sre-explained-stochastic-processes">stochastic processes</a>, let's consider deterministic processes.</p><p>If statistical noise leads to such small changes in the output that they don't matter, we're dealing with a <strong>deterministic process</strong>. These types of problems pop up in many areas of life and in high school or college-level physics. You'd typically solve these problems using <strong>difference</strong> or <strong>differential equations</strong>. In their simplest form, they look like this:</p><p>$$\begin{align*} y(t+1)-y(t)=f(y(t)) && \text{discrete time}\\ \dot{y}(t)=f(y(t)) && \text{continuous time} \end{align*}$$</p><p>Ordinary deterministic processes are simpler to understand than stochastic processes. However, deterministic processes can only approximate their stochastic counterparts under certain conditions.</p><p>An interesting question is:</p><blockquote><p>Under what conditions statistical noise does or doesn't matter?</p></blockquote><p>I claim there are two scenarios worth considering (for this practical discussion):</p><ul><li><p>When is <strong>so small in time or effect compared to the process step or increment</strong> that it is small or negligible. This is <strong>background noise</strong> that when completely negligible gives us a fully deterministic case, and when noise is small but nonzero leads to a non-equilibrium stationary state.</p> <details><summary>On Chaos...</summary><div data-type="detailsContent">Is this really true? No, but that's what most people think intuitively. I'll expand on the ill effect of this assumption in the near future.</div></details></li><li>Noise can be so significant in both duration and stable enough in magnitude that it becomes a constant or predictable factor over a short-term period. A linear trend that can be in various domains, not just time or space, such as seasonal changes consistent enough to be constant in the frequency domain. The impact of this noise leads to <strong>local linear or bounded effects</strong> and doesn't significantly affect the variability of <strong>trajectories</strong>, but just introduce more parallel modes. It also lead to <strong>regime switching</strong> when these conditions shift.</li></ul><p>A more interesting one is:</p><blockquote><p>When do these conditions stop be true?</p></blockquote><p>I'll explore these two cases in detail in follow-up articles.</p><h1 id="heading-control-hierarchies">Control hierarchies</h1><p>To reach our goals, we need systems that are simple(r) to understand, so we can design and manage them. As we've seen, noise isn't a fixed idea: <strong>the definition of noise varies depending on the stochastic process we're considering</strong>.</p><p>We organize control for both social and technical systems in hierarchies.</p><p>Evolutionary forces and principled design choices shape control systems so they adapt to the processes and subprocesses they control, to eliminate as much as possible sources of noise.</p><p>If we consider one layer as a reference point:</p><ul><li><p>Higher layers of the control hierarchy operate on longer intervals or larger populations, to minimize background noise.</p></li><li><p>Lower layers operate on smaller space or time scales, where the nearly-constant effect of variations make it possible to ignore them or parametrize them away.</p></li></ul>]]><![CDATA[<p>SREs are software engineers who primarily focus on software systems. These systems are adaptable and quick to respond to changes.</p><p><em>Cover Image:</em> <a target="_blank" href="https://en.wikipedia.org/wiki/Hierarchical_control_system#/media/File:Functional_levels_of_a_Distributed_Control_System.svg"><em>Functional levels of a Distributed Control System</em></a><em>(DCS) by Daniele Pugliesi, CC BY-SA 3.0</em> <a target="_blank" href="https://creativecommons.org/licenses/by-sa/3.0"><em>https://creativecommons.org/licenses/by-sa/3.0</em></a><em>, via Wikimedia Commons</em></p><p>From this post, I'll start to move away from SRE and share personal insights drawn more from my education (SWE+EE, "Ingegneria Informatica e dell'Automazione", CS and Automation Engineering), personal interests and from my first decade of work (2000-2012), which includes experiences as a consultant, tech lead, or CTO in smaller companies.</p><p>Every company, organization, technical system, or individual has specific goals they aim to achieve. Their journey towards these goals takes place within a dynamic system. In reality, these paths are filled with variability, making the process generally unpredictable and influenced by many smaller factors. Since most real-world processes involve feedback loops and uncertainty, they can lead to complex behaviors. The field that studies these systems is known as <a target="_blank" href="https://en.wikipedia.org/wiki/Cybernetics">Cybernetics</a>. Over time, it has expanded into various areas across different disciplines. Nowadays, you rarely hear the term mentioned, and with this fragmentation, we've also lost an holistic, comprehensive view of these topics.</p><p>Every company, organization, or technical system worth building in these settings is an <strong>actor</strong> that exercise control over the trajectory leading to its goals. (In a general sense: this is true even for pure dashboards... their goal is to nudge their users' trajectories toward a more informed state)</p><p>When humans create systems with complex behaviors, these systems usually follow certain patterns. We organize <a target="_blank" href="https://en.wikipedia.org/wiki/Hierarchical_control_system">control systems</a> in <strong>hierarchies</strong>.</p><p>These patterns may arise from the mathematics of the problems, as we'll soon explore, or from the most efficient methods of accomplishing tasks, and they are ultimately constrained by and reflect our own biology. (If octopuses were the dominant specie, we'd see quite a different structure emerge!)</p><h1 id="heading-deterministic-vs-stochastic-processes">Deterministic vs. Stochastic processes</h1><p>I wrote about <a target="_blank" href="https://rlupi.com/the-math-of-sre-explained-stochastic-processes">stochastic processes</a>, let's consider deterministic processes.</p><p>If statistical noise leads to such small changes in the output that they don't matter, we're dealing with a <strong>deterministic process</strong>. These types of problems pop up in many areas of life and in high school or college-level physics. You'd typically solve these problems using <strong>difference</strong> or <strong>differential equations</strong>. In their simplest form, they look like this:</p><p>$$\begin{align*} y(t+1)-y(t)=f(y(t)) && \text{discrete time}\\ \dot{y}(t)=f(y(t)) && \text{continuous time} \end{align*}$$</p><p>Ordinary deterministic processes are simpler to understand than stochastic processes. However, deterministic processes can only approximate their stochastic counterparts under certain conditions.</p><p>An interesting question is:</p><blockquote><p>Under what conditions statistical noise does or doesn't matter?</p></blockquote><p>I claim there are two scenarios worth considering (for this practical discussion):</p><ul><li><p>When is <strong>so small in time or effect compared to the process step or increment</strong> that it is small or negligible. This is <strong>background noise</strong> that when completely negligible gives us a fully deterministic case, and when noise is small but nonzero leads to a non-equilibrium stationary state.</p> <details><summary>On Chaos...</summary><div data-type="detailsContent">Is this really true? No, but that's what most people think intuitively. I'll expand on the ill effect of this assumption in the near future.</div></details></li><li>Noise can be so significant in both duration and stable enough in magnitude that it becomes a constant or predictable factor over a short-term period. A linear trend that can be in various domains, not just time or space, such as seasonal changes consistent enough to be constant in the frequency domain. The impact of this noise leads to <strong>local linear or bounded effects</strong> and doesn't significantly affect the variability of <strong>trajectories</strong>, but just introduce more parallel modes. It also lead to <strong>regime switching</strong> when these conditions shift.</li></ul><p>A more interesting one is:</p><blockquote><p>When do these conditions stop be true?</p></blockquote><p>I'll explore these two cases in detail in follow-up articles.</p><h1 id="heading-control-hierarchies">Control hierarchies</h1><p>To reach our goals, we need systems that are simple(r) to understand, so we can design and manage them. As we've seen, noise isn't a fixed idea: <strong>the definition of noise varies depending on the stochastic process we're considering</strong>.</p><p>We organize control for both social and technical systems in hierarchies.</p><p>Evolutionary forces and principled design choices shape control systems so they adapt to the processes and subprocesses they control, to eliminate as much as possible sources of noise.</p><p>If we consider one layer as a reference point:</p><ul><li><p>Higher layers of the control hierarchy operate on longer intervals or larger populations, to minimize background noise.</p></li><li><p>Lower layers operate on smaller space or time scales, where the nearly-constant effect of variations make it possible to ignore them or parametrize them away.</p></li></ul>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1709879707046/a38375dc-5491-489a-9e80-6abb0fd7ca28.png<![CDATA[The Math of Reliability: stochastic processes]]>https://rlupi.com/the-math-of-sre-explained-stochastic-processeshttps://rlupi.com/the-math-of-sre-explained-stochastic-processesWed, 06 Mar 2024 07:03:41 GMT<![CDATA[<p>Take a collection random variables and organize them in some way, you get a <strong>stochastic process</strong> or a <strong>random field.</strong></p><p>We need to explore more than just <a target="_blank" href="https://rlupi.com/the-math-of-sli-and-slos-explained">basic statistics</a> and discuss how sequences of random variables change over time and space. (This is another introductory post, so I can build upon these concepts later)</p><p>What does organizing mean? It involves assigning each variable an index from a set. If the index set consists of natural or real numbers (or can be mapped to these), we refer to it as a <strong>stochastic process</strong>. If the index set is higher-dimensional, it's called a <strong>random field.</strong></p><p>The <strong>state space</strong> is the shared space from which all these random variables draw their values.</p><p>When we consider the product of index set and state space, there are several typical combinations:</p><ul><li><p>discrete-time, discrete-state</p></li><li><p>discrete-time, continuous-state</p></li><li><p>continuous-time, discrete-state</p></li><li><p>continuous-time, continuous-state</p></li><li><p><em>(note: this list isn't complete! there are more options!)</em></p></li></ul><p>The index set doesn't have to be time.</p><p>When we do talk about time, we should be really thinking about <em>times</em> in the plural. There are multiple domains of time, not just one. A lot of confusion and mistakes happen when different concepts of time are mixed without careful consideration.</p><p>Random variables aren't limited to single numbers. They can be vectors, matrices, functions, and much more. All they require is a method to measure things, which in mathematical terms, is called a <a target="_blank" href="https://en.wikipedia.org/wiki/%CE%A3-algebra">-algebra</a>. A measurable space is actually defined by a pair (X,), where X is a set and is a -algebra.</p><p>The result of a stochastic process is a function that connects the index set with the state space. This is known by several names: as the <strong>realization</strong>, sample function, or, when it involves time, as a <strong>trajectory</strong> or <strong>sample path</strong>. The difference between two random variables (for example, two steps in the same sample path) is called an <strong>increment</strong>.</p><p>In my previous article, I talked about Bernoulli random variables. A <strong>Bernoulli process</strong> is a series of independent and identically distributed (i.i.d.) random variables with a <em>state space</em> of {0,1}, where there's a constant probability <em>p</em> over time. A Bernoulli process operates in discrete time and has discrete states.</p><p>A <strong>random walk</strong> is a <strong>sum</strong> of random variables or vectors. Many ideas can be thought of as sums: you just need an associative way to combine things and a starting point (a zero or identity element), this is called a <a target="_blank" href="https://en.wikipedia.org/wiki/Monoid">Monoid</a>. If you've heard about Haskell, you might know about <a target="_blank" href="https://en.wikipedia.org/wiki/Monad_(functional_programming)">Monads</a>, which apply similar concepts in programming. Monads can represent many things: from simple optional values and lists (like a series of changes over time), to changes in state, ongoing processes, and even parallel operations. Random walks are processes that operate in discrete time.</p><p>Random walks are typically described as the sum of <strong>iid</strong> random variables or vectors. However, we shouldn't restrict ourselves to fixed parameters. In reality, especially in incident response, transient behavior often arises from parameters or conditions that change over time.</p><p>The simplest form, a simple random walk, is indeed stationary discrete-time, discrete-space. It is based on Bernoulli trials that are mapped to increments of {-1, 1}, and the resulting state space includes all integers.</p><p>A <strong>Wiener process</strong> is the continuous equivalent of a simple random walk: it operates in continuous time and has stationary, independent, and identically distributed increments that follow a normal distribution. This concept is widely used in areas like quantitative finance (e.g., Black-Scholes model) and physics (e.g., Brownian motion).</p><p>A <strong>Poisson process</strong> counts the random number of events up to some time. If the rate of events remains constant over time, it is called an homogeneous Poisson process. When the probability of events changes over time, they are called nonhomogeneous.</p>]]><![CDATA[<p>Take a collection random variables and organize them in some way, you get a <strong>stochastic process</strong> or a <strong>random field.</strong></p><p>We need to explore more than just <a target="_blank" href="https://rlupi.com/the-math-of-sli-and-slos-explained">basic statistics</a> and discuss how sequences of random variables change over time and space. (This is another introductory post, so I can build upon these concepts later)</p><p>What does organizing mean? It involves assigning each variable an index from a set. If the index set consists of natural or real numbers (or can be mapped to these), we refer to it as a <strong>stochastic process</strong>. If the index set is higher-dimensional, it's called a <strong>random field.</strong></p><p>The <strong>state space</strong> is the shared space from which all these random variables draw their values.</p><p>When we consider the product of index set and state space, there are several typical combinations:</p><ul><li><p>discrete-time, discrete-state</p></li><li><p>discrete-time, continuous-state</p></li><li><p>continuous-time, discrete-state</p></li><li><p>continuous-time, continuous-state</p></li><li><p><em>(note: this list isn't complete! there are more options!)</em></p></li></ul><p>The index set doesn't have to be time.</p><p>When we do talk about time, we should be really thinking about <em>times</em> in the plural. There are multiple domains of time, not just one. A lot of confusion and mistakes happen when different concepts of time are mixed without careful consideration.</p><p>Random variables aren't limited to single numbers. They can be vectors, matrices, functions, and much more. All they require is a method to measure things, which in mathematical terms, is called a <a target="_blank" href="https://en.wikipedia.org/wiki/%CE%A3-algebra">-algebra</a>. A measurable space is actually defined by a pair (X,), where X is a set and is a -algebra.</p><p>The result of a stochastic process is a function that connects the index set with the state space. This is known by several names: as the <strong>realization</strong>, sample function, or, when it involves time, as a <strong>trajectory</strong> or <strong>sample path</strong>. The difference between two random variables (for example, two steps in the same sample path) is called an <strong>increment</strong>.</p><p>In my previous article, I talked about Bernoulli random variables. A <strong>Bernoulli process</strong> is a series of independent and identically distributed (i.i.d.) random variables with a <em>state space</em> of {0,1}, where there's a constant probability <em>p</em> over time. A Bernoulli process operates in discrete time and has discrete states.</p><p>A <strong>random walk</strong> is a <strong>sum</strong> of random variables or vectors. Many ideas can be thought of as sums: you just need an associative way to combine things and a starting point (a zero or identity element), this is called a <a target="_blank" href="https://en.wikipedia.org/wiki/Monoid">Monoid</a>. If you've heard about Haskell, you might know about <a target="_blank" href="https://en.wikipedia.org/wiki/Monad_(functional_programming)">Monads</a>, which apply similar concepts in programming. Monads can represent many things: from simple optional values and lists (like a series of changes over time), to changes in state, ongoing processes, and even parallel operations. Random walks are processes that operate in discrete time.</p><p>Random walks are typically described as the sum of <strong>iid</strong> random variables or vectors. However, we shouldn't restrict ourselves to fixed parameters. In reality, especially in incident response, transient behavior often arises from parameters or conditions that change over time.</p><p>The simplest form, a simple random walk, is indeed stationary discrete-time, discrete-space. It is based on Bernoulli trials that are mapped to increments of {-1, 1}, and the resulting state space includes all integers.</p><p>A <strong>Wiener process</strong> is the continuous equivalent of a simple random walk: it operates in continuous time and has stationary, independent, and identically distributed increments that follow a normal distribution. This concept is widely used in areas like quantitative finance (e.g., Black-Scholes model) and physics (e.g., Brownian motion).</p><p>A <strong>Poisson process</strong> counts the random number of events up to some time. If the rate of events remains constant over time, it is called an homogeneous Poisson process. When the probability of events changes over time, they are called nonhomogeneous.</p>]]><![CDATA[The Math of SRE Explained: SLI and SLOs]]>https://rlupi.com/the-math-of-sli-and-slos-explainedhttps://rlupi.com/the-math-of-sli-and-slos-explainedMon, 04 Mar 2024 12:00:00 GMT<![CDATA[<p>Let me share what I've learned from 12 years of experience, defining and defending SLOs on various software and hardware systems, including TPU pods, HPC/GPU clusters, and general compute clusters. (This is just an introductory post, but I need to cover the basics before diving into more juicy topics)</p><p>SRE, by <a target="_blank" href="https://sre.google/sre-book/service-level-objectives/">the book</a>, defines SLI as "quantitative measure of some aspects of the level of a service that is provided" and SLO as " a target value or range of values for a service level that is measured by an SLI"; the <a target="_blank" href="https://sre.google/workbook/implementing-slos/">SRE workbook</a> warns that "It can be tempting to try to math your way out of these problems [...] Unless each of these dependencies and failure patterns is carefully enumerated and accounted for, any such calculations will be deceptive".</p><p>I agree with the last statement, but let's not shy away from the journey. Let's embrace math, and we'll uncover a rich and fascinating landscape.</p><h1 id="heading-fantastic-slisslos-and-where-to-find-them">Fantastic SLIs/SLOs and where to find them</h1><p>Here is a bestiary of SLIs and SLOs commonly found in the wild, to lay the ground of what we'll be talking about.</p><p>SLIs for software services usually cover <em>latency</em>, which is the time it takes to respond to a request or finish a task, <em>error rate</em>, the ratio of failed requests to total requests received, and some measure of <em>system throughput</em>, like requests per second or GBps.</p><p>Not all requests are created equal. It's common to set different SLOs for different types or classes of requests. Various terms such as priority, criticality, or quality of service (QoS) are used in different situations, but they all serve the purpose of treating requests differently and managing load when a service is under stress.</p><p>Compute services, such as Kubernetes clusters or Borg cells, operate on a different time scale compared to web services, but they are not entirely different. Users care about how long it takes to schedule a job (or to receive an error), that scheduling failures are rare, and how efficiently compute resources are used compared to the theoretical zero-overhead scenario.</p><p>For both software services and compute resources, it's common to define <em>availability</em>.</p><p>In systems that show measurable progress, we focus on the portion of throughput that is truly effective. This concept of application-level throughput is known as <em>goodput</em> in networking, and the term has also been embraced by ML (<a target="_blank" href="https://github.com/google/cloud_tpu_goodput">at least at Google</a>).</p><p>In storage systems, we also care about <em>durability</em>, which is the likelihood of accurately reading back what we wrote without any errors or silent corruption.</p><p>Given enough size, silent data corruption is also a problem that you care about in compute infrastructure.</p><p><strong>SLIs are usually defined as a certain quantile over a sample or measurement</strong>, and these quantiles often fall at the tail of distributions, not in the center. This is an important point because it changes the mathematical rules of the game!</p><p>Perfection is costly. It is common to define SLI as the <em>latency</em> of 95/99% of requests, and you rarely or never see max latency in SLO definitions. When it does appear, there are always escape clauses.</p><p>Most hyperscaler services and the business scenarios they support aim for between 3 (99.9%) to 5 (99.999%) nines of availability. However, many users can <a target="_blank" href="https://cloud.google.com/compute/sla">tolerate</a> even <a target="_blank" href="https://aws.amazon.com/compute/sla/?nc1=h_ls">less</a> at the instance- or region-level, e.g. with 99% availability allowing for up to 7.2 hours of downtime per month.</p><p>There are users in the world who can't tolerate more than 30 milliseconds of server downtime per year. They require 99.9999999% availability, and there are systems capable of delivering these impressive numbers. However, this level of reliability is not typical for common computing infrastructure. (For those interested, the number above is the stated reliability of an IBM z16 mainframe.)</p><h1 id="heading-a-side-trip-into-probability-distribution-and-basic-statistics">A side trip into probability distribution and basic statistics</h1><p>If you're already familiar with basic frequentist versus Bayesian statistics, distribution families, and the central limit theorem, feel free to skip this section.</p><p>If you're not familiar or just want a refresher, let's start with the basics. We'll discuss <strong>counting things</strong>, as we aim to measure how often a system fails.</p><p>A random variable is a variable with a value that is uncertain and determined by random events. In mathematical terms, it's not actually a variable, but a function that maps possible outcomes in a sample space to a measurable space (known as the support).</p><p>So, when the possible outcomes are discrete and finite or countably infinitely many outcomes or uncountably infinite (and piecewise continuous), we can define the expectation of a random variable X respectively as:</p><p>$$\begin{align*} E[X] &= x_1p_1+x_2p_2+\dots+x_kp_k\\ E[X] &= \sum_{i=1}^{\infty}x_ip_i \\ E[X] &= \int_{-\infty}^{\infty}xf(x)dx \end{align*}$$</p><p>where p_i is the probability of outcome x_i and f(x) is the probability density function (the corresponding name in the discrete case i.e. pmf(x_i)=p_i is called the probability mass function).</p><p>There are two main approaches to statistics.</p><ul><li><p>Frequentist or classical statistics, assigns probability to data, focusing on the frequency of events, and tests yield results that are either true or false.</p></li><li><p>Bayesian statistics, assigns probabilities to hypotheses, producing credible intervals and the probability that a hypothesis is true or false; it also incorporates prior knowledge into the analysis, updating this as more data becomes available.</p></li></ul><p>Probability distributions aren't isolated; they are connected to each other <a target="_blank" href="https://en.wikipedia.org/wiki/Relationships_among_probability_distributions">in complex ways</a>. They can be transformations, combinations, approximations (for example, from discrete to continuous), or compositions of each other, and so on...</p><p>The binary outcome {0, 1}, like available/broken, of independent events is a <em>Bernoulli</em> random variable (rv) with parameter <em>p</em>, which represents the probability of events. When we sample from a Bernoulli distribution, we get a set of values that can either be 0 or 1.</p><p>When counting events in discrete time, such as if we consider events that fall in a given second/minute/hour/month:</p><ul><li><p>The time between positive events follows a <em>Geometric</em> distribution with the same parameter <em>p</em>;</p></li><li><p>The time between every second positive event follows a <em>Negative Binomial</em> distribution with the same success probability <em>p</em> as before and <em>r=2</em>, meaning we are focusing on every second positive event.</p><ul><li>An notable point is that the Negative Binomial random variable is the sum of Geometric random variables.</li></ul></li><li><p>The number of events in a given interval or sample of size <em>N</em> follows a <em>Binomial</em> distribution with parameters <em>p</em> and <em>N</em>.</p><ul><li>For instance, the raw number of available TPU slices in a TPUv2 or v3 pod, or the number of DGX systems in a DGX Superpod, can be roughly estimated using this distribution... but what makes the difference in these systems is how you manage disruptions, planned and unplanned maintenance.</li></ul></li></ul><p>When we start counting events in continuous time, we discover matching distributions with the same mathematical structure and relationships. These are the limit distributions of the previous ones when the interval sizes become infinitely small. For those who are particularly adventurous, let me mention that <a target="_blank" href="https://ncatlab.org/nlab/show/probability+theory">category theory</a> offers an alternative and much broader explanation, as is often the case.</p><ul><li><p>The time between independent events happening at a constant average rate and with a probability <em>p</em> follows an <em>Exponential</em> random variable with a rate parameter =1/<em>p</em>.</p></li><li><p>The time between every <em>k</em>-th event follows a <em>Gamma</em> random variable with parameters <em>k</em> and <em>=1/p</em>.</p><ul><li>The Gamma random variable is the sum of k exponential random variables, the same relationship as the negative binomial and geometric random variables above.</li></ul></li><li><p>The number of events in a given time span of size N is a <em>Poisson</em> random variable with rate parameter <em>=N/p</em>.</p></li></ul><p>We refer to a "sample" as a subset of values taken from a statistical population. A "sample statistic" is a value calculated from the sample. When we use a statistic to estimate a population parameter, we call it an "estimator".</p><p>For a sufficiently large number of events in Bernoulli samples, if we count the number of positive events and divide by the total number of events, the result will be close to <em>p</em>. As the number of events grows to infinity, the result converges to <em>p</em>, which is the "mean" of the Bernoulli distribution. This is known as the "law of large numbers": the average obtained from a large number of i.i.d. random samples converges to the true value if it exists.</p><p>The average of many samples of a random variable, which has a finite mean and variance, becomes a random variable that follows a <em>Normal distribution,</em> under specific conditions. This is known as the "central limit theorem".</p><p>This is what most people recall from their introductory statistics courses.</p><h1 id="heading-the-central-limit-theorem-need-not-apply-extreme-value-theory">The central limit theorem, need not apply... Extreme Value Theory</h1><p>Let's talk about when the central limit theorem does not apply, as this is a common situation with the data that SREs deal with. Not knowing its limits is one reason why relying on what people remember from introductory statistics is discouraged (though it's not the only reason).</p><p>A popular way to normalize data is to compute the Z-score, defined as:</p><p>$$Z = {{x - \mu} \over {\sigma}}$$</p><p>where Z is known as the standard score, and it normalizes samples in terms of standard deviations () and the mean ().</p><p>But don't attempt to analyze variability in <em>error rates</em> by normalizing error and request counts, and then calculating the ratio. The ratio of two independent Normal random variables with a mean of zero (Cauchy random variables) has an undefined mean!</p><p>There are two other crucial distributions that frequently appear in SLI and SLO, yet many SREs seem unaware of them. These distributions are connected to quantiles and significant deviations from the median of a probability distribution. Therefore, the field of statistics that examines them is known as <em>Extreme Value Theory</em>.</p><p>There are two primary methods for analyzing the tails of distributions.</p><ul><li><p><em>You can calculate the maximum or minimum values over a specific period or block, and then analyze the distribution of these values (for example, the highest values of each year based on the highest values of each month).</em> This method helps you understand the distribution of the minimum or maximum values from very large sets of identical, independently distributed random variables from the same distribution. The distribution you find is often a form of the <a target="_blank" href="https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution"><em>Generalized Extreme Value Distribution</em></a>. Here we consider the maxima/minima and then compute the distribution that it follows given a arbitrarily large or infinity amount of time. It's good to estimate the worst-case scenario for a steady-state process.</p></li><li><p><em>You can count how many times peak values go above or below a certain threshold within any period.</em> There are two distributions to look at: the number of events in a specific period (like a <em>Poisson distribution</em> mentioned earlier), and how much the values exceed the threshold, which usually follows a <a target="_blank" href="https://en.wikipedia.org/wiki/Generalized_Pareto_distribution"><em>generalized Pareto distribution</em></a>. Here we consider the frequency of violations and how severe they can be.</p></li></ul><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709569330938/d9a7cbf8-d367-4d60-b155-45a174f33a6c.png" alt class="image--center mx-auto" /></p>]]><![CDATA[<p>Let me share what I've learned from 12 years of experience, defining and defending SLOs on various software and hardware systems, including TPU pods, HPC/GPU clusters, and general compute clusters. (This is just an introductory post, but I need to cover the basics before diving into more juicy topics)</p><p>SRE, by <a target="_blank" href="https://sre.google/sre-book/service-level-objectives/">the book</a>, defines SLI as "quantitative measure of some aspects of the level of a service that is provided" and SLO as " a target value or range of values for a service level that is measured by an SLI"; the <a target="_blank" href="https://sre.google/workbook/implementing-slos/">SRE workbook</a> warns that "It can be tempting to try to math your way out of these problems [...] Unless each of these dependencies and failure patterns is carefully enumerated and accounted for, any such calculations will be deceptive".</p><p>I agree with the last statement, but let's not shy away from the journey. Let's embrace math, and we'll uncover a rich and fascinating landscape.</p><h1 id="heading-fantastic-slisslos-and-where-to-find-them">Fantastic SLIs/SLOs and where to find them</h1><p>Here is a bestiary of SLIs and SLOs commonly found in the wild, to lay the ground of what we'll be talking about.</p><p>SLIs for software services usually cover <em>latency</em>, which is the time it takes to respond to a request or finish a task, <em>error rate</em>, the ratio of failed requests to total requests received, and some measure of <em>system throughput</em>, like requests per second or GBps.</p><p>Not all requests are created equal. It's common to set different SLOs for different types or classes of requests. Various terms such as priority, criticality, or quality of service (QoS) are used in different situations, but they all serve the purpose of treating requests differently and managing load when a service is under stress.</p><p>Compute services, such as Kubernetes clusters or Borg cells, operate on a different time scale compared to web services, but they are not entirely different. Users care about how long it takes to schedule a job (or to receive an error), that scheduling failures are rare, and how efficiently compute resources are used compared to the theoretical zero-overhead scenario.</p><p>For both software services and compute resources, it's common to define <em>availability</em>.</p><p>In systems that show measurable progress, we focus on the portion of throughput that is truly effective. This concept of application-level throughput is known as <em>goodput</em> in networking, and the term has also been embraced by ML (<a target="_blank" href="https://github.com/google/cloud_tpu_goodput">at least at Google</a>).</p><p>In storage systems, we also care about <em>durability</em>, which is the likelihood of accurately reading back what we wrote without any errors or silent corruption.</p><p>Given enough size, silent data corruption is also a problem that you care about in compute infrastructure.</p><p><strong>SLIs are usually defined as a certain quantile over a sample or measurement</strong>, and these quantiles often fall at the tail of distributions, not in the center. This is an important point because it changes the mathematical rules of the game!</p><p>Perfection is costly. It is common to define SLI as the <em>latency</em> of 95/99% of requests, and you rarely or never see max latency in SLO definitions. When it does appear, there are always escape clauses.</p><p>Most hyperscaler services and the business scenarios they support aim for between 3 (99.9%) to 5 (99.999%) nines of availability. However, many users can <a target="_blank" href="https://cloud.google.com/compute/sla">tolerate</a> even <a target="_blank" href="https://aws.amazon.com/compute/sla/?nc1=h_ls">less</a> at the instance- or region-level, e.g. with 99% availability allowing for up to 7.2 hours of downtime per month.</p><p>There are users in the world who can't tolerate more than 30 milliseconds of server downtime per year. They require 99.9999999% availability, and there are systems capable of delivering these impressive numbers. However, this level of reliability is not typical for common computing infrastructure. (For those interested, the number above is the stated reliability of an IBM z16 mainframe.)</p><h1 id="heading-a-side-trip-into-probability-distribution-and-basic-statistics">A side trip into probability distribution and basic statistics</h1><p>If you're already familiar with basic frequentist versus Bayesian statistics, distribution families, and the central limit theorem, feel free to skip this section.</p><p>If you're not familiar or just want a refresher, let's start with the basics. We'll discuss <strong>counting things</strong>, as we aim to measure how often a system fails.</p><p>A random variable is a variable with a value that is uncertain and determined by random events. In mathematical terms, it's not actually a variable, but a function that maps possible outcomes in a sample space to a measurable space (known as the support).</p><p>So, when the possible outcomes are discrete and finite or countably infinitely many outcomes or uncountably infinite (and piecewise continuous), we can define the expectation of a random variable X respectively as:</p><p>$$\begin{align*} E[X] &= x_1p_1+x_2p_2+\dots+x_kp_k\\ E[X] &= \sum_{i=1}^{\infty}x_ip_i \\ E[X] &= \int_{-\infty}^{\infty}xf(x)dx \end{align*}$$</p><p>where p_i is the probability of outcome x_i and f(x) is the probability density function (the corresponding name in the discrete case i.e. pmf(x_i)=p_i is called the probability mass function).</p><p>There are two main approaches to statistics.</p><ul><li><p>Frequentist or classical statistics, assigns probability to data, focusing on the frequency of events, and tests yield results that are either true or false.</p></li><li><p>Bayesian statistics, assigns probabilities to hypotheses, producing credible intervals and the probability that a hypothesis is true or false; it also incorporates prior knowledge into the analysis, updating this as more data becomes available.</p></li></ul><p>Probability distributions aren't isolated; they are connected to each other <a target="_blank" href="https://en.wikipedia.org/wiki/Relationships_among_probability_distributions">in complex ways</a>. They can be transformations, combinations, approximations (for example, from discrete to continuous), or compositions of each other, and so on...</p><p>The binary outcome {0, 1}, like available/broken, of independent events is a <em>Bernoulli</em> random variable (rv) with parameter <em>p</em>, which represents the probability of events. When we sample from a Bernoulli distribution, we get a set of values that can either be 0 or 1.</p><p>When counting events in discrete time, such as if we consider events that fall in a given second/minute/hour/month:</p><ul><li><p>The time between positive events follows a <em>Geometric</em> distribution with the same parameter <em>p</em>;</p></li><li><p>The time between every second positive event follows a <em>Negative Binomial</em> distribution with the same success probability <em>p</em> as before and <em>r=2</em>, meaning we are focusing on every second positive event.</p><ul><li>An notable point is that the Negative Binomial random variable is the sum of Geometric random variables.</li></ul></li><li><p>The number of events in a given interval or sample of size <em>N</em> follows a <em>Binomial</em> distribution with parameters <em>p</em> and <em>N</em>.</p><ul><li>For instance, the raw number of available TPU slices in a TPUv2 or v3 pod, or the number of DGX systems in a DGX Superpod, can be roughly estimated using this distribution... but what makes the difference in these systems is how you manage disruptions, planned and unplanned maintenance.</li></ul></li></ul><p>When we start counting events in continuous time, we discover matching distributions with the same mathematical structure and relationships. These are the limit distributions of the previous ones when the interval sizes become infinitely small. For those who are particularly adventurous, let me mention that <a target="_blank" href="https://ncatlab.org/nlab/show/probability+theory">category theory</a> offers an alternative and much broader explanation, as is often the case.</p><ul><li><p>The time between independent events happening at a constant average rate and with a probability <em>p</em> follows an <em>Exponential</em> random variable with a rate parameter =1/<em>p</em>.</p></li><li><p>The time between every <em>k</em>-th event follows a <em>Gamma</em> random variable with parameters <em>k</em> and <em>=1/p</em>.</p><ul><li>The Gamma random variable is the sum of k exponential random variables, the same relationship as the negative binomial and geometric random variables above.</li></ul></li><li><p>The number of events in a given time span of size N is a <em>Poisson</em> random variable with rate parameter <em>=N/p</em>.</p></li></ul><p>We refer to a "sample" as a subset of values taken from a statistical population. A "sample statistic" is a value calculated from the sample. When we use a statistic to estimate a population parameter, we call it an "estimator".</p><p>For a sufficiently large number of events in Bernoulli samples, if we count the number of positive events and divide by the total number of events, the result will be close to <em>p</em>. As the number of events grows to infinity, the result converges to <em>p</em>, which is the "mean" of the Bernoulli distribution. This is known as the "law of large numbers": the average obtained from a large number of i.i.d. random samples converges to the true value if it exists.</p><p>The average of many samples of a random variable, which has a finite mean and variance, becomes a random variable that follows a <em>Normal distribution,</em> under specific conditions. This is known as the "central limit theorem".</p><p>This is what most people recall from their introductory statistics courses.</p><h1 id="heading-the-central-limit-theorem-need-not-apply-extreme-value-theory">The central limit theorem, need not apply... Extreme Value Theory</h1><p>Let's talk about when the central limit theorem does not apply, as this is a common situation with the data that SREs deal with. Not knowing its limits is one reason why relying on what people remember from introductory statistics is discouraged (though it's not the only reason).</p><p>A popular way to normalize data is to compute the Z-score, defined as:</p><p>$$Z = {{x - \mu} \over {\sigma}}$$</p><p>where Z is known as the standard score, and it normalizes samples in terms of standard deviations () and the mean ().</p><p>But don't attempt to analyze variability in <em>error rates</em> by normalizing error and request counts, and then calculating the ratio. The ratio of two independent Normal random variables with a mean of zero (Cauchy random variables) has an undefined mean!</p><p>There are two other crucial distributions that frequently appear in SLI and SLO, yet many SREs seem unaware of them. These distributions are connected to quantiles and significant deviations from the median of a probability distribution. Therefore, the field of statistics that examines them is known as <em>Extreme Value Theory</em>.</p><p>There are two primary methods for analyzing the tails of distributions.</p><ul><li><p><em>You can calculate the maximum or minimum values over a specific period or block, and then analyze the distribution of these values (for example, the highest values of each year based on the highest values of each month).</em> This method helps you understand the distribution of the minimum or maximum values from very large sets of identical, independently distributed random variables from the same distribution. The distribution you find is often a form of the <a target="_blank" href="https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution"><em>Generalized Extreme Value Distribution</em></a>. Here we consider the maxima/minima and then compute the distribution that it follows given a arbitrarily large or infinity amount of time. It's good to estimate the worst-case scenario for a steady-state process.</p></li><li><p><em>You can count how many times peak values go above or below a certain threshold within any period.</em> There are two distributions to look at: the number of events in a specific period (like a <em>Poisson distribution</em> mentioned earlier), and how much the values exceed the threshold, which usually follows a <a target="_blank" href="https://en.wikipedia.org/wiki/Generalized_Pareto_distribution"><em>generalized Pareto distribution</em></a>. Here we consider the frequency of violations and how severe they can be.</p></li></ul><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1709569330938/d9a7cbf8-d367-4d60-b155-45a174f33a6c.png" alt class="image--center mx-auto" /></p>]]><![CDATA[OKR planning as belief revision]]>https://rlupi.com/okr-planning-as-belief-revisionhttps://rlupi.com/okr-planning-as-belief-revisionThu, 29 Feb 2024 08:22:33 GMT<![CDATA[<p>Objective and Key Results (OKR) is a well-known framework used to set and track an organization's goals. Leaders establish top-level OKRs, which then flow through the organization. In this journey, more details are added, these strategic goals drive alignment and but also adjust according to tactical needs and constraints. Progress is monitored, and any necessary changes are made throughout the organization.</p><p>Large organizations are structured in complex hierarchies because this setup helps with planning and execution, compared to other options.</p><p>The concept of <a target="_blank" href="https://plato.stanford.edu/entries/logic-belief-revision/">belief revision</a> has been explored in various fields for a long time. In 1982, Judea Pearl introduced <a target="_blank" href="https://en.wikipedia.org/wiki/Belief_propagation">belief propagation</a>, an algorithm designed for making predictions using graphical models like Bayesian networks or Markov random fields. When applied to a tree-shaped model, this algorithm finishes in just two complete cycles. On more complex graphs, an approximate version of belief propagation can be used, though it requires more steps which is costly when implemented at the speed dictated by human processes.</p><p>AI and LLM will make these processes faster and lower coordination costs.</p><div data-node-type="callout"><div data-node-type="callout-emoji"></div><div data-node-type="callout-text">The key question is whether LLMs will lead to a <strong>tipping point</strong>, and the <em>emergence of new organizational structures and coordination dynamics that adapt to change more efficiently than our current ones</em>.</div></div><p>There are many other <a target="_blank" href="https://en.wikipedia.org/wiki/Moral_Mazes">functions and factors</a> that stabilize bureaucracies, but they are ultimately bound by what they achieve. The often unstated purpose of large corporations is not to compete in the market and make profits, it is to grow beyond that game and "engulf enough of the world to shield themselves from uncertainty" (rephrasing cit. John Kenneth Galbraith and Danella Meadows's <a target="_blank" href="https://donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/">Leverage Points</a>). When they are successful for a long time, they can loose the ingredients and factors that made them good at the innovation game in their earlier history. Increasing profits is a necessary condition to continue to play. Inefficiency and slow speed are a moral hazard and mortal sin for a corporation that faces market shakeups and looses <em>mind</em>share even before marketshare.</p>]]><![CDATA[<p>Objective and Key Results (OKR) is a well-known framework used to set and track an organization's goals. Leaders establish top-level OKRs, which then flow through the organization. In this journey, more details are added, these strategic goals drive alignment and but also adjust according to tactical needs and constraints. Progress is monitored, and any necessary changes are made throughout the organization.</p><p>Large organizations are structured in complex hierarchies because this setup helps with planning and execution, compared to other options.</p><p>The concept of <a target="_blank" href="https://plato.stanford.edu/entries/logic-belief-revision/">belief revision</a> has been explored in various fields for a long time. In 1982, Judea Pearl introduced <a target="_blank" href="https://en.wikipedia.org/wiki/Belief_propagation">belief propagation</a>, an algorithm designed for making predictions using graphical models like Bayesian networks or Markov random fields. When applied to a tree-shaped model, this algorithm finishes in just two complete cycles. On more complex graphs, an approximate version of belief propagation can be used, though it requires more steps which is costly when implemented at the speed dictated by human processes.</p><p>AI and LLM will make these processes faster and lower coordination costs.</p><div data-node-type="callout"><div data-node-type="callout-emoji"></div><div data-node-type="callout-text">The key question is whether LLMs will lead to a <strong>tipping point</strong>, and the <em>emergence of new organizational structures and coordination dynamics that adapt to change more efficiently than our current ones</em>.</div></div><p>There are many other <a target="_blank" href="https://en.wikipedia.org/wiki/Moral_Mazes">functions and factors</a> that stabilize bureaucracies, but they are ultimately bound by what they achieve. The often unstated purpose of large corporations is not to compete in the market and make profits, it is to grow beyond that game and "engulf enough of the world to shield themselves from uncertainty" (rephrasing cit. John Kenneth Galbraith and Danella Meadows's <a target="_blank" href="https://donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/">Leverage Points</a>). When they are successful for a long time, they can loose the ingredients and factors that made them good at the innovation game in their earlier history. Increasing profits is a necessary condition to continue to play. Inefficiency and slow speed are a moral hazard and mortal sin for a corporation that faces market shakeups and looses <em>mind</em>share even before marketshare.</p>]]>