lördag 9 december 2017

Chaos engineering (workshop)


As longtime readers know, I've been involved in the ACM LIMITS workshop since (before) it started three years ago. Here's the short blurb about what Limits is about:

"The ACM LIMITS workshop aims to foster discussion on the impact of present and future ecological, material, energetic, and societal limits on computing. These topics are seldom discussed in contemporary computing research. A key aim of the workshop is to promote innovative, concrete research, potentially of an interdisciplinary nature, that focuses on technologies, critiques, techniques, and contexts for computing within fundamental economic and ecological limits. [...] . We hope to impact society through the design and development of computing systems in the abundant present for use in a future of limits."

Back in early November I came across an invitation to a December 6 workshop on ”chaos engineering” at KTH and found that despite the difference in origins and in aims, there were also some very interesting similarities. So signed up for attending the workshop more or less on a lark and partly also because it was a held on a Wednesday - the only day of the week when I don't teach during this hectic period of the year. Here's what I wrote to my Limits colleagues when I asked if they had heard about chaos engineering (none had):


"The chaos engineering princples start like this (I made parts of the text bold):

"Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that increase flexibility of development and velocity of deployment. An urgent question follows on the heels of these benefits: How much confidence we can have in the complex systems that we put into production?

Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.

We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.

An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. We call this Chaos Engineering.""


I went to the workshop with absolutely no assumptions and was prepared to leave after the introduction session should I find it boring or too technical for me. I instead left at the end of the day feeling uplifted and very enthusiastic. 

Besides the KTH chaos engineering workshop program and the chaos engineering principles, there's also a chaos engineering community and one-page Google doc presentation of (self-selected) participants at the KTH workshop (all men, there were very few women attending the workshop). 

Some of the questions I brought with me to the workshop were:
- Where does chaos engineering come from?
- Why "chaos" engineering?
- Why here and why now?

Some of these questions were answered already in Martin Monperrus' introduction to the workshop. Before there was chaos engineering there was the Chaos Monkey. Netflix invented it to test their distributed system of delivering streaming video. Since 50% of their servers fail (shut down, restart) any given day, they wanted to be able to test their systems and Chaos Monkey did that by randomly and automatically shutting down servers. This simulates various problems that can happen "in the wild"; hardware problems, problems with the operating system, software application problems and problems with virtualization. By randomly shutting down servers, it is possible to systematically verify that the system can withstand expected-but-unpredictable failures. Taking it to the next level, there are now planned and monitored fake shutdowns of whole data centers (you don't actually take down the data center, you just cut the connections to and from the data center in question). As apart from taking down random servers, this is not random but planned in advance as a way of testing that the network can withstand a data center going down. The original definition of chaos engineering came out of Netflix' specific needs and it reads:

"Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production."

Out of Netflix' work came Chaos engineering day; first in San Fransisco (2015), then in Seattle (2016) and then back in San Fransisco (2017). In 2017, chaos engineering also started to happen in Europe; Paris (Nov 2017), Stockholm (Dec 2017) and London (Dec 2017). It's hard to say for sure, but it seems to catch on. It's still a long way from a worldwide movement but interest in chaos engineering seems to be on the rise.

The KTH workshop was organized with support from a new KTH Center for Software Research (CASTOR) and has 43 participants from academia and from industry (primarily SAAB). Besides Sweden, there were also participants from Norway, France and Spain. Organizer Monperrus stated that Europe "should be good at chaos" due to diversity and a plethora of countries, languages etc. I noticed that in the initial one-minute self-presentations, several persons stated they work on minimizing, controlling, preventing or removing chaos in computing and that one one person said he embraces chaos in computing!

Now that chaos computing have made it out of Netflix' (and Amazon's) server rooms, taken the step across the Atlantic and also moved to a more academic context, the "Netflix definition" (above) has been upgraded. Perhaps the systems that are tested do not have to be distributed. It could for example be a test of the Linux kernel so a more appropriate definition would then seem to be:

- "Chaos Engineering is the discipline of experimenting on a software system in order to build confidence in the system’s capability to withstand turbulent conditions in production."

In the workshop introduction Monperrus suggested we might want to use a yet more open ended definition for this workshop:

- "Chaos Engineering is the discipline of perturbing a software system in production for fun and profit."

What blew my mind was that there was a discussion of whether this definition was suitable or not, i.e. the discussion about what chaos computing might mean happened then and there with me as a silent witness! 

Monperrus connected chaos computing to related works; the scientific method itself and philosopher Karl Popper's ideas about falsifiability. Monperrus also mentioned an early system that added a number of "ghost planes" to the real airplanes that were handled by an air traffic control system to both test the system itself but also the air traffic controllers' capacity to handle a situation of "congestion", e.g. a higher load of incoming airplanes. The basic idea is again to push a system to its limits in an attempt to try to break it (which seems dangerous in a real-life air traffic control situation - there might of course also be ethical problems with this approach). A brief search didn't really give me any hits, but I did find Martin Monperrus own homepage about "Antifragile software" and while it doesn't say, I would assume that there is a connection to Nassim Nicholas Taleb's use of this term in his 2012 book "Antifragile: Things that gain from disorder".

Related (software) fields were suggested to be randomization & software diversity, testing (in-the-field and stress testing) and devops (Canari testing/rolling deployment, A/B testing, disaster recovery). Devops ("a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops)") seems to be compatible with chaos engineering but it's not up to me to tell as I am now moving out of my comfort zone. It's not a wild guess though as the full name of the KTH workshop was "Chaos engineering & devops meetup@KTH Stokcholms, Dec 6 2017". There was also questions and a protest ("A/B testing is not chaos! It's very planned!")

I took a lot of notes of things that I almost, somehow, hardly or not really understood, but I will not try to render these notes comprehensible as I don't really see an upside; either I have to read up (a lot) to make sense of my notes or I run the very real possibility of of embarrassing myself. I will just end with a couple of notes from the workshop introduction:
- Netflix' Chaos Monkey does not add a lot of value today as that particular system now handles the problems of randomly dropping a server here or there. Chaos Monkey is now primarily used to verify new versions of the system when it is upgraded.
- A statement that the Netflix learning curve was steeper in the beginning met resistance though. A workshop participant from Spotify suggested that different characteristics/values were tested early vs later in the Chaos Monkey deployment and that it might not be possible to say that the value "decreases" at later stages.
- The first chaos engineering company, Gremlin, was founded by a Netflix Chaos Engineer. Their slogan is "Break things on purpose". "About Gremlin" states that "We’ve employed Chaos Engineering to harden and prepare our services for internet scale".
- Netflix (2011): "Inspired by the success of the Chaos Monkey, we’ve started creating new simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them; a virtual Simian Army to keep our cloud safe, secure, and highly available." This Simian Army (then) consisted of Chaos Monkey, Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey and Chaos Gorilla.

This will already be a long blog post and it would be unmanageably long if I wrote as much about the rest of the day as I at this point have written about the 30-minute introduction so I will keep things shorter but I make an exception for the keynote which followed directly on the introduction. The keynote speech by Joe Armstrong was called "Let it crash" (and also "How to write fault-tolerant software"). His slides are available online. Joe was very funny - almost funny enough to be able to have a back-up career as a stand-up comedian! Some points he made that stuck with me:
- Fault-tolerant software is never fished - it's always work-in-progress.
- Armstrong worked at Rymdbolaget, working on the Viking Satellite in the early 1980's With a satellite you only get one chance because after launch there's nothing much you can do if you got it wrong the first time.
- In a more forgiving setting, how can we create software that works reasonably well even if there are errors in the software?
- On errors: Q: What is a software error? A: It is an undesirable property of a program; something that crashes a program; a deviation between desired and observed behavior.
   - Q: Who finds the error? A: The compiler, the programmer or the program (runtime) finds the error!
   - Q: What should the program do (runtime) when it finds an error? A: Ignore it (No!). Try to fix it (No!). Crash immediately (Yes!) - because you don't want to make things worse. Assume someone else will fix the error.
- Q: What should the programmer do when he finds an error? A: Ignore it (No!). Log it (Yes). Try to fix it (possibly, but make sure not to make matters worse). Crash immediately (Yes). You should then write the technical stuff to a log and a big huge apology to the user. Don't ever send the log to the user (e.g. "Error 148X-YZ; could not connect to database because...")
- "Fault-tolerance is impossible with only one computer". Fault-tolerance implies concurrency and distribution. "Scalability is impossible with only one computer". So fault-tolerance and scalability are two sides of the same coin. If you opt for one you accomplish also the other as a side effect. You might choose a system for scalability and you will then get fault-tolerance as a side effect.
- There are noisy errors that go out with a big bang and then there are silent but deadly errors - errors where the program does not crash but where it delivers flawed results. The latter is worse because you don't have the right answer but don't know you don't have the right answer. It could nullify the value of everything that follows without you even knowing there is a problem.
- Many programs have no specifications or specifications that are so imprecise so as to be useless. Or the programmer misunderstands the specification. And the specification might be incorrect in the first place. And the tests might be incorrect too. So you should always be prudent and plan for the worst.
- If Apple comes out with a security update for the OS every week, how secure is that system right now? A: Not very.
- Protocols are contracts. Contracts assign blame (when things go wrong). When things go right, contracts are not needed. When things go wrong, they are crucial.

What then happened at the workshop was interesting and very unexpected. I felt that this was an great group to pitch Computing within Limits to. So I wanted organizer Monperrus to mention that I had created an extra slide in the participant presentations exclusively for inviting people to submit papers to the Limits workshop (deadline Feb 9). He thought I should say something about it myself and then encouraged me to talk some about Limits. Yes, the workshop program was that flexible. I was really inspired because I felt that some of the people at the workshop could probably write the "systems papers" we want a Limits (most papers this far, including all of my own papers have instead been "discussion papers").

So I probably sat for closer to an hour putting together a 20-minute 60-slide whirlwind presentation of Computing within Limits which I held just before the concluding discussion. I was quite nervous because 1) I didn't know anyone at the workshop and 2) I had no idea how my presentation would go down with that particular audience. It turned out to be the right presentation for the right group of people, but it remains to be seen if it actually inspired anyone to write a paper for the upcoming Limits workshop.

Some take away notes from the summary and concluding discussion (where I was very active):
- Someone said "my company earns money from (preventing) chaos and I would like to model chaos in society". Can we imagine a chaos monkey for society? As software becomes more important and there are evil people out there who want to destroy important system, chaos engineering could help us build more resilient/anti-fragile systems.
- What would be the simplest way to explain chaos engineering to the man-on-the-street be? What is the simplest Chaos Monkey? Me and others suggested that the societal equivalent would be to randomly turn off hot water, wifi or electricity. This made me think of Don Patterson's paper "Haitian Resiliency: A Case Study in Intermittent Infrastructure" (pdf) about how lots of people in less affluent (or disaster-stricken parts of the world) already now live with rolling brownouts and blackouts and with having to adapt to having resources and infrastructure on an on/off basis. Why not study them already? This also made me think of Alf Hornborg's quip about the fact that where there is money there is (working) technology that works dependably - and vice versa.
- There are some very good stories out there, we just have to find them. I suggested we might for example look at "energy poverty" in the UK, Portugal and elsewhere in Europe. It's about not being able to afford energy (heating) or of being shut off from the network. It's not random, but it opens up towards questions about justice and about a basic level of service to everyone. My mother-in-law (in Argentina) can afford to have a backup generator for when the electricity fails her, but most people can't.
- We should model not just the single components but multiple/sets of components. Let's say a bit flip crashes a single autonomous truck. If the problem instead is with the coordination between trucks, the results would be much much worse. We need more resilience, robustness and anti-fragility.

Something very intriguing was a comment about the deep connection between simulation and chaos. Google stores tons of data about autonomous cars. They can then change the algorithm and "replay" old data with the new algorithm to see if outcomes would have been improved "historically". This is really interesting methodologically and I could not refrain from seeing some extremely fascinating parallells between this and our methodology of "rolling back" and rewriting history to establish a new timeline by using counterfactual history (also called alternate/alternative history, virtual history, uchronical history, allohistorical scenarios) in this recent article of ours:

Pargman, D., Eriksson, E., Höök, M., Tanenbaum, J., Pufal, M., & Wangel, J. (2017). What if there had only been half the oil? Rewriting history to envision the consequences of peak oil. Energy Research & Social Science, special issue on Narratives and storytelling in energy and climate change research. Volume 31, pp.170-178.

I find the connections between simulation and chaos (engineering) very interesting and I would like to further think about/discuss the many philosophical aspects that all of a sudden become readier-at-hand. The example above with ghost planes (the term "replay scenarios" was mentioned) would be one example and it opens up the door for thinking about not only simulations but also overlays on reality, e.g. augmented reality.

Someone mentioned that The Santa Fe institute (tagline "The world headquarters for complexity science") have a "simulation of civilizations". I couldn't find such a project but did find much else of interest when I checked out their homepage:
- The Santa Fe Institute have an "InterPlanetary Project" on "the global challenge of becoming an InterPlanetary civilization" as well as an upcoming (June 2018) "first annual InterPlanetary Festival".
- The Santa Fe Institute have a "Complex Systems Summer School" which offers an intensive four-week introduction to complex behavior in mathematical, physical, living, and social systems. I'd like to go there but the summer 2018 is impossible so it would have to be in 2019 at the earliest.
- In their News Center, there are texts about "The limits of predictive technologies", "The science of prediction", "How complexity science can prevent large scale power outages", "Overcoming the limitations of our brains", "On leading people who are too smart to be led", "What happens when the systems we rely on go haywire?", "On what nature can teach us about managing financial systems", "Order and chaos in presidential primaries" (Feb 2016), "The science of orchestrating social outcomes", "On the need for new economic models" and "On why all cities are really the same".
- Their news center also seems to be a really good starting point for a full-day reading session and I have to say that I'm severely tempted! Two examples are the text about "15 [Santa Fe Institute] SFI-authored papers are 'must-reads' for ecologists" and the "$5,000 prize for best ‘near-future’ speculative fiction" (with a December 31, 2017 deadline - so hurry up!).

As to future plans - the very final discussion - here are my own hopes for the future:
- I personally hope I have managed to interest some people in LIMITS 2018 and that there will be at least one systems paper submission from this group to the upcoming Limits conference (Deadline February 9).
- I hope I have found a topic that I can talk with theoretical computer scientists about. These are literally my next-door neighbors at my job (Theoretical Computer Science - TCS) but I have never had anything to talk to them about before. I could probably have set up a few meetings at the workshop had it not been for my upcoming sabbatical and my total absence from KTH during the next six months. But I can perhaps convince the organizer (Monperrus) that I should give a talk to their group at some point. Perhaps the very same talk I gave but now with some more time (including more time for questions).
- I have in fact set up a March meeting with a professor in Electronic System Design. I don't remember exactly what we were supposed to talk about but time will tell. Things are however coming back as I write this and I realize that several persons in the audience suddenly saw the overlap between their "technical" interests and my interest in sustainability and I believe that this is what our upcoming (March) meeting was supposed to be about. Management guru Peter Drucker has said that "efficiency is doing things right; effectiveness is doing the right things". Working to improve efficiency from a purely technical perspective ("using 10% less energy to do X") represents Business As Usual (BAU), but working on effectiveness to me means promoting sustainability. So what questions open up when we adopt such a perspective and how can we cooperate? Just formulating the question in that particular way is in fact a breakthrough because it might represent a way to "sell" sustainability to other parts of The School of Computer Science and Communication (CSC).
- One such overlapping interest might be how there is a massive over-provisioning of resources today "just to be on the safe side". I have read that Google distributes each search to a dozen data centers and returns the answer from the fastest contender. This is a way to reduce uncertainty and to guarantee a very quick answer but it also means that more than 90% of the energy for a search is "wasted". So how could chaos engineering thinking help reduce redundancies and increase the efficiency of our use of resources? How can "right-sizing" be done in production through chaos engineering (e.g. by utilizing Chaos Monkey and the rest of the Netflix' simian army)?

To summarize; great workshop, lots of new ideas!

Inga kommentarer:

Skicka en kommentar