Creating A Resilient Security Strategy Through Chaos Engineering with Kelly Shortridge

Published: May 30, 2023, 10 a.m.

b'

Kelly Shortridge, Senior Principal Engineer at Fastly, joins Corey on Screaming in the Cloud to discuss their recently released book, Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly explains why a resilient strategy is far preferable to a bubble-wrapped approach to cybersecurity, and how developer teams can use evidence to mitigate security threats. Corey and Kelly discuss how the risks of working with complex systems is perfectly illustrated by Jurassic Park, and Kelly also highlights why it\\u2019s critical to address both system vulnerabilities and human vulnerabilities in your development environment rather than pointing fingers when something goes wrong.


About Kelly

Kelly Shortridge is a senior principal engineer at Fastly in the office of the CTO and lead author of "Security Chaos Engineering: Sustaining Resilience in Software and Systems" (O\'Reilly Media). Shortridge is best known for their work on resilience in complex software systems, the application of behavioral economics to cybersecurity, and bringing security out of the dark ages. Shortridge has been a successful enterprise product leader as well as a startup founder (with an exit to CrowdStrike) and investment banker. Shortridge frequently advises Fortune 500s, investors, startups, and federal agencies and has spoken at major technology conferences internationally, including Black Hat USA, O\'Reilly Velocity Conference, and SREcon. Shortridge\'s research has been featured in ACM, IEEE, and USENIX, spanning behavioral science in cybersecurity, deception strategies, and the ROI of software resilience. They also serve on the editorial board of ACM Queue.


Links Referenced:


Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey: Have you listened to the new season of Traceroute yet? Traceroute is a tech podcast that peels back the layers of the stack to tell the real, human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform, or learn more about Traceroute at origins.dev. My thanks to them for sponsoring this ridiculous podcast.\\xa0


Corey: Welcome to Screaming in the Cloud, I\\u2019m Corey Quinn. My guest today is Kelly Shortridge, who is a Senior Principal Engineer over at Fastly, as well as the lead author of the recently released Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly, welcome to the show.


Kelly: Thank you so much for having me.


Corey: So, I want to start with the honest truth that in that title, I think I know what some of the words mean, but when you put them together in that particular order, I want to make sure we\\u2019re talking about the same thing. Can you explain that like I\\u2019m five, as far as what your book is about?


Kelly: Yes. I\\u2019ll actually start with an analogy I make in the book, which is, imagine you were trying to rollerblade to some destination. Now, one thing you could do is wrap yourself in a bunch of bubble wrap and become the bubble person, and you can waddle down the street trying to make it to your destination on the rollerblades, but if there\\u2019s a gust of wind or a dog barks or something, you\\u2019re going to flop over, you\\u2019re not going to recover. However, if you instead do what everybody does, which is you know, kneepads and other things that keep you flexible and nimble, the gust you know, there\\u2019s a gust of wind, you can kind of be agile, navigate around it; if a dog barks, you just roller-skate around it; you can reach your destination. The former, the bubble person, that\\u2019s a lot of our cybersecurity today. It\\u2019s just keeping us very rigid, right? And then the alternative is resilience, which is the ability to recover from failure and adapt to evolving conditions.


Corey: I feel like I am about to torture your analogy to death because back when I was in school in 2000, there was an annual tradition at the school I was attending before failing out, where a bunch of us would paint ourselves green every year and then bike around the campus naked. It was the green bike ride. So, one year I did this on rollerblades. So, if you wind up looking\\u2014there\\u2019s the bubble wrap, there\\u2019s the safety gear, and then there\\u2019s wearing absolutely nothing, which feels\\u2014


Kelly: [laugh]. Yes.


Corey: \\u2014kind of like the startup approach to InfoSec. It\\u2019s like, \\u201cIt\\u2019ll be fine. What\\u2019s the worst that happens?\\u201d And you\\u2019re super nimble, super flexible, until suddenly, oops, now I really wish I\\u2019d done things differently.


Kelly: Well, there\\u2019s a reason why I don\\u2019t say rollerblade naked, which other than it being rather visceral, what you described is what I\\u2019ve called YOLOSec before, which is not what you want to do. Because the problem when you think about it from a resilience perspective, again, is you want to be able to recover from failure and adapt. Sure, you can oftentimes move quickly, but you\\u2019re probably going to erode software quality over time, so to a certain point, there\\u2019s going to be some big incident, and suddenly, you aren\\u2019t fast anymore, you\\u2019re actually pretty slow. So, there\\u2019s this, kind of, happy medium where you have enough, I would like security by design\\u2014we can talk about that a bit if you want\\u2014where you have enough of the security by design baked in and you can think of it as guardrails that you\\u2019re able to withstand and recover from any failure. But yeah, going naked, that\\u2019s a recipe for not being able to rollerblade, like, ever again, potentially [laugh].


Corey: I think, on some level, that the correct dialing in of security posture is going to come down to context, in almost every case. I\\u2019m building something in my spare time in the off hours does not need the same security posture\\u2014mostly\\u2014as we are a bank. It feels like there\\u2019s a very wide gulf between those two extremes. Unfortunately, I find that there\\u2019s a certain tone-deafness coming from a lot of the security industry around oh, everyone must have security as their number one thing, ever. I mean, with my clients who I fixed their AWS bills, I have to care about security contractually, but the secrets that I hold are boring: how much money certain companies pay another very large company.


Yes, I\\u2019ll get sued into oblivion if that leaks, but nobody dies. Nobody is having their money stolen as a result. It\\u2019s slightly embarrassing in the tech press for a cycle and then it\\u2019s over and done with. That\\u2019s not the same thing as a brief stint I did running tech ops at Grindr ten years ago where, leak that database and people will die. There\\u2019s a strong difference between those threat models, and on some level, being able to act accordingly has been one of the more eye-opening approaches to increasing velocity in my experience. Does that align with the thesis of your book, since my copy has not yet arrived for this recording?


Kelly: Yes. The book, I am not afraid to say it depends on the book, and you\\u2019re right, it depends on context. I actually talk about this resilience potion recipe that you can check out if you want, these ingredients so we can sustain resilience. A key one is defining your critical functions, just what is your system\\u2019s reason for existence, and that is what you want to make sure it can recover and still operate under adverse conditions, like you said.


Another example I give all the time is most SaaS apps have some sort of reporting functionality. Guess what? That\\u2019s not mission-critical. You don\\u2019t need the utmost security on that, for the most part. But if it\\u2019s processing transactions, yeah, probably you want to invest more security there. So yes, I couldn\\u2019t agree more that it\\u2019s context-dependent and oh, my God, does the security industry ignore that so much of the time, and it\\u2019s been my gripe for, I feel like as long as I\\u2019ve been in the industry.


Corey: I mean, there was a great talk that Netflix gave years ago where they mentioned in passing, that all developers have root in production. And that\\u2019s awesome and the person next to him was super excited and I looked at their badge, and holy hell, they worked at an actual bank. That seems like a bad plan. But talking to the Netflix speaker after the fact, Dave Hahn, something that I found that was extraordinarily insightful, was that, yeah, well we just isolate off the PCI environment so the rest and sensitive data lives in its own compartmentalized area. So, at that point, yeah, you\\u2019re not going to be able to break much in that scenario.


It\\u2019s like, that would have been helpful context to put in talk. Which I\\u2019m sure he did, but my attention span had tripped out and I missed that. But that\\u2019s, on some level, constraining blast radius and not having compliance and regulatory issues extending to every corner of your environment really frees you up to do things appropriately. But there are some things where you do need to care about this stuff, regardless of how small the surface area is.


Kelly: Agreed. And I introduced the concept of the effort investment portfolio in the book, which is basically, that is where does it matter to invest effort and where can you kind of like, maybe save some resources up. I think one thing you touched on, though, is, we\\u2019re really talking about isolation and I actually think people don\\u2019t think about isolation in as detailed or maybe as expansively as they could. Because we want both temporal and logical and spatial isolation. What you talked about is, yeah, there are some cases where you want to isolate data, you want to isolate certain subsystems, and that could be containers, it could also be AWS security groups.


It could take a bunch of different forms, it could be something like RLBox in WebAssembly land. But I think that\\u2019s something that I really try to highlight in the book is, there\\u2019s actually a huge opportunity for security engineers starting from the design of a system to really think about how can we infuse different forms of isolation to sustain resilience.


Corey: It\\u2019s interesting that you use the word investment. When fixing AWS bills for a living, I\\u2019ve learned over the last almost seven years now of doing this that cost and architecture and cloud are fundamentally the same thing. And resilience is something that comes with a very real cost, particularly when you start looking at what the architectural choices are. And one of the big reasons that I only ever work on a fixed-fee basis is because if I\\u2019m charging for a percentage of savings or something, it inspires me to say really uncomfortable things like, \\u201cBackups are for cowards.\\u201d And, \\u201cWhen was the last time you saw an entire AWS availability zone go down for so long that it mattered? You don\\u2019t need to worry about that.\\u201d And it does cut off an awful lot of cost issues, at the price of making the environment more fragile.


That\\u2019s where one of the context thing starts to come in. I mean, in many cases, if AWS is having a bad day in a given region, well does your business need that workload to be functional? For my newsletter, I have a publication system that\\u2019s single-homed out of the Oregon region. If that whole thing goes down for multiple days, I\\u2019m writing that week\\u2019s issue by hand because I\\u2019m going to have something different to talk about anyway. For me, there is no value in making that investment. But for companies, there absolutely is, but there\\u2019s also seems to be a lack of awareness around, how much is a reasonable investment in that area when do you start making that investment? And most critically, when do you stop?


Kelly: I think that\\u2019s a good point, and luckily, what\\u2019s on my side is the fact that there\\u2019s a lot of just profligate spending in cybersecurity and [laugh] that\\u2019s really what I\\u2019m focused on is, how can we spend those investments better? And I actually think there\\u2019s an opportunity in many cases to ditch a ton of cybersecurity tools and focus more on some of the stuff he talked about. I agree, by the way that I\\u2019ve seen some threat models where it\\u2019s like, well, AWS, all regions go down. I\\u2019m like, at that point, we have, like, a severe, bigger-than-whatever-you\\u2019re-thinking-about problem, right?


Corey: Right. So, does your business continuity plan account for every one of your staff suddenly quitting on the spot because there\\u2019s a whole bunch of companies with very expensive consulting, like, problems that I\\u2019m going to go work for a week and then buy a house in cash. It\\u2019s one of those areas where, yeah, people are not going to care about your environment more than they are about their families and other things that are going on. Plan accordingly. People tend to get so carried away with these things with tabletop planning exercises. And then of course, they forget little things like I overwrote the database by dropping the wrong thing. Turns out that was production. [laugh]. Remembering for [a me 00:10:00] there.


Kelly: Precisely. And a lot of the chaos experiments that I talk about in the book are a lot of those, like, let\\u2019s validate some of those basics, right? That\\u2019s actually some of the best investments you can make. Like, if you do have backups, I can totally see your argument about backups are for cowards, but if you do have them, like, maybe you conduct experiments to make sure that they\\u2019re available when you need them, and the same thing, even on the [unintelligible 00:10:21] side\\u2014


Corey: No one cares about backups, but everyone really cares about restores, suddenly, right after\\u2014


Kelly: Yeah.


Corey: \\u2014they really should have cared about backups.


Kelly: Exactly. So, I think it\\u2019s looking at those experiments where it\\u2019s like, okay, you have these basic assumptions in place that you assume to be invariance or assume that they\\u2019re going to bail you out if something goes wrong. Let\\u2019s just verify. That\\u2019s a great place to start because I can tell you\\u2014I know you\\u2019ve been to the RSA hall floor\\u2014how many cybersecurity teams are actually assessing the efficacy and actually experimenting to see if those tools really help them during incidents. It\\u2019s pretty few.


Corey: Oh, vendors do not want to do those analyses. They don\\u2019t want you to do those analyses, either, and if you do, for God\\u2019s sakes, shut up about it. They\\u2019re trying to sell things here, mostly firewalls.


Kelly: Yeah, cybersecurity vendors aren\\u2019t necessarily happy about my book and what I talk about because I have almost this ruthless focus on evidence and [unintelligible 00:11:08] cybersecurity vendors kind of thrive on a lack of evidence. So.


Corey: There\\u2019s so much fear, uncertainty, and doubt in that space and I do feel for them. It\\u2019s a hard market to sell in without having to talk about here\\u2019s the thing that you\\u2019re defending against. In my case, it\\u2019s easy to sell the AWS bill is high because if I don\\u2019t have to explain why more or less setting money on fire as a bad thing, I don\\u2019t really know what to tell you. I\\u2019m going to go look for a slightly different customer profile. That\\u2019s not really how it works in security, I\\u2019m sure there are better go-to-market approaches, but they\\u2019re hard to find, at least, ones that work holistically.


Kelly: There are. And one of my priorities with the book was to really enumerate how many opportunities there are to take software engineering practices that people already know, let\\u2019s say something like type systems even, and how those can actually help sustain resilience. Even things like integration testing or infrastructure as code, there are a lot of opportunities just to extend what we already do for systems reliability to sustain resilience against things that aren\\u2019t attacks and just make sure that, you know, we cover a few of those cases as well. A lot of it should be really natural to software engineering teams. Again, security vendors don\\u2019t like that because it turns out software engineering teams don\\u2019t particularly like security vendors.


Corey: I hadn\\u2019t noticed that. I do wonder, though, for those who are unaware, chaos engineering started off as breaking things on purpose, which I feel like one person had a really good story and thought about it super quickly when they were about to get fired. Like, \\u201cNo, no, it\\u2019s called Chaos Engineering.\\u201d Good for them. It\\u2019s now a well-regarded discipline. But I\\u2019ve always heard of it in the context of reliability of, \\u201cOh, you think your site is going to work if the database falls over? Let\\u2019s push it over and see what happens.\\u201d How does that manifest in a security context?


Kelly: So, I will clarify, I think that\\u2019s a slight misconception. It\\u2019s really about fixing things in production, and that\\u2019s the end goal. I think we should not break things just to break them, right? But I\\u2019ll give a simple example, which I know it\\u2019s based on what Aaron Rinehart conducted at UnitedHealth Group, which is, okay, let\\u2019s inject a misconfigured port as an experiment and see what happens, end-to-end. In their case, the firewall only detected the misconfigured port 60% of the time, so 60% of the time, it works every time.


But it was actually the cloud, the very common, like, Cloud configuration management tool that caught the change and alerted responders. So, it\\u2019s that kind of thing where we\\u2019re still trying to verify those assumptions that we have about our systems and how they behave, again, end-to-end. In a lot of cases, again, with security tools, they are not behaving as we expect. But I still argue security is just a subset of software quality, so if we\\u2019re experimenting to verify, again, our assumptions and observe system behavior, we\\u2019re benefiting software quality, and security is just a subset of that. Think about C code, right? It\\u2019s not like there\\u2019s, like, a healthy memory corruption, so it\\u2019s bad for both the quality and security reason.


Corey: One problem that I\\u2019ve had in the security space for a while is\\u2014let\\u2019s [unintelligible 00:14:05] on this to AWS for a second because that is the area in which I spend the most of my time, which probably explains a lot about my personality challenges. But the problem that I keep smacking into is if I go ahead and configure everything the way that I should according to best practices and the rest, I wind up with a firehose torrent of information in terms of CloudTrail logs, et cetera. And it\\u2019s expensive in its own right. But then to sort through it or to do a lot of things in security, there are basically two options. I can either buy a vendor\\u2019s product, which generally tends to start around $12,000 a year and goes up rapidly from there on my current $6,000 a year bill, so okay, twice as much as the infrastructure for security monitoring. Okay.


Or alternately, find a bunch of different random scripts and tools on GitHub of wildly diverging quality and sort of hope for the best on that. It feels like there\\u2019s nothing in between. And the reason I care about this is not because I\\u2019m cheap but because when you have an individual learner who is either a student or a career switcher or someone just trying to experiment with this, you want them to begin as you want them to go on, and things that are no money for an enterprise are all the money to them. They\\u2019re going to learn to work with the tools that they can afford. That feels like it\\u2019s a big security swing and a miss. Do you agree or disagree? What\\u2019s the nuance I\\u2019m missing here?


Kelly: No, I don\\u2019t think there\\u2019s nuance you\\u2019re missing. I think security observability, for one, isn\\u2019t a buzzword that particularly exists. I\\u2019ve been trying to make it a thing, but I\\u2019m solely one individual screaming into the void. But observability just hasn\\u2019t been a thing. We haven\\u2019t really focused on, okay, so what, like, we get data and what do we do with it?


And I think, again, from a software engineering perspective, I think there\\u2019s a lot we can do. One, we can just avoid duplicating efforts. We can treat observability, again, of any sort of issue as similar, whether that\\u2019s an attack or a performance issue. I think this is another place where security, or any sort of chaos experiment, shines though because if you have an idea of here\\u2019s an adverse scenario we care about, you can actually see how does it manifest in the logs and you can start to figure out, like, what signals do we actually need to be looking for, what signals mattered to be able to narrow it down. Which again, it involves time and effort, but also, I can attest when you\\u2019re buying the security vendor tool and, in theory, absolving some of that time and effort, it\\u2019s maybe, maybe not, because it can be hard to understand what the outcomes are or what the outputs are from the tool and it can also be very difficult to tune it and to be able to explain some of the outputs. It\\u2019s kind of like trading upfront effort versus long-term overall overhead if that makes sense.


Corey: It does. On that note, the title of your book includes the magic key phrase \\u2018sustaining resilience.\\u2019 I have found that security effort and investment tends to resemble a fire drill in\\u2014


Kelly: [laugh].


Corey: \\u2014an awful lot of places, where, \\u201cWe care very much about security,\\u201d says the company, right after they very clearly failed to care about security, and I know this because I\\u2019m reading getting an email about a breach that they\\u2019ve just sent me. And then there\\u2019s a whole bunch of running around and hair-on-fire moments. But then there\\u2019s a new shiny that always comes up, a new strategic priority, and it falls to the wayside again. What do you see the drives that sustained effort and focus on resilience in a security context?


Kelly: I think it\\u2019s really making sure you have a learning culture, which sounds very [unintelligible 00:17:30], but things again, like, experiments can help just because when you do simulate those adverse scenarios and you see how your system behaves, it\\u2019s almost like running an incident and you can use that as very fresh, kind of, like collective memory. And I even strongly recommend starting off with prior incidents in simulating those, just to see like, hey, did the improvements we make actually help? If they didn\\u2019t, that can be kind of another fire under the butt, so to speak, to continue investing. So, definitely in practice\\u2014and there\\u2019s some case studies in the book\\u2014it can be really helpful just to kind of like sustain that memory and sustain that learning and keep things feeling a bit fresh. It\\u2019s almost like prodding the nervous system a little, just so it doesn\\u2019t go back to that complacent and convenient feeling.


Corey: It\\u2019s one of the hard problems because\\u2014I\\u2019m sure I\\u2019m going to get castigated for this by some of the listeners\\u2014but computers are easy, particularly compared to the people. There are deterministic ways to solve almost any computer problem, but people are always going to be a little bit different, and getting them to perform the same way today that they did yesterday is an exercise in frustration. Changing the culture, changing the approach and the attitude that people take toward a lot of these things feels, from my perspective, like, something of an impossible job. Cultural transformations are things that everyone talks about, but it\\u2019s rare to see them succeed.


Kelly: Yes, and that\\u2019s actually something that I very strongly weaved throughout the book is that if your security solutions rely on human behavior, they\\u2019re going to fail. We want to either reduce hazards or eliminate hazards by design as much as possible. So, my view is very much again, like, can you make processes more repeatable? That\\u2019s going to help security. I definitely do not think that if anyone takes away from my book that they need to have, like, a thousand hours of training to change hearts and minds, then they have completely misunderstood most of the book.


The idea is very much like, what are practices that we want for other outcomes anyway\\u2014again, reliability or faster time to market\\u2014and how can we harness those to also be improving resilience or security at the same time? It\\u2019s very much trying to think about those opportunities rather than, you know, trying to drill into people\\u2019s heads, like, \\u201cThou shalt not,\\u201d or, \\u201cThou shall.\\u201d


Corey: Way back in 2018, you gave a keynote at some conference or another and you built the entire thing on the story of Jurassic Park, specifically Ian Malcolm as one of your favorite fictional heroes, and you tied it into security in a bunch of different ways. You hadn\\u2019t written this book then unless the authorship process is way longer than I think it is. So, I\\u2019m curious to get your take on what Jurassic Park can teach us about software security.


Kelly: Yes, so I talk about Jurassic Park as a reference throughout the book, frequently. I\\u2019ve loved that book since I was a very young child. Jurassic Park is a great example of a complex system gone wrong because you can\\u2019t point to any one thing. Like there\\u2019s Dennis Nedry, you know, messing up the power system, but then there\\u2019s also the software was looking for a very specific count of dinosaurs and they didn\\u2019t anticipate there could be more in the count. Like, there are so many different factors that influenced it, you can\\u2019t actually blame just, like, human error or point fingers at one thing.


That\\u2019s a beautiful example of how things go wrong in our software systems because like you said, there\\u2019s this human element and then there\\u2019s also how the humans interact and how the software components interact. But with Jurassic Park, too, I think the great thing is dinosaurs are going to do dinosaur things like eating people, and there are also equivalents in software, like C code. C code is going to do C code things, right? It\\u2019s not a memory safe language, so we shouldn\\u2019t be surprised when something goes wrong. We need to prepare accordingly.


Corey: \\u201cHow could this happen? Again?\\u201d Yeah.


Kelly: Right. At a certain point, it\\u2019s like, there\\u2019s probably no way to sufficiently introduce isolation for dinosaurs unless you put them in a bunker where no one can see them, and it\\u2019s the same thing sometimes with things like C code. There\\u2019s just no amount of effort you can invest, and you\\u2019re just kind of investing for a really unclear and generally not fortuitous outcome. So, I like it as kind of this analogy to think about, okay, where do our effort investments make sense and where is it sometimes like, we really just do need to refactor because we\\u2019re dealing with dinosaurs here.


Corey: When I was a kid, that was one of my favorite books, too. The problem is, I didn\\u2019t realize I was getting a glimpse of my future at a number of crappy startups that I worked at. Because you have John Hammond, who was the owner of the park talking constantly about how, \\u201cWe spared no expense,\\u201d but then you look at what actually happened and he spared every frickin expense. You have one IT person who is so criminally underpaid that smuggling dinosaur embryos off the island becomes a viable strategy for this. He wound up, \\u201cOh, we couldn\\u2019t find the right DNA, so we\\u2019re just going to, like, splice some other random stuff in there. It\\u2019ll be fine.\\u201d


Then you have the massive overconfidence because it sounds very much like he had this almost Muskian desire to fire anyone who disagreed with him, and yeah, there was a certain lack of investment that could have been made, despite loud protestations to the contrary. I\\u2019d say that he is the root cause, he is the proximate reason for the entire failure of the park. But I\\u2019m willing to entertain disagreement on that point.


Kelly: I think there are other individuals, like Dr. Wu, if you recall, like, deciding to do the frog DNA and not thinking that maybe something could go wrong. I think there was a lot of overconfidence, which you\\u2019re right, we do see a lot in software. So, I think that\\u2019s actually another very important lesson is that incentives matter and incentives are very hard to change, kind of like what you talked about earlier. It doesn\\u2019t mean that we shouldn\\u2019t include incentives in our threat model.


So like, in the book I talked about, our threat models should include things like maybe yeah, people are underpaid or there is a ton of pressure to deliver things quickly or, you know, do things as cheaply as possible. That should be just as much of our threat models as all of the technical stuff too.


Corey: I think that there\\u2019s a lot that was in that movie that was flat-out wrong. For example, one of the kids\\u2014I forget her name; it\\u2019s been a long time\\u2014was logging in and said, \\u201cOh, this is Unix. I know Unix.\\u201d And having learned Unix as my first basically professional operating system, \\u201cNo, you don\\u2019t. No one knows Unix. They get very confused at some point, the question is, just how far down what rabbit hole it is.\\u201d


I feel so sorry for that kid. I hope she wound up seeking therapy when she was older to realize that, no, you don\\u2019t actually know Unix. It\\u2019s not that you\\u2019re bad at computers, it\\u2019s that Unix is user-hostile, actively so. Like, the raptors, like, that\\u2019s the better metaphor when everything winds up shaking out.


Kelly: Yeah. I don\\u2019t disagree with that. The movie definitely takes many liberties. I think what\\u2019s interesting, though, is that Michael Creighton, specifically, when he talks about writing the book\\u2014I don\\u2019t know how many people know this\\u2014dinosaurs were just a mechanism. He knew people would want to read it in airport.


What he cared about was communicating really the danger of complex systems and how if you don\\u2019t respect them and respect that interactivity and that it can baffle and surprise us, like, things will go wrong. So, I actually find it kind of beautiful in a way that the dinosaurs were almost like an afterthought. What he really cared about was exactly what we deal with all the time in software, is when things go wrong with complexity.


Corey: Like one of his other books, Airframe, talked about an air disaster. There\\u2019s a bunch of contributing factors in the rest, and for some reason, that did not receive the wild acclaim that Jurassic Park did to become a cultural phenomenon that we\\u2019re still talking about, what, 30 years later.


Kelly: Right. Dinosaurs are very compelling.


Corey: They really are. I have to ask though\\u2014this is the joy of having a kid who is almost six\\u2014what is your favorite dinosaur? Not a question most people get asked very often, but I am going to trot that one out.


Kelly: No. Oh, that is such a good question. Maybe a Deinonychus.


Corey: Oh, because they get so angry they spit and kill people? That\\u2019s amazing.


Kelly: Yeah. And I like that, kind of like, nimble, smarter one, and also the fact that most of the smaller ones allegedly had feathers, which I just love this idea of, like, feather-ful murder machines. I have the classic, like, nerd kid syndrome, though, where I read all these dinosaur names as a kid and I\\u2019ve never pronounced them out loud. So, I\\u2019m sure there are others\\u2014


Corey: Yep.


Kelly: \\u2014that I would just word salad. But honestly, it\\u2019s hard to go wrong with choosing a favorite dinosaur.


Corey: Oh, yeah. I\\u2019m sure some paleontologist is sitting out there in the field on the dig somewhere listening to this podcast, just getting very angry at our pronunciation and things. But for God\\u2019s sake, I call the database Postgres-squeal. Get in line. There\\u2019s a lot of that out there where looking at a complex system failures and different contributing factors and the rest makes stuff\\u2014that\\u2019s what makes things interesting.


I think that there\\u2019s this the idea of a root cause is almost always incorrect. It\\u2019s not, \\u201cOkay, who tripped over the buried landmine,\\u201d is not the interesting question. It\\u2019s, \\u201cWho buried the thing?\\u201d What were all the things that wound up contributing to this? And you can\\u2019t even frame it that way in the blaming context, just because you start doing that and people clam up, and good luck figuring out what really happened.


Kelly: Exactly. That\\u2019s so much of what the cybersecurity industry is focused on is how do we assign blame? And it\\u2019s, you know, the marketing person clicked on a link. And it\\u2019s like, they do that thousands of times, like a month, and the one time, suddenly, they were stupid for doing it? That doesn\\u2019t sound right.


So, I\\u2019m a big fan of, yes, vanquishing root cause, thinking about contributing factors, and in particular, in any sort of incident review, you have to think about, was there a designer process problem? You can\\u2019t just think about the human behavior; you have to think about where are the opportunities for us to design things better, to make this secure way more of the default way.


Corey: When you talk about resilience and reliability and big, notable outages, most forward-thinking companies are going to go and do a variety of incident reviews and disclosures around everything that happened to it, depending upon levels of trust and whether your NDA\\u2019ed or not, and how much gets public is going to vary from place to place. But from a security perspective, that feels like the sort of thing that companies will clam up about and never say a word.


Kelly: Yes.


Corey: Because I can wind up pouring a couple of drinks into people and get the real story of outages, or the AWS bill, but security stuff, they start to wonder if I\\u2019m a state actor, on some level. When you were building all of this, how did you wind up getting people to talk candidly and forthrightly about issues that if it became tied to them that they were talking to this in public would almost certainly have negative career impact for them?


Kelly: Yes, so that\\u2019s almost like a trade secret, I feel like. A lot of it is yes, over the years talking with people over, generally at a conference where you know, things are tipsy. I never want to betray confidentiality, to be clear, but certainly pattern-matching across people\\u2019s stories.


Corey: Yeah, we\\u2019re both in positions where if even the hint of they can\\u2019t be trusted enters the ecosystem, I think both of our careers explode and never recover. Like it\\u2019s\\u2014


Kelly: Exactly.


Corey: \\u2014yeah. Oh, yeah. They play fast and loose with secrets is never the reputation you want as a professional.


Kelly: No. No, definitely not. So, it\\u2019s much more pattern matching and trying to generalize. But again, a lot of what can go wrong is not that different when you think about a developer being really tired and making a bunch of mistakes versus an attacker. A lot of times they\\u2019re very much the same, so luckily there\\u2019s commonality there.


I do wish the security industry was more forthright and less clandestine because frankly, all of the public postmortems that are out there about performance issues are just such, such a boon for everyone else to improve what they\\u2019re doing. So, that\\u2019s a change I wish would happen.


Corey: So, I have to ask, given that you talk about security, chaos engineering, and resilience-and of course, software and systems\\u2014all in the title of the O\\u2019Reilly book, who is the target audience for this? Is it folks who have the word security featured three times in their job title? Is it folks who are new to the space? What is your target audience start and stop?


Kelly: Yes, so I have kept it pretty broad and it\\u2019s anyone who works with software, but I\\u2019ll talk about the software engineering audience because that is, honestly, probably out of anyone who I would love to read the book the most because I firmly believe that there\\u2019s so much that software engineering teams can do to sustain resilience and security and they don\\u2019t have to be security experts. So, I\\u2019ve tried to demystify security, make it much less arcane, even down to, like, how attackers, you know, they have their own development lifecycle. I try to demystify that, too. So, it\\u2019s very much for any team, especially, like, platform engineering teams, SREs, to think about, hey, what are some of the things maybe I\\u2019m already doing that I can extend to cover, you know, the security cases as well? So, I would love for every software engineer to check it out to see, like, hey, what are the opportunities for me to just do things slightly differently and have these great security outcomes?


Corey: I really want to thank you for taking the time to talk with me about how you view these things. If people want to learn more, where\\u2019s the best place for them to find you?


Kelly: Yes, I have all of the social media which is increasingly fragmented, [laugh] I feel like, but I also have my personal site, kellyshortridge.com. The official book site is securitychaoseng.com as well. But otherwise, find me on LinkedIn, Twitter, [Mastodon 00:30:22], Bluesky. I\\u2019m probably blanking on the others. There\\u2019s probably already a new one while we\\u2019ve spoken.


Corey: Blue-ski is how I insist on pronouncing it as well, while we\\u2019re talking about\\u2014


Kelly: Blue-ski?


Corey: Funhouse pronunciation on things.


Kelly: I like it.


Corey: Excellent. And we will, of course, put links to all of those things in the [show notes 00:30:37]. Thank you so much for being so generous with your time. I really appreciate it.


Kelly: Thank you for having me and being a fellow dinosaur nerd.


Corey: [laugh]. Kelly Shortridge, Senior Principal Engineer at Fastly. I\\u2019m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you\\u2019ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you\\u2019ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment about how our choice of dinosaurs is incorrect, then put the computer away and struggle to figure out how to open a door.


Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

'