Subscribe
& more

Episode 24

Are Big Mistakes That Big Of A Deal? Part 2

Compiler

Show Notes

REPLACE SUMMARY

Transcript

00:02 — Johan Philippine
Angela, Brent. I know you're both terrific at your jobs. But have you ever really, really messed up?

00:11 — Angela Andrews
I've shut down production systems and databases. It would be one VM (virtual machine) and there would be a name that was super close to another one. And I would be rebooting it and it's like, "Oh, sugar foot." That was literally the database for the admissions. Oh God. Oh yeah, I've had my share. Yes.

00:33 — Johan Philippine
What happened after though?

00:36 — Angela Andrews
That particular one, it was like I knew immediately and I was like, "Oh God, I got to bring this system back up." So yes, there was an outage. It wasn't a widespread outage because if you weren't using it at that particular moment, you were unaware that I actually brought down the database server. Oh God, I've done that. I'm still here. I'm still here and I am not defined by my mistakes. Oh wait, I have to tell this story, please, please, please, please.

01:06 — Johan Philippine
Oh yeah, go ahead.

01:07 — Brent Simoneaux
Yeah, please do.

01:08 — Angela Andrews
So there was a power outage at this university that I worked at. We were working with getting generators working. And so anyway, our whole data center went out, like boom! The entire data center in this college campus. So we're bringing things back up and we're all trying to do a postmortem, we're standing there. And this guy walks in and he's an electrician, and there's this big red button on the wall.

01:35 — Brent Simoneaux
Oh no.

01:35 — Angela Andrews
And he says, "What's this?" And he pushes it and the whole room went in slow motion, and we're like, "No." And the whole thing went dzzsh. The whole data center went down again.

01:51 — Brent Simoneaux
Oh my God.

01:52 — Angela Andrews
We call him buttons. And he is still there, so.

01:56 — Brent Simoneaux
He's still there.

01:58 — Angela Andrews
We're not defined by our mistakes.

02:00 — Johan Philippine
But on the other hand, from what I've heard anyways, knowing people in the tech industry. Big mistakes doing something to a production system, it's almost like a rite of passage in this industry, right? Almost everyone has at least one story of doing exactly the same stuff that you were just talking about, right?

02:20 — Brent Simoneaux
Mm-hmm.

02:20 — Johan Philippine
So that led me to really wonder, are big mistakes that big of a deal?

02:30 — Brent Simoneaux
This is Compiler, an original podcast from Red Hat.

02:34 — Angela Andrews
We're your hosts.

02:35 — Brent Simoneaux
I'm Brent Simoneaux.

02:37 — Angela Andrews
And I'm Angela Andrews.

02:38 — Brent Simoneaux
We're here to break down questions from the tech industry, big, small, and sometimes strange.

02:45 — Angela Andrews
Each episode we go out in search of answers from Red Hatters and people they're connected to.

02:51 — Brent Simoneaux
Today's question, are big mistakes that big of a deal?

03:00 — Angela Andrews
Producer Johan Philippine is here to find out.

03:04 — Johan Philippine
So today I've got three stories to share. Act one, I call it flying under the radar.

03:12 — Angela Andrews
Okay. Ira Glass.

03:16 — Brent Simoneaux
Oh God, Johan.

03:17 — Johan Philippine
Look, it works. So I spoke with Ian Walker, he's a technical account manager here at Red Hat and he lives in Japan. Now I spoke to him first because he started an email thread a few months ago in response to a large social media outage that affected a lot of people and a lot of different websites. In his thread, he links to an article that describes effing up as part of the job of software development.

03:47 — Angela Andrews
He's not lying.

03:48 — Ian Walker
As I was just looking at the news and stuff, I happened to cross the article from one of the writers of the Daily WTF, who mentioned that as software developers, screwing up is our job and that you need to screw up in order to get better. And screwing up allows you to get better at recovering from the screw ups and stuff like that. And so I thought, "Well, this is interesting." And there is a lot of stigma and stuff associated with making mistakes and things like that.

04:16 — Johan Philippine
I thought what he was doing was really commendable, which was first of all, sharing the article, but trying to destigmatize the idea of messing up. Because Ian, well, he's got his own story about messing up. Early on in his career, he had an IT job for a big airline and his office was based in Los Angeles. Now this airline had flights across the Pacific Ocean and he was on the IT support team for airports in North America, Central America and South America.

04:47 — Brent Simoneaux
Okay.

04:48 — Johan Philippine
And that includes the airport in Kona, Hawaii. Now at the time of the story, the rest of his team had gone home and he was alone in the office.

04:58 — Ian Walker
So I had just learned about network switches and how you can log into them remotely and you enter some commands, and you can look at the configuration for all the different ports and all the different settings for the switch. And I'm not sure if I had been asked to gather this information or if I just decided to do it myself. So I was in our office in Los Angeles and I was accessing a switch that was a couple thousand miles away in Kona, Hawaii. So it was not something where I could just walk over there and plug it back in. But for some reason, I had decided I was going to log into one of the switches at an airport and I was going to check the settings to see what it had been set to. So at that time, I think either I telnetted or SSH'd into the switch, and I knew just enough to be dangerous. I knew that the command SH was supposed to show you the settings.

05:53 — Johan Philippine
Now Angela, I take it you know where this is going.

05:56 — Angela Andrews
Ooh yeah, I can see where this is going. And I guess I'm laughing about it.

06:03 — Johan Philippine
Care to fill us in?

06:04 — Brent Simoneaux
Wait, so there's a physical switch?

06:07 — Angela Andrews
Somewhere.

06:08 — Johan Philippine
Yeah. So at the airport, they have their own servers. Each airline had their own servers in the airport. And these servers handled things like check-in, and flight assignments and stuff like that. And they would have these physical servers and the network cables would come in and out of them to get their internet connections, right?

06:29 — Brent Simoneaux
So what happened, Johan?

06:30 — Johan Philippine
So he typed in SH thinking it would show him the settings because I assume that's what it does in some other contexts. But when he's logged into a particular port like that, it actually shuts down that port physically.

06:44 — Angela Andrews
I'm sorry. I'm sorry. Oh my gosh, why do we do things like that?

06:55 — Brent Simoneaux
So he's in Los Angeles, but the switch is in Hawaii?

06:58 — Johan Philippine
That's right.

06:59 — Brent Simoneaux
That's a problem.

07:00 — Johan Philippine
That's a problem. His connection died, to this switch. But not only that, he killed that switch's connection outwards as well.

07:08 — Brent Simoneaux
It's not like he could just walk down the hall and...

07:11 — Johan Philippine
Yeah, exactly. So he basically shut down that server's access to the internet at the airport during business hours. So when he disabled the port, the airline operations department, they were unable to access their back-end airline systems, they weren't able to check-in, they weren't able to check the status of the flights that they were handling.

07:30 — Angela Andrews
Wow.

07:31 — Johan Philippine
Now luckily for him, he had just recently been to that airport a couple months before on a business trip to help them set up, I assume. And he had actually taken pictures of their setup.

07:41 — Ian Walker
So I knew what cable was plugged in where and how it was all set up. So I called up the operations department and I said, "Hey, it looks like your internet connection just went down." And they were like, "Yeah, everything just suddenly stopped working. It's weird, I can't access anything." So I was like, "Hmm. I think I might know what's going on."

08:01 — Brent Simoneaux
"Yeah, that's really weird."

08:03 — Angela Andrews
"How did this happen?" "I have no idea."

08:06 — Brent Simoneaux
"Super weird."

08:08 — Johan Philippine
Oh Ian, you definitely knew what was going on.

08:09 — Brent Simoneaux
Yeah.

08:10 — Angela Andrews
You really do have to play dumb for a second. You don't want to put yourself out there too fast too far.

08:17 — Ian Walker
"Can you go over to the switch?" And I explained what the switch was and said, "Can you take the cable out of number 14 and plug it into number 15 port just to see what happens?" So they did that and somehow it came up and I was able to connect to it.

08:33 — Angela Andrews
He said somehow. It magically came back up. Oh my gosh, I love it. I love this is probably one of the best stories.

08:45 — Johan Philippine
Oh, it gets better.

08:46 — Ian Walker
So I quickly logged back in and turned on the port that I had just shut down, and then asked them again to put this cable back to the original port. And they did, and everything came up and was working fine.

09:01 — Angela Andrews
Wow.

09:02 — Brent Simoneaux
Mm-hmm.

09:03 — Angela Andrews
This is a good story.

09:04 — Johan Philippine
It's great, I loved it. I loved hearing this from him.

09:07 — Brent Simoneaux
So how long did this whole thing last?

09:09 — Johan Philippine
Well, he was a little hazy on the details, but he estimated that it took about 30 minutes to an hour from start to finish, is what he remembers. I mean, time gets a little funny when you're in panic mode like that.

09:23 — Brent Simoneaux
But I'm sure it felt like hours. Yeah.

09:25 — Angela Andrews
For sure.

09:26 — Johan Philippine
And another lucky break for him: It was early evening for him in Los Angeles, it was like mid afternoon in Kona, Hawaii at the time. So it all happened while the airline actually wasn't all that busy.

09:36 — Brent Simoneaux
Okay.

09:36 — Angela Andrews
Ooh, lady luck.

09:37 — Johan Philippine
There weren't that many consequences. Lot of luck for him. So I asked him, what did he learn from his experience?

09:45 — Ian Walker
Well, I learned not to enter commands that you don't really understand.

09:51 — Johan Philippine
I think that's pretty good advice.

09:53 — Angela Andrews
The most sound advice anyone could ever give you.

09:56 — Johan Philippine
Yeah.

09:58 — Angela Andrews
I want to say it was an honest mistake. It was one of those mistakes like, "Bro, you know you messed up, right?" But it wasn't. He was curious, and curiosity is an amazing thing to have when you work in technology, just not on production systems.

10:16 — Johan Philippine
Mm-hmm. So this is great advice, and it's advice that our next guest could have really used when she had a rough go on her first Linux job.

10:32 — Joanna Delaporte
Oh, you would've just had to start over.

10:37 — Johan Philippine
We're at act two. I call this one, ‘what is going on right now’? And I spoke to Joanna Delaporte.

10:45 — Joanna Delaporte
Mistakes happen, that's what this is all about.

10:48 — Johan Philippine
So that's her. She's been in the tech industry for about 15 years at this point. And about 10 years ago, she took a job as a Linux systems administrator for her local community college.

11:00 — Brent Simoneaux
Okay.

11:01 — Johan Philippine
Now, while she had some Windows administrative experience, she was learning a lot on the job how to handle the Linux system.

11:08 — Angela Andrews
I mean, that's how I learned it.

11:10 — Johan Philippine
Yeah. It forces you to learn it quickly, right? She had taken one course in college on Linux systems. So she had the basics down, but she had a lot more to learn.

11:20 — Joanna Delaporte
Yeah. So I ran all of the Linux systems for my community college, and that was everything involved in the domain for Linux systems. So domain authentication, file sharing, managing the named DNS server, patching and configuring all of the lab systems for all the students. So if this machine went down, all of the other servers would go down as well.

11:49 — Johan Philippine
So, pretty important system for her community college. It was located in a server room which she worked out of as well. It was about eight feet wide by maybe 18 feet long.

11:59 — Brent Simoneaux
Okay.

12:00 — Johan Philippine
Not a really big space. And she shared it with a half rack and then a few individual server towers. Now it was loud and it was cold to keep the servers cool.

12:11 — Brent Simoneaux
Yeah.

12:11 — Angela Andrews
Mm-hmm.

12:12 — Joanna Delaporte
Yep. I was all alone in the closet.

12:14 — Brent Simoneaux
Me too, girl.

12:17 — Angela Andrews
Great.

12:20 — Johan Philippine
So one day, Joanna was trying to figure out how to do a particular thing on a system. She doesn't really remember what it was she was trying to figure out, and she thought she could try and find that command by going through the log history because the previous administrator had surely done it before. Angela, if you had to go through a history of previously run commands, how would you go about that?

12:47 — Angela Andrews
Besides up arrow? No. I always do a control R and maybe start typing in what I think some of the command could have been and it tries to do an auto complete for you. Like when you're in Google, and you start typing and it tries to fill in the spaces, that's one way to do it. That's two ways to do it, actually. That would be my go-to.

13:08 — Johan Philippine
I see. Well, neither of those are what Joanna ended up doing.

13:12 — Angela Andrews
Oh God.

13:14 — Joanna Delaporte
Well, that's the funny thing. So I didn't actually know how to look at the commands. I was not familiar with the less command, or the more command, or the cat command. And what I wanted was one of those. Essentially, I wanted to see the commands. What I actually ended up typing was source of the root bash history, which was not a good move. It's definitely not something I should have done.

13:42 — Johan Philippine
I heard a big sigh there.

13:44 — Angela Andrews
Oh gosh, okay. So the source command is a really powerful, very powerful command and I only use it when I'm trying to do something very particular.

13:59 — Johan Philippine
Mm-hmm.

13:59 — Brent Simoneaux
Yeah.

14:00 — Angela Andrews
Let me think for a second. When do I use the source command? If I'm installing something from maybe binaries or something like that.

14:07 — Johan Philippine
Mm-hmm.

14:08 — Angela Andrews
So it's like a shell command that executes something almost like the gospel. So you're going to source whatever this thing is, you're typing after the word source.

14:19 — Brent Simoneaux
Okay.

14:20 — Angela Andrews
So you just said that she did type source and then root, or?

14:25 — Johan Philippine
Of the root bash history.

14:27 — Angela Andrews
Oh sugar. Oh yeah. Well, so she did all of that, did she?

14:35 — Johan Philippine
Yeah.

14:35 — Angela Andrews
Okay. She did all the things.

14:37 — Johan Philippine
She did all the things.

14:38 — Angela Andrews
Okay.

14:40 — Joanna Delaporte
So instead of just seeing the commands in a harmless way, I was actually executing every command in the bash history file.

14:47 — Angela Andrews
Shut up.

14:49 — Joanna Delaporte
And it fired off pretty rapidly as computers tend to do. It probably ran through at least 20 or 30 before I really understood what it was doing and that it was executing every command.

15:02 — Angela Andrews
Girl, control C.

15:06 — Joanna Delaporte
But even at that point, I wasn't sure yet how to stop it. I didn't even know how to use a PS command to find a process at that point, so it was something I had to figure out during this execution. I would say it probably ran somewhere between 50 and 200 commands before I finally managed to kill it. It's hard to say because so many of them happened so quickly that I wouldn't have seen them all necessarily.

15:32 — Brent Simoneaux
I am sweating right now.

15:34 — Angela Andrews
Me too. Okay, all right.

15:34 — Brent Simoneaux
I am sweating.

15:37 — Angela Andrews
I am so hot and nervous. And I was not the one who did the source command.

15:42 — Johan Philippine
This happened 10 years ago, yeah.

15:46 — Angela Andrews
Ooh. Yes. So just put yourself in this position where you have no idea. So this person, her predecessor may have been doing all types of things, installs, patching, removing software, changing config files, all these things. And she did a cut and paste and said, "Okay, I'm going to just do all the things that you've just done," not knowing what those things were. You can feel your soul leave your body when you watch those commands just run across the screen. And she didn't know how to stop it, oh poor thing.

16:23 — Johan Philippine
Mm-hmm, yeah. So it was doing all those things. It was also SSHing into other machines, right, which as soon as these would see that pop up, she would kill it immediately.

16:35 — Brent Simoneaux
Yeah.

16:36 — Johan Philippine
Until eventually, she realized that the whole thing would pause when that new shell would come up.

16:41 — Angela Andrews
Mm-hmm, that's right.

16:43 — Johan Philippine
Right. Then she realized, "Okay, I'm going to leave it open. I'm not going to touch it because that's going to give me time to think and figure out how to stop this." Once that happened, she finally opened up another terminal to kill that process and the parade of terror was finally over.

17:01 — Brent Simoneaux
The parade of terror.

17:01 — Angela Andrews
It's literally a parade. They're marching across your street.

17:05 — Johan Philippine
Right? Because it's one thing after the other.

17:07 — Brent Simoneaux
Little marching band.

17:08 — Johan Philippine
And you're just like, "Oh no."

17:12 — Joanna Delaporte
Yeah. In the moment of course, daylight is funny when you're terrified and things are going wrong. It was probably somewhere between four and 10 minutes. When I eventually realized I had some slack, basically I got to the point where I was like, "I'm just going to let it get to the next point where it pauses because it has SSH'd into something or opened a file. And at that point, then I started doing the research I needed to figure out how to log in, find the process and kill the process.

17:41 — Brent Simoneaux
So what did Joanna learn from all this?

17:43 — Johan Philippine
I think it's going to sound very familiar.

17:45 — Brent Simoneaux
Yeah.

17:46 — Joanna Delaporte
I should have known what this command does, right? I'd heard of this command once, that's why I used it because I'd heard it once. But in a way, I felt like I should have known better, right? I should know not to use a command that I don't know what it does. I don't really know what it does. And I thought it was way more simple and harmless of a command than it really is.

18:07 — Johan Philippine
Luckily for her, no really lasting and permanent damage was done to the system.

18:12 — Brent Simoneaux
Yeah.

18:12 — Johan Philippine
She looked back. She didn't have to wipe it and rebuild the system because that would've taken a long time, especially since she was still pretty new at this job. But she learned a valuable lesson from that.

18:25 — Brent Simoneaux
I'm starting to pick up on a little theme here.

18:28 — Johan Philippine
Do tell.

18:29 — Angela Andrews
There's a common thread. What are you realizing?

18:31 — Brent Simoneaux
There's a common thread which seems like a little bit of a golden rule here, which is, don't use commands that you don't understand.

18:42 — Angela Andrews
Sometimes they sound like a good idea, I don't know. But you're right, this is literally a cautionary tale to anyone who's listening to this.

18:50 — Johan Philippine
Several cautionary tales.

18:52 — Angela Andrews
Exactly. If you're listening to this podcast, please make sure you know what command you're about to run before you type it and hit enter.

18:59 — Brent Simoneaux
Mm-hmm.

19:00 — Angela Andrews
Know the consequences of what you're about to do.

19:03 — Brent Simoneaux
Mm-hmm.

19:05 — Brent Simoneaux
Which is not to be preachy at all, right? Not to be preachy at all.

19:09 — Angela Andrews
Oh gosh, no, wait a minute.

19:10 — Brent Simoneaux
This is very common, right?

19:14 — Brent Simoneaux
Mm-hmm.

19:14 — Angela Andrews
It's common. It is common. We're humans.

19:17 — Johan Philippine
Yeah.

19:17 — Angela Andrews
And sometimes you could know, or at least you think you know, "Oh, I know what this command is going to do," and it does something, one, because it's really not the command that you think it is. And it does something totally unexpected.

19:30 — Johan Philippine
On that note, we have one more story with a quick caveat that the person telling the story didn't cause the mistake, but she was part of the team that had to fix the mistake as it happened.

19:42 — Angela Andrews
The cleanup crew, okay.

19:44 — Johan Philippine
She was part of the cleanup crew. Act three. I call this one ‘syntax error’.

19:51 — Brent Simoneaux
Okay.

19:52 — Johan Philippine
It actually happened pretty recently. It was in 2018 at a massive tech company that we've all heard about.

19:58 — Angela Andrews
We are not naming names.

20:00 — Johan Philippine
Well, we're not naming company names. But I spoke to Ann Marie Fred, and at this point in her career, she had several years of experience as a developer. She was working in an open floor office with about 75 people in the room, that group was in charge of online sales and product information for this, again, massive tech company. And because it's fair to say that it was fairly well frequented, the website. Ann

20:29 — Marie Fred
I know that one of our bigger web engines would get 4 million hits a month.

20:37 — Angela Andrews
Well frequented, okay.

20:40 — Johan Philippine
Nothing to sneeze at, right?

20:41 — Angela Andrews
Nope.

20:44 — Johan Philippine
So they were running some AB testing on these pages. They're roughly half a million individual pages when counting all the content, which was also translated in multiple languages. They had a little snippet of JavaScript embedded in each of these pages to run experiments and gather data for analysis, to track conversion rates and things like that. And it worked pretty well until one of the consultants running the experiments, a consultant who was not a developer, made a critical coding mistake. Ann

21:15 — Marie Fred
Yeah. So the experiment itself, the little bit of code that was important, basically said if window.location.HRF = the URL for page A, then set window.location.HF to the URL for page B. Pretty simple.

21:36 — Angela Andrews
I'm sorry. I had. To laugh because I have to wait until I hear exactly what happened, but it's literally pointing to another page.

21:45 — Johan Philippine
Yeah. So yes.

21:46 — Angela Andrews
Okay. Ann

21:48 — Marie Fred
And since this little snippet of code was embedded on all the web pages that our group was generating. Between the product pages, and the search pages and the 100+ languages that we were supporting, we're talking about at least a few hundred thousand webpages, maybe a half million webpages that had this little snippet on them.

22:10 — Johan Philippine
So A/B testing. You randomly assign a user, either A or B at that point, that is the A version of a webpage or a B version of the webpage. There's going to be some differences between the two, and the idea is to determine which page out of those two is more effective at getting whatever desired outcome that you're trying to measure, right?

22:33 — Brent Simoneaux
Oh, so you're running a little experiment.

22:35 — Johan Philippine
You're running little experiments, right?

22:37 — Angela Andrews
Yes.

22:38 — Brent Simoneaux
But as a user, you don't really know.

22:40 — Johan Philippine
As a user, you have no idea because you just either see page A or you see page B, you don't see both of them. You don't even know that an experiment's being run most of the time.

22:49 — Brent Simoneaux
Yeah.

22:51 — Johan Philippine
So they were running a particular experiment, or they're about to run a particular experiment and something goes terribly wrong.

23:00 — Brent Simoneaux
Oh no. Ann

23:01 — Marie Fred
Well, in JavaScript, the single equal sign is used for assigning values to a variable. And then of course, a double equal sign lets you compare two variables irrespective of the data type. And then the triple equal sign compares two variables, but it checks the type strictly, right? Unfortunately, the person accidentally used the single equal sign. So instead of checking if the window location was logically equal to A, it was just immediately setting the window location to the new page, or it was actually setting the window location to A immediately. And so what happened is, as soon as that experiment went live, every single one of those pages started redirecting to the target page in an infinite loop.

23:51 — Angela Andrews
Ooh.

23:54 — Johan Philippine
So, half a million pages, give or take a few thousand. Instead of performing a check, instead they redirected to a single page. It's not too bad, right? That's the worst of it, right?

24:08 — Angela Andrews
Is it though? Ann

24:09 — Marie Fred
So they launched the experiment and then immediately went into a multi-hour customer meeting and turned their phone off.

24:20 — Angela Andrews
No. Ann

24:20 — Marie Fred
Of course, it's like the classic launch something on Friday evening scenario, right? But we noticed in our big, open office room, we had a lot of monitors on those webpages. And so what happened is, all the monitors that were checking for a specific content to render on a page, or for user journeys that could go through successfully started failing at roughly the same time within five to 15 minutes, depending on how sensitive they were. And so immediately, phones started ringing all over the place in our office from different teams that were monitoring their pages. And it very quickly became a... When one pager goes off, people would shake it off. But when 10 pagers go off, everybody in the room stops working and everybody wants to know what's going on.

25:12 — Angela Andrews
And all these heads are popping up over their monitors like groundhogs like, "Wait a minute."

25:16 — Johan Philippine
Yeah, Like groundhogs and meerkats. They're like, "What's going on here?"

25:19 — Angela Andrews
Oh wow, that's a good one.

25:23 — Brent Simoneaux
Wait. Paint this picture for us, Johan. What just happened here?

25:28 — Johan Philippine
So a consultant who was running these A/B tests on the webpages. A consultant whose office, by the way, was in another city, not conveniently next door where they could just pop into her office and say like, "Hey, what's going on?" She started running an AB test experiment and it immediately started to redirect all of the pages for whatever that group was said to monitor to a single page, which would overload their system, I assume, is what happened.

25:58 — Brent Simoneaux
Mm-hmm.

25:59 — Johan Philippine
Everything stops working properly. All the hundreds of thousands of pages were no longer accessible, right, and they were all trying to get to one single page, which triggered all of these alarms. And all the teams who depended on the data from those pages, they'd noticed that something was wrong and they were calling into Anne Marie's office to be like, "Hey, something's up? Is something happening on your end?" And they didn't know what was going on because this was out of the blue for them, right, they didn't know that the A/B test had just been launched. It took them about half an hour to figure out, first of all, what was happening. Then they figured out that the pages were caught in a loop, but they didn't know why.

26:40 — Angela Andrews
Wow. That's so stressful.

26:42 — Johan Philippine
When they realized it was from the A/B testing platform, they went to try and shut it down only to find out that they didn't have the right permissions to do so, only the person who launched the experiment was able to do that, the consultant. And because she worked in another office and because her phone was off, they weren't able to turn it off right away. So Anne Marie was tasked with contacting this consultant and getting her to shut it down. Eventually she did so by calling other people who worked in that same office and to be like, "Hey, we really need to talk to this person right now. Can you get her on the phone and out of whatever meeting that she's in because this is a big deal."

27:22 — Angela Andrews
"She's in a meeting. May I take a message?"

27:24 — Johan Philippine
No.

27:29 — Brent Simoneaux
Drag them out of that office.

27:31 — Johan Philippine
But even though Anne Marie wasn't the cause of this problem, she and her team still learned a pretty valuable lesson. Ann

27:38 — Marie Fred
Well, we learned that the amazing power of an A/B testing framework could bring down a website if it's not configured correctly. So we got much more cautious after that. We worked with the vendor to put in an emergency kill switch so that we, as developers, could shut off any test or experiment with a single command.

28:01 — Johan Philippine
Again, someone who didn't really know what they were doing caused a big problem, but Ann Marie and her team were able to put in a kill switch and a backup system so that they could intervene. They also implemented a code review so that anytime the A/B testers wanted to push something to production, they had a developer actually go in and check it to make sure that they wouldn't cause any more problems.

28:26 — Angela Andrews
That's smart. It had more eyes on it.

28:28 — Johan Philippine
It sure did. And after they implemented that code review, they didn't have the same mistake happen again. On that note, Anne Marie has got some advice about learning from those mistakes. Ann

28:39 — Marie Fred
But it's the same goal, right, that you learn from your mistakes and don't get angry about them. So I think that's really important to have a formal way to learn from those mistakes and also to fight for a culture where these things are treated blamelessly. Because you need people to trust the process and their coworkers enough that they will tell you the truth about what they know as opposed to getting into a defensive mode, right? And just to have a sense of humor about it because really, everybody makes mistakes.

29:15 — Brent Simoneaux
Mm-hmm. Ain't that right?

29:16 — Angela Andrews
She's right.

29:17 — Johan Philippine
She's right.

29:20 — Brent Simoneaux
So Johan, we've just heard a few stories about people making big mistakes by doing things they don't quite understand. What are we to take away from this?

29:33 — Johan Philippine
Well, mistakes happen. Big mistakes happen, especially when people are doing things that they don't fully understand. It is their fault in the end, right? But if you treat it in the right way, instead of pointing blame and try to learn from it, all of these people, they've learned from their mistakes and they're all still working in the tech industry, right? So big mistakes are going to happen, and sure, there are some situations where big mistakes are going to end a career. But from what I've heard from talking to people in the tech industry, that's pretty rare.

30:05 — Brent Simoneaux
Does that line up with your experience, Angela?

30:07 — Angela Andrews
It does. Because again, mistakes are all a part of the job. Because you're curious in your job and you're trying to do a better job, you shouldn't be penalized for your curiosity. Yes, you have to figure out what you're doing and what are their impacts, but none of this stuff was done in malice. None of it was done to bring the company down. No, it was just really people doing their job or just being curious, and mistakes are always going to happen. Sometimes you just have to know how to mitigate them, right, as quickly as possible.

30:40 — Johan Philippine
And in my conversation with Ian Walker, from the top of the show, he was telling me how he really likes to create an environment where it's okay to make mistakes. He really tries to shield his junior developers from the consequences if there are any. Now, over the years, people have developed systems, they've developed ways in which mistakes can get caught or prevented before they have big consequences. As a preview for our next episode, which is part two of this ‘big mistakes’ episode.

31:11 — Brent Simoneaux
Part two.

31:12 — Johan Philippine
Sometimes the systems, they aren't enough. Sometimes they fail.

31:19 — Chris Kelley
I only realized that something had gone horribly wrong when I got a call from the database admin an hour later, and he wasn't happy.

31:25 — Angela Andrews
Ooh.

31:27 — Johan Philippine
That's next time on Compiler.

31:32 — Angela Andrews
This was such a great story, listeners, and I hope you had as much fun listening to it as we had talking about it. We want you to share your thoughts with us. Tweet us @Red Hat on Twitter. Use the hashtag #CompilerPodcast. We just want to hear about your F ups too, because we know they're out there. We know you've done them, now you just have to share them with us. We'd love to hear from you. And that does it for the first ‘eff-ups’ episode of Compiler.

32:04 — Brent Simoneaux
Today's episode was produced by Johan Philippine and Caroline Craighead. Victoria Lawton makes sure we know what we're doing.

32:14 — Angela Andrews
Our audio engineer is Kristie Chan. Special thanks to Sean Cole. Our theme song was composed by Mary-Ancheta.

32:23 — Brent Simoneaux
A big thank you to our guest. Ian Walker, Joanna Delaporte and Anne Marie Fred.

32:29 — Angela Andrews
Our audio team includes Leigh Day, Laura Barnes, Stephanie Wonderlick, Mike Esser, Claire Allison, Nick Burns, Aaron Williamson, Karen King, Boo Boo Howse, Rachel Ertel, Mike Compton, Ocean Matthews and Laura Walters.

32:46 — Brent Simoneaux
If you like today's episode, please follow the show. Rate us, leave us a review and share it with someone you know, it really does help us out.

32:54 — Angela Andrews
So glad you listened. Thank you. And we'll see you next time.

32:58 — Brent Simoneaux
All right.

Compiler background

Featured guests

Chris Kelley

Xander Soldaat

Christine Caulfield

 

Keep Listening