Subscribe
& more

Episode 23

Are Big Mistakes That Big Of A Deal?

Compiler

Show Notes

Oops. We all make mistakes. Most of the time, they’re small enough no one notices. But every now and then, we do something that makes us break into a cold sweat. The "Oops" becomes a curse, desperate pleas—or horrified silence as we process what just happened. In the moment, they’re panic-inducing. But once the dust settles, are those big mistakes that big of a deal?

We hear three stories of people who wish they had an easy undo button. But making those mistakes taught them all something important—and changed how they do their jobs. Because those big mistakes end up being valuable lessons for the rest of their careers.

Transcript

00:02 — Johan Philippine

Angela, Brent. I know you're both terrific at your jobs. But have you ever really, really messed up?

00:11 — Angela Andrews 

I've shut down production systems and databases. It would be one VM (virtual machine) and there would be a name that was super close to another one. And I would be rebooting it and it's like, "Oh, sugar foot." That was literally the database for the admissions. Oh God. Oh yeah, I've had my share. Yes.

00:33 — Johan Philippine 

What happened after though? 

00:36 — Angela Andrews

That particular one, it was like I knew immediately and I was like, "Oh God, I got to bring this system back up." So yes, there was an outage. It wasn't a widespread outage because if you weren't using it at that particular moment, you were unaware that I actually brought down the database server. Oh God, I've done that. I'm still here. I'm still here and I am not defined by my mistakes. Oh wait, I have to tell this story, please, please, please, please.

01:06 — Johan Philippine 

Oh yeah, go ahead.

01:07 — Brent Simoneaux

Yeah, please do.

01:08 — Angela Andrews

So there was a power outage at this university that I worked at. We were working with getting generators working. And so anyway, our whole data center went out, like boom! The entire data center in this college campus. So we're bringing things back up and we're all trying to do a postmortem, we're standing there. And this guy walks in and he's an electrician, and there's this big red button on the wall.

01:35 — Brent Simoneaux

Oh no.

01:35 — Angela Andrews

And he says, "What's this?" And he pushes it and the whole room went in slow motion, and we're like, "No." And the whole thing went dzzsh. The whole data center went down again.

01:51 — Brent Simoneaux

Oh my God.

01:52 — Angela Andrews

We call him buttons. And he is still there, so.

01:56 — Brent Simoneaux

He's still there.

01:58 — Angela Andrews

We're not defined by our mistakes.

02:00 — Johan Philippine 

But on the other hand, from what I've heard anyways, knowing people in the tech industry. Big mistakes doing something to a production system, it's almost like a rite of passage in this industry, right? Almost everyone has at least one story of doing exactly the same stuff that you were just talking about, right?

02:20 — Brent Simoneaux

Mm-hmm.

02:20 — Johan Philippine

So that led me to really wonder, are big mistakes that big of a deal?

02:30 — Brent Simoneaux

This is Compiler, an original podcast from Red Hat.

02:34 — Angela Andrews

We're your hosts.

02:35 — Brent Simoneaux

I'm Brent Simoneaux.

02:37 — Angela Andrews 

And I'm Angela Andrews.

02:38 — Brent Simoneaux 

We're here to break down questions from the tech industry, big, small, and sometimes strange.

02:45 — Angela Andrews

Each episode we go out in search of answers from Red Hatters and people they're connected to.

02:51 — Brent Simoneaux

Today's question, are big mistakes that big of a deal?

03:00 — Angela Andrews 

Producer Johan Philippine is here to find out.

03:04 — Johan Philippine 

So today I've got three stories to share. Act one, I call it flying under the radar.

03:12 — Angela Andrews

Okay. Ira Glass.

03:16 — Brent Simoneaux 

Oh God, Johan.

03:17 — Johan Philippine

Look, it works. So I spoke with Ian Walker, he's a technical account manager here at Red Hat and he lives in Japan. Now I spoke to him first because he started an email thread a few months ago in response to a large social media outage that affected a lot of people and a lot of different websites. In his thread, he links to an article that describes effing up as part of the job of software development.

03:47 — Angela Andrews 

He's not lying.

03:48 — Ian Walker 

As I was just looking at the news and stuff, I happened to cross the article from one of the writers of the Daily WTF, who mentioned that as software developers, screwing up is our job and that you need to screw up in order to get better. And screwing up allows you to get better at recovering from the screw ups and stuff like that. And so I thought, "Well, this is interesting." And there is a lot of stigma and stuff associated with making mistakes and things like that.

04:16 — Johan Philippine

I thought what he was doing was really commendable, which was first of all, sharing the article, but trying to destigmatize the idea of messing up. Because Ian, well, he's got his own story about messing up. Early on in his career, he had an IT job for a big airline and his office was based in Los Angeles. Now this airline had flights across the Pacific Ocean and he was on the IT support team for airports in North America, Central America and South America.

04:47 — Brent Simoneaux

Okay.

04:48 — Johan Philippine

And that includes the airport in Kona, Hawaii. Now at the time of the story, the rest of his team had gone home and he was alone in the office.

04:58 — Ian Walker 

So I had just learned about network switches and how you can log into them remotely and you enter some commands, and you can look at the configuration for all the different ports and all the different settings for the switch. And I'm not sure if I had been asked to gather this information or if I just decided to do it myself. So I was in our office in Los Angeles and I was accessing a switch that was a couple thousand miles away in Kona, Hawaii. So it was not something where I could just walk over there and plug it back in. But for some reason, I had decided I was going to log into one of the switches at an airport and I was going to check the settings to see what it had been set to. So at that time, I think either I telnetted or SSH'd into the switch, and I knew just enough to be dangerous. I knew that the command SH was supposed to show you the settings.

05:53 — Johan Philippine

Now Angela, I take it you know where this is going.

05:56 — Angela Andrews

Ooh yeah, I can see where this is going. And I guess I'm laughing about it.

06:03 — Johan Philippine

Care to fill us in?

06:04 — Brent Simoneaux

Wait, so there's a physical switch?

06:07 — Angela Andrews 

Somewhere.

06:08 — Johan Philippine 

Yeah. So at the airport, they have their own servers. Each airline had their own servers in the airport. And these servers handled things like check-in, and flight assignments and stuff like that. And they would have these physical servers and the network cables would come in and out of them to get their internet connections, right?

06:29 — Brent Simoneaux

So what happened, Johan?

06:30 — Johan Philippine

So he typed in SH thinking it would show him the settings because I assume that's what it does in some other contexts. But when he's logged into a particular port like that, it actually shuts down that port physically.

06:44 — Angela Andrews

I'm sorry. I'm sorry. Oh my gosh, why do we do things like that?

06:55 — Brent Simoneaux

So he's in Los Angeles, but the switch is in Hawaii?

06:58 — Johan Philippine

That's right.

06:59 — Brent Simoneaux

That's a problem.

07:00 — Johan Philippine

That's a problem. His connection died, to this switch. But not only that, he killed that switch's connection outwards as well.

07:08 — Brent Simoneaux

It's not like he could just walk down the hall and...

07:11 — Johan Philippine 

Yeah, exactly. So he basically shut down that server's access to the internet at the airport during business hours. So when he disabled the port, the airline operations department, they were unable to access their back-end airline systems, they weren't able to check-in, they weren't able to check the status of the flights that they were handling.

07:30 — Angela Andrews

Wow.

07:31 — Johan Philippine

Now luckily for him, he had just recently been to that airport a couple months before on a business trip to help them set up, I assume. And he had actually taken pictures of their setup.

07:41— Ian Walker

So I knew what cable was plugged in where and how it was all set up. So I called up the operations department and I said, "Hey, it looks like your internet connection just went down." And they were like, "Yeah, everything just suddenly stopped working. It's weird, I can't access anything." So I was like, "Hmm. I think I might know what's going on."

08:01 — Brent Simoneaux

"Yeah, that's really weird."

08:03 — Angela Andrews

"How did this happen?" "I have no idea."

08:06 — Brent Simoneaux 

"Super weird."

08:08 — Johan Philippine

Oh Ian, you definitely knew what was going on.

08:09 — Brent Simoneaux

Yeah.

08:10 — Angela Andrews

You really do have to play dumb for a second. You don't want to put yourself out there too fast too far.

08:17 — Ian Walker

"Can you go over to the switch?" And I explained what the switch was and said, "Can you take the cable out of number 14 and plug it into number 15 port just to see what happens?" So they did that and somehow it came up and I was able to connect to it.

08:33 — Angela Andrews

He said somehow. It magically came back up. Oh my gosh, I love it. I love this is probably one of the best stories.

08:45 — Johan Philippine

Oh, it gets better.

08:46 — Ian Walker 

So I quickly logged back in and turned on the port that I had just shut down, and then asked them again to put this cable back to the original port. And they did, and everything came up and was working fine.

09:01 — Angela Andrews

Wow.

09:02 — Brent Simoneaux

Mm-hmm.

09:03 — Angela Andrews 

This is a good story.

09:04 — Johan Philippine

It's great, I loved it. I loved hearing this from him.

09:07 — Brent Simoneaux

So how long did this whole thing last?

09:09 — Johan Philippine

Well, he was a little hazy on the details, but he estimated that it took about 30 minutes to an hour from start to finish, is what he remembers. I mean, time gets a little funny when you're in panic mode like that.

09:23 — Brent Simoneaux

But I'm sure it felt like hours. Yeah.

09:25 — Angela Andrews

For sure.

09:26 — Johan Philippine

And another lucky break for him: It was early evening for him in Los Angeles, it was like mid afternoon in Kona, Hawaii at the time. So it all happened while the airline actually wasn't all that busy.

09:36 — Brent Simoneaux

Okay.

09:36 — Angela Andrews

Ooh, lady luck.

09:37 — Johan Philippine

There weren't that many consequences. Lot of luck for him. So I asked him, what did he learn from his experience?

09:45 — Ian Walker

Well, I learned not to enter commands that you don't really understand.

09:51 — Johan Philippine

I think that's pretty good advice.

09:53 — Angela Andrews

The most sound advice anyone could ever give you.

09:56 — Johan Philippine

Yeah.

09:58 — Angela Andrews 

I want to say it was an honest mistake. It was one of those mistakes like, "Bro, you know you messed up, right?" But it wasn't. He was curious, and curiosity is an amazing thing to have when you work in technology, just not on production systems.

10:16 — Johan Philippine

Mm-hmm. So this is great advice, and it's advice that our next guest could have really used when she had a rough go on her first Linux job.

10:32 — Joanna Delaporte

Oh, you would've just had to start over.

10:37 — Johan Philippine

We're at act two. I call this one, ‘what is going on right now’? And I spoke to Joanna Delaporte.

10:45 — Joanna Delaporte

Mistakes happen, that's what this is all about.

10:48 — Johan Philippine

So that's her. She's been in the tech industry for about 15 years at this point. And about 10 years ago, she took a job as a Linux systems administrator for her local community college.

11:00 — Brent Simoneaux

Okay.

11:01 — Johan Philippine

Now, while she had some Windows administrative experience, she was learning a lot on the job how to handle the Linux system.

11:08 — Angela Andrews

I mean, that's how I learned it.

11:10 — Johan Philippine

Yeah. It forces you to learn it quickly, right? She had taken one course in college on Linux systems. So she had the basics down, but she had a lot more to learn.

11:20 — Joanna Delaporte

Yeah. So I ran all of the Linux systems for my community college, and that was everything involved in the domain for Linux systems. So domain authentication, file sharing, managing the named DNS server, patching and configuring all of the lab systems for all the students. So if this machine went down, all of the other servers would go down as well.

11:49 — Johan Philippine

So, pretty important system for her community college. It was located in a server room which she worked out of as well. It was about eight feet wide by maybe 18 feet long.

11:59 — Brent Simoneaux

Okay.

12:00 — Johan Philippine 

Not a really big space. And she shared it with a half rack and then a few individual server towers. Now it was loud and it was cold to keep the servers cool.

12:11 — Brent Simoneaux

Yeah.

12:11 — Angela Andrews

Mm-hmm.

12:12 — Joanna Delaporte

Yep. I was all alone in the closet.

12:14 — Brent Simoneaux

Me too, girl.

12:17 — Angela Andrews

Great.

12:20 — Johan Philippine

So one day, Joanna was trying to figure out how to do a particular thing on a system. She doesn't really remember what it was she was trying to figure out, and she thought she could try and find that command by going through the log history because the previous administrator had surely done it before. Angela, if you had to go through a history of previously run commands, how would you go about that?

12:47 — Angela Andrews

Besides up arrow? No. I always do a control R and maybe start typing in what I think some of the command could have been and it tries to do an auto complete for you. Like when you're in Google, and you start typing and it tries to fill in the spaces, that's one way to do it. That's two ways to do it, actually. That would be my go-to.

13:08 — Johan Philippine

I see. Well, neither of those are what Joanna ended up doing.

13:12 — Angela Andrews 

Oh God.

13:14 — Joanna Delaporte

Well, that's the funny thing. So I didn't actually know how to look at the commands. I was not familiar with the less command, or the more command, or the cat command. And what I wanted was one of those. Essentially, I wanted to see the commands. What I actually ended up typing was source of the root bash history, which was not a good move. It's definitely not something I should have done.

13:42 — Johan Philippine

I heard a big sigh there.

13:44 — Angela Andrews

Oh gosh, okay. So the source command is a really powerful, very powerful command and I only use it when I'm trying to do something very particular.

13:59 — Johan Philippine

Mm-hmm.

13:59 — Brent Simoneaux

Yeah.

14:00 — Angela Andrews

Let me think for a second. When do I use the source command? If I'm installing something from maybe binaries or something like that.

14:07 — Johan Philippine

Mm-hmm.

14:08 — Angela Andrews

So it's like a shell command that executes something almost like the gospel. So you're going to source whatever this thing is, you're typing after the word source.

14:19 — Brent Simoneaux

Okay.

14:20 — Angela Andrews

So you just said that she did type source and then root, or?

14:25 — Johan Philippine

Of the root bash history.

14:27 — Angela Andrews

Oh sugar. Oh yeah. Well, so she did all of that, did she?

14:35 — Johan Philippine

Yeah.

14:35 — Angela Andrews

Okay. She did all the things.

14:37 — Johan Philippine

She did all the things.

14:38 — Angela Andrews

Okay.

14:40 — Joanna Delaporte

So instead of just seeing the commands in a harmless way, I was actually executing every command in the bash history file.

14:47 — Angela Andrews

Shut up.

14:49 — Joanna Delaporte

And it fired off pretty rapidly as computers tend to do. It probably ran through at least 20 or 30 before I really understood what it was doing and that it was executing every command.

15:02 — Angela Andrews

Girl, control C.

15:06 — Joanna Delaporte

But even at that point, I wasn't sure yet how to stop it. I didn't even know how to use a PS command to find a process at that point, so it was something I had to figure out during this execution. I would say it probably ran somewhere between 50 and 200 commands before I finally managed to kill it. It's hard to say because so many of them happened so quickly that I wouldn't have seen them all necessarily.

15:32 — Brent Simoneaux

I am sweating right now.

15:34 — Angela Andrews

Me too. Okay, all right.

15:34 — Brent Simoneaux

I am sweating.

15:37 — Angela Andrews

I am so hot and nervous. And I was not the one who did the source command.

15:42 — Johan Philippine

This happened 10 years ago, yeah.

15:46 — Angela Andrews

Ooh. Yes. So just put yourself in this position where you have no idea. So this person, her predecessor may have been doing all types of things, installs, patching, removing software, changing config files, all these things. And she did a cut and paste and said, "Okay, I'm going to just do all the things that you've just done," not knowing what those things were. You can feel your soul leave your body when you watch those commands just run across the screen. And she didn't know how to stop it, oh poor thing.

16:23 — Johan Philippine

Mm-hmm, yeah. So it was doing all those things. It was also SSHing into other machines, right, which as soon as these would see that pop up, she would kill it immediately.

16:35 — Brent Simoneaux

Yeah.

16:36 — Johan Philippine

Until eventually, she realized that the whole thing would pause when that new shell would come up.

16:41 — Angela Andrews

Mm-hmm, that's right.

16:43 — Johan Philippine

Right. Then she realized, "Okay, I'm going to leave it open. I'm not going to touch it because that's going to give me time to think and figure out how to stop this." Once that happened, she finally opened up another terminal to kill that process and the parade of terror was finally over.

17:01 — Brent Simoneaux

The parade of terror.

17:01 — Angela Andrews

It's literally a parade. They're marching across your street.

17:05 — Johan Philippine

Right? Because it's one thing after the other.

17:07 — Brent Simoneaux

Little marching band.

17:08 — Johan Philippine

And you're just like, "Oh no."

17:12 — Joanna Delaporte

Yeah. In the moment of course, time dilates funny when you're terrified and things are going wrong. It was probably somewhere between four and 10 minutes. When I eventually realized I had some slack, basically I got to the point where I was like, "I'm just going to let it get to the next point where it pauses because it has SSH'd into something or opened a file. And at that point, then I started doing the research I needed to figure out how to log in, find the process and kill the process.

17:41 — Brent Simoneaux

So what did Joanna learn from all this?

17:43 — Johan Philippine

I think it's going to sound very familiar.

17:45 — Brent Simoneaux

Yeah.

17:46 — Joanna Delaporte 

I should have known what this command does, right? I'd heard of this command once, that's why I used it because I'd heard it once. But in a way, I felt like I should have known better, right? I should know not to use a command that I don't know what it does. I don't really know what it does. And I thought it was way more simple and harmless of a command than it really is.

18:07 — Johan Philippine

Luckily for her, no really lasting and permanent damage was done to the system.

18:12 — Brent Simoneaux

Yeah.

18:12 — Johan Philippine

She looked back. She didn't have to wipe it and rebuild the system because that would've taken a long time, especially since she was still pretty new at this job. But she learned a valuable lesson from that.

18:25 — Brent Simoneaux

I'm starting to pick up on a little theme here.

18:28 — Johan Philippine

Do tell.

18:29 — Angela Andrews 

There's a common thread. What are you realizing?

18:31 — Brent Simoneaux 

There's a common thread which seems like a little bit of a golden rule here, which is, don't use commands that you don't understand.

18:42 — Angela Andrews

Sometimes they sound like a good idea, I don't know. But you're right, this is literally a cautionary tale to anyone who's listening to this.

18:50 — Johan Philippine

Several cautionary tales.

18:52 — Angela Andrews

Exactly. If you're listening to this podcast, please make sure you know what command you're about to run before you type it and hit enter.

18:59 — Brent Simoneaux

Mm-hmm.

19:00 — Angela Andrews

Know the consequences of what you're about to do.

19:03 — Brent Simoneaux

Mm-hmm.

19:05 — Brent Simoneaux

Which is not to be preachy at all, right? Not to be preachy at all.

19:09 — Angela Andrews

Oh gosh, no, wait a minute.

19:10 — Brent Simoneaux

This is very common, right?

19:14 — Johan Philippine

Mm-hmm.

19:14 — Angela Andrews

It's common. It is common. We're humans.

19:17 — Johan Philippine

Yeah.

19:17 — Angela Andrews

And sometimes you could know, or at least you think you know, "Oh, I know what this command is going to do," and it does something, one, because it's really not the command that you think it is. And it does something totally unexpected.

19:30 — Johan Philippine

On that note, we have one more story with a quick caveat that the person telling the story didn't cause the mistake, but she was part of the team that had to fix the mistake as it happened.

19:42 — Angela Andrews

The cleanup crew, okay.

19:44 — Johan Philippine

She was part of the cleanup crew. Act three. I call this one ‘syntax error’.

19:51 — Brent Simoneaux

Okay.

19:52 — Johan Philippine

It actually happened pretty recently. It was in 2018 at a massive tech company that we've all heard about.

19:58 — Angela Andrews

We are not naming names.

20:00 — Johan Philippine

Well, we're not naming company names. But I spoke to Ann Marie Fred, and at this point in her career, she had several years of experience as a developer. She was working in an open floor office with about 75 people in the room, that group was in charge of online sales and product information for this, again, massive tech company. And because it's fair to say that it was fairly well frequented, the website.

20:29 — Ann Marie Fred

I know that one of our bigger web engines would get 4 million hits a month.

20:37 — Angela Andrews 

Well frequented, okay.

20:40 — Johan Philippine 

Nothing to sneeze at, right?

20:41 — Angela Andrews

Nope.

20:44 — Johan Philippine

So they were running some A/B testing on these pages. They're roughly half a million individual pages when counting all the content, which was also translated in multiple languages. They had a little snippet of JavaScript embedded in each of these pages to run experiments and gather data for analysis, to track conversion rates and things like that. And it worked pretty well until one of the consultants running the experiments, a consultant who was not a developer, made a critical coding mistake.

21:15 — Ann Marie Fred

Yeah. So the experiment itself, the little bit of code that was important, basically said if window.location.HREF = the URL for page A, then set window.location.HREF to the URL for page B. Pretty simple.

21:36 — Angela Andrews

I'm sorry. I had. To laugh because I have to wait until I hear exactly what happened, but it's literally pointing to another page.

​​21:45 — Johan Philippine

Yeah. So yes.

21:46 — Angela Andrews

Okay.

21:48 — Ann Marie Fred

And since this little snippet of code was embedded on all the web pages that our group was generating. Between the product pages, and the search pages and the 100+ languages that we were supporting, we're talking about at least a few hundred thousand webpages, maybe a half million webpages that had this little snippet on them.

22:10 — Johan Philippine

So A/B testing. You randomly assign a user, either A or B at that point, that is the A version of a webpage or a B version of the webpage. There's going to be some differences between the two, and the idea is to determine which page out of those two is more effective at getting whatever desired outcome that you're trying to measure, right?

22:33 — Brent Simoneaux

Oh, so you're running a little experiment.

22:35 — Johan Philippine

You're running little experiments, right?

22:37 — Angela Andrews

Yes.

22:38 — Brent Simoneaux

But as a user, you don't really know.

22:40 — Johan Philippine

As a user, you have no idea because you just either see page A or you see page B, you don't see both of them. You don't even know that an experiment's being run most of the time.

22:49 — Brent Simoneaux

Yeah.

22:51 — Johan Philippine

So they were running a particular experiment, or they're about to run a particular experiment and something goes terribly wrong.

23:00 — Brent Simoneaux

Oh no.

23:01 — Ann Marie Fred

Well, in JavaScript, the single equal sign is used for assigning values to a variable. And then of course, a double equal sign lets you compare two variables irrespective of the data type. And then the triple equal sign compares two variables, but it checks the type strictly, right? Unfortunately, the person accidentally used the single equal sign. So instead of checking if the window location was logically equal to A, it was just immediately setting the window location to the new page, or it was actually setting the window location to A immediately. And so what happened is, as soon as that experiment went live, every single one of those pages started redirecting to the target page in an infinite loop.

23:51 — Angela Andrews

Ooh.

23:54 — Johan Philippine

So, half a million pages, give or take a few thousand. Instead of performing a check, instead they redirected to a single page. It's not too bad, right? That's the worst of it, right?

24:08 — Angela Andrews

Is it though?

24:09 — Ann Marie Fred

So they launched the experiment and then immediately went into a multi-hour customer meeting and turned their phone off.

24:20 — Angela Andrews

No.

24:20 — Ann Marie Fred

Of course, it's like the classic launch something on Friday evening scenario, right? But we noticed in our big, open office room, we had a lot of monitors on those webpages. And so what happened is, all the monitors that were checking for a specific content to render on a page, or for user journeys that could go through successfully started failing at roughly the same time within five to 15 minutes, depending on how sensitive they were. And so immediately, phones started ringing all over the place in our office from different teams that were monitoring their pages. And it very quickly became a... When one pager goes off, people would shake it off. But when 10 pagers go off, everybody in the room stops working and everybody wants to know what's going on.

25:12 — Angela Andrews

And all these heads are popping up over their monitors like groundhogs like, "Wait a minute."

25:16 — Johan Philippine

Yeah, Like groundhogs and meerkats. They're like, "What's going on here?"

25:19 — Angela Andrews

Oh wow, that's a good one.

25:23 — Brent Simoneaux

Wait. Paint this picture for us, Johan. What just happened here?

25:28 — Johan Philippine

So a consultant who was running these A/B tests on the webpages. A consultant whose office, by the way, was in another city, not conveniently next door where they could just pop into her office and say like, "Hey, what's going on?" She started running an A/B test experiment and it immediately started to redirect all of the pages for whatever that group was said to monitor to a single page, which would overload their system, I assume, is what happened.

25:58 — Brent Simoneaux

Mm-hmm.

25:59 — Johan Philippine

Everything stops working properly. All the hundreds of thousands of pages were no longer accessible, right, and they were all trying to get to one single page, which triggered all of these alarms. And all the teams who depended on the data from those pages, they'd noticed that something was wrong and they were calling into Anne Marie's office to be like, "Hey, something's up? Is something happening on your end?" And they didn't know what was going on because this was out of the blue for them, right, they didn't know that the A/B test had just been launched. It took them about half an hour to figure out, first of all, what was happening. Then they figured out that the pages were caught in a loop, but they didn't know why.

26:40 — Angela Andrews

Wow. That's so stressful.

26:42 — Johan Philippine

When they realized it was from the A/B testing platform, they went to try and shut it down only to find out that they didn't have the right permissions to do so, only the person who launched the experiment was able to do that, the consultant. And because she worked in another office and because her phone was off, they weren't able to turn it off right away. So Anne Marie was tasked with contacting this consultant and getting her to shut it down. Eventually she did so by calling other people who worked in that same office and to be like, "Hey, we really need to talk to this person right now. Can you get her on the phone and out of whatever meeting that she's in because this is a big deal."

27:22 — Angela Andrews

"She's in a meeting. May I take a message?"

27:24 — Johan Philippine

No. 

27:29 — Brent Simoneaux

Drag them out of that office.

27:31 — Johan Philippine

But even though Ann Marie wasn't the cause of this problem, she and her team still learned a pretty valuable lesson.

27:38 — Ann Marie Fred

Well, we learned that the amazing power of an A/B testing framework could bring down a website if it's not configured correctly. So we got much more cautious after that. We worked with the vendor to put in an emergency kill switch so that we, as developers, could shut off any test or experiment with a single command.

28:01 — Johan Philippine

Again, someone who didn't really know what they were doing caused a big problem, but Ann Marie and her team were able to put in a kill switch and a backup system so that they could intervene. They also implemented a code review so that anytime the A/B testers wanted to push something to production, they had a developer actually go in and check it to make sure that they wouldn't cause any more problems.

28:26 — Angela Andrews

That's smart. It had more eyes on it.

28:28 — Johan Philippine

It sure did. And after they implemented that code review, they didn't have the same mistake happen again. On that note, Anne Marie has got some advice about learning from those mistakes.

28:39 — Ann Marie Fred

But it's the same goal, right, that you learn from your mistakes and don't get angry about them. So I think that's really important to have a formal way to learn from those mistakes and also to fight for a culture where these things are treated blamelessly. Because you need people to trust the process and their coworkers enough that they will tell you the truth about what they know as opposed to getting into a defensive mode, right? And just to have a sense of humor about it because really, everybody makes mistakes.

29:15 — Brent Simoneaux

Mm-hmm. Ain't that right?

29:16 — Angela Andrews

She's right.

29:17 — Johan Philippine 

She's right.

29:20 — Brent Simoneaux

So Johan, we've just heard a few stories about people making big mistakes by doing things they don't quite understand. What are we to take away from this?

29:33 — Johan Philippine

Well, mistakes happen. Big mistakes happen, especially when people are doing things that they don't fully understand. It is their fault in the end, right? But if you treat it in the right way, instead of pointing blame and try to learn from it, all of these people, they've learned from their mistakes and they're all still working in the tech industry, right? So big mistakes are going to happen, and sure, there are some situations where big mistakes are going to end a career. But from what I've heard from talking to people in the tech industry, that's pretty rare.

30:05 — Brent Simoneaux

Does that line up with your experience, Angela?

30:07 — Angela Andrews 

It does. Because again, mistakes are all a part of the job. Because you're curious in your job and you're trying to do a better job, you shouldn't be penalized for your curiosity. Yes, you have to figure out what you're doing and what are their impacts, but none of this stuff was done in malice. None of it was done to bring the company down. No, it was just really people doing their job or just being curious, and mistakes are always going to happen. Sometimes you just have to know how to mitigate them, right, as quickly as possible.

30:40 — Johan Philippine

And in my conversation with Ian Walker, from the top of the show, he was telling me how he really likes to create an environment where it's okay to make mistakes. He really tries to shield his junior developers from the consequences if there are any. Now, over the years, people have developed systems, they've developed ways in which mistakes can get caught or prevented before they have big consequences. As a preview for our next episode, which is part two of this ‘big mistakes’ episode.

31:11 — Brent Simoneaux

Part two.

31:12 — Johan Philippine

Sometimes the systems, they aren't enough. Sometimes they fail.

31:19 — Chris Kelley

I only realized that something had gone horribly wrong when I got a call from the database admin an hour later, and he wasn't happy.

31:25 — Angela Andrews

Ooh.

31:27 — Johan Philippine

That's next time on Compiler.

31:32 — Angela Andrews 

This was such a great story, listeners, and I hope you had as much fun listening to it as we had talking about it. We want you to share your thoughts with us. Tweet us @Red Hat on Twitter. Use the hashtag #CompilerPodcast. We just want to hear about your F ups too, because we know they're out there. We know you've done them, now you just have to share them with us. We'd love to hear from you. And that does it for the first ‘eff-ups’ episode of Compiler.

32:04 — Brent Simoneaux

Today's episode was produced by Johan Philippine and Caroline Craighead. Victoria Lawton makes sure we know what we're doing.

32:14 — Angela Andrews

Our audio engineer is Kristie Chan. Special thanks to Sean Cole. Our theme song was composed by Mary-Ancheta.

32:23 — Brent Simoneaux 

A big thank you to our guest. Ian Walker, Joanna Delaporte and Anne Marie Fred.

32:29 — Angela Andrews

Our audio team includes Leigh Day, Laura Barnes, Stephanie Wonderlick, Mike Esser, Claire Allison, Nick Burns, Aaron Williamson, Karen King, Boo Boo Howse, Rachel Ertel, Mike Compton, Ocean Matthews and Laura Walters.

32:46 — Brent Simoneaux

If you like today's episode, please follow the show. Rate us, leave us a review and share it with someone you know, it really does help us out.

32:54 — Angela Andrews

So glad you listened. Thank you. And we'll see you next time.

32:58 — Brent Simoneaux

All right.

Featured guests

Ian Walker

Joanna Delaporte

Ann Marie Fred

 

Keep Listening