You can also read or share a shorter version of this that I've posted on LinkedIn.
Saturday, you might have seen the news about a “false alarm” push alert that was sent to many iPhones and other smart phones in Hawaii. An alert was also sent out over many TV channels through the Emergency Response System.
If you were in Hawaii and actually received the message, it might have been a bit traumatizing, as people shared photos of the “THIS IS NOT A DRILL” alert online:
Golfer Justin Thomas said he prepared for the end after an alert, warning of an imminent missile strike, was wrongly issued in Hawaii.
This is the message that was sent out. 👇
— BBC Sport (@BBCSport) January 14, 2018
Here's how it looked on some TV stations, as a crawl:
— Daily Express (@Daily_Express) January 13, 2018
After about 40 minutes of panic, an “all clear” message was sent, but the aftermath leaves people asking what happened, how it happened, and why.
Some are, inevitably, asking WHO screwed up, looking for a culprit to blame and punish.
It's important first to find out what happened. It's hard to do problem solving through news reports, but let's do our best. As this Wall St. Journal article explains:
“On Saturday morning, just after 8 a.m. local time, a Hawaii state employee hit the wrong button on a computer during a shift change and accidentally sent an alert to many of the state's cellphones that a ballistic missile was about to strike.”
What Was Your Reaction?
Some people are going to have the reaction of, “What an idiot. They should have been more careful.” (see my companion site BeMoreCareful.com).
Is this reaction helpful from government officials?
“I'm extremely angry right now. People should lose their jobs if this was an error,” Hawaii State Representative Matt Lopresti told CNN.
There are, unfortunately, many news stories that use the word “blame” in this situation. Blaming seems to be part of our human nature (and primate nature)… we can do better, though.
Firing or punishing an individual without making meaningful and effective changes to the systems means it's just a matter of time before a new person makes a mistake. It might be appropriate to fire an individual who intentionally did the wrong thing, but that seems not to be the case here. As I've blogged about, the “just culture methodology” is very helpful in cases like this to determine if punishing an individual is fair and “just.” It's not a matter of being “nice” or “soft.” Unfair and unjust punishment gets in the way of future improvement.
My first reaction was along the lines of, “How could the system be so poorly designed so that it's possible to accidentally hit such an important button??”
Even if this was an intentional act, if a single person can do that, isn't that another case where we have a badly-designed system?
I'm curious what the “shift change” detail has to do with anything in this scenario. It's reasonable that a shift ended at 8 am and the alert went out “just after 8 am.”
Bad Systems, not Bad People?
It made me wonder if the computer system they were using was designed like this (see image):
Suspected UI / UX design on Hawaii state computer system as the employee tried to log out for shift change. Oops. pic.twitter.com/EN6XQjTNR8
— Mark Graban (@MarkGraban) January 14, 2018
This WIRED story says:
“It's a regular PC interface. This person probably had a mouse and a dropdown menu of the kind of alert messages you can send,” and selected the wrong one, Simpson says.
One time, I accidentally booked a flight on the wrong day because, I think, I hit the scroll wheel on my mouse and it changed the date in the dropdown menu. Things like this can happen too easily.
It's a much more trivial situation, but I've blogged before about how the American Airlines website made it far too easy to cancel the wrong flight. To my surprise, American fixed the issue and even contacted me to thank me.
What I suggested to American was the Lean concept of “error proofing” or “mistake proofing.” The focus with mistake proofing is to make it more difficult to do the wrong thing. Or, ideally, we make it impossible to do the wrong thing. It seems this emergency notification system wasn't designed with error proofing in mind. Or, it was designed well and this was an intentional act of sabotage. Who knows.
I'll continue with the assumption that this was an honest mistake. A “slip” or an “error.” If so, I'd call this a systems problem instead of labeling it “human error.” Calling something “human error” often leads to blaming the individual or we throw our hands up and say, “What can we do? People aren't perfect.”
Well, what we can do is design better systems that are less prone to human error. That principle applies in healthcare and other settings too. Telling others to be careful is NOT an effective strategy for quality or safety. Errors might occur even WITH people being careful.
See a previous blog post on reacting to “human error:”
Again from the WSJ:
“Officials canceled the alert six minutes later to stop it from being sent to people who hadn't already received it. But the news continued to proliferate as frightened residents called friends and family members.”
It's good that they stopped the flow of erroneous messages going out. We'd call that a “containment” step to stem the flow of poor quality information. Apparently the entire state doesn't receive messages exactly simultaneously, which makes sense. I wonder if somebody in that office (or the person who pushed the “wrong button”) got the alert and realized something had gone wrong. Was the person who “pushed the wrong button” on the way to their car?
“Thirty-eight minutes passed before state officials sent a new message that said the first alert had been a false alarm.”
— Damian Davila, MBA (@daviladamian) January 13, 2018
My next question is why it took so long to “cancel” the erroneous message after it was detected. There wasn't an easy “send all clear” button in that computer system? Were there checks and balances (or some bureaucracy to clear) before the all clear could be sent? It doesn't seem like those checks and balances were there for the original terrifying message.
“Mr. Miyagi didn't offer an explanation for why more than half an hour passed before a notification was sent to cellphones that the alert had been a false alarm. “One thing we have to work on more is the cancellation notice,” he said.”
The WSJ shares a bit more detail about how there was apparently a routine test of systems during shift change:
“At a press conference Saturday afternoon, Vern T. Miyagi, administrator of the Hawaii Emergency Management Agency, said the alert was mistakenly triggered when the wrong computer button was clicked during a morning shift change test of the emergency-alert system.”
I wonder how much thought went into designing the test process, with a mind for “how do we prevent triggering a false alarm?”
How Do You Improve the Process to Prevent Recurrence?
Better late than never, the state has changed the process to require some redundancy:
“The system for sending out emergency alerts has now been changed to a system with two people involved, so the same kind of mistake couldn't happen again, state officials said.”
I'm guessing they couldn't change the actual software that quickly. Does “two people involved” mean somebody standing over the shoulder of the clicker to make sure they're being careful enough?
Does that process change truly prevent a repeat of the problem or just make it a little less likely?
“We've implemented changes already to ensure that it becomes a redundant process, so it won't be a single individual,” Mr. Miyagi said.
I hope they aren't going down the path to blaming that particular (or any) single individual. I'll give the state credit for appearing to take a systems view instead of saying something like “the person who made the error has been put on leave” or something like that.
The Federal Communications Commission has also launched an investigation.
“FCC Commissioner Jessica Rosenworcel said the commission must find out what went wrong.
“Emergency alerts are meant to keep us and our families safe, not to create false panic. We must investigate and we must do better,” she wrote on Twitter.”
Have you seen other news stories with details about what happened?
What are the lessons learned for your workplace? Are there likely errors that could occur, where it's just a matter of time?
Read or share my Twitter thread on this post:
— Mark Graban (@MarkGraban) January 14, 2018
Follow up post:
More on this disturbing theme:
We Still Live Under the Chilling Risk of Accidental Nuclear War
There's a fascinating, if not chilling book, that I read recently: The Doomsday Machine: Confessions of a Nuclear War Planner by Daniel Ellsberg, who used to work deep in the military-industrial complex (he was the leaker of “The Pentagon Papers.”).
In the same timeframe as the Pentagon Papers, when he was on trial, Ellsberg also had a stash of stolen papers about the U.S. nuclear program that detailed, for example, the extremely high risk of an accidental nuclear war. As described in this book review:
“Intending to release these as a follow-up to his Vietnam leaks, he gave them to his brother for safekeeping, who buried them in his yard in upstate New York. Soon after, a hurricane and flood swept them away into a landfill…”
Even with searches of the landfill, the papers were never found.
Maybe I'll do another blog post on The Doomsday Machine. There were many shocking allegations that will make you question the design of our nuclear systems (and those of the Russians). The risk of “accidental” full-on nuclear war is still very high today, Ellsberg says, and that's not just a commentary on our current president. In the book, it sounds like there are some bad systems and processes that could lead to full-on nuclear war WITHOUT the president sending an ill-considered first strike.
Some of the supposedly “elaborate safeguards” that were supposed to prevent a rogue launch of a nuclear strike without the president's permission were subject to workarounds.
There's the famous story of how, supposedly, the safeguard launch lock code on all Minuteman missiles was set to all zeros after JFK ordered an extra protection be put in place. The Air Force claims that was never the case. Ellsberg's book lays out many examples where speed was emphasized over safety or caution. It's logical that those responsible for having to launch missiles would do everything in their power to be able to launch them quickly (before being hit by a Soviet strike), but the focus on speed certainly led to a number of risks.
Speed was cited as a factor in this erroneous message, per WIRED.
Simpson agrees: “You don't want to be in the middle of a attack on the US and have someone fumbling around with the message.” It's also natural to conduct exercises to ensure the system is functioning. The problem in this case, Simpson says, is any exercise message should begin with the words, “EXERCISE EXERCISE EXERCISE.”
“This was probably a state-run emergency exercise that doesn't have the strong controls that DoD has learned the hard way from 50 years of screwing up,” Simpson says.
Using “EXERCISE EXERCISE EXERCISE” sounds like a form of error proofing that mitigates an error rather than preventing it. People might not have panicked over that form of message.
Has the DoD learned? Ellsberg says no.
It's also claimed that the launch codes for strategic bombers weren't really as effective of a safeguard as we were told. Ellsberg claims:
“Several RAND colleagues who were knowledgeable about SAC procedures supported my guess that the numbers in the code were the same for all planes in the SAC alert force. Only a single radio signal needed to be sent out. And their understanding was that the code was changed very seldom.”
Here's one other scenario to share from the book for now. Conventional wisdom is that a bomber pilot, missile launch officer, or submarine commander can't launch an unauthorized strike. Ellsberg, again with his insider's knowledge, continually questions if these controls are as “failsafe” as we'd like to think:
For example, on the matter of the envelope authentication, when I posed the possibility that a conscientious (or unbalanced) pilot who felt impelled to go to target might try to convince others to go with him in the way my memo had speculated, the typical response was: “Well, he can't do that, because he doesn't know the whole authentication code.” I would pause at this point, waiting to hear a second thought expressed (which never occurred). Then I would say offhandedly: “Unless he opened the envelopes.”
Even this hint often failed to turn a light on. I'd hear: “But that's against his orders. If he hasn't gotten the whole signal, he can't open it.” That answer usually hung in the air only a moment or so. The premise was, after all, that the officer in question had come to feel, on some basis or other (like General Jack D. Ripper in Kubrick's Dr. Strangelove, a few years later), that it was time to commence World War III. He was on his way to drop a thermonuclear bomb on Russia, and he wouldn't expect to come back. Everyone I encountered came to agree by this point in the discussion that there was a real problem here, however unlikely.
The book is chilling… and it describes current-day risks, not just some historical situation that mankind was lucky to survive.
The 1964 movie “Fail Safe” is still chilling today… without the dark comedy moments that we get in “Doctor Strangelove” — a film that Ellsberg says was basically an inadvertent documentary.