The Response to the Hawaii False Alarm Can’t End With Firing Someone This and other nuclear threats due to bad systems

14

You can also read or share a shorter version of this that I've posted on LinkedIn.

Listen to this post as a podcast (subscribe to the series):


Saturday, you might have seen the news about a “false alarm” push alert that was sent to many iPhones and other smart phones in Hawaii. An alert was also sent out over many TV channels through the Emergency Response System.

If you were in Hawaii and actually received the message, it might have been a bit traumatizing, as people shared photos of the “THIS IS NOT A DRILL” alert online:

Here's how it looked on some TV stations, as a crawl:

After about 40 minutes of panic, an “all clear” message was sent, but the aftermath leaves people asking what happened, how it happened, and why.

Some are, inevitably, asking WHO screwed up, looking for a culprit to blame and punish.

It's important first to find out what happened. It's hard to do problem solving through news reports, but let's do our best. As this Wall St. Journal article explains:

“On Saturday morning, just after 8 a.m. local time, a Hawaii state employee hit the wrong button on a computer during a shift change and accidentally sent an alert to many of the state's cellphones that a ballistic missile was about to strike.”

What Was Your Reaction?

Some people are going to have the reaction of, “What an idiot. They should have been more careful.” (see my companion site BeMoreCareful.com).

Is this reaction helpful from government officials?

“I'm extremely angry right now. People should lose their jobs if this was an error,” Hawaii State Representative Matt Lopresti told CNN.

There are, unfortunately, many news stories that use the word “blame” in this situation. Blaming seems to be part of our human nature (and primate nature)… we can do better, though.

Firing or punishing an individual without making meaningful and effective changes to the systems means it's just a matter of time before a new person makes a mistake. It might be appropriate to fire an individual who intentionally did the wrong thing, but that seems not to be the case here. As I've blogged about, the “just culture methodology” is very helpful in cases like this to determine if punishing an individual is fair and “just.” It's not a matter of being “nice” or “soft.” Unfair and unjust punishment gets in the way of future improvement.

My first reaction was along the lines of, “How could the system be so poorly designed so that it's possible to accidentally hit such an important button??”

Even if this was an intentional act, if a single person can do that, isn't that another case where we have a badly-designed system?

I'm curious what the “shift change” detail has to do with anything in this scenario. It's reasonable that a shift ended at 8 am and the alert went out “just after 8 am.”

Bad Systems, not Bad People?

It made me wonder if the computer system they were using was designed like this (see image):

This WIRED story says:

“It's a regular PC interface. This person probably had a mouse and a dropdown menu of the kind of alert messages you can send,” and selected the wrong one, Simpson says.

One time, I accidentally booked a flight on the wrong day because, I think, I hit the scroll wheel on my mouse and it changed the date in the dropdown menu. Things like this can happen too easily.

It's a much more trivial situation, but I've blogged before about how the American Airlines website made it far too easy to cancel the wrong flight. To my surprise, American fixed the issue and even contacted me to thank me.

What I suggested to American was the Lean concept of “error proofing” or “mistake proofing.” The focus with mistake proofing is to make it more difficult to do the wrong thing. Or, ideally, we make it impossible to do the wrong thing. It seems this emergency notification system wasn't designed with error proofing in mind. Or, it was designed well and this was an intentional act of sabotage. Who knows.

I'll continue with the assumption that this was an honest mistake. A “slip” or an “error.” If so, I'd call this a systems problem instead of labeling it “human error.” Calling something “human error” often leads to blaming the individual or we throw our hands up and say, “What can we do? People aren't perfect.”

Well, what we can do is design better systems that are less prone to human error. That principle applies in healthcare and other settings too. Telling others to be careful is NOT an effective strategy for quality or safety. Errors might occur even WITH people being careful.

See a previous blog post on reacting to “human error:”

3 Ways to React to Human Error

Again from the WSJ:

Officials canceled the alert six minutes later to stop it from being sent to people who hadn't already received it. But the news continued to proliferate as frightened residents called friends and family members.”

It's good that they stopped the flow of erroneous messages going out. We'd call that a “containment” step to stem the flow of poor quality information. Apparently the entire state doesn't receive messages exactly simultaneously, which makes sense. I wonder if somebody in that office (or the person who pushed the “wrong button”) got the alert and realized something had gone wrong. Was the person who “pushed the wrong button” on the way to their car?

Thirty-eight minutes passed before state officials sent a new message that said the first alert had been a false alarm.”

My next question is why it took so long to “cancel” the erroneous message after it was detected. There wasn't an easy “send all clear” button in that computer system? Were there checks and balances (or some bureaucracy to clear) before the all clear could be sent? It doesn't seem like those checks and balances were there for the original terrifying message.

“Mr. Miyagi didn't offer an explanation for why more than half an hour passed before a notification was sent to cellphones that the alert had been a false alarm. “One thing we have to work on more is the cancellation notice,” he said.”

The WSJ shares a bit more detail about how there was apparently a routine test of systems during shift change:

“At a press conference Saturday afternoon, Vern T. Miyagi, administrator of the Hawaii Emergency Management Agency, said the alert was mistakenly triggered when the wrong computer button was clicked during a morning shift change test of the emergency-alert system.”

I wonder how much thought went into designing the test process, with a mind for “how do we prevent triggering a false alarm?”

How Do You Improve the Process to Prevent Recurrence?

Better late than never, the state has changed the process to require some redundancy:

“The system for sending out emergency alerts has now been changed to a system with two people involved, so the same kind of mistake couldn't happen again, state officials said.”

I'm guessing they couldn't change the actual software that quickly. Does “two people involved” mean somebody standing over the shoulder of the clicker to make sure they're being careful enough?

Does that process change truly prevent a repeat of the problem or just make it a little less likely?

“We've implemented changes already to ensure that it becomes a redundant process, so it won't be a single individual,” Mr. Miyagi said.

I hope they aren't going down the path to blaming that particular (or any) single individual. I'll give the state credit for appearing to take a systems view instead of saying something like “the person who made the error has been put on leave” or something like that.

The Federal Communications Commission has also launched an investigation.

“FCC Commissioner Jessica Rosenworcel said the commission must find out what went wrong.

“Emergency alerts are meant to keep us and our families safe, not to create false panic. We must investigate and we must do better,” she wrote on Twitter.”

Have you seen other news stories with details about what happened?

What are the lessons learned for your workplace? Are there likely errors that could occur, where it's just a matter of time?

Read or share my Twitter thread on this post:

Follow up post:

[Updated] Somebody *Did* Get Unjustly Fired in Hawaii, But System Problems Should be Blamed

 

More on this disturbing theme:

We Still Live Under the Chilling Risk of Accidental Nuclear War

There's a fascinating, if not chilling book, that I read recently: The Doomsday Machine: Confessions of a Nuclear War Planner by Daniel Ellsberg, who used to work deep in the military-industrial complex (he was the leaker of “The Pentagon Papers.”). 

In the same timeframe as the Pentagon Papers, when he was on trial, Ellsberg also had a stash of stolen  papers about the U.S. nuclear program that detailed, for example, the extremely high risk of an accidental nuclear war. As described in this book review:

“Intending to release these as a follow-up to his Vietnam leaks, he gave them to his brother for safekeeping, who buried them in his yard in upstate New York. Soon after, a hurricane and flood swept them away into a landfill…”

Even with searches of the landfill, the papers were never found.

Maybe I'll do another blog post on The Doomsday Machine. There were many shocking allegations that will make you question the design of our nuclear systems (and those of the Russians). The risk of “accidental” full-on nuclear war is still very high today, Ellsberg says, and that's not just a commentary on our current president. In the book, it sounds like there are some bad systems and processes that could lead to full-on nuclear war WITHOUT the president sending an ill-considered first strike.

Some of the supposedly “elaborate safeguards” that were supposed to prevent a rogue launch of a nuclear strike without the president's permission were subject to workarounds.

There's the famous story of how, supposedly, the safeguard launch lock code on all Minuteman missiles was set to all zeros after JFK ordered an extra protection be put in place. The Air Force claims that was never the case. Ellsberg's book lays out many examples where speed was emphasized over safety or caution. It's logical that those responsible for having to launch missiles would do everything in their power to be able to launch them quickly (before being hit by a Soviet strike), but the focus on speed certainly led to a number of risks.

Speed was cited as a factor in this erroneous message, per WIRED.

Simpson agrees: “You don't want to be in the middle of a attack on the US and have someone fumbling around with the message.” It's also natural to conduct exercises to ensure the system is functioning. The problem in this case, Simpson says, is any exercise message should begin with the words, “EXERCISE EXERCISE EXERCISE.”

“This was probably a state-run emergency exercise that doesn't have the strong controls that DoD has learned the hard way from 50 years of screwing up,” Simpson says.

Using “EXERCISE EXERCISE EXERCISE” sounds like a form of error proofing that mitigates an error rather than preventing it. People might not have panicked over that form of message.

Has the DoD learned? Ellsberg says no.

It's also claimed that the launch codes for strategic bombers weren't really as effective of a safeguard as we were told. Ellsberg claims:

“Several RAND colleagues who were knowledgeable about SAC procedures supported my guess that the numbers in the code were the same for all planes in the SAC alert force. Only a single radio signal needed to be sent out. And their understanding was that the code was changed very seldom.”

Here's one other scenario to share from the book for now. Conventional wisdom is that a bomber pilot, missile launch officer, or submarine commander can't launch an unauthorized strike. Ellsberg, again with his insider's knowledge, continually questions if these controls are as “failsafe” as we'd like to think:

For example, on the matter of the envelope authentication, when I posed the possibility that a conscientious (or unbalanced) pilot who felt impelled to go to target might try to convince others to go with him in the way my memo had speculated, the typical response was: “Well, he can't do that, because he doesn't know the whole authentication code.” I would pause at this point, waiting to hear a second thought expressed (which never occurred). Then I would say offhandedly: “Unless he opened the envelopes.”

Even this hint often failed to turn a light on. I'd hear: “But that's against his orders. If he hasn't gotten the whole signal, he can't open it.” That answer usually hung in the air only a moment or so. The premise was, after all, that the officer in question had come to feel, on some basis or other (like General Jack D. Ripper in Kubrick's Dr. Strangelove, a few years later), that it was time to commence World War III. He was on his way to drop a thermonuclear bomb on Russia, and he wouldn't expect to come back. Everyone I encountered came to agree by this point in the discussion that there was a real problem here, however unlikely.

The book is chilling… and it describes current-day risks, not just some historical situation that mankind was lucky to survive.

The 1964 movie “Fail Safe” is still chilling today… without the dark comedy moments that we get in “Doctor Strangelove” — a film that Ellsberg says was basically an inadvertent documentary.

Please post a comment and join the discussion. Subscribe to get notified about posts daily or weekly.

Mark Graban is an internationally-recognized consultant, author, and professional speaker who has worked in healthcare, manufacturing, and startups. He is author of the Shingo Award-winning books Lean Hospitals and Healthcare Kaizen, as well as The Executive Guide to Healthcare Kaizen. His most recent book is the anthology Practicing Lean that benefits the Louise H. Batz Patient Safety Foundation, where Mark is a board member. Mark is also a Senior Advisor to the technology company KaiNexus. His latest book has been released as an "in-progress" book, titled Measures of Success.

14 Comments
  1. B. Bryant says

    I am going to point out the obvious, but not obvious solution. Get rid of the nuclear weapons! There are still people out there that think that there is a real survivable chance to use these weapons. They are still making them day in and day out on top of the ones that are already in place.

    There is no chance of surviving the use of these weapons for any human being! Period! End of story and Human kind. Its just a matter of who will die first and last. We would all be gone in short order!

    MAD= Mutually Assured Destruction! Can’t get much clearer than that!!!!!!

    1. Mark Graban says

      What’s your countermeasure? Your next step?

      Sure, it would be ideal if nuclear weapons didn’t exist.

      I think you’d appreciate Ellsberg’s book “The Doomsday Machine.” He points out that the US and Russia have FAR more nuclear weapons than are needed for any deterrence or threatened second strike retaliation.

      The risk of one nuke going off (through terrorism) in DC or Moscow could easily trigger an automated or uncontrolled chain of events that leads to all-out war, which (if you believe in nuclear winter) will kill nearly every human on earth.

      Ellsberg proposes the idea that the US doesn’t need silo-based ICBMs anymore and they could be eliminated, saving a ton of money, without reducing the nation’s security.

  2. Mark Graban says

    Now the blame game begins?

    Worker who sent out Hawaii missile alert reassigned

    It’s not as bad as the headline suggests. Yes, they reassigned the worker, but won’t fire him.

    “Part of the problem was it was too easy — for anyone — to make such a big mistake,” Rapoza said. “We have to make sure that we’re not looking for retribution, but we should be fixing the problems in the system. … I know that it’s a very, very difficult situation for him.”

    1. Mark Graban says

      That same article talks about an additional process change:

      The agency has already put in place new safeguards to prevent such a misfire — including a “cancel” button that will immediately send out corrective alerts when an erroneous warning is issued, officials said Sunday.

      That means they won’t have another one of those 38 minute panic-filled delays.

  3. Mark Graban says

    Good stuff from Twitter:

  4. Mark Graban says

    NPR reported that the state of Hawaii had to contact FEMA to get the corrected message and the all clear sent out. That’s apparently one reason for the 38 minute delay.

    Shaking my head…

  5. Mark Graban says

    Comment from LinkedIn:

    “Or a deliberate act?”

    My reply:

    I’d say a system that allows one person to commit a deliberate act like that isn’t a good system.

    If I were really investigating this in the workplace, if I were a leader there, I would certainly ask questions to find out if it was deliberate. If you’re familiar with the “Just Culture” methodology, deliberate acts are treated differently than accidents or mistakes.

  6. Mark Graban says

    Another article says:

    “The employee then confirmed the choice when prompted to do so by the program.”

    That’s the old ‘are you sure?” popup. That’s not 100% effective error proofing.

    How often have I accidentally clicked “no” to the “Do you want to save before closing?” question? Too many times to count. Sometimes we just click wrong or our finger gets ahead of our brain.

    If an “are you sure?” was effective, why not have a second “are you really sure? question? That wouldn’t be 100% effective, either.

    Vern Miyagi, the agency administrator, said at a Saturday press conference that the employee “feels terrible” about what happened.

    “This guy feels bad, right. He’s not doing this on purpose — it was a mistake,” Miyagi said.

    Of course he feels bad. People generally don’t come to work to make mistakes…

  7. Mark Graban says

    Sadly, what Don Norman describes in this article also happens in healthcare too often:

    When some error occurs, it is commonplace to look for the reason. In serious cases, a committee is formed which more or less thoroughly tries to determine the cause. Eventually, it will be discovered that a person did something wrong. “Hah,” says the investigation committee. “Human error. Increase the training. Punish the guilty person.” Everyone feels good. The public is reassured. An innocent person is punished, and the real problem remains unfixed.

Leave A Reply

Your email address will not be published.