Problem solving to root cause is an important skill to build capability in. The “why” behind problem solving to root cause is so that you prevent problems from ever recurring again, thus eliminating potential waste from your future activities. Along the way you grasp the situation, take steps to contain the problem so you are still able to provide the product or service to your customers, understand potential direct causes, test the connection from cause to the problem, develop and test a countermeasure (your hypothesis towards preventing the problem from ever coming back), and put measures in place to sustain and check that the countermeasure works and is still in place. (Yes, this is simply a description of P-D-C(S)-A with a containment step thrown in!) It doesn’t matter if you’re talking about 4-Step, 8-Step, x-Step or DMAIC problem solving, the fundamental principles are all the same.
Most organizations that begin a lean transformation are already very good at what they think of as problem solving. A problem comes up, I work hard to understand what caused it, I fix what caused it, and we’re up and running again. I add it to our troubleshooting guides and therefore if it happens again we will be able to get up and running even faster because we know how to fix it! And the veteran problem solvers will be able to tell you war stories of all-nighters where hours of investigation finally yielded something they never thought of: a motor wired backwards and turning in the wrong direction, a bit set wrong in the control logic, or an incorrect part sent and installed that looked just like another part but had different guts. They might call it “stuff that never should have happened if someone else had done their job right in the first place”, because they are still learning what it means to focus on the process and not the people. And how if only they’d called the expert in the first place they could have avoided all those hours of downtime because he’d recognize the symptoms and connect it to a problem he solved 5 years ago and could have told them exactly what to check and fix.
“Haven’t we worked on this problem before?” and “Didn’t we fix this last year?” are common phrases you might hear that should trigger you to wonder if you really understood the root cause of the problem the first time. It feels like problem solving deja vu! You honestly shouldn’t feel bad about it though, especially if you are still at the outset (read: first several years or perhaps decades depending on point or systemic problems!) of your lean journey. Solving to root cause, so the problem never occurs again anywhere in your organization, is hard. It can be hard to identify the root cause, hard to rollout countermeasures across groups in multiple global locations, hard to not get distracted by all the other fires you need to fight this week that seem like a much higher priority.
I recently switched cable, internet, and phone service providers from Time Warner to AT&T, for a whole host of reasons that could be turned into another post on “thinking customer”. I’m actually very happy so far in a short period of time with what I now have from AT&T Uverse. It’s not all ice cream and puppy dogs through the switch however, I have had a couple of internet setup issues that have been frustrating, but they’ve actually been more due to lack of knowledge and added system complexity on my end vs. something that was the service provider’s responsibility.
Yesterday I woke up to a new problem. When I tried to open a web browser, I got an error message from my AT&T wireless router that popped up on the screen, saying “Excessive Sessions Detected.” It explained that one of my computers had a whole lot of internet sessions going at once, and that it was likely the result of some form of a virus, or malware. So, my head began doing problem solving. Target = Able to access the web from all my devices. Actual = Not able to from one PC. Let’s continue to grasp the situation. Check PC #2 – I get the same error. Check IPhone – I get the same error. Actual now equals “Not able to access web from any devices.”
Potential direct causes… 1) the error message tells me it may be a virus 2) the error message tells me I may have gaming software causing it 3) could be a problem with the router 4) a power problem or connection problem somewhere in the system 5) something is broken on AT&T’s end 6) my internet cache is full 7) just something weird that requires a restart.
Ok, so let’s try and work through the most likely causes – 2) I can eliminate this cause, don’t have any gaming software going on (sad, I know, I’m a long way from my college and single days!). 4) check to see I have power everywhere – all ok, eliminated. 1) the system error is telling me “virus” is the first place I should look. So I run a check, and sure enough, it finds two items and eliminates them. So as I restart my computer my mind is already jumping down the why chain to root causes like, inadequate standard for setting up my antivirus software, or inadequate process for selecting antivirus software.
Computer is restarted, and… nope, error message still pops up. How about 7) – let’s restart the router and everything else. Nope, error still there, can cross that one off. Now it is about time for work, so I eat some breakfast, watch some TV, and off I go. Midway through the day my wife tells me the TVs no longer work. So now, new information surfaced that tells me that something in the system is degrading – the problem is getting worse! So in my head I mapped out how the system worked and where problems could occur (see setup picture below), and couldn’t figure out what was changing to cause the new problems, because most of the TVs don’t run off the router. Why did they work in the morning and then suddenly not work in the middle of the day? Now I start leaning towards something on AT&T’s end as the direct cause.
When I got home I hopped on the phone with AT&T, explained the situation and what I’d done so far, and then we went through their troubleshooting guides. We do a reset from their end, restart computers and routers and DVRs, and the error still comes up. We cleared the internet cache and tried again. Still have the error. We’ve now eliminated 5) and 6), and AT&T is out of ideas on their end too. Their only solution left is to send out a technician tomorrow, and maybe they’ll swap out a router to try and check 4) – the only direct cause we have left on the list. This disappoints us, because we wanted to watch the new Modern Family!
As AT&T is finalizing the order for the technician, I decide to check one more thing. On the error message there are two buttons – one says “Do Not Show” and the other says “Continue”. I had tried “Continue” early on and didn’t get anywhere. “Do Not Show” was labeled as something you should only click if you think the cause was gaming software 2) which we had already eliminated, and I didn’t want to ignore the error if I really had a problem or a virus. So at this point I said what the heck and clicked “Do Not Show”. It asked me to log in to my gateway, I did, and then it gave me a message – “The problem has been resolved.”
Eureka! We were now connected to the outside world again! My three year old would not be without her Doc McStuffins in the morning! We could watch Modern Family! I could stream YouTube videos through my TV again! I was the hero, I had “resolved the problem”!
My lean thinking was gnawing at me though. I didn’t know what caused the problem in the first place. I can’t recreate it. I can’t develop any countermeasures to prevent it from happening again. And I’m not sure the AT&T person really captured my “solution” in their knowledge database so that they try it with other customers in the future before deciding to send out a technician. Yes, the problem is contained, and we are up and running again. Is that good enough for this situation? Or should I have done more?
Good organizations are very quick to recognize and contain problems, and to get up and running again to avoid customer service issues. Great organizations have the discipline to spend time working towards truly understanding the root cause of the problem, developing adequate countermeasures, and ensuring waste never recurs.