Thursday 29th of July 2010

Problem Solving Principles PDF Print E-mail
The most generally accepted, and generally applicable problem solving methodologies basically do the following two analyses:
  1. Analyze the problem state.
  2. Analyze the solved state.
The problem state is basically "what don't you like about the way this works", with the analysis yielding the root cause of it not working correctly. Anyone troubleshooting well defined systems like machines and computer systems and networks will instantly recognize this as what they do. It consists of an accurate symptom description followed by deductive reasoning.

Analyzing the solved state is basically asking "how would you like things to be", followed by inductive reasoning to formulate a system that can deliver those results. Also included is design work in actually realizing the conceived solved state. Because design and inductive reasoning requires much more creativity, analyzing the solved state requires quite a bit of creativity (and therefore quite a bit of brainpower to supply that creativity). Remember this point.

Sometimes only one of these two steps is required. In fixing a machine, the solved state degenerates into "the as designed state and behavior". No analysis, induction or creativity needed. This point will be made time and again throughout this issue of Troubleshooting Professional.

Likewise, analyzing the problem state is sometimes not necessary, as pointed out by problem solving expert Fred Nickols (link in URL's section). Fred points out that sometimes it's impossible to return to the pre-problem state anyway, so analyzing the problem state is a needless exercise. As an example he points out the stock market crash of 1987, and the fact that though the crash caused many problems, the crash could not be undone.

Perhaps a more common example of irrelevant problem state occurs in problems created by progress. When the free Linux operating system acquired a power level comparable with UNIX, many UNIX vendors went out of business. The UNIX vendors were still operating as designed, but now they were in an environment hostile to that design. No amount of assigning cause would make the UNIX vendors profitable. Only a new solved state would do that.

Most advocates of general problem solving methodologies make additions to the two steps of analyzing the problem state and analyzing the solved state. Many include a step or substep that could be called "how do we get there from here?". In other words, how do we move from our present condition to the solved state? Oftentimes that requires co-worker support, management "buy in", a budget, and much, much more. It's not simple.

Another frequently added step is "how do we prevent future problems?". That can be either future occurrences of the same problem, problems caused by the solution, or even brand new problems. So a possible generic problem solving process could look like this:

  1. Find the root cause of the problem
  2. Determine the desired solution
  3. Decide how to implement that solution
  4. Head off future problems
Don't let the simplicity of the preceding process fool you. Each of those steps contains many substeps and tools. For instance, five of the ten steps in the Universal Troubleshooting Process apply solely to #1 above:
2. Get a complete and accurate symptom description
3. Make damage control plan
4. Reproduce the symptom
5. Do the appropriate general maintenance
6. Narrow it down to the root cause

Most experts have an extremely complex methodology for determining the desired solution. It's tough to jump-start creativity.

So we have an interesting distinction in problem solving. Some systems have a documented as-designed state and behavior -- typically machines, computerized systems and networks. The desired solution in such systems degenerates to the restoration of as-designed state and behavior. No inductive solution finding is necessary. Several problem solving methodologies are optimized for systems with defined and documented state and behavior. Those methodologies produce ultra-fast solutions because the Troubleshooter doesn't waste his time doing steps whose purpose is to creatively determine the solved state, which in this case is already defined.

I choose to call systems with a defined and documented state and behavior "well defined systems". I choose to call systems without a defined and documented state and behavior "fuzzily defined systems". System definition is not a binary absolute, but rather a spectrum. On one end are machines designed by humans, with complete schematic diagrams. Everything about the system is known. Simple to moderately complex machines, and extremely well documented computer programs are great examples.

On the other end of the spectrum is the human mind, which is probably the most erratic, unpredictable and variable system imaginable. Any documentation of the human mind, or relationships between small number of people, is a matter statistics at best, or conjecture and anecdote at worst. When dealing with the human mind or relationships between a few people, solutions are almost completely creative in nature -- it's just too hard to deduce a "root cause". This is especially true in relationship disputes, where each party assigns causation to the other :-)

Human physiology is somewhere in the middle. It's absolutely documented and defined that a human without a liver or a heart will soon die. It's absolutely documented and defined that the organ used to see is the eyes. At the most general level, doctors and physiologists can draw a very accurate block diagram of a human. But when it gets down to details like the effect of hormones on various systems in the body, the definition goes way down, once again supported by statistics and anecdotal evidence. To the extent that human physiology is well defined, the "fix" is to restore the bad "part" to its as-designed (or in this case normal) state. At lower levels where physiology is fuzzily defined, much creativity is necessary to fix symptoms with a minimum of side effects.

Somewhere between human physiology and simple machines are complex mechanisms like computer operating systems. Although the point could be made that an operating system is completely documented by its body of source code, the fact is there's no human capable of knowing that whole body of source code. Therefore, its level of definition depends on block diagrams and other documentation. Some operating systems are more predictable and straightforward than others. UNIX and most of its workalikes (Linux included) are fairly modular, lending themselves well to rather complete documentation of state and behavior. The Windows operating system, on the other hand, is rather non-modular and unpredictable, so that at many levels it is not well defined in spite of the fact that it was built by humans and its entire body of source code exists.

Somewhere between human physiology and the human mind is the human organization. Humans are like gaseous material -- the more of them there are, the more predictable (deducible) they become. The existence of rules and policies make them even more predictable, in some cases to the point where they can be accurately modeled. Human organizations can be documented, defined and modeled when mixed with machines, technology and policies, such as in a factory. This is one of the reasons for the extraordinary success of the Theory of Constraints in the manufacturing sector.

So to solve problems, the level of definition defines whether it's necessary to analyze the problem state, the solved state.

Other Distinctions

Level of definition is perhaps the most evident and important distinction, but not the only one.

Reproducibility/Intermittence

Another important distinction is that of reproducibility, which again is a spectrum. At one end is the reproducible problem, which can "always" be reproduced by a known set of steps (the reproduction sequence). Any problem not reproducible is said to be intermittent. But intermittents come in all occurrence frequencies, from the type that always happens within 2 minutes (which is very near reproducible, at least for troubleshooting purposes), to the event, a single occurrence which requires a very special methodology to solve.

Within the reproducibility distinction, there's a sub-distinction consisting of whether intermittence is caused by fuzzily defined system (example: the Windows operating system), or a well defined component whose malfunction happens to vary with time, temperature, stress, distortion, etc. (a thermally intermittent transistor is an example). They are both very difficult to solve. The latter is usually soluble given sufficient time. The former often is not. They're both troubleshot approximately the same way, so this is the last we'll say about this sub distinction.

The most effective weapon against intermittence is maintenance, both preventative (proactive: before the fact) and repair-consequent (reactive: after the fact, called General Maintenance in the Universal Troubleshooting Process). Intermittence is often caused by worn, dirty or bent connections between components, rather than components themselves. Such non-component causes are harder to find by normal deductive reasoning, but they're a primary target of maintenance. Repair consequent maintenance is forbidden and extremely dangerous in safety critical situations, but preventative maintenance is acceptable and necessary in safety critical situations.

Because maintenance is such an effective weapon against intermittence, those problem solving methodologies emphasizing maintenance are most effective against intermittence. The champion here is the Intelliworxx Era 4 troubleshooting tool, which raises repair-consequent maintenance to an artform. The Universal Troubleshooting Process also places emphasis on repair-consequent maintenance, having it as step 5 of the process (General Maintenance).

The Root Cause Analysis problem solving methodology contains a step called Barrier Analysis, which examines barriers to failure. Certainly preventative maintenance, and the policies and procedures that support it, is a major barrier. Though not specifically named in Root Cause Analysis, preventative maintenance is obviously a major part of that methodology.

As a matter of fact, intermittence seldom survives Root Cause Analysis. Besides its emphasis on preventative maintenance, Root Cause Analysis demands the finding of the true root cause. For instance, a power plant's reactor tripped because of a bad power supply board, but what was wrong with the board? The board had a bad solder joint, but why did that bad solder joint happen? The solder joint occurred from constantly elevated temperatures in the room, but why were the temperatures high? The temperatures were high because one of the room's air conditioners had conked out, but why wasn't that fact discovered before it caused damage. The fact wasn't discovered because there was no reporting procedure for temperature in the room. So ultimately, lack of a procedure to report the room's temperature tripped the reactor. Once the procedure is put in place, the air conditioner is repaired or replaced, the solder joint is resoldered, and the reactor is put back on line, it will never happen again. Well, except that my simplification forgot to follow the fault tree down the air conditioner to find why it failed, but assuming you follow that fault tree too, you can be pretty sure that failure mechanism will never occur again.

Contrast this with the repair of consumer equipment, where the solder joint itself is considered the root cause. If that stereo is returned to a hot room...

Safety Sensitivity

Some systems are more safety sensitive than others. For instance, it would be rare indeed for an inadequate repair on a battery powered radio to cause injury to anyone. On the other hand, a repair mishap on a nuclear defense system could kill millions. Repair methodologies for the two are completely different.

Most evident, the repair-consequent maintenance that speeds repair so much in non safety sensitive systems can kill in safety sensitive systems. I have no idea how nuclear defense systems work, but imagine an armed missile goes into launch sequence and fortunately is manually disarmed. Would you go in and clean all switches and controls?

Not likely. Let's say cleaning those switches and controls fixed the problem because the problem was a dirty switch. You don't know which switch. You don't know what caused it to get dirty. You never will. You just erased the evidence. Some day that switch or another one will get dirty, a missile will go into launch sequence again, and maybe, just maybe, nobody will shut it down in time. No repair-consequent maintenance is acceptable in extreme safety sensitive situations.

Another obvious difference in the treatment of safety sensitive systems is that you don't try to reproduce the problem. It would be just a little too gutsy to try, for instance, to reproduce the missile's spontaneous launch sequence initiation, thus placing the world within a minute of nuclear war. Incidentally, this is sort of what happened at Chernobyl. The technicians wanted to investigate the system's behavior in the absence of various safety mechanisms, so they defeated those safety mechanisms.

And then there's the fact that in extreme safety sensitive systems, the term "root cause" has an entirely different meaning, as illustrated by the discussion of the power plant in the preceding section of this article. Also, in extreme safety sensitive systems, one never dismisses an apparently disappeared intermittent with "it's probably fixed". Consumer testing is wonderful for televisions, but not for jumbo jets.

Between the extremes of battery powered radios and nuclear defense systems are a wide spectrum of systems whose problem solving methodologies represent a tradeoff between the cost effectiveness and safety.

Bad car brakes can kill, but nobody would spend the thousands of dollars it would take to trace a lone event of brake failure. Instead, symptom reproduction is attempted, repair-consequent maintenance is done, and finally a non-rigorous analysis is done of likely causes, and all implicated parts are replaced. This is half way between what would be done with a battery operated radio (sorry, we can't reproduce the symptom), and a nuclear power plant (full Root Cause Analysis).

Total Failures vs. Inadequacies

When a car won't start, it won't start. No ifs, ands or buts -- it doesn't go. That's a total failure. But sometimes a car can go only 30 mph. That's an inadequacy. The first hint of an inadequacy is the inclusion of words like "too", "enough", "sufficient" or "insufficient" in the symptom description:
  • It's too slow
  • There's not enough power
  • The bandwidth is insufficient
  • The water comes out too fast
Note that sometimes an inadequacy is an overabundance, such as the last of the preceding symptom description examples. In cases of hyper-adequacy, some limiter is doing an inadequate job.

Inadequacies don't just happen in machines and technology:

  • We're over budget by 30%
  • We took a loss (insufficient revenue to cover expenses)
  • Our high employee turnover is a major cost
  • All our shipments are late
The Theory of Constraints (TOC) is made especially to fix inadequacies. A subset, Bottleneck Analysis, is used to fix inadequacies in well defined systems.

These Distinctions Determine the Problem Solving Methodology

Theoretically, you could use generic problem solving methodologies (Analyze the problem state, analyze the solved state, design the transition, and prevent future problems) to solve all problems. However, in solving well defined problems, competitors using processes optimized for well defined systems would beat your problem solving productivity several fold. With inadequacies, competitors using the Theory of Constraints would run circles around you. And you'd need to build in special considerations for safety-sensitive systems.

Using one problem solving methodology, for all categories of problems, would be like building houses using only a hammer. In this competitive world, for each type of problem solving you do, you need a problem solving methodology optimized for that category of problem.

The good news is once you've learned one, the rest are easier to learn. They all have commonalties.

Problem Solving Commonalties

  • Every legitimate problem solving methodology demands that you define the problem very early. The reasons are pretty obvious.
  • All methodologies demand that you work toward a goal, although in the case of methodologies optimized for well defined systems that goal degenerates into "return system to as-designed behavior and state".
  • Most methodologies either explicitly or implicitly demand you prevent future occurrence of the problem.
  • All methodologies demand that if it's necessary to find the root cause, it be found by deductive reasoning.
  • Most methodologies demand testing of the "repair" in one form or another.
  • All legitimate methodologies are based on cause and effect, especially when analyzing the problem state.
  • Most methodologies incorporate ethics explicitly or implicitly.
  • All methodologies demand that if it's necessary to define a solved state beyond "as designed", that creative processes be used. Many methodologies offer suggestions on creative processes. Most methodologies encourage considering solutions that deviate significantly (often radically) from "the way we've done it before".
  • Well defined systems can grow into fuzzily defined systems when the environment in which they operate changes significantly. For instance, after thousands of transistor radios shorted out in the shower, somebody invented a shower radio. A well defined computer program becomes fuzzy when the humans forming the work, paper and data flow that the computer program models do fuzzily defined things.

In Summary

There are many distinctions by which you can classify systems, including:
  • Degree of definition
  • Degree of reproducibility
  • Degree of safety sensitivity
  • Is the problem a matter of degree?
These, and other distinctions enable the problem solver to pick a methodology optimized for the system at hand. Picking a severely suboptimal methodology can cost significant money or cause significant damage, injury or death.

In spite of the differences between systems and their associated methodologies, there are remarkable similarities between the methodologies. After learning one, it's likely that learning others will be significantly easier. That's a good thing, because it's likely the expert problem solver will need multiple methodologies because he solves multiple problem categories.

 

Recommended


Powered by Auto-365.Com