Fault trees

Arjen Markus (18 april 2008) I got interested in risk analysis while reading a book about it (*) and one of the techniques used to analyse risks is that of a fault tree, The idea is to identify possible failures in a system and how they lead to other failures and finally to a failure of the entire system. Each individual failure or fault gets assigned a probability and by combining them into a tree you can analyse the chance for the system to fail or most likely path to failure.

(*) Wang and Roush: What every engineer should know about risk engineering and management

To make this more concrete, consider a very simple example:

I want to go from A to B via the train. I can take two routes: a direct one and one that requires two successive train rides. The drawback of the first is that I am more likely to miss it (tight schedule, I leave home too late ;)), so the alternative is the slightly longer route with two train rides. However, that route may fail if the first train arrives too late at the intermediate station (or I miss that one as well).

So the fault tree looks like this:

                     fail to arrive at B
                         in time (?)
                            |
           +---------------AND----------------+
           |                                  |
    fail to catch the
      direct train (0.2)            +-------- OR -----------+
                                    |                       |
                               first train arrives       fail to catch the
                             too late at intermediate    first train (0.1)
                                    station (0.15)

(Note the AND and OR notations - there is no inherent ordering in the events)

The number in parentheses is the probability that this event occurs.

In the above tree, the event "fail to arrive at B in time" occurs only if both branches fail:

    P(fail to arrive at B in time) = P(fail to catch the direct train) *
                                     P(fail to catch either of the two others)

The failure represented by the right branch occurs if either I fail to catch the first train or the first train arrives too late. So, the chance of me arriving in time via this route is:

   P(arrive via two trains) = P(catch first train) * P(first train arrives in time)
                            = (1-0.1) * (1-0.15) = 0.9 * 0.85 = 0.765

So the chance of me failing to catch either of the two trains in the alternative route is:

   P(fail to catch either of the two) = 1 - P(arrive via two trains)
                                      = 0.235

This is the strategy to compute the chance of failure:

   P(fail to arrive at B in time) = 0.2 * 0.235 = 0.047

The most likely failing path is the one on the right-hand side: 0.265 instead of 0.2

Now let us try and implement this technique in Tcl:

# failure --
#     Construct a new failure event
#
# Arguments:
#     probability       Probability the failure occurs
#     description       Descriptive text
#
# Result:
#     List representing the failure event
#
proc failure {probability description} {

    if { $probability < 0.0 || $probability > 1.0 } {
	error "Probability must be a number between 0 and 1"
    }

    return [list FAILURE $probability $description]
}

# all-failing --
#     Combine two or more failure events into a single event, which
#     occurs if all individual events occurs.
#
# Arguments:
#     event	     First failure event
#     args	      All other events
#
# Result:
#     List representing the combined failure event
#
proc all-failing {event args} {

    return [concat ALL-FAILING [list $event] $args]
}

# at-least-one-failing --
#     Combine two or more failure events into a single event, which
#     occurs if at least one individual event occurs.
#
# Arguments:
#     event	     First failure event
#     args	      All other events
#
# Result:
#     List representing the combined failure event
#
proc at-least-one-failing {event args} {

    return [concat ONE-FAILING [list $event] $args]
}

# failure-probability --
#     Compute the chance of failure
#
# Arguments:
#     event	     Event to be examined
#
# Result:
#     Probability that the event occurs
#
proc failure-probability {event} {

    switch -- [lindex $event 0] {
	"FAILURE"  { return [lindex $event 1] }
	"ONE-FAILING"  {
	    set probability 1.0
	    foreach e [lrange $event 1 end] {
		set fail [failure-probability $e]
		set probability [expr {$probability * (1.0-$fail)}]
	    }
	    set probability [expr {1.0-$probability}]
	}
	"ALL-FAILING"  {
	    set probability 1.0
	    foreach e [lrange $event 1 end] {
		set fail [failure-probability $e]
		set probability [expr {$probability * $fail}]
	    }
	}
	default {
	    error "Unknown event type: [lindex $event 0]"
	}
    }
    return $probability
}


# most-likely-path --
#     Compute the most likely path to failure
#
# Arguments:
#     event	     Event to be examined
#
# Result:
#     List of events leading to the failure (type: ALL-FAILING)
#
proc most-likely-path {event} {

    return "Alas, this remains to be done"
 
}

# main --
#     Test the procedures with the simple example above
#
set e1 [failure 0.2 "Miss direct train"]
set e2 [failure 0.1 "Miss first train"]
set e3 [failure 0.15 "First train too"]

set e4 [at-least-one-failing $e2 $e3]
set e5 [all-failing $e1 $e4]

puts "Probability of being too late: [failure-probability $e5]"

(Note: another interesting concept I found in that book: Orthogonal arrays, useful for evaluating and optimising design parameters)