Arjen Markus (18 april 2008) I got interested in risk analysis while reading a book about it (*) and one of the techniques used to analyse risks is that of a fault tree, The idea is to identify possible failures in a system and how they lead to other failures and finally to a failure of the entire system. Each individual failure or fault gets assigned a probability and by combining them into a *tree* you can analyse the chance for the system to fail or most likely path to failure.

(*) Wang and Roush: What every engineer should know about risk engineering and management

To make this more concrete, consider a very simple example:

I want to go from A to B via the train. I can take two routes: a direct one and one that requires two successive train rides. The drawback of the first is that I am more likely to miss it (tight schedule, I leave home too late ;)), so the alternative is the slightly longer route with two train rides. However, that route may fail if the first train arrives too late at the intermediate station (or I miss that one as well).

So the fault tree looks like this:

fail to arrive at B in time (?) | +---------------AND----------------+ | | fail to catch the direct train (0.2) +-------- OR -----------+ | | first train arrives fail to catch the too late at intermediate first train (0.1) station (0.15)

(Note the AND and OR notations - there is no inherent ordering in the events)

The number in parentheses is the probability that this event occurs.

In the above tree, the event "fail to arrive at B in time" occurs only if *both* branches fail:

P(fail to arrive at B in time) = P(fail to catch the direct train) * P(fail to catch either of the two others)

The failure represented by the right branch occurs if either I fail to catch the first train or the first train arrives too late. So, the chance of me arriving in time via this route is:

P(arrive via two trains) = P(catch first train) * P(first train arrives in time) = (1-0.1) * (1-0.15) = 0.9 * 0.85 = 0.765

So the chance of me failing to catch either of the two trains in the alternative route is:

P(fail to catch either of the two) = 1 - P(arrive via two trains) = 0.235

This is the strategy to compute the chance of failure:

P(fail to arrive at B in time) = 0.2 * 0.235 = 0.047

The most likely failing path is the one on the right-hand side: 0.265 instead of 0.2

Now let us try and implement this technique in Tcl:

# failure -- # Construct a new failure event # # Arguments: # probability Probability the failure occurs # description Descriptive text # # Result: # List representing the failure event # proc failure {probability description} { if { $probability < 0.0 || $probability > 1.0 } { error "Probability must be a number between 0 and 1" } return [list FAILURE $probability $description] } # all-failing -- # Combine two or more failure events into a single event, which # occurs if all individual events occurs. # # Arguments: # event First failure event # args All other events # # Result: # List representing the combined failure event # proc all-failing {event args} { return [concat ALL-FAILING [list $event] $args] } # at-least-one-failing -- # Combine two or more failure events into a single event, which # occurs if at least one individual event occurs. # # Arguments: # event First failure event # args All other events # # Result: # List representing the combined failure event # proc at-least-one-failing {event args} { return [concat ONE-FAILING [list $event] $args] } # failure-probability -- # Compute the chance of failure # # Arguments: # event Event to be examined # # Result: # Probability that the event occurs # proc failure-probability {event} { switch -- [lindex $event 0] { "FAILURE" { return [lindex $event 1] } "ONE-FAILING" { set probability 1.0 foreach e [lrange $event 1 end] { set fail [failure-probability $e] set probability [expr {$probability * (1.0-$fail)}] } set probability [expr {1.0-$probability}] } "ALL-FAILING" { set probability 1.0 foreach e [lrange $event 1 end] { set fail [failure-probability $e] set probability [expr {$probability * $fail}] } } default { error "Unknown event type: [lindex $event 0]" } } return $probability } # most-likely-path -- # Compute the most likely path to failure # # Arguments: # event Event to be examined # # Result: # List of events leading to the failure (type: ALL-FAILING) # proc most-likely-path {event} { return "Alas, this remains to be done" } # main -- # Test the procedures with the simple example above # set e1 [failure 0.2 "Miss direct train"] set e2 [failure 0.1 "Miss first train"] set e3 [failure 0.15 "First train too"] set e4 [at-least-one-failing $e2 $e3] set e5 [all-failing $e1 $e4] puts "Probability of being too late: [failure-probability $e5]"

(Note: another interesting concept I found in that book: Orthogonal arrays, useful for evaluating and optimising design parameters)