## Fault trees

Arjen Markus (18 april 2008) I got interested in risk analysis while reading a book about it (*) and one of the techniques used to analyse risks is that of a fault tree, The idea is to identify possible failures in a system and how they lead to other failures and finally to a failure of the entire system. Each individual failure or fault gets assigned a probability and by combining them into a tree you can analyse the chance for the system to fail or most likely path to failure.

(*) Wang and Roush: What every engineer should know about risk engineering and management

To make this more concrete, consider a very simple example:

I want to go from A to B via the train. I can take two routes: a direct one and one that requires two successive train rides. The drawback of the first is that I am more likely to miss it (tight schedule, I leave home too late ;)), so the alternative is the slightly longer route with two train rides. However, that route may fail if the first train arrives too late at the intermediate station (or I miss that one as well).

So the fault tree looks like this:

```                     fail to arrive at B
in time (?)
|
+---------------AND----------------+
|                                  |
fail to catch the
direct train (0.2)            +-------- OR -----------+
|                       |
first train arrives       fail to catch the
too late at intermediate    first train (0.1)
station (0.15)
```

(Note the AND and OR notations - there is no inherent ordering in the events)

The number in parentheses is the probability that this event occurs.

In the above tree, the event "fail to arrive at B in time" occurs only if both branches fail:

```    P(fail to arrive at B in time) = P(fail to catch the direct train) *
P(fail to catch either of the two others)```

The failure represented by the right branch occurs if either I fail to catch the first train or the first train arrives too late. So, the chance of me arriving in time via this route is:

```   P(arrive via two trains) = P(catch first train) * P(first train arrives in time)
= (1-0.1) * (1-0.15) = 0.9 * 0.85 = 0.765```

So the chance of me failing to catch either of the two trains in the alternative route is:

```   P(fail to catch either of the two) = 1 - P(arrive via two trains)
= 0.235```

This is the strategy to compute the chance of failure:

`   P(fail to arrive at B in time) = 0.2 * 0.235 = 0.047`

The most likely failing path is the one on the right-hand side: 0.265 instead of 0.2

Now let us try and implement this technique in Tcl:

```# failure --
#     Construct a new failure event
#
# Arguments:
#     probability       Probability the failure occurs
#     description       Descriptive text
#
# Result:
#     List representing the failure event
#
proc failure {probability description} {

if { \$probability < 0.0 || \$probability > 1.0 } {
error "Probability must be a number between 0 and 1"
}

return [list FAILURE \$probability \$description]
}

# all-failing --
#     Combine two or more failure events into a single event, which
#     occurs if all individual events occurs.
#
# Arguments:
#     event	     First failure event
#     args	      All other events
#
# Result:
#     List representing the combined failure event
#
proc all-failing {event args} {

return [concat ALL-FAILING [list \$event] \$args]
}

# at-least-one-failing --
#     Combine two or more failure events into a single event, which
#     occurs if at least one individual event occurs.
#
# Arguments:
#     event	     First failure event
#     args	      All other events
#
# Result:
#     List representing the combined failure event
#
proc at-least-one-failing {event args} {

return [concat ONE-FAILING [list \$event] \$args]
}

# failure-probability --
#     Compute the chance of failure
#
# Arguments:
#     event	     Event to be examined
#
# Result:
#     Probability that the event occurs
#
proc failure-probability {event} {

switch -- [lindex \$event 0] {
"FAILURE"  { return [lindex \$event 1] }
"ONE-FAILING"  {
set probability 1.0
foreach e [lrange \$event 1 end] {
set fail [failure-probability \$e]
set probability [expr {\$probability * (1.0-\$fail)}]
}
set probability [expr {1.0-\$probability}]
}
"ALL-FAILING"  {
set probability 1.0
foreach e [lrange \$event 1 end] {
set fail [failure-probability \$e]
set probability [expr {\$probability * \$fail}]
}
}
default {
error "Unknown event type: [lindex \$event 0]"
}
}
return \$probability
}

# most-likely-path --
#     Compute the most likely path to failure
#
# Arguments:
#     event	     Event to be examined
#
# Result:
#     List of events leading to the failure (type: ALL-FAILING)
#
proc most-likely-path {event} {

return "Alas, this remains to be done"

}

# main --
#     Test the procedures with the simple example above
#
set e1 [failure 0.2 "Miss direct train"]
set e2 [failure 0.1 "Miss first train"]
set e3 [failure 0.15 "First train too"]

set e4 [at-least-one-failing \$e2 \$e3]
set e5 [all-failing \$e1 \$e4]

puts "Probability of being too late: [failure-probability \$e5]"
```

(Note: another interesting concept I found in that book: Orthogonal arrays, useful for evaluating and optimising design parameters)