Therac-25

Ok, so let’s talk about mistakes.  Yes mistakes.  Everybody makes mistakes.  To err is human.  We know that.  Of course, computers do exactly what they are told, so a computer makes mistakes when it is designed incorrectly or given poor code or instructions.  This stuff makes sense.  What is more important though is the following question:  How can we reduce the harm that our mistakes cause?

Since I love math, here’s an equation.  Let n equal the number of mistakes that are made, and let h equal the harm that that mistakes causes when it happens.  n*h = t, where t is the total harm that can be done in a process where mistakes are made.  There are two ways to reduce this t value.  We can either reduce n, or reduce h.  So how do we do that?

We already know that people make mistakes, therefore, the best way to reduce mistakes is by using redundancy.  If a program is correct 90% of the time, it has 90% accuracy, but if the program is run 3 times in a row(ok I know this implies a lot of things, but you get the concept) then 99.9% accuracy is achieved.

On the other hand, reducing the h value of a mistake is more difficult.  Mistakes are often unpredictable or difficult to control.  An example of lowering the h value of a program mistake may be simulating a rocket launch before the launch actually happens.  In the event of a mistake, there is no failed rocket launch.

For real world applications with high h values, even 99.9% accuracy is not enough.  In the business world, real world processes try to keep their processes “Six sigma” accurate, meaning less than .00000002% of processes fall outside of expected bounds.  For important things such as drinking water purity, even this number of mistakes may not be low enough!

Although computers can be very precise, they can also make mistakes, when programmed incorrectly.  We’re learning (or learned) in Operating Systems Principles that many programs that are written can run differently each time they run.  If a program is used in an important system such as a medical system, this can be catastrophic.

This leads us to the topic of discussion, the Therac-25.  The Therac-25 was programmed entirely in assembly code.  The article we read states that the coder for the Therac-25 likely didn’t have any “experience with real-time systems”.  In other words, the programmer did not set up the code to be able to deal with many things happening at the same time.  The initial problem happened when a controller tried to input a new command while the first command was still being processed.  Since the Therac-25’s h value for mistakes is incredibly high, the n value must be very low.  Older versions of the Therac had mechanical failsafes that would protect a patient in the event of a computer malfunction, which lowered the program’s h value(since a mistake would lead to a caught error), but the Therac-25 decided to forgo some of these failsafes, likely because of confidence in the design.

As a software designer working with systems with such a high h value, I think that there should be lots of error checking and failsafes in the code itself.  When you produce an application in a company, that company is liable of any mistakes that that product makes.  Not only could mistakes hurt people, but they could also hurt your career if you are not careful.

 

Leave a comment