Welcome to the grande finale of the method overview – the REAL LIFE Root Cause Analysis example! This is the 5th post in this mini-series (here are first, second, third and fourth), and my goal here is to summarise everything I’ve written about before.
So let’s begin our Root Cause Analysis
For some of you mentioning failure regarding clash between imperial and metic units probably rings a bell, and this is the exact situation situation where it took place. So fasten your seatbelts, and follow me. I did my analysis basing on these sources, quite comprehensively describing situation:
- Marc Climate Orbiter article on Wikipedia (of course 🙂 )
- Simscale.com blog – I found this hilarious picture over there!
Background – and what the hell is MCO?
In December 1998, when I was only 15 Y.O., NASA launched one of its many satellites – Mars Climate Orbiter (I will often refer to it as “MCO” or “satellite”) to study Martian atmosphere, climate changes and surface changes.
The discussed Orbiter was quite expensive (125M USD – today taking into consideration inflation it would be around 200M USD. I asked Google AI, Bard ;)), while whole mission was worth around 327,5 M USD (525 in 2023) . Nice amount of money at stake to make sure everything is to the top.
It was (supposed to be ;)) thoroughly prepared, as all (?) NASA missions are. And yet, almost 10 months later, NASA lost, and never retrieved contact with the MCO.
What actually happened? The Problem
As you remember, first part of Root Cause Analysis is creating the Cause and Effect diagram. Properly performed (effectively and efficiently) requires stating correct problem description.
In our case It would be:
It is specific, time-based and self-explanatory. And in green, to represent the symptoms (the “leaves” of our Root Cause Analysis tree).
How it happened? The Analysis
First level branches
So, how did it happen, that NASA lost its Mars Climate Orbiter? First of all – they had to send it over there 😉 so our first branch would be the fact, that MCO was sent to Mars. It will be our DESIRED CONDITION, so we do not have to analyse further. They did it, because they were supposed to do this. Period.
Second branch – MCO travelled for 9,5 months, and traversed 669 millions kilometers. Which is again – Desired condition, as it travelled according to mission plan.
You may ask quite a valid question, why we state such obvious facts, as they are rather transparent for the analysis. There are two reasons:
- To make sure, that no potentially valuable information in our Root Cause Analysis is lost. Especially, that cost of having such branches is minimal
- To maintain sufficient level of detail for further storytelling. Our brains love context, and this is precisely the context – which is crucial for often loosely informed stakeholders
Third branch – at the last phase of the flight, just before orbital insertion, the NASA engineers basing on the multiple parameters of the flight, computed and executed a command called TCM-4. TCM Stands for Trajectory Correction Maneuver-4.
- It implicates, that there could have been three earlier TCMs – TCM-1, TCM-2 and TCM-3, however without known influence for further events. As I do not have the data, I did not include it here – but if I was doing real-time RCA, I would definitely ask a few questions 😉
After the execution of the TCM-4, it turned out, that the MCO was heading towards much lower orbit than the NASA engineers planned. As a result it (most probably) caused the destruction of the Orbiter.
- So as you can see – we have three things to analyse – TCM-4 (why it was executed), why expected output was different than actual, and why MCO was destroyed on a lower orbit.
And this first level of analysis will look like this:
Second level branches
After reading some documentation, we realise that TCM-4 It was regular maneuver, according to mission plan, before initialising procedure to insert the MCO on the orbit. Therefore we do not analyse it further – it is our desired condition.
So what about the remaining two branches?
My sources (Wikipiedia 🙂 ) provided Information, that for successful inserting into target orbit, MCO was supposed to reach altitude of 226 kilometers above the surface of Mars. So it was another Desired condition, and we finish this branch here.
However, in result of TCM-4, MCO ended up much, much lower – at the altitude of 57 kilometers. And it definitely desired condition was not 🙂
- So we can see the discrepancy between target and actual state – and it is something, that Root Cause Analysis Driver should definitely look into.
Last, fifth main branch takes into consideration why lower altitude (most probably) caused the destruction of the MCO.
First of all, lower parts of Martian atmosphere are significantly denser than higher ones. And it is law of nature, that NASA was aware of – therefore it was DESIRED condition.
Second of all, MCA was supposed to perform orbit insertion starting from altitude of 226 kilometers. Then, during the slow descending, that would have taken weeks, it was supposed to reach optimal orbit for planned scientific research. And due to the Martian atmosphere density and the technicalities of the MCO construction, the threshold level of safe orbital insertion start was 80+ kilometers. So this is another Desired condition.
However, the MCO as we remember from a few paragraphs above – ended up on only 57 kilometers, so almost 30% lower than the “life” line. That’s a lot.
So here we have situation, that in two separate branches we have the same element. It is normal, and sometimes happen – so we perform analysis just once and connect it for readability 😉
Third level branches
Why the orbiter ended up 4x lower than it was supposed to be? Let’s find out.
Aforementioned TCM-4 in (a bit more) details was applying of certain, calculated impulse based on computations, relying on a few variables (Orbiter position, its current speed, etc.). Again – it is desired condition.
Results of the expected and actual position of the MCO after executing TCM-4 were significantly different. And this is something to be analyzed further.
Moreover, as we read on Wikipedia, this abnormal discrepancy of trajectory was observed during the process by at least two Navigators. And it is desired condition. Somebody noticed, that something is not going as planned.
However, the concerns were dismissed, due to the fact, that they did not fill required form to document it (!), and finally trajectory wa not corrected. The last sentence is so striking, that I, as an RCA Driver might suggest to perform another Root Cause Analysis to analyse the procedures. As you might remember – this is one of possible endings of a branch, more than applicable in this situation. Of course, if there is enough time – it could be included in this Root Cause Analysis.
This part of analysis we see below.
Forth level branches
So as you have probably realised – we are getting deeper and deeper, closing all the branches on the way. Our analysis thread now continues around the fact, that expected position of the MCO was much different than actual, as a result of an TCM-4 (Trajectory Correction Maneuver-4).
Analysis showed, that the computation the NASA engineers performed was correct, basing on the data they possessed. So no error here. Desired Condition.
However, the force applied in MCO thrusters as a result was incorrect, it was actually multiplied. So as you can see here – we have found something strongly suspicious.
Fifth level branches. What is it all about thrusters?
First of all, these thrusters are technical term to smal rocket engines used to stabilise and change trajectory of the Orbiter. And to work (properly or not 😉 ) – they need dedicated control software. And again – this is another desired condition, as they are designed for this purpose.
This software controlling thrusters was combined from two separate parts. Each of the parts were designed by different company. One was Lockheed Martin, second was NASA. We do not know the details of this situation – maybe it was a result of a specific contract, etc. Yet despite it might seem counter-intuitive, it is quite common to use e.g. in layered software architecture. So it may work 😉 and for the sake of our example, because it was not a surprise, but purposeful – it is another desired condition.
Yet, there were two problems.
- One of the parts of the software controlling thrusters contained a major fault.
- This fault was not found during mission preparation, nor during the MCO flight itself.
So as you can see – wew have something really tangible here, and our objective to analyze it properly.
Imperial, metic? Who cares? Really, who should care?
Sixth level branches
Here are some facts (that I learned reading about this case). And here is another tip – as an RCA driver/coach I do not have to be technical expert. I “just” need to be able to understand nuances and explanations provided by the RCA team. This is exactly what I did here 🙂
- There are two ways of expressing Impulse (pol. “popęd”), which is a change in momentum of the object (pol. “pęd”), and as this is the law of physics – it is desired condition:
- Metric – Newton-seconds (N * s OR kg * m/s)
- Imperial – pound-force seconds (lbf * s)
- There is document called SIS – Software Interface Specification. It requires all input-output units be presented in METRIC units. It is another desired condition – as there is specific guideline that is supposed to prevent misunderstandings, faults, etc. This is the PREVENTION to mitigate the risk.
- NASA and Lockheed Martin are American companies. And probably (RCA Coach assumption here (!) for the understandability purposes of this article) due to that both used Imperial units in working version(s) of the Software.
- Here we can stop and reflect a little. We could use two approaches and analyse it from two standpoints. First – as a undesirable behaviour that might increase probability of an error. Or, as an industry standard, to speed-up development (as the imperial units are more familiar for Americans).
- If it was transparent to everyone, and there was additionally SIS document standardising eventual outcome – we can call it desired condition.
- However (yes, this is the moment for HOWEVER :)), one of the companies (Lockheed Martin to be precise) DID NOT converted units from Imperial to Metric in its software.
- So we have found our first Root Cause. Of course, as a NASA employee I would definitely ask for more explanation, but you see where I am heading to.
- Had Lockheed Martin converted the units – we can assume, that MCO would have reached the desired orbit according to the plan.
- NASA software expected input in metric units as an input (according to SIS). And as such was used Lockheed Martin, expressed in Imperial units. And this is, ladies and gentlemen – root cause .
- The effect was quite simple. E.G. if Lockheed Martin software computated, that to reach desired orbit, the Impulse of 100 pound seconds should be applied. So the expected impulse was 100.
- NASA software expected the value in Newton seconds (which is ~ 4,5x bigger than pound force second). And thrusters applied Impulse of 100. Newton seconds, which equals ~445 pound force seconds. So instead of lightly pressing the acceleration pedal – they stepped on it to the floor.
- Finally – nobody realised this discrepancy by any kind of verification and validation. Which is obviously another Root Cause. Ok, it is kind of shortcut for this article – we should go deeper, and check:
- What kind of testing procedures are there on NASA and Lockheed Martin sides?
- Are such situation covered by test cases?
- Were the relevant test cases executed?
- What was the result of these test cases execution?
- And so on
- So the real REAL Root Cause(s) would be at least one answer to these questions
And as a closing of the last branch remaining, why the fault remained unnoticed/unfixed during preparations and almost whole flight – we see two already mentioned Root Causes.
- The Root Cause regarding the fault going unnoticed during testing
- The Root Cause regarding dismissing the observed abnormalities due to the lack of filling correct form
Therefore this branch looks like this:
Final shape of the tree – overview
As you can see – all the branches are consistent, coherent, and finished with right endings. There were multiple alternative endings, starting from desired conditions, through advised another Root Cause Analysis, for the set of different Root Causes.
As you see (and I stressed in one of previous posts) – there is always more than one Root Cause.
Are these legitimate Root Causes? Yes, they are. How can I be so sure? When we reframe she situation, and imagine, what would happen, if the Root Cause was “difused”, we can see that the MCO crash would not happen.
- IF observed incorrect trajectory had been investigated – and reacted upon, they would have found the discrepancy in the Impulse computated and applied, OR
- IF the testing procedures would have included comprehensive verification of all vital parameters and their alignment with SIS document, they would have found the error, OR
- IF the NASA software part have done some kind of validation, that supplier provided input according to the SIS document, they would have found the error, OR
- IF the Lockheed Martin engineers/procedures would have checked and converted units according to SIS document, , they would have found the error.
A lot of IFs, and a lot of potential ways of dodging the bullet, but none of them exercised. And as a result of this Root Cause Analysis, we get this tree.
So this is it. I hopy you enjoyed the series, and you will be able to consider using it in your organisations. As I said – it is a tool, very handy in multiple situations, but has its maybe not flaws, but necessary prerequisites.
Thanks for reading, and till the next one 🙂
Best Regards,
Marcin
1 comment