Poor Engineering Decision-Making

I am a disaster-documentary fanatic, not because I am ghoulish and delight in tragedy, but because I consider myself an engineering sort of person and thus desire to learn all I can from others’ work. I chew up episodes of Mayday/Air Disasters, for example.  I also am sucked in by shows such as Seconds from Disaster (now no longer being produced) and recently came across a series on Discovery, When Big Things Go Wrong.  In Season 1, Episode 3 of this series, one of the linked investigations concerned the derailment of a high-speed Deutsche Bahn trainset at Eschede in Germany on June 3, 1998 which led to tremendous loss of life and limb. 

Eschede train disaster. This image was taken with a small hand held photo camera and shows the severe destruction of the rear passenger cars that were pushed into each other and into a road bridge. (Nils Fretwurst, Wikimedia Commons)

The story of the Eschede derailment (see https://en.wikipedia.org/wiki/Eschede_derailment) holds a number of lessons that engineers need to heed.  The one key lesson that I want to extract and apply to us software engineers is that we must be very careful about repurposing seemingly usable components from other use-cases no matter how similar they may be to our use-case.  The entire scenario that unfolded to become the worst high-speed rail disaster in the world was triggered by a single fatigue crack on a single wheel which, using perfect 20/20 engineering hindsight, should never have been used on that specific train!

When these high-speed ICE1 high-speed trains were originally designed, they were originally equipped with standard solid wheels. Unfortunately, at least in the scope of our story, these solid single-cast wheels could develop out-of-round and other conditions which would disturb their perfect balance at speed and thus lead to noticeable vibrations creeping into the passenger areas. Engineers were tasked with solving this problem.

The solution seemed to be in replacing the solid wheels with wheels that isolated the rolling surface from the hub using some sort of rubber ring. This was not, in itself, a new design since these “resilent wheels” were in use by several tram systems worldwide specifically to reduce noise and vibration. This design is sometimes referred to as a “steel tire” since a steel band is forced over a thick rubber band attached to the hub center as can be seen in the diagram below:

Sample modern resilent wheel design and components (image by Group Lucchini)

It is important to understand that this design was well tested within the domain it was being used, that is, for trams and other low-speed applications.

So, ignoring the oversimplification of thought-process, ultimately engineers decided that silent wheels for trams and other rail rolling-stock should fit the need for a quiet wheel on the ICE 1 trains. This is the danger of following similarity in use-cases in distinct domains. It is extremely important to ensure that such decisions really are valid because similar is not equivalent to sameness. In our software world, we are faced with these sorts of decisions all the time. There is a plethora of frameworks and infinite numbers of open-source and boxed solutions that constantly vye for our attention. These solutions present dozens of use-case scenarios that seem to match our specific design needs and are absolutely seductive choices to help us eliminate tons of research time and work.

The note of warning is I hope we derive is that these decisions may have safety, performance, and development consequences that have to be fully vetted and risks understood and weighed. This leads us into the rest of the sad Eschede story. Once the engineers had reasoned that these resilent wheels should meet their requirements for quiet operation, they pushed forward with designing them for the ICE 1 application.

One of the great mistakes that they made was not consulting other SMEs (Subject Matter Experts) in the industry. Japan had been working with high-speed trains since the early 1960s and France had kicked off their TGV train services in 1981. Other surrounding countries were also working on high-speed options. Thus, there was a considerable pool of talent and prior-art to tap for information. Another mistake that they eventually made was one which runs rampant in our software industry. Yes, there was a lack of proper and complete testing of the design. There were a significant lack of full high-speed testing of the new design to understand its limits and possible problem areas. This oversight was because there were no German facilities existing that could actually physically test the design at speeds approaching their operational limits. Thus, testing was mostly conducted with the low speeds and the engineers made qualified assumptions based on theory and understandings of the materials and pressures involved.

The new wheels accomplished their immediate goal of eliminating the dreaded vibrations so, if it works, as we all know, we can ship it! Another notable point is that the new wheels worked for many years without any failures. In our parallel universe, how often are we software architects and developers lulled into a sense of safety by years of flawless performance? We must always have an expectation of catastrophic failure triggered by a confluence of issues that all contribute to something unforeseen occurring.

The final mistake that was made and that is worth highlighting here is that of management and engineers not keeping a finger on the pulse of the industry. Üstra, the company that operated the low-speed tram services in Hanover, discovered that their resilent wheels were developing stress fractures running at the trams’ maximum speed of about 14-15 mph. They reported their findings to the industry at large and to DB directly. Had something been done about this finding immediately it is probable that the fatal event would not have ever occurred. The recourse, an expensive one admittedly, would have been to pull wheels from the ICE 1 trains, remove the steel tires and properly inspect them for stress and fatigue fractures. We all face this kind of decision based upon design choices we may have made years before. Either we appropriate the consequences of the choices or we will sweep them under the rug and blame others. As engineers, we need to be as sure of our original design choices as possible and understand that the day may come when we need to redesign or rebuild some part of that design based on new information coming to light.

Well, the rest of the story is documented in reports and documentaries. About a year after Üstra warned DB about the problems with their wheels, ICE 1 #51 had one of its steel tires on the first car fail catastrophically. It penetrated the floor of the car and dangled down to the point that it pulled a guide rail for a switching point away from the rails. This guide rail projected upwards and also penetrated the floor of the car, lifted the bogie containing the axles involved, and they ended up on the parallel track. This led to the track of the train being disturbed and the cars were violently thrown off their tracks at greater than 120 mph. Many travelers lost their lives in the subsequent melee.

We must be extremely deliberate in our design decisions. Software is not unlike hardware like these wheels in that it has tremendous power to cause chain reactions that may lead to loss of life. How do we know that the code we produce today will not be used to drive a self-driving vehicle, control safety equipment, support medical applications, keep aircraft in the air, or some other critical function. It is easy to point fingers at those who participated in the decisions that led to Eschede but are we fundamentally different from them in our own engineering capacities? Food for thought.

About claforet

I have been photographing since my college days. My current gear is a Nikon D-700 along with a plethora of lenses. My first serious camera was a Canon EF back in the early 80s. A Canon A-1 soon followed. Since then, I also have had a Minolta Maxxum, Nikon FM, Nikon F4, and Nikon Point and Shoot in my film days. I have had and maintained a private full color and B&W lab on and off for much of that time. My photos can be found at https://www.flickr.com/photos/claforet/ Photography and painting are my "sanity breaks" that save me from my day-to-day software development existence! I host a group in Facebook called "Painting as a Second Language" for anyone who is interested in painting as an outlet for the day-to-day pressures. Please look it up and join if you think it will meet your needs. Recently, I have also branched into the video world and am learning how to shoot video better and how to edit compelling video sequences. My learning experiences will be part of this blog and my videos can be seen at http://www.vimeo.com/claforet I live in, and photograph mostly around, North Carolina. I love traveling so there are many shots from states around us, out West, and other places. My daughter has been bitten by the photography bug too. She has spent time in a B&W lab and loves the excitement generated by just the smell of the chemicals. Maybe one day she will take over where I leave off....
This entry was posted in Philosophical ramblings, Programming, Software architecture and development and tagged , , , , . Bookmark the permalink.

1 Response to Poor Engineering Decision-Making

  1. Pingback: Don’t “Go Your Own Way” | Chris' Creative Musings

Leave a comment