The SDLC "knowledge gap" in motion, DevOps to the rescue?

 

Reading my Twitter stream over coffee this morning, I saw Nick Galbreath post the Nanex Research theory/analysis of the Knight Capital trading platform disaster which you can read about in more detail at the links I've provided in the appendix below.  Point is, and this is conjecture and hypothesis, but it would appear as though a piece of code that's meant to stay in the lab, made it out into production.  We can see hints of why as Nanex analyzed the issue - 

 

"When the time comes to deploy the new market making software, which is likely handled by a different group, the Tester is accidentally included in the release package and started on NYSE's live system."

 

That makes sense.  How many organizations have this same set-up for deploying software?  The developers in one group and the packaging/deployment (otherwise known as Operations) people in another?  Thinking back to my days at GE Power Systems this was rigidly enforced.  Developers, who were off-shore resources more often than not, would work on code as it was spec'd out to them, and then FTP (or otherwise transfer) the completed software bits to "our side" where someone would package it up and put it on a server to test.  Testing included, eventually, security testing.

 

I can't tell you the fun things we found in this pre-production environment when we started digging around during security testing.  No, really, I can't tell you, but rest assured it didn't end with misconfigurations, or accidental code bits being included.  Once we found a few files from another piece of software that was not designated for our environment ... maybe one day I'll be able to tell that story.

 

Anyway ... the moral of the story is that there were these separate groups that designed, built, packaged, tested, deployed, and then monitored these applications - and we often found ourselves in similar situations as Knight except without the multi-billion-dollar holding problem.  As you can guess, security issues were not scarce, and neither were configuration bugs, developer-included 'whoops' comments and pieces of test harnesses that should never make it to production.  But they did.

 

As I think about how a DevOps tribe could function differently it becomes increasingly clear that disasters like this one could be diminished if only we had continuity in the SDLC.  There's the key, continuity.

 

Any process that lacks continuity is doomed to stumble at some point.  The costs that we accrue go into technical debt discussions, but ultimately the piper comes calling.  This Knight Capital incident may be an interesting case study in where DevOps could decrease the likelihood of failures like this one - or maybe not, we would need a similar organization doing DevOps to compare against.  I simply believe, and more so with every incident like this, that rigid processes where we have a knowledge gap at the handoff between groups is the problem, not a solution.  This is one of the primary reasons I'm so fervently behind DevOps - the knowledge gap has less chance to rear its head.

 

Knight Capital Links:

  1. Chicago Tribune article
  2. NY Times opinion from the perspective of a former developer (good read)

Comments
Robert David Graham(anon) | ‎08-13-2012 10:51 AM

You misunderstand the problem, and I'm not sure you read the NYTimes article properly.

 

No development methology can fix the problem because it wasn't a software "bug". Instead, the software behaved exactly as developers wanted it to. It's only that in retrospect, the developers realize that what they wanted was the wrong thing.

 

The problem was that during the "requirements" phase, they said certain risks were acceptable, and only later realized that the risks were not acceptable.

| ‎08-14-2012 08:42 AM

Consider that Knight's first statements were that it was an infrastructure problem, and not a code problem.  This may be true when we consider automated deployment systems could be configured incorrectly - which begs the question, who tested the THAT system?

Another question is that if we are to "build quality in" to the system (and the code) then why didn't the developers build in a secondary or tertiary failsafe into the code?

I'll offer to build one for them for just 1% of their $440M damages.  :-)

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author(s)


Follow Us
Community Announcements