Continuous Delivery – Resilience

I recently blogged about where to start CD, with test or deployment  automation?   The conclusion I came to was neither answer is correct, you should work on both incrementally because one without the other is not very useful.  Working on both should produce more value to customers quicker even if each moves slower individually.

I have continued to contemplate how enterprises with existing (sometimes very old, fragile and/or poorly architected) software should start there CD journey.  Many of the CD principles can take years to fully implement but teams need to know how to start getting value out of CD quickly.  There are many principles in CD but which ones should a new team focus on first?

Jezz Huble (some say the father of CD) has blogged several times about the principles of CD in addition to all the great content in his books:

In looking at my teams CD journey,considering the issues we saw along the way and hearing about struggles from other companies/teams, I am convinced Optimize for Resilience is the key starting point for most teams new to CD.  I believe it is the principle you can start implementing and seeing value from most quickly.  Jez details this principle further in another post where he points out the human element:

But the most important element in creating resilient systems is human, as Richard I Cook’s short and excellent paper “How Complex Systems Fail” points out. This is one of the reasons that the DevOps movement focuses so much on culture.

I completely agree, optimizing for resilience is a necessary culture change. 

Assume you will Fail

Image result for You Fail Meme

That sounds really negative but the reality is –  failure is inevitable.  We will all fail from time to time  in our careers.  For example, we may

  • Not consider a data condition that only exists in a specific environment
  • Incorrectly predict how users will interact with our website
  • Misunderstand what the business was asking for
  • Break a service contract
  • Deploy components in the wrong order
  • Forget a database script

Too many teams/companies spend all their energy on trying to prevent failure.  We should always learn from our failures and try to improve but I propose its more important to have a plan for when we fail first and foremost.  If we embrace the fact that failure in inevitable and always have a plan for recovering or isolating the failure we become more resilient and those failures will have far less of an effect, if any.

Plan For Failure

I blogged about Release Readiness back in February which is how we plan to be resilient.  I didn’t really put it in those terms but that is exactly what it is.  We formalize the process of analyzing what we are changing and how we will unchange, hide or disable the change if/when needed.  This is a very different way of thinking.  Its not natural.  Most of us want to believe we know what we are doing and everything will work out.  It usually does but when it doesn’t it can negatively effect customers, teammates and other co-workers. 

In our case we made Release Readiness Analysis a definition of done on our Kanban board so we wouldn’t forget to do it.  Even then developers would forget but over time the teams kept each other accountable and now its second nature. Here are some of the plans we have made for different failures:

  • Always deploy changes through multiple environments (local, DEV, QA) to practice deployments that will eventually fail
  • Use feature flags so new functionality can be disabled/hidden when the business doesn’t approve it in UAT or it causes an error in PROD
  • Only break contracts when necessary
  • Never update services beyond DEV or QA instead deploy changes as a new version so you don’t break clients
  • Have a rollback plan and practice it so you can confidently rollback a bad deployment
  • Don’t change database schemas, append them then cleanup later

I want to call out rollbacks specifically as they are the typical failure plan for teams.  This is only one way of recovering from failure.  They can be tricky to handle but we have a working rollback process. Even though we have it and continue to test it we have only used it once over the last three years.  We typically roll forward and push a fix through the pipeline instead. This is usually faster and less risky when you consider data conditions and all the other factors involved.

There is no question this all requires more thought/analysis and will slow down development initially but in our experience this small amount of effort pays dividends later. 

Start Today

It doesn’t matter where you are on the maturity scale for CD, you can start Optimizing for Failure today.  I think its the best place for teams just starting CD because it forces us to think differently about what we develop and how.  Its not easy but well worth it.

What do you think?   I would love to hear more thoughts on Optimizing for Resilience and how that has or could effect your teams.  Start a conversation in the comments below or on Twitter.

Leave a Reply