Thumbnail image

Structural Failure Looks Like Quality and Delivery Failure.

Different ways of working have different failure modes. We tend to pick ways of working that are simple to understand but have incredibly complex and costly failure modes.

We Wish for the best, but end up with eldritch terrors

All our processes can be split into 2 approaches:

  1. handoff over multiple teams
  2. cross-functional team

Both approaches have benefits and shortcomings. However, humans are not wired to understand the shortcomings of handoffs. This is because we don’t see the feedback loops; these processes exist in time, making it incredibly hard for us to grasp problems that occur over time. To comprehend such issues, we need to tell stories about each possible scenario and somehow understand how their combination plays out. This isn’t a human capability – we developed mathematics for this purpose.

Eg a 3 stage process with 50% failure rates produces lead times between 2-20x the success lead time half the time.

This isn’t how humans work; we each have a single history and see things as linear straight lines (sometimes these fork and merge), but we do not really understand the impact of a loop backward that causes repetition.

We see this shortcoming of humans when we look at cross-functional teams. Some people are not always busy, which must be wasteful! This is because we see each team member as a separate person – we cannot really intuit that the team is fully busy.

When we think about our body, we don’t consider it wasteful for our legs to be still while we are at a desk. It would be silly to have them flapping about – it would be very annoying to your teammates! Yet in business, we strive to keep every organ as busy as possible.

Let’s make this concrete!

Example - a real situation

Consider 3 ways of delivering a series of 50 web forms for a submission process - which is better?

  1. Build each page at the same time by multiple people and then integrated together at the end.
  2. Have 1 team build the pages one at a time and integrate as they go.
  3. Have 1 team build the pages, but start with the last page and work backwards to the first!

Option 1 - systemic certain failure to hit deadline

Option 1 seems simple, fast, and independent, but this is because we are considering things in isolation. It does not consider the temporal reality of multiple people working at different speeds, quality, and having other workloads. The reality is that the pages will not be independent (because they are part of one submission), which means changing one page will affect others. We are not considering work done previously with this approach and how it will impact this piece, nor how this piece will impact future work – nor how changes between people as we learn more will propagate to other pages, nor how later testing and failure will impact other pages.

The problem with multiple pages (1) at once is that:

Some stats:

  • 50 pages with a quality of 95% will have a success rate of 7.7%, which means these failure conditions are likely to actualize.
  • Getting quality up to 99% still only has a success rate of 60% – the problem here is not quality of work, but systemic.

The problems:

  • all the pages will have to work with one another - how can we stabalise the totality of the pages if change in 1 is effecting the others? a will change b will change c could change a. Eg answers on one might change questions on another.
  • The value of the questions and answers will be different for different use cases - this means we need a many test scenarios, but the pages are being built independantly, do they all have to run all tests on themselves as they are being built, this is strange. Worst case, results in number of pages * number of usecases test scenarios. Normally, we would just test the usecases.
  • Once all are done then we need to test them as a unit - but the people who finished early (all but the last one) will be on other work, so this invent a need for another team to do the testing.
  • if another team does testing, then release is long after work finished so the feedback loop is delayed and ‘done’ work rots without being used for months delivering no value. It follows that most ‘done’ work (last 6 months) will not be in use.
  • This will mean that the other work started in the interim will get delayed when test failures are found later.
  • What this means is also that the work to do the pages will get delayed because of failure in earlier work!
  • Once a page changes due to a bug others will also likley have to change - and the dominoes continue.

This is a process that looks simple on paper but because of the dependencies there are many feedback loops. Because each feedback loop has a probability of success, all the odds multiply together. When everyone behaves in this system, the results will be work that is guaranteed to be massively late. It is a mathematical truth.

  • Everything gets interrupted by design and verification (i.e., finishing) is delayed by design in favor of starting (and not finishing) other work. This can never work for things that are not truly independent.

The primary reason this fails is that the point at which we see if it’s working is the point at which most work has been done. In the real example, 6 months of work was done before a team hit the API of the client. What’s worse is that the client didn’t want their API to be hit until it was known to be working … 2 days before live the API still hadn’t been called. Do you think it worked? Of course not, and it was the engineering teams that had to work late – not the incompetent management teams that set this structure in place.

This approach fails because it’s based on a flawed assumption: that all pages are independent of each other. However, since they’re part of one submission process, changes in one page will inevitably affect others.

The descent into waterfail / wagile

The ways we design out of this is to go waterfall process and design heavy - TDA, signoffs, design commity etc. This reduces cross page dependency but does nothing about the temporal aspects of team capacity, finishing at different times and handling other work. These aspects dominate the lead times because of the feedback loops, wait times and failure rates - but we cannot see this because telling each variation of the story and adding them up is really hard. I had to write code to simulate it to understand.

The psychology behind this approach is a psychology that will produce about 6 months (or more) of work that is not in production. They will attempt to speed up by adding more people - because the assumption that leads to this mess is that people can work independantly and be valuable. It is simply not true.

This approach is in my view the primary reason why we cannot have nice things - we are hard wired to make this mistake. This is a software example, but a great many business processes we have introduce steps and handovers when none are really necessary.

The way to resolve this is to stop, pick a thing and focus on getting it done.

Option 2

The problem this team has is that they get to the end, submit the form and it doesn’t work. They suffer next to none of the problems above.

When they are done, they can deploy it and test it. It will fail, but they will be present to address the failure. There could be an external test team involved - but the need for it is not here anymore. The QA element can live within the team.

The failing of this approach is that the iterative process of making changes, pasing them through all of the pages and then testing again is arduous!

This is worst when we consider the number of use cases they need to consider. Is there a way that avoids all this cycling?

It is worth noting that because we have eliminated all of the cross person feedback loops this will finish many months faster than the first option. This is in part because the time in option 1 is dominated by wait time, but also because of the amount of thrashing it will induce between test and dev and because of the thrashing of change between the pages.

Option 3

Option 3 involves constructing the data object required to make a submission and ensure that this object works. Once this works we then create the pages and functionality to create this object.

When we start with a submission we are forced to ask ‘What is the scenario for this submission?’ Then we have 2 options:

  1. We can build only the pages and functionality to make this pass - and then move onto the next scenario
  2. We can build up more examples of submissions that cover all the scenarios we care about - ie we produce submission data that would exist if we had already built the forms. We can then build test cases that pass for this. We can then implement the pages this by refactoring back from submission data into the forms requried to populate this data.

Either way we have echieved an economy of motion.

  • Nothing we do will ever fail

Of course its going to be very hard to build a submission object - because thats the part that connects to reality. It is far easier for everyone to be busy on work than to do that.

Conclusion

I havent gone into cross-functional here, options 2 + 3 are kind of cross functional options. I do think it deserves seperate attention simply because these teams do have failure modes.

It is important to note that every approach has every way of failing - the reason why we talk about failure modes though is to highlight that in some structures it is more clear where the responsibility and accountability for failures should land.

With option 1 management who determines the structure doesn’t have the light of responsibility upon them, it is the teams quality that is the problem! This structure also makes it very hard to measure critical factors to show how its failing and so enable people to behave responsibly to fix it.

With option 2 though the failure is most definatley on the team - it is just that often they do not have the responsibility for defining the criteria to enable success.