When Will It Be Done? Feedback Loops and Determinism - Where Delivery Meets Engineering

In the delivery realm, we have one fundamental question:

  • How long until we get our stuff?

Gross Lead Time Does Not Depend Upon Estimates

The last post summarizes how gross lead time of delivery does not depend upon estimates - the vast majority of the time comes from how the system is structured. This system is what produces the enormous variance via feedback loops (and later we will find also wait times, but we have not considered multiple pieces of work yet).

Therefore, any delivery system that depends on estimates and not systemic analysis of itself end-to-end for its delivery estimates is selling snake oil. i.e., if you have engineering being tracked but you don’t track delivery, ops + product time, and gross lead time of work flowing through the whole, then you are solving for the wrong things.

How much does it cost to run a program?

One of the fundamental problems of computer science is being able to take a program and say how long it will take to run. This is what the phrase ‘halting problem’ refers to - do we know if and when it will finish?

This problem with delivery estimates and understanding how work flows through the system is - in abstraction - exactly the same.

Essentially a Turing Complete language for writing programs in only needs a couple of things:

  • Ability to process in a sequence
  • Ability to branch (i.e., if ‘something’ happens, then do ‘sequence a’, otherwise do ‘sequence b’)
  • Ability to loop and repeat things (i.e., a feedback loop)

This is excellent because this means that all the engineering and scientific progress over in computer science land can be tested over in delivery land.

We know that in ops + delivery space that our variance is caused by loops.

We want to know how to convert looping systems into sequential systems. So, let’s use the body of science that has made it its job to answer these questions.

In computer science land, we want to figure out how to take looping programs, convert them into sequential code for a processor to run efficiently, and have an expectation of run time based upon input.

The world of algorithms, i.e., the games engineers play with practicing katas maps very well into this space.

Engineering Kata + Elimination of Loops - Performance Optimization AKA Ship It

In engineering, there are various exercises and practices employed to build up skills - and the ability to execute them without thought - in a variety of ways. The thought processes, the knowledge, algorithms, and data structures, familiarity with languages, and proficiency with IDEs (to name just a few).

It takes a lot of practice of a lot of skills to become a strong engineer (writing blog posts, sadly, is not one of them, so I apologize for the meandering).

Repeating the same problem with different constraints to enable deliberate practice is part of what I would expect aspiring engineers to be doing. It shortens the leveling process immensely. The value of this is that when you automate your ability to do a thing, you build yourself the space to think about - and get ahead of - the hard stuff.

Novel Practice - I Haven’t Seen Anybody Attempt to Exapt This Knowledge into Delivery

What if we could start to think about how this practice of practice could apply to delivery and operations? What are the conditions that would be required? Repeatable predictable frameworks that can be abstracted into a game would be a start.

Some katas are interesting (Advent of Code) because often if you solve them in a very straightforward easy to explain + draw way, they take a very long time to run whereas if you design the correct structure where everything can run in a single pass, they will scale and execute very well.

So if we can develop a way to formalize how to model the end-to-end process with a focus on the prediction of gross lead time, then we can start to construct games to find better ways of working that can directly attach themselves to the most valuable project metrics - how much did it cost to do a thing? How long did it take to do it? How do we measure its value?

A Great Example of This Is Planning Work around Finding Prime Numbers

If we take a mathematical problem and look to see how we plan and figure out how to solve it, we can observe some things:

  1. We do not just start work and figure out what to fix later; our first problem is determining the process by which we know we will get to something that works with a definition of success.
  2. Our process has no failure conditions because it constructs the valid space by progressively narrowing the space and then executes against it.

So Let’s Solve Finding a Prime:

To tell if a number (let’s call it N) is prime, you have to test every multiplication combination to get to it. This is an infinite space of all integers multiplied by all integers.

However, we can quickly identify some constraints:

  1. If one number is negative, the answer will be negative - so we can reject numbers less than 2 for positive N.
  2. Because multiplying 2 integers together produces a larger integer, we know we never want to check values that are N or higher.
  3. We do not want to perform work multiple times.

This puts a box around the problem and defines a finite space … the number of things to check is N squared.


This matters because any user input is effectively an infinite space - we can simplify our initial implementations by hard coding (or simply not including in the interface) the majority of values and solve for a single example.

Done Infinite -> Finite. Time to Prune

So from here we have turned the problem into a finite one. We can multiply all combinations together and see if we get the prime.

Eg if we are testing 5 we can check the following

  • 1*5 and 5*1 - we rejected this previously
  • 2*4 and 4*2
  • 3*3 and 3*3 - dupe in here
  • 4*2 and 2*4 - this is a dupe so we infer from symmetry that we can stop because we have a systematic way of iterating the combinations and infer that future ones will also be duplicates.

At this branching point there is an important thing delivery-wise. If we were not happy with the intuition that for larger values of N the symmetry does not hold (because we have not formally proved it) we should spin up a team to check for the failure condition and continue here under risk.


So this problem requires 3 calculations (2*4, 4*2, and 3*3). However because of commutativity we can reduce this to 2. 2*4 and 3*3. Again we might need to spin up a team to verify and prove commutativity if we are not happy.

We can take this further, but so far we have gone from

  1. infinite
  2. N * N possibilities (the box)
  3. N (the list of all items in the box)
  4. N / 2 (deduplication)

If we take this and say we have ways to prove that a multiplication or an addition was correct - which means there is no way for someone’s work to be ‘completed’ with defects (which means strictly linear system) then …

For a given value of N we know the number of operations and can establish a time for each type and so can establish an accurate estimate including the expected variance (0)

Getting to Failure Fast

This is an example of using constraints to make solution space small enough to solve it reliably in known time. If we stop here, it will be N/2 calculations worst case as we do not expect to have to go through every number when it is not prime. There is an important point here, it is faster to get to failure than completion. So we suspect that we can further reduce this time.

However, all this is predicated on a reliable way to do the work in a way that doesn’t cause a defect feedback loop.

Additionally, if we look at the constraints from earlier we can see that they are really tests for validity in input data. IE they scope the input so that the system doesn’t have to compensate for it - ie it stays far simpler.

We have defined the reduced scope of the problem by the definition of easy to detect - and prove - failure cases. We haven’t done some nonsense involving a happy path. We have defined a single scenario (N=5) and worked it end-2-end to find an efficient implementation that does several things:

  1. It defines how to generate input values - and prune these.
  2. It defines that if the multiplication produces N we fail the prime check and abort the process.
  3. It defines that if the multiplication does not produce N then we do not fail the prime check yet.
  4. It defines that once all multiplications have not failed, then we know we have a prime. via

Note how most of the clever work is in step 1 - the definition of how to eliminate all possible multiplications except the ones we desire:

1 generates the data to loop over. 2+3 define decision conditions + an early exit case for failure. 4 defines the exit case on the loop for success.

Conclusion

This should suggest why the tone of these posts is around us not really understanding where the bottleneck is in our systems. Delivery and product cannot do their jobs while we build systems that obscure the data that would put design pressure back onto their systems.

We think we have an engineering / QA / Route to live problem - but we really have a problem where the people who should be defining the behavior of the system are way over capacity because job 1 is very hard and is being done by people who have very little formal training in this stuff. This is because the history of engineering is to pull the human interaction aspect of engineering (i.e., the non-code parts) out of engineering; because of an untested assumption that doing this will increase throughput because more time spent writing code leads to more things of value things being produced.

I am strongly asserting that this assumption is completely wrong - and if you measured gross lead time (and throughput - coming!) this would be very obvious because the demand on systems if often such that we do not invest the time (or have the knowledge to) to properly structure them in ways that can produce reliable measurable output.

This excess demand produces low quality requirements that are experienced as engineering failure and a quality problem in build or test.

But while nobody is tracking gross lead time how is any team supposed to directly experience this?