Queues and Backlogs Are a Cause of Low Quality in Delivery

A question

Are delivery and product people aware that in engineering, we have various types of queues and buffers? There is a significant amount of science and theory behind queues and their management—it is essential because they are a core part of how large-scale systems perform. How much of this theory, experimentation, and practice exists in the product and delivery space, where the digital delivery production system happens? In my experience, it is none but has anyone tried to cross pollinate here?

For any of this to make sense, we need to agree that our goal is to deliver the most value.

If you want to know more, please get in touch!

We have a large assumption that engineering and releasing code are the bottlenecks. This is because there is a huge backlog of work in front of engineering—this is the queue. I regularly see backlogs with thousands of ‘not done’ items, these backlogs are handed to teams as if they are an asset.

a queue breaking things

The point

In delivery systems, we often separate work stations or job roles with queues—for reasons.

Scenario - the simplest system

Let’s consider a scenario representing the most basic way of working: someone does a task and passes it to the next person:

  • Given there are no queues
  • When we have 2 functions in a sequence, ‘First’ followed by ‘Last’
  • Then First cannot start a new task until Last picks up the task just finished.

This is because to avoid having a queue, First cannot start new work until Last is ready to pick up the finished work.

A problem with this system

If the production rate in First is higher than Last, then Last is constantly busy whereas First experiences downtime because it finishes first..

  • Utilization = Time spent working / time paid for

If we measure the system efficiency as the percentage of time spent being busy, then we can see that this system is less than 100%. The utilization of the system is less than 100% because an earlier stage is being slowed down by a later one.

Let’s fix low utilization with Queues!

How can we address this inefficiency? After all, we are paying someone full time for a part-time job! It’s not fair on Last and it’s obviously not good value for money! We could be getting more out of them!

If only there was a way that First could continue working after a piece of work was finished. What if they could pile up work?

We introduce a queue!

  • Now when First finishes a job, they can start the next one straight away, and Last picks it up and works on it when it can continue.
  • Now every part of the system is producing at 100% utilization

What happens now? Rejoice?

Everybody can now be very busy with no downtime! Each person is producing more stuff, so we expect more to be delivered.

We can see this busyness via proxies such as the ability for people to partake in things like socials, business development, practice development, learning, taking a break to do something home-related reduces dramatically because there is always more work to do! We can also put targets on work to get more of it!

Observations

  • Time invested into work by First remains the minimum (which is efficient!). They do work and put it onto the queue and start the next thing.
  • We also generate a pile of work at a rate that Last can never keep up with - some work can never get done.

Knock-on effects

Overproduction leads to expensive prioritization

Previously, Last could just pick up the work being given. Now, however, there is a choice. We now have work to do to decide what work should be picked up next. We need to order it.

Bear in mind this isn’t work that was canceled or deleted; this is work that has already been decided to be done that has not yet been done because we have introduced a new inefficiency into the system.

That of managing overproduction.

  • The time spent ordering work has to be proportional to the size of the list of work in some way. When we get to comparing each item to one another, we find it is an n^2 relationship. We often introduce categorizations (which require work to figure out how to) to reduce the scope of what needs looking at. Everything becomes complex.

  • Time spent involving Last in this process cannibalizes time from the slower steps, thus lowering their output.

  • Instead of deciding to do a thing once (which is when First started to work on it), we now need to decide to do it again for Last to pick it up!

Quality leads to rework

So far, we have assumed that all work, once done, delivers value. This is simply not true for many reasons; it might be defective or nobody may use it.

When something is ‘done’ but needs to be redone, it reduces the capacity of Last.

What we had—in the dumb system—was a handover at the time the work was done and fresh in mind; we also had ‘dead’ time that could have been spent on that piece of work to discover more about the customers, more examples to help reduce defects, and define a larger area that could work.

We have removed that free slack time that could be spent on increasing quality and therefore reducing feedback loops, and instead invested it in the production of waste.

Conclusion

These two effects mean that the queue is a far more costly exercise than paying someone with 20% inefficiency (if it was 50% or more, you would add another team). By trying to get that 20% back, the output of the system gets destroyed.

Why does this happen?

It is because the amount you invest in a system is not related to the throughput measured by the value of the system.

When we control a system by looking at cost (because that is easier), we completely ignore value and end up producing intermediary products at a maximal rate.

We do not realize that work in the middle of a system is not of value—it is toxic waste that needs to be minimized. This upsets people because everyone wants to be doing something valuable, and they have been taught to derive meaning from being busy.

This everyone-always-is-busy probably worked in a Protestant field where people spend a lifetime doing all the jobs and just go around applying themselves where they are needed—it doesn’t work in a highly structured ‘this is your job’ system. The irony is that that way of working is what we are about to suggest.

How should we optimize systems?

The important thing in our system is getting things in front of people quickly because that is a feedback loop we can learn from. If you can release quickly, you can get some value out before the requirement changes—the requirement will always change given sufficient time.

A real example - the typo

If an engineer can fix a typo and push it out, why do we want to introduce a ticket, plan, schedule, prioritize, build, QA, etc.?

Really, because a spell checker is the QA tool; we should just sneak it out. However, in many environments, this is not an allowed behavior because people would worry it would be viewed as a change that was not on clients' priorities. For example, ‘it would not be in the release notes; therefore, we wouldn’t know which change introduced this, which would break our QA process’—again, this is a real example. Naturally, it got a 2-word response, and I encourage all teams to just fix this stuff and not tell anybody because the systems we work in are that stupid. Having a single system in a complex project is naive—and this points at the problem. When we are just doing our jobs, we are being stupid because our job is to help the system throughput, not produce pretty pictures, or code, or Gantt charts.

What is the minimum viable delivery structure?

This is a choice! Not making a choice is still a choice; it’s either ignorance or abdication.

Anything in our system that increases the gross lead time of work should be eliminated.

This comes in the form of:

  1. Feedback loops from failure later.
  2. Producing piles of subpar work in front of genuine efforts that could be much smaller, higher quality piles with fewer failures.

There is a massive false assumption in managing these systems that producing ‘a thing’ has value. This is simply not true. Value and throughput (the rate of value) are related to things a customer can use and ultimately pay for.

The misunderstanding

If we do a thing and revenue does not go up, then we have increased the liability of the organization, not its assets.

If this was understood, the initial optimization of introducing the queue would never happen until after the minimum gross lead time was already discovered from not having the queue.

The only reason - here - to introduce a single-item queue would be downtime in Last. Increasing this would then be a result of another downtime until it happens so infrequently that we don’t care.

Its devastating effect on performance would then be measurable, and so the length/type optimizations that could take place would also have a measurable effect.

We need to maximize throughput first and then manage cost later.

We get into cost management conversations usually because our throughput is less than agreed, and so people don’t want to pay (rightly so, you are failing).

Does this mean queues are bad?

Uncontrolled queues are very bad - most queues I see are uncontrolled. For example, when someone has an idea, what do people say? ‘Put it on the backlog.’

People don’t realize that the backlog could be called the waste pile and be 100% accurate.

When are queues good?

The value of a queue is to have a buffer. A buffer is an agreed inefficiency, a deliberate overproduction up to some limit that represents the risk of Last processing a series of jobs quicker than they can be produced by First and thus starving.

So the conditions of a queue being potentially beneficial are:

  1. Sometimes building and releasing all the way to production takes less time than defining the work.
  2. This happens enough, and the current queue size gets exceeded, making us want to increase it.

There are many types of queues and buffers. Some are appropriate in different situations, and they all have different management strategies and optimizations.

Yet I only see the backlog of waste items that are never going to be picked up - but absolutely cannot be deleted for some bizarre reason.

Final Comment

Much like a child that babbles incessantly (but is having a great time) - that later grows up to form coherent thoughts and reasons as they grow up - we need to stop producing ticket diarrhea into backlogs.

It isn’t work; at best, it’s play, and at worst, active destruction of the team’s throughput.