Sabotage: Use Governance to Couple User Behavior to Code Deployments

Jun 25, 2024
20 Min Read

In engineering we can do so many things that have changed the way experience is delivered to customers. Sufficiently advanced tech looks like magic (to butcher Arthur C. Clarke).

The next bit is likely old news for many readers (ie private companies in industries like ecom) - however, in public sector governance this might fall under the category of ‘radical thinking’ because we rarley see this happening. It seems like contract negotiation maybe gets lost in the weeds and misses the big picture of what constraints the agreement puts on how delivery can work. When you effect delivery you effect teams structures, you effect the product thinking that can take place and you effect how engineers can design, build and release behavior to customers.

What we do is package 3-6 months - often more - into a single release and turn what could be many small things with distributed risk into one big thing with accumulated risk. The perception is that this is done because people are afraid of change whilst bringing about the conditions for the fear to actualise

The big rubber duck of release landing in an ocean with people displaying a variety of emotions

What do clients want?

What I think is happening is that the client has a desire for some very rational things - governance, oversight, knowledge and signoff on behavior that customers will see with a view on risk management. They want to know that someone is not about to cause them to suffer reputation damage. They also need to know that all these parties are behaving in step because - as with every company ever - the output of delivery has to tie into many different pieces of work externally. Fundamentally, it is driven by a duty of care to the - often vulnerable - people the service is serving.

I also think consultancies have a desire for some very rational things also - they want something simple and clear to enable them to operate in complex environments in a black and white way to show obligations have been met. I am sure there are many other highly complex things going on.

However what I want to throw into dispute is that the commonly agreed solution to this harms both parties in unacceptable ways. The people agreeing governance likley have very little understanding about (maybe they only have experiences of this way) the wide variety of ways that we can meet these needs that do not have to be large milestone releases. This is an example of why in previous posts I have stated things like ‘methodology is no substitute for design’. There is a ‘how’ process going on, but it needs to be fed relevant ‘whats’ from experts in other areas in order for it to serve the needs of the whole system. When this information gets out of date and we do not take action, then we cannot expect anything to change. I have very similar complaints to how business cases have to work (which is more likely a product of my lack of experience in that area) - except here the intervention is obvious to me because of decades of working in the industry.

The client wants to reduce risk, understand behavior and line up the rest of the organization. Small controlled releases of experiments to some users instead of all enable feedback, training, education, involvement in the product and confidence that it actually works because the people who are experts can properly kick the tyres prior to ‘mass release’. We can achieve this by taking a singular notion of release and producing something more gradual and doing this in the production systems instead of what is commonly done - ie introducing multiple environments that all fail at different times for different reasons, long after engineers have started something new. Normally we spend 6 month testing hundreds of thousands of lines of code, hit some level of confidence (its not high as its too complex), shout ‘yolo’ release it and then everyone sits gripping chairs tightly for a few days wondering if it will work. This - whilst mitigated to possible extents - is a product of the governance, not of what anybody building the thing actually wants.

We use environments that are private and not publicly available to attempt to mitigate the above. This is not sufficient for what I consider to be acceptable design, experience and risk goals of a system. We need to be able to do all these things safely in publicly available systems against tightly controlled groups, because it is the public who are the subjects of the system. I am not arguing against private environments - just the exclusivity - I would have far less in number.

Do we have to be agile if the governance dictates not agile?

It is my opinion that if what we commonly agree (all or nothing, everyone gets everything releases) is actually the best we can do - because there may be legitimate non-technical constraints (that I have no experience of) in play, then we need to drop the whole ‘agile’ thing because the agreement of ‘behavior in production determined by releases of versions of code’ massively undermines everything.

When we undertake an iterative process within a release it is the same thing as context-switching, producing waste and going in circles. It is one thing (Good) to iterate and design in UX, however, it is completely different to iterate in development (terrible). This is deliberately wasting time on the bottleneck. It is my opinion that this one decision (sign off on deployments) - or the interpretation of whatever is agreed into this - is the largest singular cause of waste because of the effect it has on all later parts of the system. It causes iterations to take place within a deployment because we cannot yet get signoff. When we attempt an iterative loop inside of a system that is fundamentally linear we are at risk of spending phenomenal amounts of money between releases to find ‘best’ (but with no feedback loop to score what we have done). We see this as scope creep because requests for change cannot be resisted, because nothing of value is released as a foundation.

Really we should do the opposite and get to ‘release’ quicker by finding smaller things, and be putting more pressure on the contractual aspect to force it to change. We can take agile principles and force the part of the system that should hurt to hurt and so change. An iteration in agile is a release - if you are iterating without releases I have no idea what you think you are doing, but it is not agile, nor close to good practice (is it necessary?).

It turns out that by changing this release constraint it opens the door to technical possibilities that help enable far more granular slices of value that can be released - and have their value realized - by members of the public before the whole population can be served.

Meaningless scrum ceremonies - iterations that do not iterate anything the public consume

An example of how meaningless things become: In scrum at the end of a sprint we do a showcase. The showcase is supposed to be of things that will be released into production - if nobody raises an objection - because it is the showcase of what will be released from the current iteration. This, in other sectors, is non-controversial, or at least aspirational - yet in public sector the deviance is so normalised readers will be recoiling in cognitive dissonance with ‘but you don’t understand, we have x,y,z constraints’. I agree with you, that you might - but that also means that you are not doing scrum or representing any of the values that are in the agile manifesto. Whilst pressure is not put upon those constraints there can be no reasonable expectation to expect them to move. That is ok, we don’t have to do scrum or agile - we can just work with what we have and nudge systems towards better places. There is no judgement, but when we fail to properly define language and behavior then we also lose the ability to compare and improve. It is self defeating to use words in fictitious ways - language is how we think. Imprecise language is imprecise thought.

It is reckless to take on code deployments of large batches of behavior because it depends upon a toxic notion of done - and one that greatly increases complexity - without measuring it of what is to be released without any of the real feedback we need to do this responsibly from the customer. We can feel that it is reckless because the pressure and intensity of working in these systems is many orders of magnitude higher than systems that genuinley operate at scale. Black friday (or tax return) events are scale - most of the rest of the time all of the problems do not come from any technical issue - they are almost all political. Systems that operate at scale for techical reasons constantly deploy changes because they have removed and managed the risks of doing so as a fundamental design goal. They know they have to change - but are also not behind a procurement wall that pretends that change isn’t continuous and for the lifetime of the product. We possibly do not have to build procurement assumptions into governance.

The single delivery methodology form of sabotage is enabled by this also

Essentially it is almost never a good idea to couple the behavior that our users (both internal and external) experience to deployments of code - rather, we should probably couple the behavior that different types of users experience to the type of the user. When we separate behavior from code version and instead couple behavior to configuration we can start to offer different experiences to different people and processes - because their objectives of their execution of use cases are quite different.

IE different kinds of user who are performing the same set of actions can have very different value propositions.

However I think that the people making the agreements have no idea of what they lose.

What do we get when we decouple deployment of versions of code from user experience?

Eg given the same journey when we make this distinction then we can use that same journey to achieve any (or all) of the below.

A/B testing

Amazon famously deploys experiments, measures conversion on real users and uses this to validate changes to experience)

Turning behavior on and off without relying on full deployments

The deployment of a config change can be very fast. By using config tests for the system can test next behavior and current behavior.

Testing in production of features

Often we do not run our automated test suits in production because lack of design means we do not want test data sqewing analysis, as if this isn’t just a simple design problem

Running code to monitor performance continuously against known payloads

In a system with variable load being able to control all variables except load is highly desirable for comparitive analysis.

To give different types of users a different experience

Some users want latest behavior because the value of this to them is greater than the cost of unexpected failure. This is a form of market segmentation and opportunity for not only fast feedback from real users - but also delight

User research with our products in production

UR very often takes place against pictures rather than software. Being able to get users on the real thing, that can record and track their activity has to be more valuable.

Automatically healing systems

when a thing is turned on its success can be measured (eg conversion), if this behaves unexpectedly we can automatically disable it and revert with no human intervention, can easily provide a notification to humans as part of this process.

Automatically scaling systems

Often systems are constrained with external resources. These constraints will mean some users experience degradation, others a loss of service. We can detect this fairly easily and change the journey - perhaps via a queueing system. These things do not have to be on or off based on deployment.

Alpha / Beta testing of features in production with live users

Feature toggles are often used at granular levels, however it is probably best to use them to represent releases. In these cases we can have live behavior, the next iteration (would be beta) would be coming next, and the next iteration on top of that would be alpha. It is more than possible to have a deployed codebase with 2 future iterations also deployed and configured on and off by type of user

an early adopter pool of users, or even a pit of disillusion use pool

Essentially there is no reason why we must control the behavior users experience from technology through the use of code deployments. Doing so would require a very specific set of circumstances.

This is a long list - this is also a list that was hammered out with almost no thought. There are likley many more things that are possible but are simply not in mind that are a result of this.

How can we achieve all this simply

Without different behavior for different categories of user

When doing this I coined the phrase ‘release toggle’ - this has been used very successfully by a team on a public sector engagement. Locally, I was not happy to incur the risks of not doing this and was fortunate enough to have an aligned internal stakeholder.

This is the simplest and enables some of the above list, but not all. Eg A/B testing is not possible but it is possible to change behavior without a deploy, self-heal etc. IT achieves the decoupling and so as a result enabled a great degree of flexibility.

There are many meanings and ways of doing ‘feature toggles’. However to achieve the above we can be very coarse and treat things quite simply. By being coarse we can use the many mechanisms (that are out of scope here) but we can avoid the complexity of attempting to toggle individual features on and off - because when we move from global into narrower scopes the permutations explode.

Essentially we just want a way that we can hit things and say we want to use the service as the next version, instead of the currently live by default version. This can be a query string parameter, a header, maybe different url’s - there are many options all with tradeoffs. Personally, I quite like using query strings as it is very simple.

At the simplest it could be just 2

CurrentProduction
NextVersion

This could be done via some config read per request at runtime. The code would contain branches (very simple if statements).

This process takes complexity that is normally hidden in a source control system and moves it back into the code base where we have a lot of tools and ability to manage it better. The complexity is always there - but when we hide it we experience it as surprise later. I want to experience the surprise as early as possible and want to make the complexity as visible as possible because then we can use the development tools to manage it.

This would then enable 2 sets of tests, one for current production, which could be static and never change. The other set would be for the next version and would be an iteration. The code base would have branches wherever the behavior differs.

So we can take 1 code base and validate that the current changes work for both current version and next version at the same time. We have taken things that are coupled and moved them closer together by changing the type of coupling from being version controlled in release to be configuration controlled at run time which gives us the ability to validate both versions as we are working - rather than in separate environments with separate configurations later.

The process for managing would be

set the config for the version you are maintaining, write the test, implement the change and make it pass
run the tests which will execute the tests for all versions to ensure no changed caused a regression
push the changes which will deploy them though the pipelines and run the above tests again
deploy to production (which will have an additional test to ensure it is configured to be ‘CurrentProduction’)

This way we can be constantly deploying code into production for NextVersion, having all of its tests run over and over whilst being guaranteed no regressions and confidence that current production still behaves the same.

When we want to release NextVersion rather than it be a new deployment, it is a config change. This also means that if there are any problems instead of needing to back out a deployment - which is a tremendously difficult operation when multiple systems have state (eg database schemas + seed data) that all need to be in sync - we know that we can do it with just a config change. We know this because we have already been testing this process throughput the development cycle - in fact we depended upon it.

Once we are satisfied NextVersion works we can go through the codebase and delete the code paths for CurrentProduction, then rename NextVersion to CurrentProduction and start all over again.

What if we want to do this based upon types of user?

Not only do you do the above - it is still required, because we ultimatley want to move versioning into the codebase rather than the deployment process - but we also introduce switches based around types of user.

So instead of NextVersion perhaps we introduce a progression Development, Alpha, Beta, EarlyAdopter, Production

Instead of this being a global value we attach it to a user account.

The process above is largely sufficient - there are some details but this is not a design document!

How do we work around this?

The costs of not doing this are mind boggling - not only from a risk perspective, but also from cost.

When we insist on coupling behavior to deployments and a specif code version then we run into all kinds of relatable problems.

We start to introduce new environments and we just deploy a new thing with a new configuration for each kind of test that we want. We are forced to run these kinds of diagnostic on systems that are not our live production system because they all require it to behave differently in some way. This means that all of our information or load in those systems has to be in some way synthetic.

Each new environment represents a form of testing that could not be done earlier - this moves the ability to validate that code works further and further away from the time when the code was written. This causes work that could just be fixed in minutes at the time to turn into a process where

QA have to work to establish the requirements (often they have to define the tests - perversely, long after the code was written) - often these tests end up being manual because the code was not written in a way to enable the tests (because they were not defined),
When these tests fail (which we should expect because the code was not written to make them pass) then the QA cannot simply tell the developer to fix it as the developer is now on the next piece of work. This would cause context-switching and add delays into an already tight delivery process - despite this being work that has never been released, is not working and should never of been considered done (ie toxic definition of done). So the QA has to create a ‘defect’ and raise it for triage - because it is never clear which team owns it (because most of the time team structures are in layers rather than organized by value-stream). The QA should be testing from the outside - often in this situation we end up with QA who are testing granular sub-components which is obviously far more time consuming and no longer represents the user experience. We end up having dialog about what scenarios are real scenarios for a service buried 3 layers down that is only used for one thing.
A triage team takes the defect, establishes the responsible team(s) and then categorises the work (essentially fix / dont-fix), elaborates and puts it into a backlog
A team(s) picks it up, allegedly fixes it and moves it to done - and we go back to 1

This is true for every single environment on the way to production.

By implication we can see that more environments is a Very bad thing because all of these feedback loops create a highly multiplicative systems. A system like this will mathematically produce gross lead times of at least 2-20x the estimates half the time. This is why you are late. The introduction of an environment between development and production is therefor a thing that merits very careful consideration as it is a primary contributor to gross lead time- and so if your default is to assume we need dev, qa, int, ext, perf etc then I am forced to ask if your design goal is to spend money or to produce things that work as quickly as possible?

Introducing an environment is bad thing, that sometimes adds value - it should not be the default. It is the legacy of people who think manual testing is appropriate and is not a harmful activity. I can only think of very few exceptions to the statement ‘all testing should be automated’ and the exceptions come from selecting vendors that prohibit it (which is a whole other conversation about fitness for purpose matricies)

Conclusion

I have not gone into how I see this effecting delivery, teams structures, product mentality, service designs nor the dramatic effect it has on perception of responsibility for different roles.

When I go into teams that are having ‘code quality’ problems and a big red button for release this is what I often am aiming at addressing. However it may take a fair few interventions to clear the way to make it clear that this problem is the real causative problem.

The real issue with establishing causation is that there are often many other reasons why a certain behavior exists - but often there are a few governing constraints that enable that behavior. I probably will never prove causation - I don’t think thats very useful anyway.

When we address the constraints we can allow the system to solve itself in a more healthy way.

Coupling the behaviour that a user experiences to the act of signing off on a code deployment is reckless - I understand the goal, but there are far better ways to get this and also reap many other benefits on the way.

The cost of doing this is that complexity that already exists and is hidden in the system (which causes all kinds of other strange behaviors) ends up being explicit and becomes another thing for engineers to manage.

As an engineer it is far easier for me to manage this complexity in code explicitly because I have a vast array of tools that I can bring to bear on the problem that cannot exist if we handle it in source control.

This is why I favour mono-repos and why I favour trunk based development - I want everything to be integrated as soon as possible as simply as possible because then i can discover as soon as possible when things have gone wrong. Delaying suprise creates an amazing amount of process.

If you feel that everyone is busy, velocity is high but nobody has delivered anything, or that throughput is low it is very likely you are in a similar family of problems to the ones I care deeply about. Message me, I am always curious and happy to talk - it is far better to learn from others experiences than just your own, and in this remote world we need to make te extra effort to make these connections.