Source Control Log Hygiene - AKA Should You Squash Commits?
Engineers love things to be neat, simple and clear. It’s all we do! When we are implementing something new we should realize that if we reorganize everything then we can make a change in a single place and Blam! we are done. Working in this way is incredibly safe if we have other good engineering practices - such as outside-in TDD driven by meaningful customer interaction based requirements.
This results in almost every single engineer - ever - at some point in their career deciding that squashing commits is a good idea. For myself, it was several years.
What do we get from squashing commits?
- We get a neat one line summary of a large amount of work that could be spread over multiple contributers and significant time simplified into a single commit describing exactly what it achieves.
This is wonderful!
It is useful in lots of ways.
Delivery people and people who want nice pretty reports love this. The rest of us don’t have to look at the snake showing how hard it was to get something finished.
One thing that I don’t like about squashing from a reporting perspective is that we loose sight of the complexity of having to merge other branches in. This data could really help shine a light onto a large amount of wasted time in managing multiple branches and change across them. It is like we are measuring only the good things and hiding the bad parts - which we could use to drive meaningful process improvement and delete waste.
What is the purpose of source control?
We work on large complex code bases. We also work on these code bases out of hours supporting them, we work on other peoples code, and we work on them for significant periods of time.
Source control is what allows us to make changes and know that when we mess it up we can go back and take a different route. It is not about having a pretty record of what happened. It is a safety net and one of the very few that we have as engineers.
What is the worst that can happen?
Its 4am, you obviously are not hungover or tired, and you get a phonecall. Something is going wrong, this isn’t a system outage that is stopping people using the service - oh no, that would be too easy.
This is an error where you realize that for a period of unknown time the data on customers is being corrupted. This gets worse if you work in a business that isn’t selling ice cream but is part of something essential EG Electricity, health care, government (tax, education, access to resources or even the country), banking etc - this is because now people cannot do things they need, and it is now your problem.
You cannot turn the system off because that would cause harm to everyone - you don’t even know if this is a one off or something way more scary.
This is the worst day of your career - but don’t worry we have source control, this is the tool that records every single change!
It is lucky that you are a good engineer, you use lots of small commits, you drive change via verification through tests because you need confidence in exactly what you build does. You deploy regularly - potentially after every commit. These are practices that if you do, will now save your bacon.
If on the other hand you release code days / weeks after it is written and you squash your commits then you are - to use a technical term - up shit creek because nobody has a reasonable expectation of knowing what has been causing data corruption over time. If you don’t employ these good practices, then your employer - quite rightly - will be frustrated at how much time recovering from this is going to take.
How do we find and fix defects?
I have worked on several code bases that exceed the millions of lines of code. These projects have many lifetimes of work in them and as such many of the people built them are no longer working on them. It is a reasonable expectation that nobody understands part of them; and so we develop techniques that do not depend upon knowledge of a codebase in order to identify points that matter in order to fix it.
The worst way to find a defect is to start to look through code. This is just hard work. You start in a sensible place and start looking for strange things and move about. So we optimize, we look for constraints.
Essentially because you know some piece of data is wrong, you think you know roughly what part of the system must cause it - so really it’s just about narrowing it down. IT must be related to something that updates that field. It is a very logical approach.
It is also one of the worst approaches imaginable that only works on very small code bases. But even then there is a better approach - we can use science!
How do we find and fix defects a bit better?
First, we have to reproduce the defect. We do this by writing a test, but what goes into the test? We look at the report of what went wrong, then we go into the logs to look at that users behavior up until that point.
Of course, you have used an event-sourced architecture because you know this is a situation that was going to occur and so you designed for it. This means that you can just look at your system state and build a test that simply replays all those events and see the problem. It’s very easy. After all, if you are in one of the areas that was highlighted above you understand the massive effect on peoples lives we have.
For the rest of us that didn’t design a system from the ground up to be changed and operated to solve these problems, then you will have to go log diving over multiple systems to piece this all back together. It will hurt - but it will be the fastest way to understand the sequence of behaviors that led to the problematic state.
One approach is open the box and see, the other approach involves multiple boxes and analysis with the nagging doubt of ‘did I miss something?’. The difference in time here is minutes vs hours or potentially days - this is a critical performance characteristic.
This is a lesson here - we care about events, the behavior someone has gone through - the list of steps - because it is from these that we can derive state at any point. But we cannot look at state and understand how we got there, all that context has been stripped. We will come back to this shortly and bluntly.
So we take the log of user events, and then we construct a test repeating these events and pray to the old gods that we observe the bug. When we cannot reproduce it we learn why we need to design systems that are compartmentalized and control their own state. This is low cohesion causing change from unexpected areas to spread into ours. We have designed a system that fundamentally cannot answer an incredibly urgent question. This is an architectural, product and delivery failure. It is tragic. When this occurs we have to widen the net looking for other things that interact to effect state - I usually start with things that pass large objects around into update functions. Both large objects being passed about and methods called ‘Update’ are highly suspicious code smells in my opinion for this reason.
So, we build a test, and we can reproduce the bug - this is the same as knowing what went wrong, but we still have no idea what is causing it. We might have an idea - but I can do all the above with absolutely zero knowledge of a codebase. I actually don’t case about the cause at this point, I don’t need to as it will show itself to me.
Now what?
Binary Chop
Did you know that given a log of events that describe the behavior at every point I can also exactly identify the event that causes our test to fail?
Yes, of course you do! That’s because that is what our test reproduces. You just did that.
So if we run the test on HEAD then we will see it fail. If it is a bug then when we run it earlier on it should pass. This doesn’t need to require thought, we just pick 1 months ago and run the test, if it doesn’t pass we go back for a few months until it does.
Of course, we carry the assumption that we have competent people, and it behaved at some point.
Once we have a commit where the test passes and a HEAD where it fails we can do a binary chop. We take a point, doesn’t matter where, but say half way, and we run the test against it. If the error exists we know it is in the earlier portion. If it doesn’t exist it is in the later portion.
We repeat. This is such an important process that it is build into many source control systems.
Eventually we will find the exact commit that has the error in it. This is an incredibly fast process - that requires zero brain power - and so I can do it at 4am and under immense pressure.
Because our commits are small and atomic and represent work being done at single test level (with some refactors), we not only find the error but also very likely the Exact line and the test that was written for that line of code. That is of course if you practice TDD, practice small commits all the way into production and of course that you DON’T SQUASH YOUR COMMITS
So this is why despite all your reasons to squash commits. I don’t care because I am responsible for the health of the system and ensuring that people don’t make it very difficult to heal
Yes, all your arguments may be true but being able to fix defects trumps all. There is never going to be a reason why losing the ability to do this is worth it - I love challenge on this point.
Observations
Note I still do not need to know anything about the code base, all I need to do is be able to interact with it like a customer - or potentially make some rest calls. I can do all this from the outside without looking at a single line of code.
This is an example of how we use tools in specific ways so that when things get really hard we have very simple processes to get ourselves out of trouble in very safe ways.
The thing is that it is also a process you probably only need 2-3 times in your life - hopefully never. But much like you should maintain health devices or smoke alarms despite never needing them - when they are there they will save your life.
Trust me, we need to be architecting and working in ways that are designed to protect us from the organizations we work for because when everything goes wrong it will be you at 4am with the kind of problem that will make your hair go white.
If you combine this situation - which is already stressful - with some terrible management practices we can genuinely break people.
So, you may have many good reasons for squashing. They are all about pretty shiny things such as reporting that can be achieved very effectively by using the right tools in the right places in different ways. As engineers, we have a duty to maintain functional systems that we can change responsibly. Deleting information about how those systems have changed is destroying information that will be essential in worst case situations.
I write this because I am tired of this debate and just want something to point at.
Lowering the cohesion of commits is not a good thing. We know we want high cohesion and low coupling.