Monorepoly

Apr 14, 2023 · 2168 words · 11 minute read

Go to update all your repositories. Do not pass GO, do not collect $200.

Any complex software system is structured in interdependent modules. With such a composition, the question often arises: should all modules be in one single repository, or should they each live in their own? Discussions about the topic often get blurred by conflating tools and technologies with repository structure. In this post, I will take a top-down approach from the flow of software development and look at how integration strategies are affected by repository layout. The conclusion is that the more continuous you’d like your testing to be, the more important a single repository becomes.

Software is built out of modules 🔗

Software engineering is all about abstraction. As a system grows, for it to stay maintainable, it will need to be split into multiple modules that interact with each other through well-defined interfaces. As diverse as software is, this observation applies to pretty much anything, from kernels to webapps. Depending on the context, a module can be a library, a service, an executable, an interface specification, or anything else that is based on some form of source code. Let’s take as example a service S1 that calls a service S2, and uses a library L1.

graph TD S1 -->|depends on| S2 S1 -->|depends on| L1

A system composed of three modules.

Modules change 🔗

Software is also not a one-off thing, it changes over time. A software project is started to solve some problem. A solution is designed, written (“coded”) and tested. Once that is done, any non-trivial piece of software will also need to be maintained: dependencies may be updated, bugs need to be fixed, and crucially, new requirements or problems may be identified leading to the cycle to start again. Every cycle builds on the last, and you naturally want to adapt the current solution and reuse as much of the existing software as possible in the new cycle.

In terms of modules, this means that they also evolve over time. There will be changes within modules as well as across modules.

graph TD S1 --> S2 S1 --> L1 S2 --> L2 L1 --> L2

Example of a change across modules: a part of L1 has been extracted into a new module L2. This new module is also used by S2.

Testing changes 🔗

When you make a change to a module, you want to make sure that the system as a whole still works. The correct solution is to never change the interface of a module in a breaking way (i.e. don’t break the API), and only ever introduce backwards compatible changes. This way, you can assume that dependents of your module will be unaffected.

While never-breaking changes are the ideal, they are not always reality. Either because of mistakes, or because breaking changes are sometimes easier. For example, while it is reasonable to expect widely-used open source projects to never break their interfaces (since you can’t even know who else uses your module), in organizations that control the whole system, agreeing upon and making a breaking change is usually more efficient than maintaining multiple versions of modules.¹.

In any case, regardless if the change was known to be breaking or not, there comes a time when you want assurance that your whole system still works. This assurance means testing the behavior of the system as a whole across all modules.

The cadence of running these tests can be characterized as either:

Discrete: test the system at some arbitrary point, after multiple changes to modules, when something is considered to be ready for release.
Continuous: test the system whenever any change to any module is made.

The first case is a classic approach that comes from software that is released in discrete intervals, such as traditional “installable” applications. The risk this approach has, is that breaking changes slip through during multiple module changes, which will then require a lot of work to fix before the release. One of the worst things that can happen, is an incompatibility between multiple dependent modules (triangular dependency).

graph TD S1 --> S2 L2_1["L2<sub>1</sub>"] L2_2["L2<sub>2</sub>"] S1 --> L1 L1 --> L2_2 S2 --> L2_1

L2 has been changed to fix something in L1. S2 still uses the old module, so if things are tested in isolation all is well. The problem comes when the system is tested as a whole, and only one version of L2 must be chosen.

Hence, in organizations that choose this discrete testing approach, there will often be a “freeze” period after changes, at which point everyone in the organization will be forbidden to make changes to modules while tests are run and incompatibilities fixed. If the test cadence is low (many changes per tests), there is significant potential for many issues and a very lengthy fixing period.

The second case is known as continuous delivery (CD). The idea behind this approach is to weed out compatibility issues immediately and avoid spending time on dealing with larger problems later. In a sense, it’s a bit like cleaning your house: if you do a little bit everyday it’s less work than doing it only once everything is filthy. Overall, it leads to faster delivery times and fewer issues, and is especially effective at preventing hard-to-fix triangular dependency problems like the one illustrated above.

Applying continuous delivery however also carries some challenges with it. Naively, it does not scale. As a system grows, you cannot expect a manual testing process to be run for every change. At some point even an automated test suite will need to be made smarter than simply testing the whole system on any change. But this is a whole story of its own, and isn’t related to repository structure, so let’s get back to that.

How does the testing strategy relate to repository layout? 🔗

Let’s first define by what is meant when talking about a repository and the two kinds of layouts:

A repository is a group of source files that all exist in one place, and changes to them can be submitted, reviewed and accepted or rejected in one go. The technology used to manage such a repository is not important for this definition. It could be git, but we could also have no version control system at all and simply deal with patch files for one common root folder!

If all modules of a software system are contained within one repository, that is known as a monorepo layout. If each module is contained in its own repository, that is known as a polyrepo layout.

In case you decide that you want to test things discretely at release time, then a monorepo or polyrepo doesn’t really have much of an effect. You have deferred the testing of the whole system to some later point in time and individual modules can evolve independently.

In case you however do strive for continuous integration, then by definition you want to test the whole system with all changes to all modules at once. In this situation the choice of repository is influenced by the types of changes you want to deal with.

If there’s a purportedly backwards compatible change to a module, then the choice between monorepo and polyrepo is still not that super important but a monorepo has the advantage. You’ll need to either manually or via tooling, update all modules which use the changed one and run the tests. Since in a monorepo all modules are defined in once place, it is easier to discover and update dependents.

In the situation where a change is not backwards compatible however, and hence requires multiple module updates at the same time, then a monorepo is the only approach. In this situation, you have no choice but to test multiple changes at once, since individually they break the system. Maybe some tooling can help with queueing and testing changes across multiple repositories at once, but then you have just reinvented a new version control system and are effectively still working with a single repository! Unless you are in the business of building version control systems, this is usually not a good idea. You may argue that we should never introduce breaking changes to modules, but as mentioned earlier, this sometimes is necessary. Cross-cutting restructuring of modules (e.g. merging two services into one) is also only possible in a monorepo without breaking things.

Conclusion 🔗

Purely from a software delivery perspective, without considering specific technologies, a monorepo is preferable over a polyrepo layout. In a system with traditional discrete releases, it has no real disadvantages to the latter, and in a system with continuous integration, it enables certain workflows that are not possible in a polyrepo layout.

If you want the evolution of your software to be fast and would like to minimize accidental complexity, then you also need your processes to get out of the way, and a monorepo eliminates one of those.

You may also be interested in some other discussions on the topic.

Appendix: frequent arguments against monorepos 🔗

Here are some common arguments that I’ve heard against monorepos. These arguments are all related to processes and modularity, not to using monorepos or not. I’ve covered some of them in the article, but I’ll list them here with a more specific rebuttal as well.

Build times get out of control 🔗

Argument: building and running all tests of the whole system on any change to any module does not scale. As the software system grows, tests on changes to modules would take longer and longer. This is particularly annoying if you have automated tests as part of your review process, since you must now wait for tests of totally unrelated modules to pass before your change can be accepted.

Rebuttal: this is a valid argument, but related to how continuous delivery is done and not the repo layout itself. You would have the same situation in a polyrepo setup, if you tested everything all the time.

An immediate workaround is to build and test only changes to individual modules. It’s as simple as in a polyrepo setup: for example, you can simply limit tests to modules in directories that have changed. Of course you’re no longer doing continuous integration then, but it’s not worse than the polyrepo setup.

A finer solution which takes advantage of a monorepo is to split building and testing into multiple layers and run them at different times.

Cache intermediate build results so that your build tool only needs to recompile differences and not the whole system on every change.

Regular build tools, even the venerable make, scale to pretty big numbers (see Linux) with hot caches. You can also use a hermetic build system such as Bazel which additionally allows you to share build caches with developer’s machines (if that becomes your bottleneck).
Run only a subset of tests on every change.
Run larger integration tests on a regular (e.g. nightly) basis.

Too much information in one place 🔗

Argument: there is too much code in one place to keep an overview, or new developers will have a hard time getting oriented.

Rebuttal: this is again not related to repositories, but rather how your modules are structured and documented. You would have exactly the same situation if you had multiple repositories that a developer needed to work on, with the added difficulty of finding them.

Suggest: a hierarchical approach with team-owned directories for modules.

Teams are no longer independent 🔗

Argument: teams must follow the same process and reviews become slow since they cannot operate independently anymore.

Rebuttal: teams can still own modules, regardless or what repository they live in. There are tools such as CODEOWNERS which can help with enforcing this.

One thing that is different however, is that everyone is immediately made aware of their blast radius: with a monorepo and continuous integration, it becomes immediately clear of how a change to a module owned by one team breaks another team’s. So while teams can remain independent, they do get increased scrutiny. I would however argue that this is a good thing for a healthy organization, and forces stronger collaboration across teams.

You may also be interested in Conway’s Law.

When is a polyrepo setup recommended? 🔗

If you have a truly independent software system which will be used by other modules outside of your organization, it can make sense to develop it completely independently in its own repository. Be aware of breaking changes however!

It’s also possible that you may want different access permissions to some part of a repository. In this case, using different repositories may be the simplest solution, since fine-grained access permissions are not available in all commonly used version control systems.

Note that you always need to be aware of the exact situation, and breaking changes can have unforeseen consequences, even if you update all other dependents of your module after a breaking change. In particular, in a distributed system, API changes between services need to always be rolled out incrementally unless you accept some down time. ↩︎

tooling practices git