Retrospective on a Software Rewrite
In 2015, Man Numeric began the process of re-platforming and migrating its business from SAS to Python. This meant an overhaul and re-write of the quantitative modelling, portfolio optimisation, and research platforms, completely replacing an ecosystem that had developed incrementally over the course of 25 years. The project was one of the largest and most ambitious in the firm's history, and on 31 December 2019, after nearly five years of effort, the final vestiges of the legacy system were decommissioned.
This post will reflect on that transition effort, discussing the rationale for the initiative and attempt an unbiased (if admittedly ex-post) assessment of the costs, risks, and benefits. Further, it will highlight some of the strategies used by the team to make an enormous project a bit more tractable, to maintain momentum, and most importantly, to mitigate risk.
The decision to migrate Man Numeric away from the SAS programming language and onto Python was not taken lightly. Completely re-platforming a company is a costly and risky undertaking. Joel Spolsky, Co-Founder of StackOverflow, summarises the conventional wisdom about a full, from-scratch re-write in the title of his classic essay: "Things you should never do" - and the literature since then tends to agree with him.
Furthermore, the SAS platform had facilitated the firm's emergence as an early leader in Quantitative Equity investment. The platform had decades of flight hours, during which it accrued an invaluable amount of nuanced information about the data and markets it traded. It had weathered the Quant Crisis of 2007 and the GFC shortly afterwards, and by 2015 had yielded several consecutive years of historically positive performance.
However, over the course of over 20 years, the SAS platform had accrued a great deal of complexity. Changes required significant resources. In an increasingly competitive industry, it was becoming harder and harder to stay nimble and bring innovation quickly to market.
In addition to the desire for greater agility, the 2014 acquisition of Numeric Investors by Man Group also played an important role in the direction and resourcing of the re-platforming effort. Though the decision to begin re-platforming had already been made, the acquisition brought Numeric into collaboration with Man AHL, which had recently accomplished a Python conversion of its own. This new partnership provided crucial support in terms of infrastructure and prior art.
Man Numeric briefly explored a number of options before settling on Python. Joining forces with Man Group of course made the decision easier, but other factors were important to consider as well.
Python was quickly becoming the de-facto language for data science, machine learning and natural language processing; it would unlock new sources of innovation. Python would allow us to engage with its sizeable open source community, bringing state-of-the-art technology in-house quickly, while allowing for customisation. In a competitive recruiting market, Python would also be much more attractive to top talent; ranked as the most wanted skill by developers and the third most loved language on Stack Overflow. Furthermore, new-hire onboarding could also be accelerated, as new hires would come equipped with the skills necessary to be productive on day one. Finally, a Python platform would allow our researchers, data scientists, and technologists to speak the same language, and find support and advice from a vast online community of other python programmers.
All of these benefits serve the same underlying purpose, however: to reduce the cost of innovation.
One drawback that is worth noting for those preparing to embark on a re-platforming, is that while it is open source, Python in a production environment is certainly not “free.” Where the SAS ecosystem is tidy, complete, and batteries-included, Python gives you both the freedom and the obligation to build your ecosystem. Things like package management and data persistence were solved problems in the SAS ecosystem, but required investment to get right in Python.
Risks, and Risk Mitigation
The decision to convert to Python had strong business support and solid rationale, but success was far from inevitable at the outset. In any business, the risk of a project of this nature can be high. When billions of dollars of our client's money is at stake, if things go wrong the consequences range from the merely devastating (AXA Rosenburg) to completely catastrophic (Knight Capital).
Aside from the splashy headline failures, there is a subtler brand of failure experienced by countless technology companies: timeline failures. That is, they simply fade into obsolescence focusing all of their efforts on a code re-write and failing to deliver the new features that their clients demand.
Without a doubt, re-platforming a company is a risky proposition. In fact, it is likely far riskier than a literature review would have you believe, because essays like these undoubtedly suffer from survivorship bias.
Mitigating the risk of headline failure, loss of momentum, or a slow fade into obsolescence was of utmost importance, and the strategies that Man Numeric used to do so, including lessons learned, are sufficiently general that they are worth sharing.
1. Don't Put the Business On Hold
Determining resource allocation to new business and innovation versus re-platforming is an exercise in deciding between two different types of debt. Focusing 100% of resources on code re-write may deliver it more quickly, but it may also result in insurmountable business debt. In other words, if the company fails to deliver on existing commitments to clients, or to bring new products to market in a competitive time-frame, then it may find itself running an unwinnable race to catch up with the competition once the re-platforming is complete.
On the other hand, by investing all resources on delivering new products, you will take on technical debt, piling up material to be re-written faster than you can re-write it, leaving legacy software around indefinitely, and failing ever to glean the benefits of re-platforming.
Business debt and technical debt have different risk-reward profiles. Like any form of leverage, neither one is objectively bad (most homeowners didn't pay in cash up front!), but they have to be managed thoughtfully. Do your best to avoid taking on high-interest debt; make small incremental wins; and take advantage of opportunities which benefit both the re-platforming efforts and business growth.
Finally, one decision we made early on was to ring-fence the team responsible for this new platform buildout from other project work. This provided a natural barrier to short term resourcing decisions which may have detracted from our longer-term strategic goals.
2. Identify Your Business-Level Regression Test and Invest in Instrumentation up front
Full, 100% parity with the previous system was never the goal of Man Numeric's re platforming; it would have been impractical, and precluded many upgrades that were packaged with the re-write.
In the first few months of the project we built a system to cross-check the outputs of the new platform versus its legacy counterpart, and to flag issues such as a large divergence in correlation or coverage. This system helped keep iteration cycles tight, and it was absolutely necessary, but ultimately not sufficient.
Our instrumentation told us how close we were to a full match of the legacy system, but it couldn't tell us the consequences of how far away we were. And while the true acceptance criteria was matching return and risk characteristics (within bounds), our initial efforts to cross-check the new platform resulted in developers focusing on an exact output match, a far more challenging undertaking.
We built the tooling to make the full-scale assessment of historical simulated risk and returns, and the re-platformed signals went through the same due-diligence as any new product, with Man Numeric's Investment Committee reviewing and signing off each incremental replacement. However, had we started with this by aligning our instrumentation more closely with our acceptance criteria, we might have saved ourselves a lot of time and effort in the long run.
3. Identify the Interfaces of Your System's Components, and Replace them One By One
One way that we were able to manage risk was by replacing pieces of our system incrementally, holding as many variables constant as possible. It was a deliberate process, and it was slow; almost three years passed between the time when our first Python code began handling client money, and the time that we turned off the last piece of the legacy system. Through that time period, we bore the cost of maintaining two systems side-by-side, but had the compensation of increased confidence in the scope of each change that we made.
Another benefit of this approach was that it helped to make the project manageable. Incremental delivery meant that each release was small, but it was also real and meaningful, which was important for the morale of a team engaged in a multi-year effort. It also created lots of opportunities for parallelisation.
However, when migrating a system component by component, the path of least resistance leads to a system architecture that looks a lot like your legacy system. It was important for us to define boundaries between systems components where we thought they ought to go, not where they happen to already be. In some cases this was practical, in other cases it took additional work, but ultimately we were left with a well architected system that was free from legacy technical decisions.
Problems loading this infographic? - Please click here
Source: Man Group.
Man Numeric's migration to Python was a success. Having completed it, we are measurably reaping the benefits of increased agility and reduced time-to-market. But it was not an easy project; a company with fewer available resources, a less tenacious team, and without such fierce support and participation from the investment teams might not have succeeded. The strategies suggested above were not obvious from the outset, but learned by experience.
Deciding to completely re-platform a mature company is a difficult decision for a business to make. The costs and risks are concrete and immediate, while the benefits are theoretical and have a real probability of never materialising. But if they do materialise, they can be completely transformative.
Readings that informed this post:
"Things you should Never Do" by Joel Spolsky
"The Myth of the Software Rewrite" by Eric Deitrich
"The best decision we've made was to abandon a complete code rewrite" by Tyler King
"We Decided to Rewrite Our Software. You Won’t Believe What Happened Next!" by Aaron Hardy
"Technical Debt and Tacking Into the Wind" by Ted Unangst
"Lessons from 6 software rewrite stories" by Herb Caudill (which reviews and links to a number of other excellent articles)
"Why Software Rewrites Often Fail – and How 'User Goals' Can Fix Them" by Jonah Bailey
"Stack Overflow Developer Survey" by Stack Overflow