Accelerating code migrations with AI

As Google’s codebase and its products evolve, assumptions made in the past (sometimes over a decade ago) no longer hold. For example, Google Ads has dozens of numerical unique “ID” types used as handles — for users, merchants, campaigns, etc. — and these IDs were originally defined as 32-bit integers. But with the current growth in the number of IDs, we expect them to overflow the 32-bit capacity much sooner than expected.

This realization led to a significant effort to port these IDs to 64-bit integers. The project is difficult for multiple reasons:

There are tens of thousands of locations across thousands of files where these IDs are used.
Tracking the changes across all the involved teams would be very difficult if each team were to handle the migration in their data themselves.
The IDs are often defined as generic numbers (int32_t in C++ or Integer in Java) and are not of a unique, easily searchable type, which makes the process of finding them through static tooling non-trivial.
Changes in the class interfaces need to be taken into account across multiple files.
Tests need to be updated to verify that the 64-bit IDs are handled correctly.

The full effort, if done manually was expected to require many, many software engineering years.

To accelerate the work, we employed our AI migration tooling and devised the following workflow:

An expert engineer identifies the ID they want to migrate and, using a combination of Code Search, Kythe, and custom scripts, identifies a (relatively tight) superset of files and locations to migrate.
The migration toolkit runs autonomously and produces verified changes that only contain code that passes unit tests. Some tests are themselves updated to reflect the new reality.
The engineer quickly checks the change and potentially updates files where the model failed or made a mistake. The changes are then sharded and sent to multiple reviewers who own the part of the codebase affected by the change.

Note that the IDs used in the internal code base have appropriate privacy protections already applied. While the model migrates them to a new type, it does not alter or surface them, so all privacy protections will remain intact.

For this workstream we found that 80% of the code modifications in the landed CLs were AI-authored, the rest were human-authored. The total time spent on the migration was reduced by an estimated 50% as reported by the engineers doing the migration. There was significant reduction in communication overhead as a single engineer could generate all necessary changes. Engineers still needed to spend time on the analysis of the files that needed changes and on their review. We found that in Java files our model predicted the need to edit a file with 91% accuracy.

The toolkit has already been used to create hundreds of change lists in this and other migrations. On average we achieve >75% of the AI-generated character changes successfully landing in the monorepo.

Leave a Comment Cancel Reply