Interrogating Hidden Assumptions in Code Review with AI

Predictable behaviour is one of the ways we make production software engineering possible. There's a reason why static type checkers in dynamic languages have skyrocketed in popularity. With type checking, changing old code feels so much safer.

LLMs have dramatically increased the amount of code that can be produced by a single developer. If validating that code is correct becomes the bottleneck of software development, then there's a clear opportunity to develop new tools that go beyond looking at git diffs.

When you change a schema's type from an integer to string, it's immediately clear that you're making it more flexible. Type systems can help catch structural invariants, but a lot of issues in production come from other assumptions being broken such as uniqueness, timeliness, or ordering. Formal methods in software engineering can help with that, but good luck trying to model a pandas data transformation in TLA+.

Suppose your stakeholder wants the currently long-shaped data to be made wide. Your AI agent has submitted the following code as their merge request. Can you catch the edge case that will cause this to raise an exception?

--- a/reshape.py
+++ b/reshape.py
@@ -1,18 +1,25 @@
import pandas as pd
import numpy as np
 
def reshape_metrics(df):
    def parse(v):
        if pd.isna(v) or v == "N/A":
            return np.nan
        v = str(v).strip()
        if v.endswith("%"):
            return float(v[:-1]) / 100
        return float(v.replace(",", ""))
 
    melted = df.melt(id_vars="id", var_name="key", value_name="value")
    enriched = melted.assign(
        metric=melted["key"].str.split("_").str[0],
        year=melted["key"].str.split("_").str[1].astype(int),
    )
    typed = enriched.assign(
        value=enriched["value"].astype("string").map(parse).astype("Float64"),
    )
-    return typed.drop(columns="key")
+    return (
+        typed
+        .drop(columns="key")
+        .set_index(["id", "year", "metric"])
+        .unstack("metric")
+        .rename_axis(None, axis=1)
+        .reset_index()
+    )

There's a small trick: the transformation assumes that each (id, year, metric) combination is unique before the unstack. This is the shape of data the program assumes, but it's not always true depending on the source of data. A more defensive transformation would've been made obvious by either loudly calling an error beforehand or deduplicating the data.

My experiment Runtime Diff will help you interrogate this exact edge case. This tool tries to guess what behaviours will change and which will hold after your code change as well as generating relevant examples. It's currently limited to only work with Python functions.

LLMs can do more than just leave a comment on merge requests, but help us explore the hidden assumptions in our code and enhance human reasoning. When you're just starting to program and learning about algorithms, you may use a visualizer like pythontutor.com. We give these things up once we can model program behaviour in our head, but AI may make that more difficult whether through sheer number of lines of code or complexity of the programs. This is the "grown-up" version of that. After evaluating my experiment with some evals, the tool is excellent at generating examples but still struggles to determine conditions that may differ on either side.

I lead a team that builds data pipelines which means I spend a lot of my time day to day reviewing code that transforms data. When processing data, you need to ensure your code works correctly forever. That means being defensive, it means making assumptions about your data, and it also means changing existing logic can be incredibly dangerous.

Rather than asking users to declare invariants or a formal specification which can be time-consuming and difficult or even impossible to fully model, the next level of abstraction in code review could allow AI to try to declare invariants that have shifted in a program and allowing reviewers to explore the impact of changes. As we think about ways to automate validation, we need to bring new ways that interrogate the hidden assumptions our code makes.