AI will make formal verification go mainstream

lukeasrodgers · 2025-12-10T13:29:33 1765373373

I would like this to be true, but am a bit skeptical.

I am what the article calls an "industrial software engineer" and I work on "low- to medium-assurance" projects, but have used various formal methods (alloy and TLA+) in my work to prevent and discover bugs.

I've experimented with using LLMs to generate both Alloy and TLA+ a couple times over the past years, and the problems I see are:

- They have gotten better over the last few years, but still can only produce useful results in the hands of someone who is moderately competent. Becoming moderately competent requires many hours of investment in these tools, and you will lose much of this competence if you don't keep it up. For example, I can still read TLA+ and Pluscal but can't write them without lots of referring to the docs because I only write them like once or twice a year.

- They suffer even more from GIGO than other aspects of software development. If you can't really rigorously define your problem you will get a bad model/output that only gives you false confidence. A large part of the value of doing formal methods is building the muscle for thinking rigorously. Hillel Wayne says this in several places, that doing enough TLA+ (e.g.) work gives you a much better innate sense for where there will be race conditions.

- There will still be a cultural and technical problems with integrating formal methods, and their artifacts, into the rest of your codebase and team. For example, how do you prevent drift? Will you have a CI automation that uses an LLM to detect when the spec has diverged from the code?

I'm not saying it is impossible that this will happen, and I would love to be wrong, but the general tendency I see with LLM use is to make software developers less intimately familiar with their tools, and less invested in deeply understanding their code. That bodes ill for formal methods even more than regular programming.

mutkach · 2025-12-10T13:42:28 1765374148

What would you suggest as a reference problem (a benchmark of sorts) to try to play with formal methods for someone with just a bit of formal verification background but not in the field of software verification? Can you suggest some helpful materials?

I've come across TLA+ multiple times, but it seems it was more targeted towards distributed systems (Lamport being the creator, that makes sense). Is it correct, that it would be useless in other domains?

mvr123456 · 2025-12-10T11:35:11 1765366511

This is going to happen and is real stuff we could be working towards with the tools that we already have. No need for AGI vaporware, no waiting around for the perfect agentic playground. Not even necessarily a big requirement for excellent reasoning. Just using LLMs for what they are actually good at, i.e. fuzzy translators, stylistic filters, and compilers.

Gradual-typing was practice and hints. Like gradual typing, gradual spec'ing could be an iterative and kind of parallel annotated representation, ignored by the main runtime unless called for, and ignored by developers that aren't interested in it. But when there's enough of it to hit some kind of critical mass, then it's suddenly very powerful and lots of very interesting stuff is possible

mutkach · 2025-12-10T12:41:02 1765370462

I certainly hope so.

I wonder, what is the actual blocker right now? I'd assume that LLMs are still not very good with specifications and verifcation languages? Anyone tried Datalog, TLA+, etc. with LLMs? I suppose that Gemini was specifically trained on Lean. Or at least some IMO-finetuned fork of it. Anyhow, there's probably a large Lean dataset collected somewhere in Deepmind servers, but that's not certification applicable necessarily, I think?

> AI also creates a need to formally verify more software: rather than having humans review AI-generated code, I’d much rather have the AI prove to me that the code it has generated is correct.

At RL stage LLMs could game the training*, proving easier invariants then actually expected (the proof is correct and possibly short - means positive reward). It would take additional care it to set it up right.

* I mean, if you set it to generate a code AND a proof to it.