Can LLMs model real-world systems in TLA+?

upvote

Can LLMs model real-world systems in TLA+?

(www.sigops.org)

89 points

by mad18 hours ago |

upvote

by simplegeek4 hours ago|

[-]

I feel LLMs are indeed getting better at writing models. But, in my experience, they struggle to come up with correct safety and liveness properties unless you closely work with them. And of these two, they struggle the most with correct liveness properties.

Also for some problems I observe that models produced by LLMs often cause state space explosion. For simpler models they can fix this when you guide them though.

I’m sure LLMs will get even better.

That said, I take slightly different approach. Lamport said “If you're thinking without writing, you only think you're thinking.” So taking that advice I always try to write the first draft with hand and once I have the final shape in place I then turn to an LLM for further exploration and experimentation if I have to.

reply

upvote

by tmaly9 hours ago|

[-]

I remember NVIDA sponsored a TLA+ challenge last year https://foundation.tlapl.us/challenge/index.html

reply

upvote

by uptodatenews9 hours ago|

[-]

Whoa didn't even know cool

reply

upvote

by tombert7 hours ago|

[-]

Claude has certainly been getting better with TLA+. It's not perfect yet but for laughs I got it to model the rules of Monopoly last night [1]. I haven't done any exhaustive checking on it yet, but it certainly looks passable.

It is pretty impressive at how good it's gotten at this, in a relatively short amount of time no less. I still usually write my specs by hand, but who knows how much longer I'll be doing that.

[1] https://pdfhost.io/v/KU2j37YKrP_Monopoly

reply

upvote

by randusername1 hours ago|

[-]

What's the advantage of provable correctness if it's apparently not easy to prove even for people who understand TLA+? I'm not trying to be a party pooper, just curious.

Isn't logical incorrectness less of a problem in software than failures of imagination or conscientiousness in modeling the domain?

reply

upvote

by ofrzeta7 hours ago|

[-]

It looks quite complicated and I have no idea what it is doing. Obviously, since I don't know about TLA+. But what about someone who knows TLA+? It still seems hard to make sure it is valid. And it's just for a relatively simple game.

reply

upvote

by comex4 hours ago|

[-]

Well, for one thing:

> Decline to buy: property stays with bank (auction abstracted out)

Ignoring an entire game mechanic is really stretching the definition of “abstracted out”…

Also, at the bottom it defines a “Liveness: someone eventually wins” property which I believe cannot be proven. Monopoly doesn’t have any rules forcing the game to end eventually. There is only a probabilistic guarantee, and even that only applies if the players are trying to win; if the players are conspiring to prevent the game from ending then they’re unlikely to fail.

reply

upvote

by _doctor_love4 hours ago|

[-]

There is a nice guide to TLA+ from Hillel Wayne here: https://learntla.com/

PlusCal is recommended as the gentler on-ramp to TLA+ for first learning.

reply

upvote

by NooneAtAll32 hours ago|

[-]

> I haven't done any exhaustive checking on it yet, but it certainly looks passable.

isn't that exactly the kind of fails LLMs do the most? first-glance-passable nonsense?

reply

upvote

by iFire7 hours ago|

[-]

I don't use tla+ to model real-world systems anymore, Claude is able to model systems in Lean 4 and the binary executable can handle real input or I can directly generate c / rust on proofs with numeric types that have ring structure (integers, rationals, bits).

https://github.com/lambdaclass/truth_research_zk

reply

upvote

by thomasahle4 hours ago|

[-]

I'm currently choosing between the right formalization for a big hardware project.

I'm considering between SVA, TLA+ and Lean. With the former being more domain specific and the later more general.

Do you think we'll move towards "Lean for everything" or do domain specific formalisms still make sense?

reply

upvote

by NooneAtAll32 hours ago|

[-]

what's SVA?

reply

upvote

by IshKebab2 hours ago|

[-]

SystemVerilog Assertions. Hardware (silicon ASICs, and also FPGAs often) are written in a language called SystemVerilog. It has a feature called "concurrent assertions" which is usually just called SVA.

These are sort of temporal regexes, e.g. you can write

  assert property($fell(rst) |-> foo == 1 ##[1:20] foo == 0)

Which means if the rst signal fell (changed to 0) then foo must be 1 and 1-20 cycles later it must be 0.

The nice thing about them is that there are a few commercial tools that can formally verify them. They're super expensive (~$100k/year for one license), but fairly widely used because they work really well.

It's probably the most successful application of formal verification because it doesn't require much expertise to use. Unlike software formal verification which pretty much immediately requires you to become an expert on loop invariants, termination measures, hoare triples etc. At least that has been my experience.

reply

upvote

by dmos623 hours ago|

[-]

Do you find Lean 4 sufficient for highly async systems?

reply

upvote

by iFire3 hours ago|

[-]

I haven't made money on yet, but I'm trying to model a webtransport (http/3, quic) system for massive multiplay vr games.

See https://aws.amazon.com/builders-library/challenges-with-dist... for how async related to distributed systems.

reply

upvote

by dev_arvin20007 hours ago|

[-]

[flagged]

reply

upvote

by dgacmu9 hours ago|

[-]

This post reads like an accidental advertisement for approaches like Verus [1], which couple the implementation and verification so you can't end up with a model that diverges from the actual implementation. I'm personally much more optimistic about the verus approach, but I freely admit that's my builder bias speaking.

[1] https://github.com/verus-lang/verus

reply

upvote

by dev_arvin20007 hours ago|

[-]

[dead]

reply

upvote

by atomicnature6 hours ago|

[-]

Just a question to people who may know better than me about this.

I thought the whole point of trying to write out TLA+ is so that you get a better idea of what you want and put it into formal language?

I get that an LLM can assist/help with expressing what we want in formal language a bit, but if one automates all this there is no human intent/design anymore.

If the LLM generates both the design (TLA+) and writes an arbitrary program that satisfies said design -- what exactly have we proved?

What assurance do humans get since human doesn't know or cannot specify what they want.

reply

upvote

by majormajor6 hours ago|

[-]

An LLM-generated TLA+ model can be verified for certain things in a way that LLM-generated code can't. It's infamously hard to exhaustively unit-test concurrency.

Whether or not you're modeling the right things or verifying the right things, of course... that's always left as an exercise for the user. ;)

(How to prove the implementation code is guaranteed to match the spec is a trick I haven't seen generalized yet, either, too.)

reply

upvote

by kiwicopple21 minutes ago|

[-]

> It's infamously hard to exhaustively unit-test concurrency.

a useful example from last week where TLA+ found a bug in pg_rewind:

https://multigres.com/blog/2026/05/04/tla-pg-rewind

reply

upvote

by pzoln6 hours ago|

[-]

Sorry, must be a very naive question, but what if you give LLM just a source code (maybe even obfuscate the names like Raft and Etcd) and ask it to create a TLA+ spec of that?

reply

upvote

by _doctor_love4 hours ago|

[-]

This is already being done by some folks, reverse-engineering existing source into a TLA+ spec. Like other commenters have mentioned, the challenge is in ensuring that the spec and code match each other.

reply

upvote

by Ozzie-D1 hours ago|

[-]

[dead]

reply

upvote

by ElenaDaibunny2 hours ago|

[-]

[dead]

reply

upvote

by asxndu9 hours ago|

[-]

[dead]

reply

upvote

by uptodatenews9 hours ago|

[-]

[dead]

reply