I have a small-medium following on Threads in the AI/tech space (~7K) and regularly post there, and have started posting on Twitter a little more recently, so I feel like I can provide some extra insight that might be missing in this thread.
The exposure to "what's what" in the tech space is clearly better on Twitter and it isn't even close. Nearly all tech news breaks on Twitter first, then flows downstream to Threads. For everything else it's kind of hard to say because I aggressively curate my social media feeds, so I don't get much content outside of my bubble.
The tech information I tend to get on Threads is more personal updates on mutuals' projects and niche eureka moments they have. There's maybe a dozen of these that I regularly see and interact with and maybe a couple dozen more that pop up occasionally. But again, this is after aggressively curating my feed and maintaining it for ~3 years.
I have a feeling that my efforts could have yielded better results on Twitter had I spent all that time posting and interacting there instead of Threads (or in addition to), hence me increasing my posting there.
I noticed that as I'm using agents more and more my PRs are getting more ambitious (read: bigger diffs), and when I was reviewing them with agents I noticed that the first review wouldn't catch anything but the second would. This decreased my confidence in their capabilities, so I decided to make a tool to let me run 10 review agents at once, then aggregate their findings into a single agent to asses and address.
I was using Codex at the time, so Tenex is kind of a play on "10 Codex agents" and the "10x engineer" meme.
I've since added a lot of features and just today got to use it for the first time in a production system. Some rough edges for sure, but as I'm using it any time anything feels "off" or unintuitive I'm taking notes to improve it.
Fun fact, on my machine, while launching 50x Claude Code instances very nearly crashes it, I was able to launch 100x Codex instances no problem. I tried 500x but I ran into rate limits before they could all spawn :(
Spoiler: there won't be a part 2, or if there is it will be with a different approach. I wrote a followup that summarizes my experiences trying this out in the real world on larger codebases: https://thelisowe.substack.com/p/reflections-on-relentless-v...
tl;dr I use a version of it in my codebases now, but the combination of LLM reward hacking and the long tail of verfiers in a language (some of which don't even exist! Like accurately detecting dead code in Python (vulture et. al can't reliably do this) or valid signatures for property-based tests) make this problem more complicated than it seems on the surface. It's not intractable, but you'd be writing many different language-specific libraries. And even then, with all of those verifiers in place, there's no guarantee that when working in different sized repos it will produce a consistent quality of code.
A couple weeks ago I was curious what the strictest programming language was. ChatGPT listed a couple, and it kicked off a short discussion where I began asking it about the capabilities of stricter programming languages at low levels. Funny enough at the end it mentioned that SPARK/Ada was the strictest you could get at the lowest levels, same as Ironclad.
At one point while asking it about drivers, it said "ACL2’s logic is [...] side‑effect‑free definitions with termination proofs when admitted to the logic. That is misaligned with effectful, interrupt‑driven kernel code.
I'm not an OS or kernel dev, most of my work has been in Web Dev, ML, and a little bit of embedded. How accurate is the information that was presented to me? Here is the link to the discussion: https://chatgpt.com/share/691012a7-a06c-800f-9cc9-54a7c2c8b6...
I don't know SPARK or Ada, but it just bothers me to think that we can't...I guess...prove everything about our software before we run it (yes yes, I'm familiar with halting problem shenanigans, but other than that).
There's a lot to unpack here. You can always make a stricter programming language by having your compiler error on everything.
Lisps are perfectly usable for system level code as well. There was an entire lineage of Lisp Machines where virtually all OS code was written in lisp. Those probably could have used ACL2 had it existed.
There's an X-Y component to your questions though. The strictness of the programming language just isn't the main goal for OS formal verification. It just makes certain things easier. What's important is having executable semantics of the programming language, having a machine model, and a behavioral specification. All of these are typically written in proof languages, and the kernel code is proven to implement the behavior spec according to the machine and language semantics.
SPARK gives you executable semantics, but so do C (specifically the Clight subset), most of the Lisps, Rust, and many others. You're benefiting from certain errors being impossible in SPARK, but it's not a fundamentally different process.
This sounds very similar to my workflow. Do you have pre-commits or CI beyond testing? I’ve started thinking about my codebase as an RL environment with the pre-commits as hyperparameters. It’s fascinating seeing what coding patterns emerge as a result.
I think pre-commit is essential. I enforce conventional commits (+ a hook which limits commit length to 50 chars) and for Python, ruff with many options enabled. Perhaps the most important one is to enforce complexity limits. That will catch a lot of basic mistakes. Any sanity checks that you can make deterministic are a good idea. You could even add unit tests to pre-commit, but I think it's fine to have the model run pytest separately.
The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.
You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.
Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.
Very solid advice. I need to experiment more with the pre-commit stuff, I am a bit tired of reminding the model to actually run tests / checks. They seem to be as lazy about testing as your average junior dev ;)
Yes, I do have automated linting (a bit of a PITA at this scale).
On the CI side I am using Github Actions - it does the job, but haven't put much work into it yet.
Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.
> Why not have static analysis tools on the other side of those generations that constrain how the LLM can write the code?
We do have it, we call those programmers, without such tools you don't get much useful output at all. But other than that static analysis tools aren't powerful enough to detect the kind of problems and issues these language models creates.
I'd be interested to know the answer to this as well. Considering the wealth of AI IDE integrations, it's very eyebrow-raising that there are zero instances of this. Seems like somewhat low hanging fruit to rule out tokens that are clearly syntactically or semantically invalid.
I’d like to constrain the output of the LLM by accessing the probabilities for the next token, pick the next token that has the highest probability and also is valid in the type system, and use that. Originally OpenAI did give you the probabilities for the next token, but apparently that made it easy to steal the weights, so they turned that feature off.
This can be done: I gave mine a justfile and early in the project very attentively steered it towards building out quality checks. CLAUDE.md also contains instructions to run those after each iteration.
What I'd like to see is the CLI's interaction with VSCode etc extending to understand things which the IDE has given us for free for years.
Previously at Sprout Social where I built their ML inference platform - reduced deployment time from 6 months to 6 hours and cut AWS costs by $500K/yr.
Looking for interesting problems in AI infrastructure, performance optimization, or building products from scratch.
Location: Wisconsin
Remote: Yes
Willing to relocate: Yes
Technologies: Python, PyTorch, Kubernetes, Docker, AWS, FastAPI, ONNX, MLOps
Resume: https://drive.google.com/file/d/1qO8XdisNTFq_wmrQGDKnu6eWDi2g33Me/view
GitHub: https://github.com/Mockapapella
Email: In profile or on resume
AI/ML Engineer specializing in high-performance deployments. Built distributed systems handling 30K QPS, developed a neural network for Rocket League gameplay, and created platforms that cut model deployment time from 6 months to 6 hours. Saved $500K/yr in infrastructure costs through optimization at previous role. Former technical founder with experience in humanoid robotics and AI writing assistance. I write about my projects and musings on my blog: https://thelisowe.substack.com/
Seeking roles focusing on ML infrastructure, model optimization, post-training, or full-stack AI engineering.
Previously at Sprout Social where I built their ML inference platform - reduced deployment time from 6 months to 6 hours and cut AWS costs by $500K/yr.
Looking for interesting problems in AI infrastructure, performance optimization, or building products from scratch.
Location: Wisconsin
Remote: Yes
Willing to relocate: Yes
Technologies: Python, PyTorch, Kubernetes, Docker, AWS, FastAPI, ONNX, MLOps
Resume: https://drive.google.com/file/d/1qO8XdisNTFq_wmrQGDKnu6eWDi2g33Me/view
GitHub: https://github.com/Mockapapella
Email: In profile or on resume
AI/ML Engineer specializing in high-performance deployments. Built distributed systems handling 30K QPS, developed a neural network for Rocket League gameplay, and created platforms that cut model deployment time from 6 months to 6 hours. Saved $500K/yr in infrastructure costs through optimization at previous role. Former technical founder with experience in humanoid robotics and AI writing assistance. I write about my projects and musings on my blog: https://thelisowe.substack.com/
Seeking roles focusing on ML infrastructure, model optimization, post-training, or full-stack AI engineering.
The exposure to "what's what" in the tech space is clearly better on Twitter and it isn't even close. Nearly all tech news breaks on Twitter first, then flows downstream to Threads. For everything else it's kind of hard to say because I aggressively curate my social media feeds, so I don't get much content outside of my bubble.
The tech information I tend to get on Threads is more personal updates on mutuals' projects and niche eureka moments they have. There's maybe a dozen of these that I regularly see and interact with and maybe a couple dozen more that pop up occasionally. But again, this is after aggressively curating my feed and maintaining it for ~3 years.
I have a feeling that my efforts could have yielded better results on Twitter had I spent all that time posting and interacting there instead of Threads (or in addition to), hence me increasing my posting there.
reply