I've been thinking about something like this from a UI perspective. I'm a UX designer working on a product with a fairly legacy codebase. We're vibe coding prototypes and moving towards making it easier for devs to bring in new components. We have a hard enough time verifying the UI quality as it is. And having more devs vibing on frontend code is probably going to make it a lot worse. I'm thinking about something like having agents regularly traversing the code to identify non-approved components (and either fixing or flagging them). Maybe with this we won't fall further behind with verification debt than we already are.
For context, I'm a UX Designer at a low-code company. LLMs are great at cranking out prototypes using well-known React component libraries. But lesser known low-code syntax takes more work. We made an MCP server that helps a lot, but what I'm working on now is a set of steering docs to generate components and prototypes that are "backwards compatible" with our bespoke front end language. This way our vibe prototyping has our default look out of the box and translates more directly to production code. https://github.com/pglevy/sail-zero
Our low-code expression language is not well-represented in the pre-training data. So as a baseline we get lots of syntax errors and really bad-looking UIs. But we're getting much better results by setting up our design system documentation as an MCP server. Our docs include curated guidance and code samples, so when the LLM uses the server, it's able to more competently search for things and call the relevant tools. With this small but high-quality dataset, it also looks better than some of our experiments with fine tuning. I imagine this could work for other docs use cases that are more dynamic (ie, we're actively updating the docs so having the LLM call APIs for what it needs seems more appropriate than a static RAG setup).
Not an engineer but I think this is where my mind was going after reading the post. Seems like what will be useful is continuously generated "decision documentation." So the system has access to what has come before in a dynamic way. (Like some mix of RAG with knowledge graph + MCP?) Maybe even pre-outlining "decisions to be made," so if an agent is checking in, it could see there is something that needs to be figured out but hasn't been done yet.
I actually have a "LLM as a judge" loop on all my codebases. I have an architecture panel that debates improvements given an optimization metric and convergence criteria and I feed their findings into a deterministic spec generator (cue /w validation) that can emit unit/e2e tests, scaffold terraform. It's pretty magical.
This cue spec gets decomposed into individual tasks by an orchestrator that does research per ticket and bundles it.
Mine is a much simpler use case but sharing in case it's useful. I wanted to be able to quickly generate and iterate on user flows during design collaboration. So I use some boilerplate HTML/CSS and have the LLM generate an "outline" (basically a config file) and then generate the HTML from that. This way I can make quick adjustments in the outline and just have it refresh the code when needed to avoid too much back forth with the chat.
Overall, it has been working pretty well. I did make a tweak I haven't pushed yet to make it always writes the outline to a file first (instead of just terminal). And I've also started adding slash commands to the instructions so I can type things like "/create some flow" and then just "/refresh" (instead of "pardon me, would you mind refreshing that flow now?").
My use case is a little different (mostly prototyping and building design ops tools) but +1 to this flow.
At this point, I typically do an LLM-readme at the branch level to document both planning and progress. At the project level I've started having it dump (and organize) everything in a work-focused Obsidian vault. This way I end up with cross-project resources in one place, it doesn't bloat my repos, and it can be used by other agents from where it is.
From my understanding of Simon's project it only supports OpenAI and OpenAI-compatible models in addition to local model support. For example, if I wanted to use a model on Amazon Bedrock I'd have to first deploy (and manage) a gateway/proxy layer[1] to make it OpenAI-compatible.
Mozzila's project boosts of a lot of existing interfaces already, much like LiteLLM, which has the benefit of directly being able to use a wider range or supported models.
> No Proxy or Gateway server required so you don't need to deal with setting up any other service to talk to whichever LLM provider you need.
Now how it compares to LiteLLM, I don't have enough experience in either to tell.
reply