Realistically, a large code base LLM generation tool is going to look something like old-school C code.
An initial pass will generate an architecture and a series of independent code unit definitions (.h files) and then a 'detail pass' will generate the code for (.c files) for each header file.
The header files will be 'relatively independent' and small, so they fit inside the context for the LLM, and because the function definition and comments 'define' what a function is, the LLM will generate consistent multi-file code.
The anti-patterns we see at the moment in this type of project are:
1) The entire code is passed to the LLM as context using a huge number of tokens meaninglessly. (you only need the function signature)
2) 1-page spaghetti definition files are stupid and unmaintainable (just read prompt.md if you don't believe me).
3) No way of verifying the 'plan' before you go and generate the code that implements it (expensive and a waste of time; you should generate function signatures first and then verify they are correct before generating the code for them).
4) Generating unit tests for full functions instead of just the function signatures (leaks implementation details).
It's interesting to me that modern language try to move all of these domains (header, implementation, tests) into a single place (or even a single file, look at rust), but passing all of that to an LLM is wrong.
It's architecturally wrong in a way that won't scale.
Even an LLM with 1000k tokens that could process a code base like this will be prohibitively slow and expensive to do so.
I suspect we'll see a class 'generative languages' emerge in the future that walk the other direction so they are easier to use with LLM and code-gen.
(author here) i suspect this maybe one of those “feels like a bad idea until you try it” things. Multiple times when i wanted to add a feature or be more specific, throwing it into prompt.md knowing the LLM would put it in the right file, or coordinate it across files, for me, was the most delightful experience of the “engineering with prompts” workflow.
that said if you see my future directions notes i do think theres room for file specific .md instructions.
the shared dependencies file is essentially a plan. i didnt realize it at the moment but now looking at it with fresh eyes i can do a no-op `smol plan` command pretty trivially.
Lots of different approaches for sure. I certainly see `smol plan` as being at least one step in the right direction.
My experience has been that as the amount of code you need to generate scales beyond trivial levels, the prompt-to-code approach from cross file code starts to fail.
ie.
> the shared dependencies (like filenames and variable names) we have decided on are: {shared_dependencies}
Is too trivial of a prompt for large scale code generation.
What if you generate a service in one file and expect a controller to call it from another file?
How does the model know what the functions in the other file are? Did you put the signature of every public function in shared_dependencies?
If you don't have a planning phase, you need a priori knowledge of the order in which files need to be generated, what the functions are called...
I just... don't believe what you're trying to do is actually possible. The models will just generate an isolated 'best guess implementation' for each file; and that might hang together for small code blocks... but as the number of inter-dependencies between blocks of generated code increases...
There's just no way right?
This only works if the code you're generating doesn't call itself, it only calls known library code.
shared_dependencies only solves this for the trivial case of like, 'oh, I have these 3 shared functions / types between all these files'; as the shared types and functions increase, your shared_dependencies becomes an unmaintainable monster... you end up having to generate the shared dependencies... split it up so that you have a separate shared_dependencies per file to generate... and suddenly it starts looking a heck of a lot like a .h header file....
i think the underlying assumption you have here is that you think im saying this is meant to generate a program in one shot. i have from the start pitched this as a “human centric” scaffolding tool. what i meant by that is that it doesnt try to replace the human programmer, but merely augment them, and you can just stop using it once you subjectively feel that its no longer adding value
lets say right now a single run gets you 1% of the way there. you do 10 runs of prompt iteration before you feel its run out of use, and take over, but at least by then youve gotten 20% of the way.
as intelligence goes up, the single run % goes up. as latency goes down, more runs become feasible. this is the basic equation for how much smol developer can consume the start phase of any project, especially smol/oneoff/internal apps.
this also perhaps opens up a fun definition of agi - when you can one shot 100% an app.
I'm curious how much of this is speculative vs from use?
We are doing code generation for our users (analysts & data scientists), and I found going by method names & type signatures failed pretty hard, while contextual code snippets (example docs, usages, etc you get from a vector DB) worked quite well
An old program synthesis labmate did quite well pre-genAI by using type signatures, but my takeaway is neurosymbolic for more context + option pruning, not being terse for supporting denser summarization & composition. That'd be better... But maybe training would need to change for that to work, or something deeper?
Separately, I do agree it's interesting to think about what IDEs and langs and dev will look like 5 years from now.. smarter LLMs, bigger context windows, and supporting ecosystems change a lot...
I am speculating, I’ve only done this at a small scale and had anecdotally good results. I just put examples in the comment for the function signature.
…but I can say with complete confidence that generating code from a coherent set of c header files works better than generating “one shot” full code files just from a high level goal and a file name.
You’re basically reducing the problem space from literally anything to a subset that has an defined structure.
(It helps that #include literally maps to a flat single header in c and having comments in c headers is typical, perhaps?)
yes, this is the essence of prompt engineering, you provide more context and it makes the job easier. The more specific you are, the better the results, up to a point. Sometimes they will have trouble breaking out of the context or examples to create the novelty we seek.
You might like what we are working on, generate an intermediate representation with LLMs, then use deterministic code gen to write the real code. This setup also allows for modifications and updates too, so you can use this the whole time rather than a one time bootstrapping.
While current LLMs aren't really strong enough, I wouldn't completely write off the "LLM goes brrrr" approach vs domain-specific optimisations.
How much effort is it worth investing in domain-specific tooling / languages / effort for codegen? On some level it's a bet against LLMs getting better (unless the work has intrinsic value that extends beyond being a workaround for LLM limitations).
I could see a world where if you create the right architecture then complex tasks can be broken into smaller individual tasks where your only concern is the outcome and not the underlying code. Very deterministic
Essentially all the things we developers care about might not matter. Who cares if the LLM repeats itself? DRY won’t apply anymore because there might not be a reason to share code!
LLM go brrrr until it gets the right output and the code turns into more “machine learning black box” stuff
having to define the function signature in one place and its implementation in another adds to the cognitive burden of human developers, which is why we don't see it much in modern languages. instead, what if libraries emitted the equivalent of .h files that work better with LLM? is there a currently a spec for this?
.d.ts files are produced by Typescript compiler automatically from .ts files (and can be written manually for .js files). ML signature files are much like .h files and I think for the same reason - to make compiler work easier.
Edit: as this is LLM thread - ML is Meta Language as in OCaml and SML.
I’m not sure what you’re imagining, but in this case I would not imagine you would generate js or write .d.ts files?
An LLM pass with a high level goal would generate a file list, then a series of .d.ts files from it.
Then (after perhaps a review of the type definition files, possibly also LLM assisted) a second pass taking the .d.ts files as input would generate a typescript file for every .d.ts file.
You would then discard the .d.ts files and just have a scaffolded .ts code base?
My point was doing the same trick with say, Java, seems like a harder problem to solve, but you could do the above right now with existing LLMs.
It's all a kludge, really. You shouldn't have to make up-front decisions where to put type signatures; you should be able to query for them and see them where and when you need them. We're still stuck in the local minimum of programming directly in the code serialization/storage format. It's a lot like using SQLite database by only ever viewing and editing it in a hex editor.
How do human programmers develop code? They also don't keep all the code in their head and then dump thousands of lines of code linearly in one go. Instead of feeding whole files, or even whole functions as context we could give the AI a sandboxed shell with an editor (maybe ed?), and a language toolchain with a test framework. Let it do trial and error, get feedback from the terminal when they make mistakes.
I’m building an AI-powered coding tool and the approach I’m using now is based on embeddings.
We do pass the whole files, not just headers although that’s a possibility we considered and may try in the future. Looking at other code helps the LLM a lot in maintaining similar style and making sure the code is being used correctly. Sad reality is that interfaces are rarely descriptive and robust enough to code against them without looking at details.
We don’t pass the entire codebase because even small projects don’t fit, we use embeddings and some GPT assistance to decide which files are more likely to be relevant to complete the task and pass those. It doesn’t get it right 100% of the time, (we’re working on it) but it does most of the time.
Our approach allows us to write and edit several files with a single user provided prompt. It can build entire features in existing codebases as long as they’re not huge. A lot of the time it looks like magic honestly.
The link is https://kamaraapp.com if someone is interested in trying it and providing feedback I’m happy to send over some free credits :)
> It's interesting to me that modern language try to move all of these domains (header, implementation, tests) into a single place (or even a single file, look at rust)
That's because language designers are hitting against fundamental limitations of programming directly in the serialization format. That is, plaintext code. IMHO, this is a dead-end path, already hitting diminishing returns, because it's the equivalent of designing binary formats for the convenience of person writing files directly in a hex editor, or using magnetized needle to flip bits on the hard drive.
Once you step above machine instructions, computer code is an abstract construct. Developers interacting with it have, at any given moment, different goals and different areas of interest. Those are often mutually exclusive - e.g. you can't have the same text express both high-level "horizontal" overview of the code base, and a vertical slice through specific functionality. This breeds endless conflicts and talk about tradeoffs, where it comes to e.g. "lots of small functions" vs. "few large functions", or how to organize code into files, and then how to organize files, etc. There is no good answer to those, because the problem is caused by us programming with needles on magnetic plates - writing to the serialization format directly.
Smalltalk et al. got it right all that time ago, with putting the serialization format behind a database-like abstraction, and letting you read and write at the level most fitting to your task. This is the way past the endless "clean code" holy wars and syntax churn in PL. For example, the solution to "lots of small functions" vs. "few large functions" readability is... use whichever you need for the context, but have your IDE support you in this.
Need a high-level overview? Query for function signatures, and expand those you need to look into. Need a dense vertical slice through many modules, to understand how specific feature works? Start at some "entry point", and have the tool inline all those small functions for you.
Find yourself distracted by Result<T, E> / Either<T, E> / Expect<T, E> monadic error handling bullshit, as you want to see/edit only the "golden path"? Don't do magic ?-syntax in the language - have your IDE hide those bits for you! Or say you are interested in the error case, but code is using exceptions. Stop arguing for rewriting everything to Result<T, E> - flip a switch, and have your IDE display exception flow as if it was Result<T, E> flow (they're effectively equivalent anyway).
Or, back to your original point - want your code to be both optimized for your convenience, and for convenience of LLMs? Stop trying to do both in the same plaintext serialization format. Have your environment feed the LLMs with function signatures (or better yet, teach it to issue queries against your code), while you work with whatever format you find most convenient at any given moment.
We'll get there eventually. Hopefully before we retire.
Realistically, a large code base LLM generation tool is going to look something like old-school C code.
An initial pass will generate an architecture and a series of independent code unit definitions (.h files) and then a 'detail pass' will generate the code for (.c files) for each header file.
The header files will be 'relatively independent' and small, so they fit inside the context for the LLM, and because the function definition and comments 'define' what a function is, the LLM will generate consistent multi-file code.
The anti-patterns we see at the moment in this type of project are:
1) The entire code is passed to the LLM as context using a huge number of tokens meaninglessly. (you only need the function signature)
2) 1-page spaghetti definition files are stupid and unmaintainable (just read prompt.md if you don't believe me).
3) No way of verifying the 'plan' before you go and generate the code that implements it (expensive and a waste of time; you should generate function signatures first and then verify they are correct before generating the code for them).
4) Generating unit tests for full functions instead of just the function signatures (leaks implementation details).
It's interesting to me that modern language try to move all of these domains (header, implementation, tests) into a single place (or even a single file, look at rust), but passing all of that to an LLM is wrong.
It's architecturally wrong in a way that won't scale.
Even an LLM with 1000k tokens that could process a code base like this will be prohibitively slow and expensive to do so.
I suspect we'll see a class 'generative languages' emerge in the future that walk the other direction so they are easier to use with LLM and code-gen.