Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> But if you look carefully, you will notice that it doesn’t struggle with undefined behavior in C. Or with making sure that all memory is properly freed. Or with off-by-one errors.

Doubt. These things have been trained to emulate humans, why wouldn't they make the same mistakes that humans do? (Yes, they don't make spelling errors, but most published essays etc. don't have spelling errors, whereas most published C codebases do have undefined behaviour).





There are some misconceptions here.

It's incorrect to think because it is trained on buggy human code it will make these mistakes. It predicts the most likely token. Let's say 100 programmers write a function, most (unless it's something very tricky), won't forget to free that particular function. So the most likely tokens are those which do not leak.

In addition, this is not GPT 3. There's a massive amount of reinforcement learning at play, which reinforces good code, particularly verifiably good (which includes no leaks). And also a massive amount of synthetic data which can also be generated in a way that is provably correct.


> Let's say 100 programmers write a function, most (unless it's something very tricky), won't forget to free that particular function. So the most likely tokens are those which do not leak.

You don't free a function.

And this would only be true if the function is the same content with minor variations, which is why LLMs are better suited for very small examples. Because bigger examples are less likely to be semantically similar, and so there is less data to determine the "correct" next token.

> There's a massive amount of reinforcement learning at play, which reinforces good code, particularly verifiably good (which includes no leaks)

This is a really dubious claim. Where are you getting this? Do you have some information on how these models are trained on C code specifically? How do you know whether the code they train on has no leaks?

There are huge projects that everyone depends on that have memory bugs in them right now. And these are actual experts missing these bugs, what makes you think the people at OpenAI are creating safer data than the people whose livelihoods actually depend on it?

This thread is full of people sharing how easy it is to make memory bugs with an LLM, and that has been my experience as well.


I'm not very experienced with C++ at all but Sonnet in Copilot/Copilot Chat was able to create entire files with no memory errors on the first try for me, and it was very adept at hunting down memory errors (they were always my own fault) from even just vague descriptions of crashes.

> entire files with no memory errors

How do you know? I can believe that they didn't show memory errors in a quick test run on a common architecture with a common compiler, much like most human-written code in the training corpus.


It wasn't code worth formally verifying, but even your description beats almost any programmer's first pass. With how good it is at finding bugs if you ask it, I have little reason to doubt its output.

"I don't understand the code it's writing, but I know it's smarter than you devs" is not very persuasive.

> even your description beats almost any programmer's first pass

Sure, but having access to merely mildly superhuman programming ability still doesn't make using C a good idea.


In the real world, I'd say the 90% of the C code written is somewhere between "worthwhile to spend extra effort to detect and avoid memory errors" and "worth formally verifying".

Sure, for prototype sized codebases it might be able to handle finding mistakes a fresh grad might easily make, or even that memory bugs aren't a big problem - but in my experience it happily adds memory bugs to large codebases and multithreaded code (that I think an experienced human could easily spot tbh).


Don't train them to emulate humans. Train them to emulate compilers instead.

Exactly why Rust is a better candidate for LLM generated code than C IMO

I've had issues with Claude and memory related bugs in C. Maybe small programs or prototypes it's fine if you can verify the output or all the expected inputs, but the moment the context is >50k lines or even doing something with pthreads, you run into the exact same problems as humans.

I think Claude would do much better with tools provided by modern C++ or Zig than C, frankly, anyways. Or even better, like the Rust people have helpfully mentioned, Rust.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: