Hacker Newsnew | past | comments | ask | show | jobs | submit | elzbardico's commentslogin

Ok man, now give me a moment, I need to administer my weightloss weekly injection of L-Histidyl-2-methylalanyl-L-alpha-glutamylglycyl-L-threonyl-L-phenylalanyl-L-threonyl-L-seryl-L-alpha-aspartyl-L-valyl-L-seryl-L-seryl-L-tyrosyl-L-leucyl-L-alpha-glutamylglycyl-L-glutaminyl-L-alanyl-L-alanyl-N6-(N-(17-carboxy-1-oxoheptadecyl)-L-gamma-glutamyl-2-(2-(2-aminoethoxy)ethoxy)acetyl-2-(2-(2-aminoethoxy)ethoxy)acetyl)-L-lysyl-L-alpha-glutamyl-L-phenylalanyl-L-isoleucyl-L-alanyl-L-tryptophyl-L-leucyl-L-valyl-L-arginylglycyl-L-arginylglycine

1 for the money 2 for the green 3,4-methylenedioxymethamphetamine

LLMs have this strong bias towards generating code, because writing code is the default behavior from pre-training.

Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.

A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.

Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.

More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.


> Writing code is the default behavior from pre-training

what does this even mean? could you expand on it


During pre-training the model is learning next-token prediction, which is naturally additive. Even if you added DEL as a token it would still be quite hard to change the data so that it can be used in a mext-token prediction task Hope that helps

He means that it is heavily biased to write code, not remove, condense, refactor, etc. It wants to generate more stuff, not less.

Because there are not a lot of high quality examples of code edition on the training corpora other than maybe version control diffs.

Because editing/removing code requires that the model output tokens for tools calls to be intercepted by the coding agent.

Responses like the example below are not emergent behavior, they REQUIRE fine-tuning. Period.

  I need to fix this null pointer issue in the auth module.
  <|tool_call|>
  {"id": "call_abc123", "type": "function", "function": {"name": "edit_file",     "arguments": "{"path": "src/auth.py", "start_line": 12, "end_line": 14, "replacement": "def authenticate(user):\n    if user is None:\n        return   False\n    return verify(user.token)"}"}}
  <|end_tool_call|>

I'm not disagreeing with any of this. Feels kind of hostile.

I clicked reply on the wrong level. And even then, I assure you I am not being hostile. English is a second language to me.

I don't see why this would be the case.

Have you tried using a base model from HuggingFace? they can't even answer simple questions. You input a base, raw model the input

  What is the capital of the United States?
And there's a fucking big chance it will complete it as

  What is the capital of Canada? 
as much as there is a chance it could complete it with an essay about the early American republican history or a sociological essay questioning the idea of Capital cities.

Impressive, but not very useful. A good base model will complete your input with things that generally make sense, usually correct, but a lot of times completely different from what you intended it to generate. They are like a very smart dog, a genius dog that was not trained and most of the time refuses to obey.

So, even simple behaviors like acting as a party in a conversation as a chat bot is something that requires fine-tuning (the result of them being the *-instruct models you find in HuggingFace). In Machine Learning parlance, what we call supervised learning.

But in the case of ChatBOT behavior, the fine-tuning is not that much complex, because we already have a good idea of what conversations look like from our training corpora, we have already encoded a lot of this during the unsupervised learning phase.

Now, let's think about editing code, not simple generating it. Let's do a simple experiment. Go to your project and issue the following command.

  claude -p --output-format stream-json "your prompt here to do some change in your code" | jq -r 'select(.type == "assistant") | .message.content[]? | select(.type? == "text") | .text'
Pay attention to the incredible amount of tool use calls that the LLMs generates on its output, now, think as this a whole conversation, does it look to you even similar to something a model would find in its training corpora?

Editing existing code, deleting it, refactoring is a way more complex operation than just generating a new function or class, it requires for the model to read the existing code, generate a plan to identify what needs to be changed and deleted, generate output with the appropriate tool calls.

Sequences of token that simply lead to create new code have basically a lower entropy, are more probable, than complex sequences that lead to editing and refactoring existing code.


It’s because that’s what most resembles the bulk of the tasks it was being optimized for during pre-training.

Small errors compound over time.

Probably being a very simple application and starting with an already big testing suite helped.

Funniest part:

> ..oh and the app still works, there's no new features, and just a few new bugs.


The rich will be deemed too big to fail somehow and will be bailed out somehow.

We will leave the bill for some teacher pension fund in Minnesota.


I like serif fonts, but never liked Times New Roman too much. Printed, in high resolution, it is kind of ok, but I absolutely abhor it on displays. Which is where we read things 99% of the time nowadays.

Georgia, Palantino, Bookerly. Those are high quality serif fonts which suits every occasion.

You can use the web.

Should not be using inheritance with persistent entities anyway. OOP is not about creating taxonomies.

Did it a couple of times. Not something you can do with your eyes closed, but not even close to the nightmare of upgrading a JS application or upgrading a rails app.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: