Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Forgive me for the ignorance, but can a refined training model be a specific codebase, after say training on all standard docs for the language, and 3rd party libs, and so on.

I have no formal idea how this is done, but my assumption is that "something like that" should work.

Please disabuse me of any silly ideas.



Hi Jason! I have a few thoughts on this!

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's very expensive and can disrupt the usefulness of having all the generalizations baked in from training data [1].

While LLMs can generate text based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. But I do think they hold a lot of promise in reasoning.

Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.

This sets us up for a strategy called in-context learning [1]. We take advantage of the above corollary and prime the model with context to drive the next message. In your case, a query about some specific code base with knowledge about standard docs etc.

Only there is a big problem, context sizes. Damn. 4k tokens?

We can be clever about this but there is still a lot of work and research needed. We can take all that code and standard docs and create embeddings of them [2]. Embeddings are mathematical representations of words or phrases that capture some of their semantic meaning. Basically the state of a trained neural network given inputs.

This will allow us to group similar words and concepts together closer in what is called a vector space. We can then do the same for our query and iterate over each pair finding the top-k or whatever most similar pairs. Many ways to find the most similar pairs but what's nice is cosine similarity search. Basically a fancy dot product of the pairs with a higher score indicating greater similarity. This will allow us to prime our model with the most "relevant" information to deal with the context limit. We can hope that the LLM would reason about the information just right and voila.

So yeah basically create a fancy information retrieval system that picks the most relevant information to give your model to reason about (basically this [3]). That and while also skirting around the context limitations and not overfitting and narrowing the training information that allow them to reason (controversial).

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

2: Embeddings https://arxiv.org/pdf/2201.10005.pdf

3: https://twitter.com/marktenenholtz/status/165156810719298355...


Much appreciated, Sun fearing dude, much appreciated.


You can train the model on more training data after it has been released.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: