More

x1000 · 2025-05-12T15:49:19 1747064959

Not a physicist, but I see it this way too. My understanding of Boltzmann brains is that they are a theoretical consequence of infinite time and space in a universe with random quantum fluctuations. And that those random fluctuations would still be present in an otherwise empty universe. So then this article has no bearing on the Boltzmann brain thought experiment or its ramifications.

x1000 · 2025-03-27T20:19:50 1743106790

If they had experimented using a newer model (gemma 3, deepseek-1 7b, etc.) and reported better results, would that be because their newer baseline model was better than the llama 2 model used in the previous methods' experiments? A more comprehensive study would include results for as many baseline models as possible. But there are likely other researchers in the lab all waiting to use those expensive GPUs for their experiments as well.

josephg · 2025-03-27T21:07:53 1743109673

Sure. But papers take a really long time to write and go through peer review. I think my paper on collaborative editing took about 4 months from the point where we were done writing to the point at which it appeared on arxiv.

This research was almost certainly done well before Gemma 3 and Deepseek were released.

x1000 · 2025-03-04T01:02:17 1741050137

It’s the fundamentals that underly Stable Diffusion, Dalle, and various other SOTA image generation models, video, and audio generation models. They’ve also started taking off in the field of robotics control [1]. These models are trained to incrementally nudge samples of pure noise onto the distributions of their training data. Because they’re trained on noised versions of the training set, the models are able to better explore, navigate, and make use of the regions near the true data distribution in the denoising process. One of the biggest issues with GANs is a thing called “mode collapse” [2].

[1] https://www.physicalintelligence.company/blog/pi0

[2] https://en.wikipedia.org/wiki/Mode_collapse

ddingus · 2025-03-05T07:36:32 1741160192

Thank you

x1000 · on Oct 9, 2024

I ran into exactly same pain point which was enough to nullify the benefits of using zod at all.

x1000 · on Oct 8, 2024

Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:

If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.

What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.

jszymborski · on Oct 8, 2024

Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention:

- Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))

- Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))

The second one being a bit more drastic and maybe harder to train.

x1000 · on Aug 6, 2024

My first exposure to computer architecture was through a Minecraft video[1] (which I likely stumbled upon on Digg). In Linear Algebra lecture the next day, I overheard my classmates discussing the video. I purchased the game later that week.

Seeing the circuitry of a computer in this way helped me to understand that computers operated by means other than pure magic. And, the video I saw was much less descriptive of how a computer works than the one the OP linked. So, although neither video amounts to a full college course on the topic, there’s still a lot of value in their ability to expose people to the topic. It’s inspiring to see how computers are mostly a composition of NAND gates, and to compare the massive structures in the videos with the microprocessors of the real world.

[1] https://youtu.be/LGkkyKZVzug?si=hZRdmablPt15gGqn

x1000 · on Dec 25, 2023

There’s a video[1] of Karpathy recounting an email correspondence he had with with Bahdanau. The email explains that the word “Attention” comes from Bengio who, in one of his final reviews of the paper, determined it to be preferable to Bahdanau’s original idea of calling it “RNNSearch”.

[1] https://youtu.be/XfpMkf4rD6E?t=18m23s

behnamoh · on Dec 25, 2023

"RNNSearch is all you need" probably wouldn't catch on and we'd still be ChatGPT-less.

TeMPOraL · on Dec 25, 2023

Worked with PageRank and "map reduce", tho.

rounakdatta · on Dec 25, 2023

Nerds pay attention nevertheless.

x1000 · on March 21, 2023

Imagine you are a LLM and all you see are tokens. Your job is not only to predict the next token in a sequence, but also to create a nice embedding for the token (where two similar words sit next to each other). Given a small enough latent space, you're probably not concerning yourself too much with the "structure inside" the tokens. But given a large enough latent space, and a large enough training corpus, you will encounter certain tokens frequently enough that you will begin to see a pattern. At some point during training, you are fed:

1) An English dictionary as input.

2) List of words that start with "app" wiki page as input.

3) Other alphabetically sorted pieces of text.

4) Elementary school homeworks for spelling.

5) Papers on glyphs, diphthongs, and other phonetic concepts.

You begin to recognize that the tokens in these lists appear near each other in this strange context. You hardly ever see token 11346 ("apple") and token 99015 ("appli") this close to each other before. But you see it frequently enough that you decide to nudge these two tokens' embeddings closer to one another.

Your ability to predict the next token in a sequence has improved. You have no idea why these two tokens are close every ten millionth training example. Your word embeddings start to encode spelling information. Your word embeddings start to encode handwriting information. Your word embeddings start to encode phonic information. You've never seen or heard the actual word, "apple". But, after enough training, your embeddings contain enough information so that if you're asked, ["How do", "you", "spell", "apple"], you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.

pertymcpert · on March 21, 2023

Can you explain what people mean by an "embedding" or "embedding space"? It seems like something really abstract and...supernatural?

sriram_malhar · on March 21, 2023

An object in the real world can be located in 3d space. You can say that one representation of that object is as a point in that space; it is embedded in a 3d embedding space.

Of course, those coordinates are not the only way in which the object can be represented, but for a certain problem context, these location coordinates are useful.

Given objects A,B,C, or rather, given their coordinates, one can tell which two are closest to each other, or you can find the point D that is the other point of the parallelogram ... this. In fact, it allows you to do similarity tests like "A:B :: C:D". This is through standard vector algebra.

Now, imagine each word associated with a 100-dimensional vector. You can do the same thing. Amazingly, one can do things like "man:woman ::king: ...." and get the answer "queen", just by treating each word as a vector, and looking up the inverse mapping for vector to word. It almost feels ... intelligent!

This embedding -- each word associated with an n-D vector -- is obtained while training neural nets. In fact, now you have readymade, pre-trained embedding approaches like Word2Vec.

https://www.tensorflow.org/tutorials/text/word2vec

pallas_athena · on March 21, 2023

An Embedding is a n-dimensional vector (think of it as a sequence of n numbers).

During training, each token (or word) gets an Embedding assigned.

Critically, _similar words will get similar embeddings_. And "similar words" could mean both semantically or (as was the example) syntactically ("apple" and "appli").

And being vectors, you can do operations on them. To give the classic example, you could do: Embedding(`king`) + Embedding(`female`) = Embedding(`queen`).

pertymcpert · on March 22, 2023

It's completely mind blowing that you can add the vectors like that.

stormfather · on March 21, 2023

Imagine you think of 2 numbers to describe a basketball. You give a number for weight (1), and redness (0.7). Now, a basketball can be described by those 2 numbers, (1, 0.7). That is an embedding of a basketball in 2d space. In that coordinate system a baseball would be less heavy and less red, so maybe you would embed it as (0.2, 0.2).

basketball ==> (1.0, 0.7) # heavier, redder baseball ==> (0.2, 0.2) # less heavy, less red

When an LLM (large language model) is fed a word, it transforms that word into a vector in n-dimensional space. For example:

basketball -> [0.5, 0.3, 0.6, ... , 0.9] # Here the embedding is many, many numbers

It does this because computers process numbers not words. These numbers all represent some property of the word/concept basketball in a way that makes sense to the model. It learns to do this during it's training, and the humans that train these models can only guess what the embedding mappings it's learning actually represent. This is the first step of what a LLM does when it processes text.

wyldfire · on March 21, 2023

I have no idea if these concepts are similar, but as a machine learning beginner, I found the concept of a "perceptron" [1] to be useful in understanding how networks get trained. IIRC a perceptron can be activated or not activated by a particular input depending on the specific network-under-training between the two. What it means to be activated or not depends on that perceptron's overall function. That perceptron is like a single "cell" of the larger matrix, maybe like the cells in your brain.

When I read the GP description referring to "embedding" above I thought of the perceptron.

Definitely not supernatural at all. The act of making an automaton that "can perceive" feels to me like it's closer to the opposite. Taking that which might seem mystical and breaking it down into something predictable and reproducible.

[1] https://en.wikipedia.org/wiki/Perceptron

rrrrrrrrrrrryan · on March 21, 2023

> you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.

Is it possible for the current generation of LLMs to assign confidence intervals to their responses?

That's my main qualm with ChatGPT so far: sometimes it will give you an answer, but it will be confidently incorrect.

terramex · on March 21, 2023

Yes, but it has some issues in latest models.

> GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, after the post-training process, the calibration is reduced (Figure 8).

pages 10-11: https://cdn.openai.com/papers/gpt-4.pdf

askiiart · on March 21, 2023

I don't know exactly how it works, but using GPT-3 via https://platform.openai.com/playground/, you can have it assign a likeliness score to each word, given all the previous text. That could act as a good confidence score.

Take this with a grain of salt though, I'm far from an expert, and it's been a while since I've played around with that feature.

harpiaharpyja · on March 21, 2023

Not an expert myself, but I imagine that generating output that expresses confidence would be distinct from any measure of confidence within the inner workings of GPT itself.

photochemsyn · on March 21, 2023

If it's learning from human behavior, this is nothing new. Our society of late has been rewarding confidence over questioning and that's likely reflected in the ChatGPT training corpus.

meghan_rain · on March 21, 2023

Supremely interesting stuff

x1000 · on Feb 20, 2021

An interesting aspect of these cryptocurrencies is the aspect of consensus, not through the intended mechanisms like PoW, but through societal acceptance. Look at BTC and BTG (bitcoin and bitcoin gold). One has the suffix of "gold" while the other maintains the (arguably?) superior lack of any such embellishments/augmentations. Was it the miner's decision to call it that? Was it the users?

Look at Ethereum v Ethereum classic. Same deal. We have two chains that share a common history, yet at some point the users of both decided to split and then society had to come to a consensus on what each chain would be called. Again, did the miners sit around and conspire to which chain would be called "Ethereum?" I don't think so. I think the decision was decentralized and emergent.

My point is, even if there was a nefarious actor who attempted a 51% attack, it seems like there would be enough of a societal pressure to ignore their empty blocks. There would exist a chain that would still be valued by the perpetrators, but not so much by the individuals being harmed by such an attack. The attacked chain would be maintained and acquire a new name "Bitcoin Hacked" or something similar, and the chain where society ignores the empty blocks would go on its merry way still being called "bitcoin."

habitue · on Feb 20, 2021

I take the point you're making (in principle, there is some off-chain consensus at work here), but for these particular examples I believe society didn't decide these names as much as the people deciding to do the forks did.

x1000 · on May 25, 2020

Reminds me of Asimov's Second Law of Robotics[1].

[1] https://en.wikipedia.org/wiki/Three_Laws_of_Robotics