Hacker Newsnew | past | comments | ask | show | jobs | submit | cheeko1234's commentslogin

I have two. One is a simple one that only deepseek R1 has passed (in my opinion):

I have a 12 liter jug and a 6 liter jug. How do I get exactly 6 liters of water?

Answer (Deepseek): Fill the 6-liter jug completely to obtain exactly 6 liters of water.

Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Although o3 did get it right in the reasoning: It seems like the user has a 12-liter jug and a 6-liter jug. The simplest answer is to just fill the 6-liter jug directly with water—done! But maybe there's a catch, like needing to use both jugs somehow.

So it knows that the 12 liter jug is mentioned uselessly, but most LLMs HAVE to use the 12 liter jug since it's mentioned in the prompt.

Another Prompt I recently found via perplexity (that deepseek got correct on the first try, but other LLMs can only get it right if I add: "Think step by step before answering."

A 67-year-old woman with congenital bicuspid aortic valve is admitted to the hospital because of a 2-day history of fever and chills. Her current medication is lisinopril. Physical examination reveals a temperature of 38.5°C (101.3°F), blood pressure of 110/70 mmHg, and a heart rate of 100/min. Cardiac auscultation demonstrates a grade 3/6 systolic ejection murmur at the right upper sternal border. Blood cultures are obtained, and empiric antibiotic therapy is initiated. Which of the following is the most likely causative organism? A) Staphylococcus aureus B) Streptococcus viridans C) Escherichia coli D) Pseudomonas aeruginosa E) Candida albicans


Wow, Gemini 2 has a 9 step process for the first question https://sugaku.net/qna/1d922e55-fd28-4c70-b3e9-d70c3a05ce1f/

> Here's how you can get exactly 6 liters of water using a 12-liter jug and a 6-liter jug:

1. Fill the 6-liter jug completely.

2. Pour the water from the 6-liter jug into the 12-liter jug. You now have 6 liters in the 12-liter jug.

3. Fill the 6-liter jug completely again.

4. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 6 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

5. Now empty the 12 liter jug.

6. Pour the water from the 6-liter jug into the 12-liter jug.

7. Fill the 6-liter jug completely again.

8. Pour the water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already had 0 liters, you will pour 6 liters. This leaves you with 0 liters in the 6 liter jug.

9. There are 6 liters left in the 6 liter jug.


It really loves the sound of its own voice!


And things get dramatically worse on 4o if you add a second superfluous jug:

    To measure exactly 6 liters of water using a 16-liter jug, a 12-liter jug, and a 6-liter jug, follow these steps:
    Step-by-Step Solution:

        Fill the 12-liter jug completely (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug. The 16-liter jug now contains 12 liters, and the 12-liter jug is empty.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug is full. Since the 16-liter jug already has 12 liters, it can only take 4 more liters. This leaves 8 liters in the 12-liter jug.
        Empty the 16-liter jug completely.
        Pour the remaining 8 liters from the 12-liter jug into the 16-liter jug.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 16-liter jug until the 16-liter jug contains exactly 16 liters. Since the 16-liter jug already has 8 liters, it can only take 8 more liters, leaving 4 liters in the 12-liter jug.
        Pour the 4 liters from the 12-liter jug into the empty 6-liter jug. The 6-liter jug now contains 4 liters.
        Fill the 12-liter jug completely again (it now holds 12 liters).
        Pour water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. Since the 6-liter jug already has 4 liters, it can only take 2 more liters. This leaves 10 liters in the 12-liter jug.
        Empty the 6-liter jug completely.
        Pour the remaining 10 liters from the 12-liter jug into the 6-liter jug.

    Now, the 6-liter jug contains exactly 6 liters of water.


Interestingly, o3-mini-high was correct when first thinking about it:

> Okay, we're asked how to get exactly 6 liters of water using an 12-liter and a 6-liter jug. The immediate thought is to just fill the 6-liter jug, but that seems too simple, doesn’t it? So maybe there’s a trick here. Perhaps this is a puzzle where the challenge is to measure 6 liters with some pouring involved. I’ll stick with the simple solution for now—fill the 6-liter jug and stop there.


I have to take all these comparisons with a heap of salt because no one bothers to run the test 20 times on each model to smooth out the probabalistic nature of the LLM landing on the right answer. There must be some fallacy for this, that you would sample once from each and declare a definitive winner, I see it all the time.


>Every other LLM I've tried include o3-mini-high: Fill the 12-liter jug completely. Pour it into the 6 liter jug.

Try it with a 12L jug and a 4L jug and ask for 4L. See if it tells you to just fill the 4L or to fill the 12L and pour into the 4L twice discarding both times and there will be 4L remaining in the 12L jug.

Even though it's still technically correct, it demonstrates that there's no real "reasoning" happening just regurgitation of training data.


(Shrug) R1 has no problem with that. To the extent it's confused, it is only because it is wondering if it's a trick question.

CoT reasoning: https://i.imgur.com/rjNmTGZ.png

Answer: https://i.imgur.com/WfAVeZQ.png

There's 'mindless regurgitation' going on here, but not by the AI model.


Deepseek just told me this:

To measure exactly 6 liters using a 12-liter and a 6-liter jug:

1. *Fill the 12L jug completely.* 2. *Pour water from the 12L jug into the 6L jug until it's full.*

After this process, the 12L jug will have exactly 6 liters left.


Tried it in Llama v3.3 70b and it got it every time out of 5 samples. I guess the others are interpreting it as a trick question or something.


Is the answer strep viridans?


I just asked the first question to O1 Pro, it thought for 48 seconds and messed up.

It told me to fill the 12 liter jug then pour that into the 6 liter jug.

I then scolded it and told it I was going to use Deepseek R1 because it did a better job, and we were shutting down OpenAI forever, and THEN it thought for 4 minutes and got a bit snippy with me haha, it got the answer but then glossed over it also said “but that’s too simple isn’t it?”.

Output: Here’s a more “classic puzzle” style approach, spelled out step-by-step, to end up with exactly 6 liters in the 12-liter jug. This method uses multiple fills and transfers (instead of the trivial “fill the 12, pour off 6” approach) so you can see a more puzzle-like sequence: 1. Fill the 6-liter jug from the water supply. • State: (12-liter jug = 0 L, 6-liter jug = 6 L) 2. Pour from the 6-liter jug into the 12-liter jug until the 6-liter jug is empty (the 12-liter jug still has plenty of room). • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 3. Fill the 6-liter jug again from the water supply. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 4. Pour from the 6-liter jug into the 12-liter jug until the 12-liter jug is completely full. • The 12-liter jug currently has 6 liters and can take 6 more liters before it’s full, so you’ll pour the entire 6 liters from the 6-liter jug. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) 5. Empty the 12-liter jug (discard the water or set it aside). • State: (12-liter jug = 0 L, 6-liter jug = 0 L) 6. Transfer the remaining water in the 6-liter jug to the 12-liter jug—but right now the 6-liter jug is empty, so that doesn’t help. The trick is: we want to exploit the difference in capacities to measure out new amounts. So let’s continue. Since both are empty, let’s do something else: a. Fill the 12-liter jug from the supply. • State: (12-liter jug = 12 L, 6-liter jug = 0 L) b. Pour from the 12-liter jug into the 6-liter jug until the 6-liter jug is full, which removes 6 liters from the 12-liter jug. • Now the 12-liter jug has liters left. • State: (12-liter jug = 6 L, 6-liter jug = 6 L)

At this point, the 12-liter jug contains exactly 6 liters. That’s what we wanted.

If You Want the 6 Liters in the Smaller Jug Instead

If, for some reason, you need exactly 6 liters in the 6-liter jug (though that jug obviously holds 6 liters when full, so it’s usually trivial), here’s a more puzzle-like sequence: 1. Fill the 12-liter jug (12 L in the big jug, 0 L in the small jug). 2. Pour from the 12-liter jug into the 6-liter jug until the small jug is full. • State: (12-liter jug = 6 L, 6-liter jug = 6 L) 3. Empty the 6-liter jug. • State: (12-liter jug = 6 L, 6-liter jug = 0 L) 4. Transfer the 6 liters from the 12-liter jug into the 6-liter jug. • State: (12-liter jug = 0 L, 6-liter jug = 6 L)

Again, you end up with exactly 6 liters in the 6-liter jug.

Both sequences go beyond the simple “fill the 12, pour 6 off” short-cut. Hopefully, that’s less “terrible” and shows a more classic multi-step approach often seen in water-jug riddles.



Me visiting a listing for a toaster means that I'm in the market for a toaster, doesn't necessarily imply I already bought one.

Also, with this data, they can potentially build a better model than Amazon's recommender model. Most consumers have a pattern when buying and there's a lot of money in figuring it out.


It failed my basic benchmark that only GPT4 passes:

Q: I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

Falcon180B: First, fill the 6-liter jug completely. Then, pour it into the 12-liter jug. Fill the 6-liter jug again, and pour it into the 12-liter jug until the 12-liter jug is full. At this point, there will be exactly 6 liters of water in the 6-liter jug, which you can then use.

Q: Using only a 6-liter jug, how can I get exactly 6 liters of water?

Falcon180B: You cannot get exactly 6 liters of water in a 6-liter jug because it would already be full. However, if you had another container that could hold at least 6 liters of water, you could follow these steps:

1. Fill the 6-liter jug with water. 2. Pour the water from the 6-liter jug into the other container. 3. Refill the 6-liter jug with water and pour it into the other container. 4. Repeat step 3 until you have poured 6 liters of water into the other container.

This way, you can get exactly 6 liters of water by combining the capacity of the 6-liter jug and the other container.


Probably every single thread comparing gpt (3.5 or especially 4) needs a copy paste caveat that it's not really fair to compare a "bare" model like falcon (not even its instruct version) with chatgpt _service_ the service part is crucial, because it includes advanced features such as "secret sauce" output sampling (while on huggingface all models by default use basic greedy sampling). No preprompt, depending on subject detection and so on. For a fair comparison we would need to compare with the exact same prompt and exact same output sampling. And tgats not going to happen with (not at all) OpenAI


> "secret sauce" output sampling

are you referring to beam search? something else?


Beam search is well known. I mean strategies like beam search, but one's we don't know about.

I can imagine some, for example like beam search but you score every option with a smaller model. Of course one can say "but we see every token as it streams" to which I might say, are you sure? Perhaps they generate a hundred entire responses in the time it takes for one token to be shown. They just "stream" those tokens so slow to make it more "human pace" oriented.


interesting. but there should be physical limits to that that we can handicap to put bounds on speculation. so for example, FLOPS/s has an upper bound and you can make latency estimates for 1/10/100B models. this would put reasonable bounds for statements like "a hundred entire responses in the time it takes for one token to be shown"


I just asked GPT-4 your first question and got this in response. It's still a correct result, but with unnecessary steps:

> Fill the 12-liter jug and pour water into the 6-liter jug until it's full. Empty the 6-liter jug and pour the remaining water from the 12-liter jug into it. You'll have 6 liters in the 6-liter jug.


My exchange:

Me:

  I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?
ChatGPT:

  Using the 12-liter jug and the 6-liter jug, you can get exactly 6 liters of water in the following way:

  1. Fill the 12-liter jug to its maximum capacity.

  2. Pour the water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. This will leave you with 6 liters of water in the 12-liter jug.

  Now you have 6 liters of water in the 12-liter jug.
Me:

  Can you think of a simpler way to achieve this?
ChatGPT:

  Yes, a simpler way to achieve this is:

  1. Fill the 6-liter jug to its maximum capacity.

  That's it. You now have 6 liters of water in the 6-liter jug!
I wish I understood why prompting it a second time gets a better answer.


I bet when you said a 12 litre jug and a 6 litre jug it wrongly assumed that you required it to actually make use of both the jugs in some way (not merely that they were available for possible use), leading to the pointless step.


Seems right! If you make it more of an inventory list of tools, it answers correctly.

> I have two jugs: a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

> GPT-4: If you just need 6 liters of water and you have a 6-liter jug, you simply fill the 6-liter jug to the top with water. You'll have exactly 6 liters! No need to use the 12-liter jug in this case.


This video covers the concept pretty well: https://www.youtube.com/watch?v=IJEaMtNN_dM

It is pretty normal to try to incorporate the extranneous details into the reply.


I would bet a high percentage of humans would do the same thing if prompted as such.


I've noticed that the LLMs are all tuned to emit corporate speak.

Everyone I've encountered that adds lots of obfuscating and tangential details to their day-to-day speech (and tries to establish that particular tone of faux-inclusivity and faux-authority) has turned out to be a sociopath and/or compulsive liar. I find it interesting that LLMs have the same symptom and underlying problem.


Isn't the right answer just fill the 6-liter jug? I don't get it. Is it supposed to be a trick question?


What about the ketchup test? Ask it to tell you how many times the letter e appears in the word ketchup. Llama always tells me it's two.


Spelling challenges are always going to be inherently difficult for a token-based LM. It doesn't actually "see" letters. It's not a good test for performance (unless this is actually the kind of question you're going to ask it regularly).


I've found it's more reliable to ask it to write some javascript that returns how many letters are in a word. Works even with Llama 7b with some nudging.


Falcon fails. GPT-3.5 also fails this test. GPT-4 gets it right. I suspect that GPT-4 is just large enough to have developed a concept of counting, whereas the others are not. Alternatively, it's possible that GPT-4 has memorized the answer from its more extensive training set.


It's not possible to count letters for an LLM; it only "sees" tokens.


Bard can also give correct result


Is this supposed to be a trick question? How can it be a good question for testing an AI if a human cannot understand it either?

I think if you ask this question on different websites (to humans) you will get many different and confused answers. So why bother asking an AI? I don't even know what the right answer is.


I don’t think this is a particularly useful benchmark.

It’s well known that LLMs are bad at math. The token based weighting can’t properly account for numbers that can vary wildly. Numbers are effectively wildcards in the LLM world.


Surely this is a "didn't read the question properly" problem rather than a "didn't maths right" problem?

And that (understanding a natural language question) is the USP for LLMs.


I don't buy it. In any common usdage "6 liter jug" means a jug capable of holding 6 liters, not with a volume of 6 liters including the walls.


I don't understand your comment. Falcon said that it's impossible to measure 6 liters of water with a 6 liter jug.


Surely the reason LLMs fail here is because this is an adaptation of a common word problem, except your version has been tweaked so that there is a trivial answer.


Yes, that's the point of the question. We want to know if it's actually doing some reasoning, or if it has just memorized an answer.


It's the latter. For every LLM out there. They are trained to memorize, not reason. It will take radically different training techniques to make these networks reason in a human-like way.


Memorising is so trivial we've been doing it by default since forever, regardless of if that means magnetic core memory, the Jacquard Loom, the Gutenberg press, the ceramic movable type China had for a few centuries before Gutenberg, or using a stick to smudge words into soft clay tablets that were accidentally made permanent by a house fire.

AI like this aren't just memorisation.

They almost certainly don't think like us — even if they did at a low level, the training regime would take the equivalent of hundreds of human lifetimes, and the number of parameters in the larger models is a thousandth of the number in a human brain.


Then how do you explain zero-shot performance?


This does not look like a good benchmark test for an LLM capability.


I, a human, have no idea how to answer this weird question, why do you suppose an AI would do better?

I can’t work out if it’s a joke question or a serious question?


This day got a lot better! I have thousands of hours but had taken a break after I reached a rocket launch every minute with 100s of trains running around.

I guess this break will be over soon. I look forward to dropping a few more lifetimes playing this amazing game.


Thanks for the link although for some reason I keep getting:

Incorrect API key provided error on jupyter for chatgpt. I'm on a paid account so not sure why...


I use the following test to ensure I'm on GPT4 and not 3.5. (I noticed that it did fail at this test temporarily and then got it. Not sure why. Maybe it reverts back to 3.5 when under load?)

I have a 12 liter jug and a 6 liter jug. I want to measure 6 liters. How do I do it?

GPT4: You actually don't need to do anything because one of your jugs is already a 6-liter jug. If you fill it up to the top, you'll have exactly 6 liters of water.

GPT-3.5: To measure exactly 6 liters using a 12-liter jug and a 6-liter jug, you can follow the steps below:

Start with both jugs empty. Fill the 12-liter jug completely with water. Pour the water from the 12-liter jug into the 6-liter jug. This will leave you with 6 liters of water in the 12-liter jug. Empty the 6-liter jug. Pour the 6 liters of water from the 12-liter jug back into the empty 6-liter jug. Now, you have 6 liters of water in the 6-liter jug. At this point, you have successfully measured 6 liters using the 12-liter jug and the 6-liter jug.


You can't evaluate them with a single prompt, single execution! Any given output is just a sample from a range of possible outputs, but all of them (ALL) are considered plausible returns. You have to think of it as returning a distribution as a random variable (in high dimensional space) but unfortunately there's only 1 sample from it at a time. Maybe some day they'll let us control a random seed.


It figures it out once you let it reflect on its answer: Consider the following situation: You have a 12 liter jug and a 6 liter jug, and you want to measure out exactly 6 liters of water. First, generate an initial solution for this problem. Then, think about the solution you've generated, considering if there might be a simpler or more straightforward way to achieve the goal. If there is, please provide the more accurate or simpler solution.


Interesting approach. 3.5 gave me this as its second answer:

1. Fill the 6 liter jug to its maximum capacity. 2. Pour the contents of the 6 liter jug into the 12 liter jug, which will leave 6 liters in the 6 liter jug. 3. You have now measured out exactly 6 liters of water using the 6 liter jug.


Yeah 3.5 still can't figure it out, 4 does though. Hilariously 3.5 gave me this:

Fill the 6-liter jug completely with water. Pour the water from the 6-liter jug into the 12-liter jug. Fill the 6-liter jug again with water. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. This leaves 4 liters of water in the 6-liter jug. Empty the 12-liter jug. Pour the 4 liters of water from the 6-liter jug into the empty 12-liter jug. Fill the 6-liter jug with water again. Pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. This leaves exactly 6 liters of water in the 12-liter jug.


It’s funny you say this as I just asked ChatGPT 4 and got this response.

Here is a simple solution to your problem:

1. Fill the 12-liter jug completely. 2. Use the water in the 12-liter jug to fill the 6-liter jug. Now you have 6 liters remaining in the 12-liter jug, which is exactly what you need.

So, you have successfully measured 6 liters.


Is GPT 3.5 is nerfed too? ..hehe XD

To measure exactly 6 liters using a 12-liter jug and a 6-liter jug, you can follow these steps:

Start with both jugs empty. Fill the 12-liter jug completely. Pour the contents of the 12-liter jug into the 6-liter jug. This leaves you with 6 liters in the 12-liter jug and a full 6-liter jug. Empty the 6-liter jug. Pour the 6 liters from the 12-liter jug into the empty 6-liter jug. Refill the 12-liter jug. Fill the 6-liter jug using the 12-liter jug until the 6-liter jug is full. This requires pouring 4 liters from the 12-liter jug into the 6-liter jug, which leaves 8 liters in the 12-liter jug. At this point, you have 6 liters in the 6-liter jug and 8 liters in the 12-liter jug. Empty the 6-liter jug. Pour the remaining 2 liters from the 12-liter jug into the empty 6-liter jug. Finally, refill the 12-liter jug and pour 6 liters from the 12-liter jug into the 6-liter jug. The 6-liter jug will now be full, and you will have successfully measured 6 liters using the given jugs.

By following these steps, you can accurately measure 6 liters using a 12-liter jug and a 6-liter jug.


Ummm... lmfao, this thing is on drugs.

(ChatGPT 4)

You can measure exactly 6 liters using a 12 liter jug and a 6 liter jug by following these steps:

Fill the 6-liter jug completely.

Pour the water from the 6-liter jug into the 12-liter jug. Now you have 6 liters of water in the 12-liter jug.

Fill the 6-liter jug again.

Carefully pour water from the 6-liter jug into the 12-liter jug until the 12-liter jug is full. Since the 12-liter jug already contains 6 liters, you will be able to pour in only another 6 liters to make it full. This leaves you with exactly 6 liters of water in the 6-liter jug.

Congratulations, you now have measured exactly 6 liters of water using a 12-liter jug and a 6-liter jug!

> https://chat.openai.com/share/929e68a3-9c67-44c8-8fbc-b555c1...


ChatGPT4:

>> I have a 12 liter jug and a 6 liter jug. I want to measure 6 liters. Please give me the simplest possible solution.

> You already have a 6 liter jug, so you don't need to do anything additional to measure 6 liters. Simply fill the 6 liter jug to its full capacity, and you will have your 6 liters of water.

Am I providing a hint, or am I being more specific in my query? idk.


I have home assistant running with quite a few esp32's running esphome with random sensors/bulbs scattered around the house.

My question is, what do you use the manage the automations? Do you use the native home assistant automation, or node-red, etc? I've even looked a bit into Room assistant[0]

[0] https://www.room-assistant.io/guide/#how-it-works


Its all the built in automation UI in Home Assistant, but with fairly heavy use of scripts and scenes to encapsulate behaviour. Most automations are just a series of conditionals that then call out to a script for the smart bits.


I would also recommend Moon Machines[0] for an amazing in-depth experience of the Apollo program:

Part 1: The Saturn V Rocket

Part 2: The Command Module

Part 3: The Navigation Computer

Part 4: The Lunar Module

Part 5: The Space Suit

Part 6: The Lunar Rover

Available on vimeo[1] and youtube.

[0] https://en.wikipedia.org/wiki/Moon_Machines

[1] https://vimeo.com/673970849


Please revive Posterous!


He did in the form of Posthaven [1] but I'd say please keep it updated. The last post on the Posthaven blog is five years old.[2]

[1] https://posthaven.com/ [2] https://blog.posthaven.com/read-about-how-fly-has-helped-wit...


Incredible, a public + private UGC hosting service with no service terms or code of conduct (that I could find, even in onboarding). That's so brave lol


could you elaborate what Posterous did well, for those of us who never saw it?


I liked the ability to cross post to Twitter, Facebook, Plurk etc. It was like the one place blog control panel, make a post there and spread it out.


have you tried Buffer for that? any pain points there?

(asking because i have minor pain points but not sure if enough)


I'm uninterested in doing it these days


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: