> I stopped playing after the challenge of creating a line-based canvas drawing ...

dwighttk · on Sept 2, 2023

"Resolution Criteria: On questions that may have ambiguous answers, I will clarify the acceptance criteria. Here, for example, I will copy/paste the code to a file called index.html, open the page, and I expect to see a US flag. If I click on the flag it should change color. The flag must be immediately recognizable: red/white stripes, blue background, 50 stars. It does not have to be dimensionally accurate, but it should be better than a flag I would draw by hand."

capableweb · on Sept 2, 2023

I don't think "questions that may have ambiguous answers" applies when you use a term like "perfectly accurate" which has a very specific meaning.

dandellion · on Sept 3, 2023

On top of the requirement to make it perfectly drawn I take issue with the "but it should be better than a flag I would draw by hand". That's a useless metric because we don't know how the author draws by hand.

I assumed they would have enough brain cells to draw the flag without cutting in half all the stars in the margins. But they must have failed kindergarten because I assumed wrong.

dandellion · on Sept 3, 2023

I started again picking a number at random without looking at the questions. This is what it had to say:

"You answered 39.29% of questions correctly with an average log-loss of 3.880. This means you would have scored better by just leaving your guesses at 50% for every single question.. On average there were 0.00% of people who did better at this game than you. If this was an exam and I was grading it on a curve, I would give you an A+. If you had been better calibrated, you could have scored in the top 23.41%, and I would have given you a B+."

So I did worse than random but 0% did better than me and got an A+. Nice.

kqr · on Sept 3, 2023

Note that the prompt is the input to the LLM, it does not specify the task in enough detail to evaluate the result. That's what the resolution criteria are for -- additional information on resolution you are given but the LLM is not.