Hacker Newsnew | past | comments | ask | show | jobs | submit | mrg3_2013's commentslogin

Had a chuckle at the mention of Stalin. Made me think. I would also think, the evils would be the one who would badly want to live forever, if an option was presented.


Queue the hot-mic moment between Putin and Xi where they discussed living longer with modern medical miracles.


That makes total sense now, indeed!


This resonates with me. Too much of anything loses value. This includes life. If there's no death, it would take special individuals to make sense out of it.


Not specific to this, but if you can solve issue with progressive lenses, you have a big market. Progressive lens sucks. They simply do not work and causes too much strain and practically unusable for daily operations. What would be great if there's a digital way where the lens changes depending on your activity. Driving, then go to long vision. In front of laptop - change to reading glass etc. I will be your customer if you solve this


This.


OpenAI continues to muddy the benchmarks, while Claude continues to improve their intelligence. Claude will win long term. It'd be wise to not rely on OpenAI at all. They are the first comers who will just burn cash and crash out I suspect.


This is just Amazon's 'me too' play. Doubt anyone serious in LLM space would consider this


DOA

When marketing talks about price delta and not quality of the output, it is DOA. For LLMs, quality is a more important metric and Nova would always try to play catch with the leaderboard forever.


Maybe. The major models seem to be about tied in terms of quality right now, so cost and ease of use (e.g. you already have an AWS account set up for billing) could be a differentiator.


Using LLMs via Bedrock is 10x more painful than using direct APIs. I could see cost consolidation via cloud marketplace a play - but I don't see Amazon's own LLM initiatives ever taking off. They should just lose those shops and buy one of the frontier models (while it is still cheap)


The major models are not tied in terms of quality. GPT-4 and GPT-o1 still beat everyone else by a significant margin on tasks that require in-depth reasoning. There's a reason why people just don't go for the cheapest option, whatever the benchmarks say.


Exactly. Citing cost has been an AWS play which worked during early days of cloud - so they are trying to stick to those plays. They don't work in AI world. No one would want a faster/cheap model that gives poor results (besides the cost of frontier model keeps coming down - so these are just dead initiative IMO).

On LLM, my experience with Claude has been much better than OpenAI models (though my use case is more on code generation)


> GPT-4 and GPT-o1 still beat everyone else by a significant margin on tasks that require in-depth reasoning

I haven't seen examples of this. Do you know where I could find some?


Here's a fairly simple test that I throw at any model that claims to be "GPT-4 level": https://news.ycombinator.com/item?id=42262661

For more complicated stuff, I did some experiments using LLMs to drive high-level AI decisions in video games. Basically, it gets a data schema and a question like "what do you do next?", and can query the schema to retrieve the info that it thinks it needs to give the best answer to that. GPT-4 and GPT-o1 especially are consistently the best performers there, both in terms of richness of queries they produce, and how they make use of them.

There's also a bunch of interesting examples along the same lines here: https://github.com/cpldcpu/MisguidedAttention. Although I should note that even top OpenAI models have troubles with much of this stuff.

https://github.com/fairydreaming/farel-bench is another interesting benchmark because it's so simple, and yet look at the number disparity in that last column! It's easy to scale, too.

Unfortunately, we're still at the point in this game where even seemingly trivial and unrelated minor changes in the prompt (e.g. slightly rewording it, and even capitalization in some cases) can have large effect on quality of output, which IMO is a tell-tale sign when the model is really operating in a "stochastic parrot" mode more so than any kind of actual reasoning. Thus benchmarks can be used as a way to screen out the poorly performing models, but they cannot reliably predict how well a model will actually do what you need it to do.


Looks interesting! I think cost of running resources would be more helpful


Thanks, that's generally more readily available in the billing explorer for most cloud providers.

There is a definite case for pulling usage data from the cloud account to make suggestions about right sizing though, that's a definite roadmap item


Interesting point! In my case, it is mostly from China. So I could do geo block


ok thanks.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: