More

smeeth · 2025-09-10T15:14:39 1757517279

If your "impossible" designs are manufactured by non-exclusive suppliers it isn't much of a moat.

smeeth · 2025-09-09T15:40:41 1757432441

Hate to break it to you, but many kids actually do better away from their parents than with them.

It's extremely sad, but a consistent finding in early childhood education is that the children who thrive most in daycares tend to come from the least advantaged backgrounds.

So a policy of paying parents to stay home would mostly benefit kids who are already well off.

yardstick · 2025-09-09T15:43:57 1757432637

Kids are social and like playing and learning from other kids. Daycare lets them do just that. It’s a great thing and every toddler I’ve met who wasn’t in daycare was behind in something. Especially verbal skills.

Plus daycare allows women to continue their career progression. It’s soo important. Not every woman wants to end their career as a mother to a young kid. Daycare enables successful women to thrive and still have families.

declan_roberts · 2025-09-09T16:26:11 1757435171

"Why do you want a thriving career?"

"So I can provide for my family"

"Why do you want to provide for your family?"

"So my children can have happy and fulfilling lives"

"What makes your young children feel happy?"

"Spending time with me"

A strong parent-child relationship is the biggest determination of life-long child happiness even into old age.

https://pmc.ncbi.nlm.nih.gov/articles/PMC4784487/

yardstick · 2025-09-11T01:14:16 1757553256

You can have a strong parent-child relationship while still using daycare.

Also people work due to other reasons unrelated to providing for their family. Individuals are allowed to have lives outside their kids.

garciasn · 2025-09-09T15:59:23 1757433563

Your anecdote is just that. All of it is highly dependent on the child, their environment, and the 'educator'. Please don't make assumptions based on your limited exposure; it's not helpful.

Spivak · 2025-09-09T16:09:50 1757434190

Your "it depends" argument is that some kids aren't social, don't like playing with other kids, are better off not having exposure to social interaction with peers and practice talking.

If this is the criticism then it's a glowing endorsement of daycare and school.

garciasn · 2025-09-09T16:13:06 1757434386

No; it depends on the 'educator'. A daycare that doesn't have kids interacting in a positive way could be just as detrimental as a parent that doesn't socialize their children externally to the home.

xp84 · 2025-09-09T15:50:44 1757433044

I'm just gonna throw this out here: Well-off kids who barely know their workaholic parents have different but equally bad issues for society, than the poor kids do.

Those poor kids have learning deficits. The "well-off" kids often have morality deficits.

A mom or dad raising them properly might help them more than being Student #642 in a government childcare facility.

This isn't an argument against childcare. My children attended preschool for 3 years before Kindergarten. But I'd rather that people got equal support to have a stay-at-home parent so that people can choose.

smeeth · 2025-09-09T16:16:46 1757434606

Do you have any evidence for that?

From what I’ve seen, the research leans the other way. For example:

Children from more advantaged families were actually more likely to view unfair distribution as unfair, while poorer children were more likely to accept it. [0]

Mother’s work hours show no link to childhood behavioral problems, it’s schedule flexibility that matters. [1]

For working-class families, more father work hours correlated with fewer behavioral problems.[2]

The idea that “well-off kids” end up with morality deficits because their parents work a lot doesn’t seem to hold up.

[0] https://onlinelibrary.wiley.com/doi/10.1111/desc.13230

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC9119633/

[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC7021583/

nevir · 2025-09-09T15:55:09 1757433309

Like all things: the extremes are never good, and it's all about getting a healthy balance.

- Kids need lots of time with their parents

- Kids need lots of time around other kids

You can do that by sending them to daycare, and ALSO spending lots of time with them when they're home.

You can also do that by taking time off work, and then taking your kid(s) to places with other kids.

Both work; and it depends on your context which works for you.

ujkhsjkdhf234 · 2025-09-09T15:55:10 1757433310

You aren't wrong but calling it being "Student #642 in a government childcare facility" the wrong way of looking at it. Children grow up best when they are allowed to play with other children. Modern society robs kids of that and helicopter parents are bad for society.

xp84 · 2025-09-09T16:23:30 1757435010

I agree with you vigorously on both those points. I am skeptical however that NM will be able to create a lot of healthy, play-based environments for so many kids.

The market already has incentives to create them -- a ton of good places have waiting lists nationwide, showing unmet demand even at the current price. This suggests the price will need to go higher to attract enough people to do this job. It seems their "$12,000 value" estimate is based on an optimistic belief that they will be buying childcare for their citizens at current prices. When they realize there aren't that many slots available at current rates of pay, will they be okay significantly increasing the costs of the program?

So, my expectations for these facilities are very low and that's a big part of my concern.

ujkhsjkdhf234 · 2025-09-12T15:17:02 1757690222

I don't know where you live but where I live, the cost of daycare is extremely high and there is a waitlist on most places.

HatchedLake721 · 2025-09-09T16:19:02 1757434742

> Hate to break it to you, but many kids actually do better away from their parents than with them.

Is this based on something?

There's research left and right shows that children under 36 months at group nurseries are linked to increased aggression, anxiety, lower emotional skills, elevated cortisol (stress hormone), which is associated with long-term health and developmental risks.

Infants and children do better with one-to-one care at home by their parents and familiar faces, rather than strangers in a group setting.

cvoss · 2025-09-09T16:22:28 1757434948

Perhaps there is something about the environment of an economically disadvantaged household that could be improved by a stipend which allows at least one parent the breathing room to dedicate full time attention to the child instead of a job (or multiple jobs). I don't think the findings you mentioned cut against that idea at all.

I hear you saying the benefit of dedicated caregiving for children mostly helps families with less economic advantage. I'd agree with that, and suggest that OP's proposal capitalizes on exactly that. I'm not convinced of what may be implied in your argument that low-earners make for bad parents and that children should be separated more from their parents for their own good. Let the internal dynamics of a family be solved first, before saying we need to separate parents from children more.

Moreover, those with more economic advantage are unlikely to take a stipend in exchange for staying home. That's not a good deal when keeping the job pays so much that they can afford to pay for childcare.

It is precisely those with less advantage who will take the deal.

So I don't agree with your prediction that such a stipend mostly benefits those who are already well off.

bko · 2025-09-09T16:41:48 1757436108

> that the children who thrive most in daycares tend to come from the least advantaged backgrounds.

So the children that do well in daycare comes from poor homes? So kids from rich home don't do well in daycare?

Every interaction I've ever had says the opposite. The disruptive bully at school usually comes from a broken home.

MisterTea · 2025-09-09T15:58:19 1757433499

My daycare was called preschool. It allowed my mother to focus on my infant brother during the day while I was literally two blocks away running around, coloring and learning shapes. Show and tell was my favorite.

Minor49er · 2025-09-09T15:48:54 1757432934

> Hate to break it to you, but many kids actually do better away from their parents than with them.

How so?

smeeth · 2025-09-09T15:51:10 1757433070

The most obvious example is the children of addicts. It’s hard to imagine a kid is better off stuck at home with druggie parents than spending the day in daycare.

declan_roberts · 2025-09-09T16:31:41 1757435501

A good example of bottom quintile policy. Because the bottom quintile has a better outcome with a certain approach, it becomes standard care for everyone else.

Once you see it, you'll see it everywhere.

smeeth · 2025-09-09T17:35:17 1757439317

…so?

A realistic stay-at-home subsidy would max out around $30k. Your proposal only meaningfully shifts incentives for the bottom income quintile. For everyone else:

- Upper-income families can already afford to choose whatever setup they want.

- Middle-income families couldn’t take it because it’d mean too steep a drop in income.

So the alternative you proposed economically benefits the bottom quintile while leaving their kids worse off. For everyone else, it probably either doesn't matter or gives them cash they don't need as much.

Minor49er · 2025-09-09T16:43:33 1757436213

Anyone would be better off being away from addicts though

smeeth · 2025-09-05T17:39:47 1757093987

Another day, another person not getting discounted cash flow.

Models trained in 2025 don’t ship until 2026/7. That means the $3bn in 2025 training costs show up as expense now, while the revenue comes later. Treating that as a straight loss is just confused.

OAI’s projected $5bn 2025 loss is mostly training spend. If you don’t separate that out with future revenues, you’re misreading the business.

And yes, inference gross margins are positive. No idea why the author pretends they aren’t.

smeeth · 2025-09-02T10:11:25 1756807885

As far as analogies go I prefer approximate database

smeeth · 2025-08-14T14:19:52 1755181192

I've been annoyed for a while people don't use a common parameter weight/compute budget for benchmarking papers.

That said, it does make it easier to claim progress...

pizza · 2025-08-14T17:23:23 1755192203

https://github.com/KellerJordan/modded-nanogpt is pretty great in that respect

godelski · 2025-08-14T23:48:31 1755215311

As a researcher, I can totally agree, but at the same time this isn't super straight forward. Things get weird because you can't just translate from one GPU to another. There isn't a clean calculation for that. There's also other issues like parallelism. Sure, your model is stable with a batch size of 8192 but that's across 1 node, it might not be stable with that batch across 2 nodes. This is a real frustrating part and honestly I don't think most people even are aware such issues exist.

Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.

I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter

smeeth · 2025-08-05T20:04:40 1754424280

Their FOSS local inference service didn't go anywhere.

This isn't Anaconda, they didn't do a bait and switch to screw their core users. It isn't sinful for devs to try and earn a living.

kermatt · 2025-08-05T20:30:02 1754425802

Another perspective:

If you earn a living using something someone else built, and expect them not to earn a living, your paycheck has a limited lifetime.

“Someone” in this context could be a person, a team, or a corporate entity. Free may be temporary.

blitzar · 2025-08-05T20:09:59 1754424599

Yet. Their FOSS local inference service hasn't go anywhere ... yet.

dcreater · 2025-08-05T22:29:16 1754432956

You can build this and go build something else as well. You don't need to morph the thing you built. That's underhanded

smeeth · 2025-07-02T17:03:28 1751475808

It's worth noting this is the exact argument people used against adopting electric calculators.

thoughtpeddler · 2025-07-02T17:47:22 1751478442

Calculators are a very narrow form of intelligence as compared to the general-purpose intelligence that LLMs are. The muscle/steroid analogy from this same discussion thread is apt here. Calculators enhanced and replaced just one 'muscle', so the argument against them would be like "ya but do we really need this one muscle anymore?", whereas with LLMs the argument is "do we really even need a body at all anymore?" (if extrapolated out several more years into the future).

smeeth · 2025-07-02T21:17:34 1751491054

You don't need the analogy. If you have a tool that does a job for you your capacity to do the job degrades alongside other associated skills.

Tools that do many things and tools that do a small number of things are still tools.

> "do we really even need a body at all anymore?"

It's a legitimate question. What's so special about the body and why do we need to have one? Would life be better or worse without bodies?

Deep down I think everyone's answer has more to do with spirituality than anything else. There isn't a single objectively correct response.

conception · 2025-07-03T01:48:09 1751507289

They weren’t wrong. There are lots of cognitive and conceptual benefits to slide rules.

smeeth · 2025-06-24T17:15:30 1750785330

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

cschmidt · 2025-06-24T18:45:00 1750790700

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

nielsole · 2025-06-25T07:58:26 1750838306

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

cschmidt · 2025-06-25T11:37:47 1750851467

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

infogulch · 2025-06-25T13:47:42 1750859262

Little endian wins in the end.

pas · 2025-06-25T16:19:26 1750868366

... why does reversing the all the digits help? could you please explain it? many thanks!

cschmidt · 2025-06-26T12:02:55 1750939375

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

pas · 2025-06-28T02:14:25 1751076865

ah, okay, thanks!

so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)

doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)

cschmidt · 2025-06-28T15:17:54 1751123874

Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.

RaftPeople · 2025-06-25T19:34:38 1750880078

> Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.

fennecbutt · 2025-06-25T17:21:11 1750872071

I guess it's just working with the brain model (so to speak) than against it.

Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.

jvanderbot · 2025-06-24T23:58:58 1750809538

Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech

chmod775 · 2025-06-25T03:37:44 1750822664

> [..] we're not at AGI with this baseline tech

DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.

Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.

Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.

kristjansson · 2025-06-25T04:46:13 1750826773

> "So... what does the thinking?"

> "You're not understanding, are you? The brain does the thinking. The meat."

> "Thinking meat! You're asking me to believe in thinking meat!"

https://www.mit.edu/people/dpolicar/writing/prose/text/think...

munksbeer · 2025-06-26T09:01:38 1750928498

It doesn't feel particularly interesting to keep dismissing "these LLMs" as incapable of reaching AGI.

It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).

What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.

dTal · 2025-06-26T14:07:15 1750946835

>If you showed it to people from the 90s, it would be mind blowing

90's? It's mind blowing to me now.

My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.

It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.

All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.

AllegedAlec · 2025-06-25T12:17:42 1750853862

Thank you. It's maddening how people keep making this fundamental mistake.

mgraczyk · 2025-06-25T08:57:58 1750841878

This is meant to be some kind of Chinese room argument? Surely a 1e18 context window model running at 1e6 tokens per second could be AGI.

chmod775 · 2025-06-25T10:04:34 1750845874

Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.

mgraczyk · 2025-06-25T12:02:51 1750852971

Tree growing?

And I don't follow, we've had vehicles capable of reaching the moon for over 55 years

anonymoushn · 2025-06-25T16:15:15 1750868115

It's about the immutability of the network at runtime. But I really don't think this is a big deal. General-purpose computers are immutable after they are manufactured, but can exhibit a variety of useful behaviors when supplied with different data. Human intelligence also doesn't rely on designing and manufacturing revised layouts for the nervous system (within a single human's lifetime, for use by that single human) to adapt to different settings. Is the level of mutability used by humans substantially more expressive than the limits of in-context learning? what about the limits of more unusual in-context learning techniques that are register-like, or that perform steps of gradient descent during inference? I don't know of a good argument that all of these techniques used in ML are fundamentally not expressive enough.

mgraczyk · 2025-06-25T17:21:07 1750872067

LLMs, considered as a function of input and output, are not immutable at runtime. They create tokens that change the function when it is called again. That breaks most theoretical arguments

anonymoushn · 2025-06-25T18:20:31 1750875631

Sure. Another view is that an LLM is an immutable function from document-prefixes to next-token distributions.

mgraczyk · 2025-06-25T18:24:56 1750875896

But that view is wrong, the model outputs multiple tokens.

The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).

There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter

VonGallifrey · 2025-06-25T13:08:48 1750856928

Excuse me for the bad joke, but it seems like your context window was too small.

The Tree growing comment was a reference to another comment earlier in the comment chain.

mgraczyk · 2025-06-25T16:51:15 1750870275

It's not a tree though

lukan · 2025-06-25T09:23:05 1750843385

"Surely a 1e18 context window model running at 1e6 tokens per second could be AGI."

And why?

mgraczyk · 2025-06-25T16:52:50 1750870370

Because that's quite a bit more information processing than any human brain

lukan · 2025-06-25T17:14:37 1750871677

I don't think it is quantity that matters. Otherwise supercomputers are smart by definition.

mgraczyk · 2025-06-25T17:19:25 1750871965

Well no, that's not what anyone is saying.

The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).

I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable

lukan · 2025-06-26T08:03:26 1750925006

I am not saying it is impossible, I am saying it might be possible, but far from plausible with the current approach of LLMs in my experience with them.

rar00 · 2025-06-25T12:33:26 1750854806

This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.

mgraczyk · 2025-06-25T16:52:24 1750870344

That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not

Y_Y · 2025-06-25T08:23:30 1750839810

What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.

williamdclt · 2025-06-25T09:57:40 1750845460

Even if LLMs get better at arithmetic, they don't seem like the right tool for the job.

LLMs might never be able to crunch numbers reliably, however I expect they should be very good at identifying the right formula and the inputs for a problem ("i need the answer to x*y, where x=12938762.3 and y=902832.2332"). Then they can call a math engine (calculator or wolfram alpha or whatever) to do the actual computation. That's what humans do anyway!

calibas · 2025-06-24T17:52:54 1750787574

It's a non-deterministic language model, shouldn't we expect mediocre performance in math? It seems like the wrong tool for the job...

rictic · 2025-06-24T18:19:26 1750789166

Models are deterministic, they're a mathematical function from sequences of tokens to probability distributions over the next token.

Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.

geysersam · 2025-06-24T20:29:06 1750796946

The LLMs are deterministic but they only return a probability distribution over following tokens. The tokens the user sees in the response are selected by some typically stochastic sampling procedure.

danielmarkbruce · 2025-06-24T22:28:48 1750804128

Assuming decent data, it won't be stochastic sampling for many math operations/input combinations. When people suggest LLMs with tokenization could learn math, they aren't suggesting a small undertrained model trained on crappy data.

anonymoushn · 2025-06-24T23:41:10 1750808470

I mean, this depends on your sampler. With temp=1 and sampling from the raw output distribution, setting aside numerics issues, these models output nonzero probability of every token at each position

danielmarkbruce · 2025-06-25T17:58:51 1750874331

A large model well trained on good data will have logits so negative for something like "1+1=" -> 3 that they won't come up in practice unless you sample in a way to deliberately misuse the model.

mgraczyk · 2025-06-24T19:14:20 1750792460

This is only ideally true. From the perspective of the user of a large closed LLM, this isn't quite right because of non-associativity, experiments, unversioned changes, etc.

It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.

ijk · 2025-06-24T19:28:51 1750793331

And even on open LLMs, GPU instability can cause non-determinism. For performance reasons, determinism is seldom guaranteed in LLMs in general.

rar00 · 2025-06-25T12:36:14 1750854974

yep, even with greedy sampling and fixed system state, numerical instability is sufficient to make output sequences diverge when processing the same exact input

CamperBob2 · 2025-06-24T18:09:19 1750788559

We passed 'mediocre' a long time ago, but yes, it would be surprising if the same vocabulary representation is optimal for both verbal language and mathematical reasoning and computing.

To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."

currymj · 2025-06-25T00:18:45 1750810725

thanks to training data + this being a popular benchmark, they're pretty good at grinding through symbolic mathematical derivations, which is often useful if you want an explanation of a mathematical concept. there's not really a better tool for this job, except for "a textbook which answers the exact question you have".

but from time to time, doing this does require doing arithmetic correctly (to correctly add two exponents or whatever). so it would be nice to be able to trust that.

i imagine there are other uses for basic arithmetic too, QA applications over data that quotes statistics and such.

agarren · 2025-06-25T00:33:27 1750811607

> but from time to time, doing this does require doing arithmetic correctly (to correctly add two exponents or whatever). so it would be nice to be able to trust that.

It sounds weird, but try writing your problem in LaTeX - I don’t know why, I’ve found a couple models to be incredibly capable at solving mathematical problems if you write them in LaTeX.

drdeca · 2025-06-24T18:03:34 1750788214

Deterministic is a special case of not-necessarily-deterministic.

search_facility · 2025-06-24T22:37:02 1750804622

regarding “math with tokens”: There was paper with tokenization that has specific tokens for int numbers, where token value = number. model learned to work with numbers as numbers and with tokens for everything else... it was good at math. can’t find a link, was on hugginface papers

samus · 2025-06-25T00:16:45 1750810605

Shouldn't production models already do this? They already tend to use tokenizers with complex rules to deal with a lot of input that would otherwise be tokenized in a suboptimal way. I recall a bug in an inference engine (maybe llama.cpp?) because of an implementation difference in their regex engine compared to the model trainer. Which means that the tokenizer used regex-based rules to chop up the input.

search_facility · 2025-06-25T21:15:10 1750886110

turns out - no, by intuition they should do this for sure - but no.

UPD: Found the paper: - https://huggingface.co/papers/2502.09741 - https://fouriernumber.github.io/

in paper mentioned “number” is a single sort-of “token” with numeric value, so network dealing with numbers like real numbers, separately from char representation. All the math happens directly on “number value”. In majority of current models numbers are handled like sequences of chars

vendiddy · 2025-06-25T06:18:07 1750832287

Do LLMs need to be good at math with the same approach?

To draw an a analogy, we've got our human brain specialized.

Why not implement a part of the AI brain that's not neural nets, but instead circuitry specialized to math?

Maybe a dumb question since I'm a layperson!

js8 · 2025-06-25T03:48:28 1750823308

It's not strange at all. I am playing with lambda calculus and combinatory logic now, as a base for mathematics (my interest is to understand rigorous thinking). You can express any computation using just S and K combinators, however, there is a price to that - the computations will be rather slow. So to make the computation faster, we can use additional combinators and rules to speed things up (good example is clapp() function in https://github.com/tromp/AIT/blob/master/uni.c).

Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.

Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).

This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.

(There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)

What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).

smeeth · 2025-06-24T12:15:41 1750767341

+1, this is the exact reason I started using uv. Extremely convenient.

For some reason uv pip has been very slow, however. Unsure why, might be my org doing weird network stuff.

greenavocado · 2025-06-24T17:03:43 1750784623

Or very difficult package spec

smeeth · 2025-06-18T22:00:18 1750284018

I suspect its something to do with the following:

When humans get stuck solving problems they often go out to acquire new information so they can better address the barrier they encountered. This is hard to replicate in a training environment, I bet its hard to let an agent search google without contaminating your training sample.