More

yomritoyj · 2025-03-22T10:52:10 1742640730

ML practice has for the moment far outstripped ML theory. But even if ML theory catches up, the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically. I see the value of theory more in providing a general conceptual framework. Just as the asymptotic theory of algorithms today cannot tell you which algorithm to use, but gives you some broad guidance.

xg15 · 2025-03-22T12:38:31 1742647111

> the answers to your question will get likely be still dependent on the nature of the process generating the data and hence they would still have to be answered empirically.

And I think that would be perfectly fine, or rather weird if otherwise. Part(*) of the unpredictability of ML models stems from the fact that the training data is unpredictable.

What is missing for me so far are more detailed explanations how the training data and task would influence specific decisions in model architecture. So I wouldn't expect a hard answer in the sense of "always use this architecture or that amount of neurons" but rather more insight what effects a specific architecture would have on the model.

E.g. every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.

But I haven't seen a lot of resources about e.g. the differences between 2- and 3-layer perceptrons, or 3- and 32-layer, etc. Similarly, how are your model capabilities influenced by the number of neurons inside a layer, or for convolutional layers, by parameters such as kernel dimensions, stride dimensions, etc? Same for transformers: What effects do embedding size, number of attention heads and number of consecutive transformer layers have on the model's abilities? How do I determine good values?

I don't want absolute numbers here, but rather any kind of understanding at all how to choose those numbers.

(There are some great answers in this thread already)

(* part of it, not all. I'm starting to get annoyed by the "culture" of ML algorithm design that seems to love throwing in additional sources of randomness and nondeterminism whenever they don't have a good idea what to do otherwise: Randomly shuffling/splitting the training data, random initialization of weights, random neuron/layer dropouts, random jumps during gradient descent, etc etc. All fine if you only care about statistics and probability distributions, but horrible if you want to debug a specific training setup or understand why your model learned some specific behavior).

lucasoshiro · 2025-03-22T16:03:38 1742659418

> every ML 101 course teaches the difference between single-layer and "multi"-layer (usually 2-layer) perceptrons: Linear separability, XOR problem etc.

Yeah, that's the point! ML related stuff seems to be starting with simpler problems like linear separation and XOR, then diving into some math, and soon it shows a magical python code out of nowhere that solves a problem (e.g. MNIST) and only that problem

yomritoyj · 2025-03-18T07:57:43 1742284663

You may want to look at this. Neural network models with enough capacity to memorize random labels are still capable of generalizing well when fed actual data

Zhang et al (2021) 'Understanding deep learning (still) requires rethinking generalization'

https://dl.acm.org/doi/10.1145/3446776

yomritoyj · on Nov 28, 2023

They are great books about some combinatorial mathematics inspired by programming problems. Not the best investment of time if you want to learn programming itself, because:

1. Most of the time you are not implementing foundational algorithms like sorting or SAT solving. You use mature implementation in libraries.

2. If you are in fact implementing foundational algorithms, then the existing volumes of Knuth cover only a very limited set of problem areas.

3. If you are implementing something in an area covered by Knuth, the books are worth looking into as a reference, but often writing performant and robust code requires considerations not in Knuth. This is because Knuth works with an idealised machine ignoring things like parallelism, memory hierarchy etc. and understandably does not get into software design issues.

jimbomins · on Nov 28, 2023

I'm not sure your point "3." is wholly correct.

They do also go much wider than just combinatorial math.

yomritoyj · on Nov 26, 2023

If you are going to do it at this level, it already exists as an example (section 8.7) in K & R.

yomritoyj · on Sept 24, 2023

awk, sed etc. belong to the museum, now that we have so many tools and libraries that can handle structured data.

The whole early Unix obsession with plain text files was a step in the wrong direction. One grating holdover of that is the /proc filesystem. Instead of a typed, structured api you get the stuff as text to be parsed, file system trees and data embedded in naming conventions.

blq10 · on Sept 24, 2023

I unironically disagree with this, structured data is incredible and powerful.

But an important part of the early internet was "its Just Text".

And in fact, the reason why JSON is so great is that if you want to use it as Just Text it works just the same!

It's a translation layer between systems that really demand highly structured data and flexible systems where as long as you can thunk about it, you can get from anywhere to anywhere else with a few simple programs that are on every machine in tbe known universe.

zzo38computer · on Sept 25, 2023

I disagree. I think awk and sed are useful for some uses, even though it is not very good for JSON. (I also think JSON is not the best format for structured data anyways.)

In the /proc file system, they could have made the data format better even without using JSON though. (For example, null-terminated text strings might be better than using parentheses around it; since, then, what if the data includes parentheses? You could use the PostScript format for escaping (string literals in PostScript are written with parentheses around it, and can be nested), but, it would be better and simpler to not require any escaping, isn't it?

(Anyways, my own ideas of operating system design, one is that it does have structured data in a much better way, so avoids this and other problems. There is a common structure for most things, and would be designed to improve the efficiency compared with JSON, XML, etc, as well as other advantages. This can also be used inherently with the command shell, so unlike Nushell, the system is more designed for this.)

yomritoyj · on Jan 20, 2023

I procrastinated on my thesis. I got the usual advice: 'perfect is the enemy of the good', 'just get it written' etc. but it made no difference. With hindsight I now realize that I was working on an unfeasible project with inadequate preparation but did not want to accept that. How do other experienced people, not just your advisor, react to your project? If they are skeptical it may be a worth taking a cold hard look at the entire plan.

yomritoyj · on Dec 25, 2022

Shameful conduct. And as a counterpoint to "nobody cares about your blog", a few days back I got a referee report asking me to cite a Twitter thread. This one: https://twitter.com/NilsEnevoldsen/status/152096733265462476...

Lot of actual academic work happening on Twitter, blogs and sites like MathOverflow.

BillEllson · on Dec 26, 2022

That Twitter thread is above my paygrade, but NE's final sentence (addressed to three other people working in his field) is rather relevant to this discussion "If this cipher has already been discovered, let me know where so I can give credit."

yomritoyj · on Nov 20, 2022

I think this is an increasing problem as 'liberalism' becomes the official ideology of the elite, so that everyone pays lip-service to 'non-hierarchical' values and overt displays of power are looked down on. Power struggles and self-aggrandizement then go underground and are covered in extra layers of hypocrisy and doublespeak. Those who don't catch on, lose.

yomritoyj · on Nov 6, 2022

For me Twitter is mainly a substitute to RSS: a central location to consume interesting content from diverse sources. In that role having an algorithmically curated as opposed to a strictly chronological feed is essential. For most people/entities I follow I'm interested in only a fraction of their tweets and I can rely on Twitter to do a good enough job of surfacing them for me. By following about a thousand accounts I can reliably hear about the latest trends in the areas that interest me by spending about half an hour each day.

On the other hand, right now I follow only a few dozen accounts on Mastodon and I'm already drowning in irrelevant posts. It can at best be a glorified group chat.

alkonaut · on Nov 6, 2022

You mean you see posts by other people than those you follow?

On Twitter I just follow say 100 people and my client shows me their posts in chronological order and nothing else, with no ads.

If a feed tried to algorithmically insert a post I’d stop using that service in a hurry.

What I have seen from mastodon is that there is such a feed available so by the look of it, so far so good.

yomritoyj · on Aug 21, 2022

We live in extraordinary times, standing on the shoulders of the scientific revolution, with gigahertz-gigabyte machines as commonplace and hundreds of years of mathematics ready to be built upon. It is criminal not to be busy.

I agree that most of us can do at most a few hours of intellectually demanding work every day. But to win those few hours of freedom, to gather the preconditions of doing work, to maintain the social organizations necessary for collaboration, and to make the results presentable, can easily eat up a whole day year after year.