Ideally it would be nice if common utils were redeveloped to have optimisations ...

burntsushi · on Oct 9, 2017

That's basically what's happening. AFAIK, ack kinda started it all by observing, "hey, we frequently have very large directories/files that we don't actually want to search, so let's not by default." Depending on what you're searching, that can be a huge optimization! Tools like `git grep` and the silver searcher took it a step further and actually used your specific configuration for which files were relevant or not. (Remember, we're in best guess territory. Just because something is in your .gitignore doesn't mean you never want to search it. But it's a fine approximation.)

Was this a thing back when the BSD and GNU greps were being developed? Was it common for people to have huge directory trees (e.g., `node_modules` or `.git`) lying around that were causing significant search slow downs? Not sure, but it seems to have taken a while for it to become a popular default!

There are of course other tricks, and most of those are inspired by changes in hardware or instruction sets. For example, back in the day, a correctly implemented Boyer Moore was critical to avoid quadratic behavior by skipping over parts of the input. But that's lessish important today because SIMD routines are so fast that it makes sense to optimize your search loop to spend as much time in SIMD routines as possible. This might change how you approach your substring search algorithm!

... so what's my point? My point is that while sometimes optimizations are classic in the sense that you just need to change your heuristics as hardware updates, other times the optimization is in the way the tool is used. Maybe you can squeeze the former into existing tools, but you're probably not going to make much progress with the latter.

I remember when I first put out ripgrep. Someone asked me why I didn't "just" contribute the changes back into GNU grep. There are a lot of reasons why I didn't, but numero uno is that the question itself is a complete non-starter because it would imply breaking the way grep works.

wruza · on Oct 9, 2017

>”hey, we frequently have very large directories/files that we don't actually want to search, so let's not by default”

>Was this a thing back when the BSD and GNU greps were being developed? Was it common for people to have huge directory trees (e.g., `node_modules` or `.git`) lying around that were causing significant search slow downs? Not sure,

Sure, actually. It was called “build” and it was moved out of source code hierarchy to not interfere with tools. Pretty clever, and you don’t have to patch every tool each time new build-path appears. This should lead us to some conclusion, but I cannot figure out which. Do you?

traviscj · on Oct 10, 2017

Those who don't understand history are destined to repeat it?

also that approach seems dependent on all projects you might want to grep (in this example) building cleanly into an external directory, which is naturally never going to be 100%: some people don't know why other software supports those options, some people don't care, some people think that's the wrong solution to the problem, etc. Ultimately someone comes along and builds up a big enough body of experience that they can account for and fix some fraction of the brain dead behavior out in the wild, and the rest of us get a useful tool.