Ask HN: I want to dive into how to make search engines

jasode · on Aug 25, 2022

>search engines

You can decompose a "search engine" into multiple big components and figure out what you want to look at first:

(1) web crawler/spiders

(2) database cache of web content -- aka building the "search index"

(3) algorithm of scoring/weighing/ranking of pages -- e.g. "PageRank"

(4) query engine -- translating user inputs into returning the most "relevant" pages

Each technical topic is a sub-specialty and can be staffed by dedicated engineers. There are also more topics such as lexical analysis, distributed computing (for all 4 areas), etc.

If you're mainly focused on experimenting with programming another ranking algorithm, you can skip part (1) by leveraging the dataset from Common Crawl: https://index.commoncrawl.org/

Here are some videos about PageRank:

https://www.youtube.com/watch?v=JGQe4kiPnrU , https://www.youtube.com/watch?v=qxEkY8OScYY

... but keep in mind that the scope of those videos omits all of (1), (2), and (4).

swyx · on Aug 25, 2022

I would also require reading on the major OSS search engines, e.g. Lucene, Solr and then Elasticsearch.

https://en.wikipedia.org/wiki/Apache_Lucene

https://en.wikipedia.org/wiki/Apache_Solr

https://en.wikipedia.org/wiki/Elasticsearch

those who dont know history will repeat it, etc

then maybe the newer stuff like https://typesense.org/

to be clear i dont know these things but thats what I would do and I'd happily read fly.io style blogposts drip feeding out knowledge over time

abadger9 · on Aug 25, 2022

As someone who works in this space, ^ this. I would say don't overthink any component, use common crawl (https://commoncrawl.org/) to build your initial index, use a pagerank implementation that's been thoroughly researched and published, and use off the shelf components from the apache foundation when you can.

potatoCake · on Sept 6, 2022

abadger9, nice username, do you have any cool portfolios ref the same kinda work?

matt_s · on Aug 25, 2022

You are implying web. If you eliminate web from your major components, the concepts are the same for building a search engine for any data. Like a product search on a company website might need to crawl through a bunch of internal data sources to build an index of all the products, metadata, etc. and you may want some products ranked higher in results for whatever reason.

I think a major effort, if its not web, is defining a common data structure to represent the varying source data sets.

Edit to add: OP if you want to study this, start smaller than "search the web" and get the major component concepts down. Expanding to the web will involve distributed systems stuff to scale but there's only like a handful of companies doing that yet there are many more companies that need a search engine on in-house data.

oikawa_tooru · on Aug 25, 2022

Hi thank you for your answer. Actually my original idea was to build it for something else rather than the web in particular. Can you please advice me on which masters I should pick between distributed systems , machine learning and theoretical computer science/algorithm if my end goal is this?

matt_s · on Aug 25, 2022

I don't think any specific MS is going to cover everything for building a search engine. You may want to look at people that work on Lucene [0], for example, and connect to see what their background might be. My guess is there is much more algorithms (types of searches and their big-O tradeoffs, sorts, etc.) involved than dist systems/ml but I don't work on search engines. Its open source so you could just download it and tinker. You may find that a degree isn't necessary to build one.

[0] https://lucene.apache.org/

troelsSteegin · on Aug 25, 2022

I would include (5) the presentation of search results - document snippets or surrogates and affordances for query refinement like faceting. One salient acronym is SERP, for search engine results page.

A different "information retrieval" perspective overall would be to take a deep dive into the career of Susan Dumais, http://susandumais.com/

oikawa_tooru · on Aug 25, 2022

Hi, between my options to go for a masters in Machine Learning, Distributed systems, Theoretical computer science/algos, Which would you pick if your goal was to build a search engine?

shishy · on Aug 26, 2022

distributed systems or cs / algo

arturventura · on Aug 25, 2022

I'm working on building an AWS for anyone who wants to make their own search engine. The idea is to have a single open webindex database, continuously updated that you can apply ranking and embedding algorithms in it. This would reduce the cost of entry, and enable developers to build competitors to google on top of it, or create new products in the search space like a search engine for clothes. I don't know if this is interesting for anyone but if it is, hit me up.

mdaniel · on Aug 25, 2022

That sounds very cool, and I hope you (and your customers!) are successful. Out of curiosity, did you find an existing market need for that, or it's a "build it and they will come" model?

Also, have you thought about partnering with commoncrawl.org? I could see that relationship benefiting both sides: they get fresher indices, you get access to the historical web snaps

arturventura · on Aug 25, 2022

I faced the problem. I think one of the main issues with google is the modality of the results. Google is forced to create a list of links because that's the main vehicle where they drive profit. If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

The problem is that if you were to build a new search engine from the ground up it will take millions in infrastructure, and a lot of time for you to test one idea. And there are multiple attack vectors to Google's business model (privacy, subscription model, modality, etc.) however you might get the change of testing one of them, and if that fails, starting again is super expensive so you might not be able to get funds to do it.

My approach then became to build something that others can build on top of.

I'm currently using common crawl but my main problem is that I need to build a small toy to test it and even processing common crawl is crazy expensive. Just a single snap are 150 Tb, so this needs to be process on metal, or you're gonna pay a hefty AWS bill.

johannes1234321 · on Aug 25, 2022

> If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

For that specific search I would start at Wikipedia, but for more general "data search" I lean towards Wolfram Alpha, which has some usability issues, but interesting maths engine for queries. https://www.wolframalpha.com/input?i=Barack+Obama+vs+Donald+...

tambourine_man · on Aug 25, 2022

It sounds like just what need to break free from Google.

I’ve been dreaming of an open web index and social graph for more than a decade.

Any company having the data + the algorithm + the presentation layer is way too much power. We can and should split that problem into its separate domains.

I hope you succeed, keep us posted.

pkghost · on Aug 25, 2022

Have you seen Common Crawl? https://commoncrawl.org/. If so, what differences do you imagine for yours?

mdaniel · on Aug 25, 2022

> continuously updated

is what I saw as the primary difference. Whether that's going to pan out in reality as well as it does in HN comments is "the devil's in the details" though

rohit89 · on Aug 25, 2022

Wouldn't it be prohibitively expensive for you to crawl and index the web?

mindcrime · on Aug 25, 2022

I don't know that anybody offers a Masters degree specifically in "search engines", but loosely speaking the main academic field backing most search engines is "Information Retrieval"[1].

You can get a taste of the kinds of things covered in this field by looking at this class site (among many others)

https://web.stanford.edu/class/cs276/

That said, to build a modern competitive search engine, you'll need to look into software engineering, distributed systems, artificial intelligence, machine learning, linguistics, computer vision, signal processing, graph theory, databases, scheduling, machine translation, and FSM-only-knows what else.

[1]: https://en.wikipedia.org/wiki/Information_retrieval

oikawa_tooru · on Aug 25, 2022

Hi thank you for your answer. Actually my original idea was to build it for something else rather than the web in particular. Can you please advice me on which masters I should pick between distributed systems , machine learning and theoretical computer science/algorithm if my end goal is this?

mrazomor · on Aug 25, 2022

Basic techniques: https://nlp.stanford.edu/IR-book/information-retrieval-book....

Smartness (NLP): https://web.stanford.edu/~jurafsky/slp3/

There are also newer techniques, like "deep learning" for search. Not sure what's a good resource for that (using ML to learn the scoring functions is more than a decade old field (learning to rank), it just surfaced again because of the new ML techniques).

Check the table of contents of those books. Courses which mention chapters or sections from them are probably a good choice.

You can also check Apache Lucene and do a deep dive to see state-of-the-art implementation (Lucene in Action is a good introduction).

ianbutler · on Aug 25, 2022

I built a pretty large (by non google standards) search engine a little over a year ago with a little over a few hundred million pages. Ultimately my cofounder and I decided not to continue but the tech itself is solid. We should opensource it as a case study for people to learn from.

franczesko · on Aug 25, 2022

Sounds interesting. Can you share more details?

ianbutler · on Aug 26, 2022

Sure, on the technical side we wrote a custom crawler in python that we were able to scale horizontally based on simple config changes so we could cap costs instead of doing the scaling dynamically, everything with the crawler was stateless and depended on SQS, a separate bloom filter to check already crawled content and we did a few other things as well to avoid needlessly crawling existing content, standard rate limiting and dynamic back off, depth limits to avoid over crawling a single site etc and at the height I think we were crawling ~10m pages a day which could have scaled endlessly but we were balancing cost. We were working on adding headless js support as well to handle dynamic content pages, but we were able to grab a great deal of good content without that.

Edge and link data were stored in a graph database called dgraph which I would run a spark job to recompute a weighted pagerank once a day to use as a signal in the search process. At some point I was working on moving from statically computed pagerank to dynamic pagerank that would update on the fly as indexing occurred.

All the text content was stored in a special format in s3 to avoid as much data growth as we could while maintaining the content for indexing. An identifier was then inserted into another SQS queue which was processed by our vectorizer pipeline for semantic search, which was another stateless python project using a finetuned version of sentence bert and some very neat lower level execution graph optimizations to get that to run ~very~ fast (which could be several blog posts of its own) wound up being able to pull off ~2m vectorizations an hour for 128 token chunks which was then stored in a vector database called Milvus which uses some very nice algos for approximate nearest neighbor matches.

Then there was another queue that would process the text content and bulk batch insert to a very heavily tuned Elasticsearch cluster (long term the idea was to replace Elasticsearch, in fact I'm writing a rust based search engine but I've since fallen off that onto other things, I'd like to get back to it at some point though.) ES is very good but it requires very good tuning and some clever pre and post processing combined with their inbuilt ranking methods to get compelling results over a very diverse set of data like we had assembled.

We used the semantic search as a prefilter for an actual text search over that Elasticsearch cluster, say return 100k ids from the semantic search and then run full text search over those 100k to reduce the time. At the end with those hundreds of millions of pages we were keeping mostly < 1s response times. One of the tricky things with ES there are the cold starts for non cached queries so you could see periodic spikes for certain classes of queries and I was working on smoothing that out.

We were lucky in that we wrangled up something like 35k in AWS credits so while this was running us about 3k a month we had a bout 10months of lead time and we shut down with like 3 months to go. Both me and my cofounder are very technical and on top of all this we were talking to users, getting feedback and looking for avenues to monetize that weren't just based on ads. We wound up pivoting to strictly technical search for businesses throughout this and were talking with those businesses to provide dev tools with specialized search and documentation capabilities based around our search engine and their codebases. Alas neat tech doesn't pay the bills, so after a few near misses on funding and some early contracts that fell through we shuttered it.

I'm super sure I'm missing details here since this was like 1 year of part time work and 6 months of full time work but hopefully that satisfies the curiosity!

alexcg1 · on Aug 25, 2022

What kinda thing do you want to search? Text I guess? But there are search engines for images, gifs, video, all kinds of stuff.

I'm working at an open-source project that builds an AI-powered search framework [0], and I've built some examples in very few lines of code (for searching fashion products via image or text [1], PDF text/images/tables search [2]) and one of our community members built a protein search engine [3].

A good place to start might be with a no-code solution like (shameless self-plug time) Jina NOW [4], which lets you build a search engine and GUI with just one CLI command.

[0] https://github.com/jina-ai/jina/

[1] https://examples.jina.ai/fashion/

[2] https://colab.research.google.com/github/jina-ai/workshops/b...

[3] https://github.com/georgeamccarthy/protein_search

[4] https://now.jina.ai

oikawa_tooru · on Aug 25, 2022

Hi thank you for your answer. Actually my original idea was to build it for something else rather than the web in particular. Can you please advice me on which masters I should pick between distributed systems , machine learning and theoretical computer science/algorithm if my end goal is this?

alexcg1 · on Aug 25, 2022

To be clear, Jina AI stuff helps with the search engine itself. Getting the data is another matter entirely, and pretty much outside of our scope (although we do provide some example datasets with Jina NOW, like artworks, music, etc)

franczesko · on Aug 25, 2022

David Evans' CS101 was about building a search engine with python, however I think it's no longer hosted on Udacity.

https://www.cs.virginia.edu/~evans/courses/

I highly recommend digging deeper and trying to find the course materials - I'm pretty sure they should be available somewhere on the Internet. Perhaps the author could point you in the right direction.

mrkramer · on Aug 25, 2022

I'm also very interested in search engines this are two books I would recommend:

Introduction to Information Retrieval: https://www.amazon.com/Introduction-Information-Retrieval-Ch...

Search User Interfaces: https://www.amazon.com/Search-User-Interfaces-Marti-Hearst-e...

tintedfireglass · on Aug 25, 2022

I'm in a similar bandwagon. I just started collecting search engines and analyzing them. I've listed some of them at https://github.com/Tintedfireglass/search-engines and what I feel is that it is easy to look at search algorithms, queries and User interfaces but I still don't understand how to create one. Prolly start learning some Javascript first. then I'll try. I just host a searx instance at this point.

1vuio0pswjnm7 · on Aug 25, 2022

Missing:

https://yippy.com

https://unbubble.eu

https://symbolhound.com

https://www.exalead.com

https://www.searchencrypt.com

https://suggest.hulbee.com

IME, analysing alternative "search engines" will reveal most are a waste of time.

1vuio0pswjnm7 · on Aug 29, 2022

Here is another one that demonstrates how absurd this idea of a "search engine" can become.

https://www.entireweb.com

Using a text-only browser I get only one search result for each query.

tintedfireglass · on Aug 26, 2022

Thanks a lot! I do feel like that sometimes that it's just a waste of time but a lot of search engines are actually really cool.

franczesko · on Aug 25, 2022

Please, see my answer about David Evans'course somewhere in this thread

tintedfireglass · on Aug 26, 2022

Thanks a lot! That's really interesting

sea6ear · on Aug 25, 2022

This book Artificial Intelligence through Prolog [0] was really interesting to me several years ago and I remember the author [1] had a number of papers on search engine / information retrieval on his research page.

[0] https://faculty.nps.edu/ncrowe/book/book.html [1] https://faculty.nps.edu/ncrowe/index.html

Kukumber · on Aug 25, 2022

You do like DuckDuckGo, you "pay" Microsoft to use Bing and pretend you made a search engine

f0e4c2f7 · on Aug 25, 2022

Graph theory and page rank are a good place to start.

https://blogs.cornell.edu/info2040/2011/09/20/pagerank-backb...

Google's algorithm is indecipherably complex today but in the early days the way search engines more or less worked was they crawled the web and the way they decided to rank the pages was by how many other pages had a URL reference pointing to it.

You can apply this idea today in private (or I suppose public) search engines in the same way to interesting results.

For example a search engine for for scientific papers might use page rank to sort papers that are cited by the most other papers.

Or if you were going to make a search engine for open source projects you could create a page rank algorithm based on what projects had dependencies to other projects.

Part of why Google's algorithm today is more complex than this is that people try to game whatever algorithm search engines commonly use. You may remember back in the 90s and 2000s people would do stuff like put back links to other websites in the source to try to game page rank. Today that kind of behavior has expanded into a whole cottage industry (unfortunately).

What's interesting though is that for a lot of more limited data sets is you have less of that SEO type problem.

Whichever kind you're building good luck! Search engines usually wind up being petty cool (and profitable).

jillesvangurp · on Aug 25, 2022

Start with some basic intro to algorithms. Dig into the field of information retrieval a bit. Read up on things like reverse indices, bloom filters, ranking (tf/idf, bm25, etc.), state machines, etc. Then maybe look at vector search, nlp, and a few other fields. That about should cover the basics and give you some level of understanding of how different features in search engines work. The bar is pretty high if you want to do a good job.

meltyness · on Aug 25, 2022

Assuming you have a background in Math, System Administration (wget), Automation (python), Software (HTML, JS, REST), Linux (there's good resources for all of these)

Truth is, this track you'll set up a search engine in about 20 minutes:

Document Store/NoSQL: https://elastic.co https://solr.apache.org

Classical AI: https://youtube.com/playlist?list=PLUl4u3cNGP63gFHB6xb-kVBiQ...

Stat ML: https://www.tensorflow.org/tutorials https://youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGg...

Frontiers: https://www.ycombinator.com/companies/?query=Search

Wronnay · on Aug 25, 2022

I wrote my own little search engine years ago.

It's pretty simple: You crawl a website and search for all links, then you crawl all the links from these linked websites and so on.

You can still see some of my code here: https://github.com/Wronnay/search-lib

agencies · on Aug 25, 2022

Interesting intro/overview in "What every software engineer should know about search" https://scribe.rip/p/what-every-software-engineer-should-kno...

pjmorris · on Aug 25, 2022

Tim Bray wrote a blog post series on building a full text search engine [0]

[0] https://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchT...

unoti · on Aug 25, 2022

There is a book called “Finding Out About” that I read a long time ago. It describes all aspects of what you would need to do to build a search engine from scratch. It provides details on storage and retrieval algorithms and so on. It’s dated and would need to be revised but the fundamentals are there.

LordHeini · on Aug 25, 2022

It kind of depends how in depth you want to go.

I wrote a small but somewhat complete search engine some time ago.

The steps are basically:

Have a queue with urls.

Download a page from the queue

Use a html parser to remove markup and get the links of the page. Add the links to a queue without the already visited ones.

Use a stemmer to clean up the text (porter stemmer or whatever).

Calculate an inverted index: https://en.wikipedia.org/wiki/Inverted_index

Save the stuff in an appropriate data structure (Hash or Tree).

Write a query engine for AND, OR and (all the words).

Calculate a simple page rank by counting the links to a page.

For just learning purposes it is not that hard but if you want to get all the crazy corner cases of the "real" web you will go insane.

The easy alternative:

Use a lib to crawl a page.

Plonk all the documents text into postgres with full text search or elastic search.

jesuslop · on Aug 25, 2022

I would try to grasp the 'random surfer' idea, that is modeled by a Markov Chain. A nice free book is Markov Chain for Programmers [1]. A discrete time Markov Chain boils down to a conditional probability that boils down to a matrix, and a steady distribution boils down to an eigenvalue 1 eigenvector of it, which determines PageRank. Then one can jump to 'The $25,000,000,000 Eigenvector: The Linear Algebra behind Google'.

[1] https://github.com/czekster/markov

[2] https://doi.org/10.1137/050623280

EDIT: admitedly there's much more to it

fzliu · on Aug 25, 2022

This depends mostly on what kind of search engine you're trying to build. I unfortunately won't be able to point you towards courses, but there are tons of great resources online to help you get started.

Search engines are a fairly broad topic, and a lot of it depends on the _type of data_ that you want to build a search engine for. If you're looking towards more traditional, Google/Yahoo-like search, Elasticsearch's learning center (https://www.elastic.co/learn) has quite a few good resources that can point you in the right direction. Many enterprise search solutions are built on top of Apache Lucene (including Elasticsearch), and videos/blogs discussing Lucene's architecture (https://www.endava.com/en/blog/Engineering/2021/Elasticsearc...) is a great starting point as well.

Opposite text/web search is _unstructured data_ search, i.e. searching across images, video, audio, etc... based on their semantics (https://www.deepset.ai/blog/semantic-search-with-milvus-know...). Work in this space has been ongoing for decades, but an emerging way of doing this is via a _vector database_ (https://frankzliu.com/blog/a-gentle-introduction-to-vector-d...) such as Zilliz Cloud (https://zilliz.com/cloud) or Milvus (https://milvus.io/). The idea here is to turn the type of data that you're searching for into an high-dimensional vector called an embedding, and to perform nearest-neighbor search on the embeddings themselves.

Disclaimer: I'm a part of the Zilliz team and Milvus community.

zackmorris · on Aug 25, 2022

Does anyone have a solution for when the spider or backend content aggregator gets its IP address blacklisted?

I think that Facebook got around this back in the day by having the user's device do the initial scraping. A side effect of this is that sometimes I'd post an article I found, but the preview image was blank because I submitted it too quickly or something.

Until we have a free publicly downloadable cache of all websites, similar to CoralCDN (is this defunct?), then building our own search engines is probably a nonstarter.

kordlessagain · on Aug 25, 2022

Not finished, but the Selenium based crawler works pretty well to combat most blocks: https://github.com/kordless/grub-2.0. I have a slightly modified version running in production for Mitta.

For IP blocks, try this: https://github.com/kordless/mitta-screenshot

lovelearning · on Aug 25, 2022

All the replies here seem to be about building yet another search engine of the current generation. But they all have usability drawbacks.

I asked a question about my ideal search engine a few days ago: https://news.ycombinator.com/item?id=32452318

Basically, a conversational semantic search engine like the ones in Star Trek. I also listed some technologies that can possibly help implement it. You might be interested in that list.

melony · on Aug 25, 2022

Like these?

https://news.ycombinator.com/item?id=30832589 https://news.ycombinator.com/item?id=32003215

lovelearning · on Aug 25, 2022

Compared to Google, Andi's search experience is definitely better and results are more relevant. But for complex queries, I found its search results and filtering behavior are still not as smart as what I'd like (though, again, much better than Google's).

marginalia_nu · on Aug 26, 2022

I just started building one (with no prior experience), and solved problems as I went along. There are a lot of problems in search, but none of them are spectacularly difficult. Requires breadth more than depth.

I've converged on a design that is actually almost eerily similar to Google's original design. ( http://infolab.stanford.edu/~backrub/google.html )

johnamata · on Aug 26, 2022

In most CS master's you'll get to take a course called information retrieval, and typically the main project for such a course is building a search engine.

skoczko · on Aug 25, 2022

If by a "search engine" you mean a tool to index and retrieve documents (essentially what the terms mean in the traditional Information Retrieval, e.g Lucene, SOLR, Elastic) then this is a pretty good on the subject that taught me a lot:

https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...

almog · on Aug 25, 2022

I'm no expert in search engine and as such, I found Victor Lavrenko's videos (52 of them) very well made though it's probably not nearly as rigorous as what you might be looking for: https://www.youtube.com/c/VictorLavrenko/playlists?view=50&s...

arthurjj · on Aug 25, 2022

When I switched to working on search my boss had written "Search Engines: Information Retrieval in Practice" and I found it very helpful for wrapping my head around the different subsystems that make up search. It took me about a month to work through and it's only $16 on Amazon

[1] https://amzn.to/3dW5YBQ

mdaniel · on Aug 27, 2022

Please don't use affiliate links: https://www.amazon.com/gp/product/0136072240

gregw134 · on Aug 25, 2022

The term you want to Google is information retrieval.

bwb · on Aug 25, 2022

Maybe rethink the core idea; why are web pages the valued part?

Maybe instead, we need people to answer questions as real people... and then score/weigh/rank those answers/people.

I just think Google is doing an amazing job for the web, and where it is failing is not with web search, but high quality answers from humans (and not marketing trash which Google is trying to fix this week)

TekMol · on Aug 25, 2022

I don't think Sergey Brin or Larry Page studied how to make search engines.

Let's start right away!

    urls_old = []
    urls_new = [ 'news.ycombinator.com' ]

    while urls_new.length>0:
        spider(urls_new[0])

... to be continued ... who writes the next 5 lines?

tintedfireglass · on Aug 25, 2022

cmon dude why this elitism. each person approaches problems differently. Don't push negativity into OP's brain before it even begins to flourish.

TekMol · on Aug 25, 2022

It is supposed to be the opposite of negativity.

I actually did kickstart projects that became pretty big lifestyle businesses by saying "No, you will not figure out how to raise venture capital. Let's open a text editor and start coding".

And it would be funny, if the massive, collective coding power of HN could collaborate and write a search engine right here on the spot. By everyone contributing 5 lines :)

blitzar · on Aug 25, 2022

Q: How do I do X? A: What have you tried?

Is perfectly reasonable conversation.

hardware2win · on Aug 25, 2022

I think you're kinda overreacting

The simplest and the best way to get xp is to get hands dirty

tintedfireglass · on Aug 26, 2022

I think yeah, I overreacted...

freeCandy · on Aug 25, 2022

Wiby has some implementation details in their install guide: https://wiby.me/about/guide.html

Xeoncross · on Aug 25, 2022

I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":

- Tries (patricia, radix, etc...)

- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)

- Consensus (raft, paxos, etc..)

- Block storage (disk block size optimizations, mmap files, delta storage, etc..)

- Probabilistic filters (hyperloloog, bloom filters, etc...)

- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)

- Ranking (pagerank, tf/idf, bm25, etc...)

- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)

- HTML (document parsing/lexing)

- Images (exif extraction, removal, resizing / proxying, etc...)

- Queues (SQS, NATS, Apollo, etc...)

- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)

- Rate limiting (leaky bucket, windowed, etc...)

- Compression

- Applied linear algebra

- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)

- etc...

I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.

If you are comfortable with Go or Rust you should look at the latest projects in this space:

- https://github.com/quickwit-oss/tantivy

- https://github.com/valeriansaliou/sonic

- https://github.com/mosuka/phalanx

- https://github.com/meilisearch/MeiliSearch

- https://github.com/blevesearch/bleve

- https://github.com/thomasjungblut/go-sstables

A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

Source: I'm currently working on the third version of my search engine and I've been studying this for 10 years.

marcooliv · on Aug 25, 2022

do you write any content related to your journey on developing search engines? youtube channel? twitter? blog? i'm working at a startup and we are creating a search engine with neural networks, but the base of information retrieval are proving more important than just some NLP models by itself. and thank you for sharing those details!

Xeoncross · on Aug 25, 2022

No, not yet. I haven't felt qualified to write about any of these individual things but I'm starting to realize not many people have the breadth of knowledge I have at this point and perhaps I should start writing for others (I have a large collection of my own notes).

aluciani · on Aug 26, 2022

Have you seen YaCy Search Engine? https://yacy.net/

This might be something to build on or explore.

kordlessagain · on Aug 25, 2022

Anything to do with vector search is probably a good choice.

Xamayon · on Aug 25, 2022

Getting and storing the content is one of the main challenges, and it's getting harder by the day with more and more sites using anti bot stuff from companies like Cloudflare. With the SauceNAO.com image search engine I tried to tailor it to my own needs, taking a slow and steady semi-curated approach. To keep things sane and costs low I went after specific sites (and other resources) which have high signal to noise, and highly desirable content. I add a couple at a time, finding and fixing bottlenecks as they come up. Nothing is perfect from the start, so I mainly focus on environment simplicity and getting the minimum viable setup working as quickly as possible. This has caused some problems to be sure, and led to the site looking and feeling less than awesome in many ways, but at least it (mostly) works... Over time I have had to rewrite everything - the crawling software, search algorithms, back-end database, and front-end when it became apparent things could be done more efficiently to deal with the ever increasing usage and scale. Having the content stored to enable re-generating indexes quickly has been very important long term! It has taken many years (started in 2008), but in its art/entertainment niche, it has really started to take off usage wise. My advice would be to start semi-small, throwing things at the wall and see if anything works. Try to keep the initial setup as simple and affordable as possible unless you have serious funding available. Building even a small search engine can take a lot of resources and time, but it can also be an amazingly fun hobby.

thirdtrigger · on Aug 25, 2022

Im affiliated with a company that makes one. We create a vector search engine called Weaviate [1] we also publish content on how it's done and the search engine itself is open source which might be helpful for you too.

[1] https://weaviate.io

oikawa_tooru · on Aug 25, 2022

Hi thank you for your answer. Actually my original idea was to build it for something else rather than the web in particular. Can you please advice me on which masters I should pick between distributed systems , machine learning and theoretical computer science/algorithm if my end goal is this?

softwaredoug · on Aug 25, 2022

This is a good opportunity to update my search reading list.

Books

* Lucene in Action - A bit out of date, but it has a lot of the core concepts of Lucene. 10 years ago, this was the go-to book even if you’re working with Solr or Elasticsearch, to understand the core data structures

* Relevant Search (my book) - Introduction to optimizing the relevance of full text search engines. Getting a bit old now, but still relevant for classic search engines.

* AI Powered Search (disclaimer - I contributed) - Author Trey Grainger is brilliant, and has been a long-time colleague of mine. He managed the search team at Careerbuilder (where we did some work together). This is in some ways his perspective on how machine learning and search work together.

* Elasticsearch The Definitive Guide - online free, very comprehensive, book from Elastic

---

Blogs

* OpenSource Connections (my old company) Blog (http://o19s.com/blog) - lots of meaty search and relevance info

* Query Understanding (https://queryunderstanding.com/) - Daniel Tunkenlang is a long term very smart search person. Has worked at Google, etc

* James Rubenstein’s Blog (https://jamesrubinstein.medium.com/) - I worked closely with James at LexisNexis. He has helped work on search and search evaluation at Ebay, Pinterest, LexisNexis, and Apple

* Sematext (https://sematext.com/blog/) - Sematext are probably best known for being really in the weeds search engine engineers, a fair amount of scaling, etc. But some relevance.

* Sease Query Blog (https://sease.io/blog-2/our-blog) - Sease are London based Information Retrieval Experts

* My Blog (http://softwaredoug.com)

---

Paid Trainings

* OpenSource Connections Training (https://opensourceconnections.com/training)

* CoRise (https://corise.com/course/search-with-machine-learning)

* ML Powered Search (My Sphere Class ) - https://www.getsphere.com/ml-engineering/ml-powered-search?s...

* Sease Training - https://sease.io/training

* Sematext Training - https://sematext.com/training/

---

Conferences

* Haystack - http://haystackconf.com

* MICES - http://mices.co (e-commerce search)

* Berlin Buzzwords - search, scale, stream, etc - https://2022.berlinbuzzwords.de/

contingencies · on Aug 25, 2022

Friendly note: "trainings" is not a word, training is already plural.