Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure, on the technical side we wrote a custom crawler in python that we were able to scale horizontally based on simple config changes so we could cap costs instead of doing the scaling dynamically, everything with the crawler was stateless and depended on SQS, a separate bloom filter to check already crawled content and we did a few other things as well to avoid needlessly crawling existing content, standard rate limiting and dynamic back off, depth limits to avoid over crawling a single site etc and at the height I think we were crawling ~10m pages a day which could have scaled endlessly but we were balancing cost. We were working on adding headless js support as well to handle dynamic content pages, but we were able to grab a great deal of good content without that.

Edge and link data were stored in a graph database called dgraph which I would run a spark job to recompute a weighted pagerank once a day to use as a signal in the search process. At some point I was working on moving from statically computed pagerank to dynamic pagerank that would update on the fly as indexing occurred.

All the text content was stored in a special format in s3 to avoid as much data growth as we could while maintaining the content for indexing. An identifier was then inserted into another SQS queue which was processed by our vectorizer pipeline for semantic search, which was another stateless python project using a finetuned version of sentence bert and some very neat lower level execution graph optimizations to get that to run ~very~ fast (which could be several blog posts of its own) wound up being able to pull off ~2m vectorizations an hour for 128 token chunks which was then stored in a vector database called Milvus which uses some very nice algos for approximate nearest neighbor matches.

Then there was another queue that would process the text content and bulk batch insert to a very heavily tuned Elasticsearch cluster (long term the idea was to replace Elasticsearch, in fact I'm writing a rust based search engine but I've since fallen off that onto other things, I'd like to get back to it at some point though.) ES is very good but it requires very good tuning and some clever pre and post processing combined with their inbuilt ranking methods to get compelling results over a very diverse set of data like we had assembled.

We used the semantic search as a prefilter for an actual text search over that Elasticsearch cluster, say return 100k ids from the semantic search and then run full text search over those 100k to reduce the time. At the end with those hundreds of millions of pages we were keeping mostly < 1s response times. One of the tricky things with ES there are the cold starts for non cached queries so you could see periodic spikes for certain classes of queries and I was working on smoothing that out.

We were lucky in that we wrangled up something like 35k in AWS credits so while this was running us about 3k a month we had a bout 10months of lead time and we shut down with like 3 months to go. Both me and my cofounder are very technical and on top of all this we were talking to users, getting feedback and looking for avenues to monetize that weren't just based on ads. We wound up pivoting to strictly technical search for businesses throughout this and were talking with those businesses to provide dev tools with specialized search and documentation capabilities based around our search engine and their codebases. Alas neat tech doesn't pay the bills, so after a few near misses on funding and some early contracts that fell through we shuttered it.

I'm super sure I'm missing details here since this was like 1 year of part time work and 6 months of full time work but hopefully that satisfies the curiosity!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: