Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

zstd decompression should almost always be very fast. It's faster to decompress than DEFLATE or LZ4 in all the benchmarks that I've seen.

you might be interested in converting the pushshift data to parquet. Using octosql I'm able to query the submissions data (from the begining of reddit to Sept 2022) in about 10 min

https://github.com/chapmanjacobd/reddit_mining#how-was-this-...

Although if you're sending the data to postgres or BigQuery you can probably get better query performance via indexes or parallelism.



Unfortunately we're not just searching for things but extracting word frequencies of every user for stylometric analysis, so we need to do custom crunching.

Spreading this task into many sub-slices of the files is annoying because the frequencies per user add up quite a lot, which results in quite a massive amount of data.


[flagged]


Two SSDs on AWS machine only give 3800 MB/sec :(


Meanwhile a single consumer Samsung 980 Pro 2TB for 200€ gives me stable 7000 MB/sec




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: