I prefer to use clickhouse-local for all my CSV needs as I don't need to learn a new language (or cli flags) and can just leverage SQL.
clickhouse local --file medias.csv --query "SELECT edito, count() AS count from table group by all order by count FORMAT PrettyCompact"
┌─edito──────┬─count─┐
│ agence │ 1 │
│ agrégateur │ 10 │
│ plateforme │ 14 │
│ individu │ 30 │
│ media │ 423 │
└────────────┴───────┘
With clickhouse-local, I can do lot more as I can leverage full power of clickhouse.
How does it compare with duckdb, which I usualy resort to?
What I like with duckdb is that it's a single binary, no server needed, and it's been happy so far with all the CSV file I've thrown at it.
clickhouse-local is similar to duckdb, you don't need a clickhouse-server running in order to use clickhouse-local. You just need to download the clickhouse binary and start using it.
clickhouse local
ClickHouse local version 25.4.1.1143 (official build).
:)
There are few benefits of using clickhouse-local since ClickHouse can just do lot more than DuckDB. One such example is handling compressed files. ClickHouse can handle compressed files with formats ranging from zstd, lz4, snappy, gz, xz, bz2, zip, tar, 7zip.
clickhouse local --query "SELECT count() FROM file('top-1m-2018-01-10.csv.zip :: *.csv')"
1000000
Also clickhouse-local is much more efficient in handling big csv files[0]
Debian package is of poor quality: not even sure if clickhouse local is included in there, I believe so but there is no manpage, no doc at all, and no `clickhouse-server -h`.
Went to the official page looking for a tarball to download, found only the `curl|sh` joke.
Went to github looking for tagged tarballs, couldn't find any. Looked for INSTALL.md, couldn't find any.
Will try harder later, have to weep my tears for now.
ClickHouse is a single binary. It can be invoked as clickhouse-server, clickhouse-client, and clickhouse-local. The help is available as `clickhouse-local --help`. clickhouse-local also has a shorthand alias, `ch`.
This binary is packaged inside .deb, .rpm, and .tgz, and it is also available for direct download. The curl|sh script selects the platform (x86_64, aarch64 x Linux, Mac, FreeBSD) and downloads the appropriate binary.
ClickHouse has a solid Iceberg integration. It has an Iceberg table function[0] and Iceberg table engine[1] for interacting with Iceberg data stored in s3, gcs, azure, hadoop etc.
oh and now the developer reopened it because it is not actually fully complete, lol. Yep, Iceburg on Clickhouse is WIP. I am actively watching this because it is relevant for my company.
> If I had to only pick two databases to deal with, I’d be quite happy with just Postgres and ClickHouse - the former for OLTP, the latter for OLAP.
As the author mentioned, I completely agree with this statement. In fact, many companies like Cloudflare are built with exactly this approach and it has scaled them pretty well without the need of any third database.
> Another reason I suggest checking out ClickHouse is that it is a joy to operate - deployment, scaling, backups and so on are well documented - even down to setting the right CPU governor is covered.
Another point mentioned by author which is worth highlighting is the ease of deployment. Most distributed databases aren't so easy to run at scale, ClickHouse is much much easier and it has become even more easier with efficient storage-compute separation.
Sai from ClickHouse here. Have been living and breathing past year helping customers integrating Postgres and ClickHouse together. Totally agreed with this statement - there are numerous production grade workloads solving most of their data problems using these 2 purpose-built Open Source databases.
My team at ClickHouse has been working hard to make the integration even seamless. We work on PeerDB, an open source tool enabling seamless Postgres replication to ClickHouse https://github.com/PeerDB-io/peerdb/ This integration is now also natively available in the Cloud through ClickPipes. The private preview was released just last week https://clickhouse.com/cloud/clickpipes/postgres-cdc-connect...
Out of curiosity: Why not mysql? I am also surprised that no one has even mentioned mysql in any of the comments so far -- so looks like the verdict is very clear on that one
PS: I am also a fan of Postgres, and we are using that for our startup. But I don't know the answer if someone asks me, why not Mysql. Hence asking
To my knowledge, both Postgres and MySQL has their own strengths and weaknesses. Example: mvcc implementation, data replication, connection pooling and difficulty of upgrades were the major weaknesses of Postgres which are much improved over time. Similarly mysql query optimizer is consider lesser developed than that of Postgres's.
Overall I think Postgres adoption and integrations and thus community is much more wider than MySQL which gives it major advantage over MySQL. Also looking at the number of database-as-a-service companies of Postgres vs those of MySQL we can immediately acknowledges that Postgres is much widely adopted.
I don't know how much of that article points are still valid.
The other part in favor of mysql (in my opinion) are that there are lots of companies that use mysql in production - so the access patterns, and its quirks are very well defined
Companies like Square, YouTube, Meta, Pinterest, now Uber all use mysql. From blind, Stripe was also thinking of moving all its fleet from Mongo to mysql
Perception wise, it looks like companies needing internet scale data are using mysql
Since analytics data is generally write-heavy, I would recommend to use ClickHouse. You can use async-insert[0] feature of ClickHouse, thus you don't need to worry about batching events on your side. If you are looking for an embedded solution, you can use chDB which is built on top of ClickHouse.
I tried to check the kind of flights they flew in the world's dangerous airport (Lukla, Nepal)[0] and found they use ATR-72 series.
[0] https://adsb.exposed/?dataset=Planes&zoom=12&lat=27.7136&lng...