Except for projects for which NoSQL is a better fit than a RDBMS, no? If I'm wri...

Olreich · on Oct 11, 2023

Postgres. NoSQL databases are specialized databases. They are best-in-class at some things, but generally that specialization came at great cost to their other options. DynamoDB is an amazing key-value store, but is severely limited at everything else. Elasticsearch is an amazing for search and analytics, but is severely limited at everything else. Other specialized databases that are SQL-full are also great at what they do, like Spark is a columnar database that has amazing capabilities for massive datasets where you need lots of cross-joins, but that severely limits it's ability to act in a lot of roles, because they traded latency for throughput and horizontal scalability, and you're restricted in what you can do with it.

The super-power of Postgres is that it supports everything. It's a best-in-class relational database, but it's also a decent key-value store, it's a decent full-text search engine, it's a decent vector database, it's a decent analytics engine. So if there's a chance you want to do something else, Postgres can act as a one-stop-shop and doesn't suck at anything but horizontal scaling. With partitioning improving, you can deal with that pretty well.

If you're writing fresh, there is basically no reason not to use Postgres to start with. It's only when you already know your scale won't work with Postgres that you should reach for a specialized database. And if you think you know because of published wisdom, I'd recommend you set up your own little benchmark, generate the volume of data you want to support, and then query it with Postgres and see if that is fast enough for you. It probably will be.

jpgvm · on Oct 11, 2023

Golden Rule of data: Use PostgreSQL unless you have an extremely good reason not to.

PostgreSQL is extremely good at append-mostly data, i.e like a chat log and has powerful partitioning features that allow you to keep said chat logs for quite some time (with some caveats) while keeping queries fast.

Generally speaking though PostgreSQL has powerful features for pretty much every workload, hence the Golden Rule.

GabeWeiss_ · on Oct 11, 2023

100% this, and even though I work for Google I absolutely agree. BUT, for the folks that need it, PostgreSQL just DOESN'T cut it, so it's why we have databases like DynamoDB, Spanner, etc. Arguing that we should "Just use PG" is kinda a moot point.

nameless912 · on Oct 12, 2023

I think I said this in another comment, but I'm not shitting on Spanner or DDB's right to exist here. Obviously, there are _some_ problems for which a globally distributed ACID compliant SQL-compatible database are useful. However, those problems are few and far between, and many/most of them exist at companies like Google. The fact is your average small to medium size enterprise doesn't need and doesn't benefit from DDB/Spanner, but "enterprise architects" love to push them for some ungodly reason.

pezezin · on Oct 11, 2023

Don't forget PostgreSQL extensions. For something like a chat log, TimescaleDB (https://www.timescale.com/) can be surprisingly efficient. It will handle partitioning for you, with additional features like data reordering, compression, and retention policies.

lost_tourist · on Oct 14, 2023

this is what I've done sqlite3 for my personal stuff, postgres for everything else. I'm far from a "120 million requests per second" level though, so my experience is limited to small to mid-size ops for businesses.

loxias · on Oct 11, 2023

Millions is tiny. Toy even. (I work on what could be called a NoSQL database, unfortunately "NoSQL" is a term without specificity. There's many different ways to be a non-relational database!)

My advise to you is to use Postgresql or, heck, don't over think it, sqlite if it helps you get a MVP done sooner. Do NOT prematurely optimize your architecture. Whatever choice results in you spending less time thinking about this now is the right choice.

In the unlikely event you someday have to deal with billions of messages and scaling problems, a great problem to have, there are people like me who are eager to help in exchange for money.

Lots of people like to throw around the term "big data" just like lots of people incorrectly think that just because google or amazon need XYZ solution that they too need XYZ solution. Lots of people are wrong.

If there exists a motherboard that money can buy, where your entire dataset fits in RAM, it's not "big data".

silisili · on Oct 11, 2023

I've found it's pretty easy to massage data either way, depending on your preference. The one I'm working on now ultimately went from postgres, to mysql, to dynamo, the latter mainly for cost reasons.

You do have to think about how to model the data in each system, but there are very few cases IMO where one is strictly 'better.'

ruuda · on Oct 11, 2023

Postgres; the schema is still structured. But even if you want something less rigid, Postgres has a jsonb type and great operators for querying json.

btown · on Oct 11, 2023

You can also create arbitrary indices on derived functions of your JSONB data, which I think is something that a lot of people don't realize. Postgres is a really, really good NoSQL database.

cooperaustinj · on Oct 11, 2023

Can you expand on this? Documentation or an example so I can learn?

tczMUFlmoNk · on Oct 11, 2023

Sure. Suppose that we have a trivial key-value table mapping integer keys to arbitrary jsonb values:

    example=> CREATE TABLE tab(k int PRIMARY KEY, data jsonb NOT NULL);
    CREATE TABLE

We can fill this with heterogeneous values:

    example=> INSERT INTO tab(k, data) SELECT i, format('{"mod":%s, "v%s":true}', i % 1000, i)::jsonb FROM generate_series(1,10000) q(i);
    INSERT 0 10000
    example=> INSERT INTO tab(k, data) SELECT i, '{"different":"abc"}'::jsonb FROM generate_series(10001,20000) q(i);
    INSERT 0 10000

Now, keys in the range 1–10000 correspond to values with a JSON key "mod". We can create an index on that property of the JSON object:

    example=> CREATE INDEX idx ON tab((data->'mod'));
    CREATE INDEX

Then, we can query over it:

    example=> SELECT k, data FROM tab WHERE data->'mod' = '7';
      k   |           data            
    ------+---------------------------
        7 | {"v7": true, "mod": 7}
     1007 | {"mod": 7, "v1007": true}
     2007 | {"mod": 7, "v2007": true}
     3007 | {"mod": 7, "v3007": true}
     4007 | {"mod": 7, "v4007": true}
     5007 | {"mod": 7, "v5007": true}
     6007 | {"mod": 7, "v6007": true}
     7007 | {"mod": 7, "v7007": true}
     8007 | {"mod": 7, "v8007": true}
     9007 | {"mod": 7, "v9007": true}
    (10 rows)

And we can check that the query is indexed, and only ever reads 10 rows:

    example=> EXPLAIN ANALYZE SELECT k, data FROM tab WHERE data->'mod' = '7';
                                                      QUERY PLAN                                                   
    ---------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on tab  (cost=5.06..157.71 rows=100 width=40) (actual time=0.035..0.052 rows=10 loops=1)
       Recheck Cond: ((data -> 'mod'::text) = '7'::jsonb)
       Heap Blocks: exact=10
       ->  Bitmap Index Scan on idx  (cost=0.00..5.04 rows=100 width=0) (actual time=0.026..0.027 rows=10 loops=1)
             Index Cond: ((data -> 'mod'::text) = '7'::jsonb)
     Planning Time: 0.086 ms
     Execution Time: 0.078 ms

If we did not have an index, the query would be slower:

    example=> DROP INDEX idx;
    DROP INDEX
    example=> EXPLAIN ANALYZE SELECT k, data FROM tab WHERE data->'mod' = '7';
                                                QUERY PLAN                                             
    ---------------------------------------------------------------------------------------------------
     Seq Scan on tab  (cost=0.00..467.00 rows=100 width=34) (actual time=0.019..9.968 rows=10 loops=1)
       Filter: ((data -> 'mod'::text) = '7'::jsonb)
       Rows Removed by Filter: 19990
     Planning Time: 0.157 ms
     Execution Time: 9.989 ms

Hence, "arbitrary indices on derived functions of your JSONB data". So the query is fast, and there's no problem with the JSON shapes of `data` being different for different rows.

See docs for expression indices: https://www.postgresql.org/docs/16/indexes-expressional.html

toast0 · on Oct 11, 2023

Either way can work. Getting to millions of messages is going to be the hard part, not storing them.

As with all data storage, the question is usually how do you want to access that data. I don't have experience with Postgres, but a lot of (older) experience with MySQL, and MySQL makes a pretty reasonable key-value storage engine, so I'd expect Postgres to do ok at that too.

I'm a big fan of pushing the messages to the clients, so the server is only holding messages in transit. Each client won't typically have millions of messages or even close, so you have freedom to store things how you want there, and the servers have more of a queue per user than a database --- but you can use a RDBMS as a queue if you want, especially if you have more important things to work on.

sosodev · on Oct 12, 2023

Seems to me like there are still plenty of relationships in a chat app. Postgres can be used like a NoSQL database via JSONB too if you want.

I think the truth is that you should use the simplest, most effective tech possible until you are absolutely certain you need something more niche.

jopsen · on Oct 12, 2023

> If I'm writing a chat app with millions of messages...

Once you have millions of messages, maybe consider moving the data intensive parts out if postgres, if necessary.

The criticism is often that people look for big data solutions, before they have big data.

If you scale out of postgres, you probably have enough users and money that you can fix it :)

But moving to a NoSQL before you have to, might just slow down development velocity -- also you haven't yet learned what patterns users have.

BoorishBears · on Oct 11, 2023

This is going to feel like a non-answer: but if you need to ask this question in this format, save yourself some great pain and use Postgres or MongoDB, doesn't really matter which, just something known and simple.

Normally you'd make a decision like this by figuring out what your peak demand is going to look like, what your latency requirements are, how distributed are the parties, how are you handling attachments, what social graph features will you offer, what's acceptable for message dropping, what is historical retention going to look like...[continues for 10 pages]

But if you don't have anything like that, just use something simple and ergonomic, and focus on getting your first few users. There's a long gap between when the simple choice will stop scaling and those first few users.

airstrike · on Oct 11, 2023

Thanks, I really appreciate this. DynamoDB was pretty simple to setup, all things considered. Some growing pains, but it's a dead simple schema.

I'm using SenseDeep's OneTable which was pretty interesting to learn https://doc.onetable.io/ in case others reading this are curious.