Probably not but users tell me when something isn't working. I spent my entire career in this sort of development model where 80% was good enough, put it into production and then moved on. When something breaks, fix it, when something is really complex test it, when the logic is critical mock it.
This is likely to be a core issue for most people. In reality it isn't "if it breaks, fix it", it's "if we notice it breaks, and we can remember what it should do, we can probably fix it". Tests allow you to spot when you break things, and also encode what they should do.
Tests also cost money to write and to maintain. It's a tradeoff. Personally I like to work at places where the tradeoff falls in favour of writing tests. But I understand there are some businesses that just don't have a lot of money and where the cost of things breaking isn't that high, and then the right decision might be to skip tests. And one might argue if the customers don't notice it's broken, does it really matter?
I agree they cost something, although I think they give back far more than they take. In this case though, I would say that not having them in place means people won't adopt these proffered microservices, as everyone will have to individually implement tests around them.
This section was interesting! Somehow I've never realized that row oriented storage is orthogonal to how disks work...
Jd is a columnar (column oriented) RDBMS.
Most RDBMS systems are row oriented. Ages ago they fell into the trap of thinking of tables as rows (records). You can see how this happened. The end user wants the record that has a first name, last name, license, make, model, color, and date. So a row was the unit of information and rows were stored sequentially on disk. Row orientation works for small amounts of data. But think about what happens when there are lots of rows and the user wants all rows where the license starts with 123 and the color is blue or black. In a naive system the application has to read every single byte of data from the disk. There are lots of bytes and reading from disk is, by orders of magnitude, the slowest part of the performance equation. To answer this simple question all the data had to be read from disk. This is a performance disaster and that is where decades of adding bandages and kludges started.
Jd is columnar so the data is 'fully inverted'. This means all of the license numbers are stored together and sequentially on disk. The same for all the other columns. Think about the earlier query for license and color. Jd gets the license numbers from disk (a tiny fraction of the database) and generates a boolean mask of rows that match. It then gets the color column from disk (another small fraction of the data) and generates a boolean mask of matches and ANDS that with the other mask. It can now directly read just the rows from just the columns that are required in the result. Only a small fraction of the data is read. In J, columns used in queries are likely already in memory and the query runs at ram speed, not the sad and slow disk speed.
Both scenarios above are simplified, but the point is strong and valid. The end user thinks in records, but the work to get those records is best organized by columns.
Row oriented is slavishly tied to the design ideas of filing cabinets and manila folders. Column oriented embraces computers.
A table column is a mapped file.
> Somehow I've never realized that row oriented storage is orthogonal to how disks work...
The section you posted is very misleading. Storage is arranged in blocks. The secret to database performance is how you lay out data in those blocks and how well your access patterns to the blocks match the capabilities of the device. This choice is the fundamental key to database performance.
If your database stores shopping baskets for an eCommerce site, you want each basket in the smallest number of blocks, ideally 1. It makes inserting, updating, and reading single baskets very fast on most modern storage devices.
If your database stores data for analytic queries, it's better (in general) to store each column as an array of values. That makes compression far better, and also makes scanning single columns very efficient.
To say as the article does that "row oriented is slavishly tied to design ideas of filing cabinets and manila folders" is nonsense. Plus there are many other choices about how to access data that include parallelization, alignment with processor caches, trading off memory vs. storage, whether you have a cost-base query optimizer, etc. Even within column stores there are big differences in performance because of these.
(Disclaimer: I work on ClickHouse and love analytic systems. They are great but not for everything.)
I would not that this query behavior (sorted data columns bitmasked together) is further orthogonal to primary-data storage representation. For example, Postgres can give you this same behavior if you declare a multi-column GIN index across the columns you want to be searchable.
If you’re interested in this thought, check out Martin Kleppman’s book DDIA where he explains storage concepts like this and many more. One of the best architecture books out there!
Ironically - He didn't even mention indexes in his description (which he admitted was simplified) - a good query optimizer will do wonders for not only coming up with the appropriate hints for the query plan, but will also dynamically adjust those hints based on the underlying data patterns.
The example he provided,
"So a row was the unit of information and rows were stored sequentially on disk. Row orientation works for small amounts of data. But think about what happens when there are lots of rows and the user wants all rows where the license starts with 123 and the color is blue or black. In a naive system the application has to read every single byte of data from the disk."
Is something no modern database would ever do. The real challenge is not to only read the records starting with 123, or having blue/black - that part is trivially handled by every Database engine I'm familiar with. The query challenge is *do you filter on license # or color first? (If there are 1k records starting with 123 and 5mm blue/black vehicles, the order is pretty critical for performance) - that's one of the features that distinguishes query optimizers.
Columnar databases are awesome when you have columnar data to work with - I've seen 20-30x reductions in disk storage in the wild (and you can obviously create synthetic examples that go way north of that), but a well indexed SQL database backed by a solid query optimizer/planner can probably stand it's own with a columnar database in terms of lookup performance, particularly if your data is row-oriented to begin with.