Hacker Newsnew | past | comments | ask | show | jobs | submit | electricshampo1's commentslogin

Depending on the IOPS rate for your app; SPDK can result in less CPU time spent in issuing IO/reaping completions compared to ex. io_uring.

See Ex. https://www.vldb.org/pvldb/vol16/p2090-haas.pdf What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines

for actual data on this.

OFC, If your block size is large enough and/or your design is batching enough etc. that you already don't spend much time in issuing IO/reaping completion then as you say, SPDK will not provide much of a gain.


From the pdf of the bill, software is still classified as R&D on page 303 of this bill; the change is that domestic (US) software R&D is no longer forced into a 5yr amortization schedule.

Foreign software R&D is still forced to be amortized over a 15 year period; see pages 304,305,306 of the bill.


Nice to see SDC concerns being taken more seriously by hardware folks. Once software gets to sufficient quality (which we have achieved in many cases), these kinds of rando hw issues are the only remaining causes of "impossible" bugs that waste endless engineering time to debug.

I wonder how much of this relies on or is made easier by the clustered core architecture of E-Core Xeons. In comparison each physical core of P-Core Xeons is its own island basically.


The gap between the efficiency displayed here and that which can be found in ex. postgres/mysql is insane.


It is integer factors better overall total BW than ddr5 spr; I think they went for minimal investment + time to market for the spr w/ hbm product rather than heavy investment to hit full bw utilization. Which may have made sense for intel overall given business context etc


Completely agree re: firedancer codebase. There is a level of thought and discipline wrt performance that I have never seen anywhere else.


It's much more than just performance they've thought about. Here are some of the secure programming practices that have been implemented:

  /* All the functions in this file are considered "secure", specifically:
     - Constant time in the input, i.e. the input can be a secret[2]
     - Small and auditable code base, incl. simple types
     - Either, no local variables = no need to clear them before exit (most functions)
     - Or, only static allocation + clear local variable before exit (fd_ed25519_scalar_mul_base_const_time)
     - Clear registers via FD_FN_SENSITIVE[3]
     - C safety
  */
libsodium[4] implements similar mechanisms, and Linux kernel encryption code does too (example: use of kfree_sensitive)[5]. However, firedancer appears to better avoid moving secrets outside of CPU registers, and [3] explains that libraries such as libsodium have inadequate zeroisation, something which firedancer claims to improve upon.

[1] https://github.com/firedancer-io/firedancer/blob/main/src/ba...

[2] https://en.wikipedia.org/wiki/Elliptic_curve_point_multiplic...

[3] https://eprint.iacr.org/2023/1713

[4] https://libsodium.gitbook.io/doc/internals#security-first

[5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


These are table stakes for core cryptographic code, and SOT crypto code --- like the Amazon implementation this story is about --- tend at this point all to be derived from formal methods.


As an example, the Amazon implementation doesn't refer to gcc's[1] and clang's[2] "zero_call_used_regs" to zeroise CPU registers upon return or exception of functions working on crypto secrets. OpenSSL doesn't either.[3] firedancer _does_ use "zero_call_used_regs" to allow gcc/clang to zeroise used CPU registers.[9]

As another example, the Amazon implementation also doesn't refer to gcc's "strub" attribute which zeroises the function's stack upon return or exception of functions working on crypto secrets.[4][5] OpenSSL doesn't either.[3] firedancer _does_ use the "strub" attribute to allow gcc to zeroise the function's stack.[9]

Is there a performance impact? [6] has the overhead at 0% for X25519 for implementing CPU register and stack zeroisation. Compiling the Linux kernel with "CONFIG_ZERO_CALL_USED_REGS=1" for x64_64 (impacting all kernel functions) was found to result in a 1-1.5% performance penalty.[7][8]

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attribute...

[2] https://clang.llvm.org/docs/AttributeReference.html#zero-cal...

[3] https://github.com/openssl/openssl/discussions/24321

[4] https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/Common-Type-At...

[5] https://gcc.gnu.org/onlinedocs/gcc/Stack-Scrubbing.html

[6] https://eprint.iacr.org/2023/1713.pdf

[7] https://www.phoronix.com/review/zero-used-regs/5

[8] https://lore.kernel.org/lkml/20210505191804.4015873-1-keesco...

[9] FD_FN_UNSANITIZED: https://github.com/firedancer-io/firedancer/blob/master/src/...


Zeroizing a register seems pretty straightforward. Zeroizing any cache that it may have touched seems a lot more complex. I guess that's why they work so hard to keep everything in registers. Lucky for them we aren't in the x86 era anymore and there are a useful number of registers. I'll need to read up on how they avoid context switches while their registers are loaded.


That team is full of world experts in high performance computing.


On prod servers I see a bunch of frontend stalls & code misses in the L2 for the kernel tcp stack; having each process statically embed its own network stack may make that worse (though using dynamic shared quic lib for ex. in userspace shared across multiple processes partially addresses that but with other tradeoffs).

Of course depending on usecase etc the benefit from first-order network behavior improvements is almost certainly more important than the second-order cache pollution effects of replicated/seperate network stacks.


When using a userspace stack you can (and should!) optimize your program during and after link to put hot code together on same/nearby lines and pages. You cannot do this or anything approximating this between an application and the Linux kernel. When Linux is built the linker doesn't know which parts of its sprawling network stack are hot or cold.


I wonder how far we are from "Birth and death of JavaScript" making such a thing possible


Thanks for this link; did not realize that they did this.


Answering only the latter question:

A Primer on Memory Consistency and Cache Coherence, Second Edition

https://www.morganclaypool.com/doi/10.2200/S00962ED2V01Y2019...

(free online book) would help


" Like many modern analytical engines [18, 20], Procella does not use the conventional BTree style secondary indexes, opting instead for light weight secondary structures such as zone maps, bitmaps, bloom filters, partition and sort keys [1]. The metadata server serves this information during query planning time. These secondary structures are collected partly from the file headers during file registration, by the registration server, and partly lazily at query evaluation time by the data server. Schemas, table to file mapping, stats, zone maps and other metadata are mostly stored in the metadata store (in Bigtable [11] and Spanner [12])."

https://storage.googleapis.com/pub-tools-public-publication-...


> modern analytical engines

What about relational databases? Most are best suited for OLTP workloads.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: