Hacker Newsnew | past | comments | ask | show | jobs | submit | dmw_ng's commentslogin

I'm a restic user, but have resisted the urge to attempt a bikeshed for a long time, mostly due to perf. It's index format seems to be slow and terrible and the chunking algorithm it uses (rabin fingerprints) is very slow compared to more recent alternatives (like FastCDC). Drives me nuts to watch it chugging along backing up or listing snapshots at nowhere close to the IO rate of the system while still making the fans run. Despite that it still seems to be the best free software option around


> It's index format seems to be slow and terrible and the chunking algorithm it uses (rabin fingerprints) is very slow compared to more recent alternatives (like FastCDC).

Hi, can you elaborate more on those two points? (Specially, what makes the index format so bad?) Or link to somewhere I can learn more


You could try running rustic on your repository. It should be a drop-in for restic and maybe it's faster? I would actually be very interested in this. Would be great if you could do that and report back.


Have you opened issues with suggested algo improvements? They might be open to them.

Even if restic isn't interested, maybe the rustic dev will be.


You should check out Kopia. It’s absolutely wonderful, is fast, has similar features, and has a GUI if you’re into that.


Currently using Kopia because Restic has no GUI and Borg requires adding a community maintained Synology package, and it didn't "Just work" when I tried.

It's amazing! The GUI isn't perfect but the fact that there is an official GUI at all is great.


I don't need a GUI but an official config format or a manager tool for restic would be cool. There ARE a few good 3rd party ones of course.


That's been a feature of S3 for quite a long time now, called S3 Select https://docs.aws.amazon.com/AmazonS3/latest/userguide/select...

Despite it being an awesome feature I've been itching to use, I've never actually found a use for it beyond messing around. Most places where S3 Select might make sense seems to be subsumed (for my uses) by Athena. Athena has a rather large amount of conceptual and actual boilerplate to get up and running with, though, S3 Select requires no upfront planning beyond building a fancy query string (or using their SDK wrappers)

Where S3 Select is likely to become fiddly is anywhere multiple files are involved. Athena makes querying large collections of CSVs (etc) straightforward, and handles all the scheduling and results merging for you.


S3 Select is not available anymore for new customers. Athena with columnar file format (eg parquet) in S3 and partitioning with Glue Data catalog is the solution for OP's problem. The cost of this kind of queries is very low because you only pay the actual data consumed/requested. And with the columnar file format Athena only accesses the necessary columns. And the data in the columns is usually compressed so the amount of data is even less.


> Amazon S3 Select is no longer available to new customers. Existing customers of Amazon S3 Select can continue to use the feature as usual.

:(

But you and patrickthebold are spot on in pointing out Athena. I've always thought of it as a database you load via S3, but of course it's equally a tool for querying data in S3.


Modern .net on Linux is lovely, you can initialize a project, pull in the S3 client and write a 1-3 line C# program that AOT compiles to a single binary with none of the perf issues or GIL hand-wringing that plagues life in Python.

Given modern Python means type annotations everywhere, the convenience edge between it and modern C# (which dispenses with much of the javaesque boilerplate) is surprisingly thin, and the capabilities of the .net runtime far superior in many ways, making it quite an appealing alternative especially for perf sensitive stuff.


Do your civic duty and disable telemetry everywhere you go. :)

export DOTNET_CLI_TELEMETRY_OPTOUT=1


I don't understand. How does that help cross platform?

All I see is a manager saying, "the data shows no one uses it"


I recently bought an n100 and within a matter of days got buyer's remorse and impulse-purchased an n305 to go right beside it, which is currently sitting with a wildly overpriced 48 GB stick installed and 2TB SN850X, it's an absolute joy perfwise and the absence of heat it generates.

The only thing I'd reserve judgement on is the tendency to throttle. I haven't got far enough to characterize it, but it's not clear how much value those extra cores will add over the n100 with TDP settings tweaked down in the BIOS, and if leaving the n305 to run at max TDP, heat/noise/cost/temperature-related instability may start to become an issue, especially when packing other hot components like a decent SSD into the tiny cases they come in.


Which N305 sbc/computer did you buy?


Seems like a massive distraction from their offering for a small company, wonder why they didn't consider something like tight integration with OnlyOffice or similar. Setting out to build a new office suite feels about as sensible as building a new web browser from scratch. Except at least with a browser, you have open specs helping you through most of the endless supply of compatibility problems.


I don't think they built this from scratch, they acquired a company that did something similar (Standard Notes) and are using their technology to build this.


They can't, all proton products are end to end encrypted, typical solutions like onlyoffice won't work.


For anyone: E2e does not mean 'private'.


How?


> converting huge amount of xml files

> pickling

Sounds like if this is the tooling and the task at hand, about the most complex things that should be passing through the pickler are partitioned lists of filenames rather than raw data. E.g. you can have each partition generate a parquet for combining in a final step (pyarrow.concat_tables() looks useful), or if it were some other format you were working with, potentially sending flat arrays back to the parent process as giant bytestrings or similar

This is not to say the limitations don't suck, just that very often there are simple approaches to avoid most of the pain


It's comical to see the sudo codebase mentioned in the same breath as increasing auditability here


Sufficiently fast software often allows leaving out whole layers of crap and needless indirection, the most common being caching. Fixing an algorithm so you can omit a dedicated database of intermediate results can be a huge maintainability/usability improvement. The same principle appears all over the place, e.g. immediate mode UIs, better networking (e.g. CSS image tiling vs. just fixing small request overhead in HTTP1 vs. QUIC), importing giant CSV files via some elaborate ETL process vs. just having a lightning fast parser+query engine, etc.

Depending on how you look at it, you could view large chunks of DOM state through this lens, as intermediate data that only exists to cache HTML parsing results. What's the point of allocating a hash table to represent element attributes if they are unchanged from the source document, and reparsing them from the source is just as fast as keeping around the parsed form? etc. These kinds of tricks only tend to present themselves after optimization work is done, which is annoying, because it's usually so damn hard to justify optimization work in the first place.


I've had similar times with DuckDB, it feels nicer to use on the surface but in terms of perf and actual function I've had a better experience with clickhouse-local.


Are you using it for simple SQL retrieval or complex analytic queries? They’re both similar for the former use case, but DuckDB — being an analytic engine — supports the latter use case much better.


SPF+DKIM+DMARC are a classic case of Goodhart's law, the amount of spam they stop these days (at least anecdotally) is minimal. Most spam I get seems to come via SalesForce infrastructure, and a variety of similar bulk email marketing providers


SPF definitely stops most 'stupid' spam (with the second-most valuable metric being EHLO-to-rDNS correspondence). Now, Salesforce and most other non-malicious transactional/list-based SaaSes present other challenges, mostly solved by applying SPF to their content From: header in addition to the SMTP 'mail from' address.

This also involves promoting sender domains from 'DATA reject' to 'MAIL FROM reject' based on behavior, since most spammers see 'MAIL FROM accept' as a win, and won't check any further results.


Proper SPF/DKIM/DMARC at least prevents brand reputation abuse via spoofing (in many cases), which at least blocks a good amount of bullshit phishing and BEC efforts.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: