Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Buku v3.0 – Command Line Bookmark Manager (github.com/jarun)
106 points by apjana on April 26, 2017 | hide | past | favorite | 65 comments


It already seems to have some nice features, but for me the dream bookmark manager would be something really simple with two commands like:

$ bookmark add http://...

That will:

1. Download a static copy of the webpage in a single HTML file, with a PDF exported copy, that also take care of removing adds and unrelated content from the stored content.

2. Run something like http://smmry.com/ to create a summary of the page in few sentences and store it.

3. Use NLP techniques to extract the principle keywords and use them as tags

And another command like:

$ bookmark search "..."

That will:

* Not use regexp or complicated search pattern, but instead;

* Search in titles, tags, AND page content smartly and interactively, and;

* Sort/filter results smartly by relevance, number of matches, frecency, or anything else useful

Storing everything in a git repository or simple file structure for easy synchronization, bonus point for browsers integrations.


I've been thinking along these lines, some other features I'd like:

- ability to have certain sites run site-specific extra processing: i.e. youtube-dl youtube links

- ability to have a list of sites to be archived periodically instead of once only. And the option to be notified when a site updates. even if it were run as a batch job

- ability to ingest a PDF or ebook, identify all the URLs, snapshot all the URLS, present them as a list that links to the original, cached version, the page location

- would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.

Overall I think it is an interesting project but the commercial potential is limited.

EDIT: maybe the document processing and periodic check thing would make more sense as a higher level tool that depended on the bookmarking tool -- and the extra processing also might make more sense as a plugin type architecture.


> would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.

This is really important yes!

> Overall I think it is an interesting project but the commercial potential is limited.

Indeed. This kind of tool is mostly limited to hackers, and it is not a big enough market to create a business model I suppose.

This would have to be done for the love of open-source :-)


There's an option to filter the print output. But yes, currently the data is stored in SQLite only.

Yes, buku doesn't intend to be a commercial utility. I use bookmarks as a context pointer for everything I do. So I wrote buku. But it's written as a library so other projects can use it.


On the filesystem, I'm thinking of some sort of (possibly virtual?) document filesystem structured by:

* Title

* Author

* Subject

* Date

* Other?

The problem with a fixed / structured filesystem is that it's largely inflexible. The next option would be to have some sort of hybrid -- a persistent data store plus, say, hardlinks or symlinks onto that store based on other elements, could work, but would be somewhat annoying to maintain (though not impossible, and a tools-based approach might well work).

The advantage to a filesystem-based approach is that you'd be able to use any filesystem-based tools on it: find, grep, ls, cd, etc.

This also points to the distinct limitations of filesystems and naming conventions for document-oriented systems, generally. They're OK for statically-defined computer systems. Sometimes (/proc, /sys, /dev, /dbus, ... are all exceptions for which virtualising the filesystem is a current fix), but when it comes to the human realm, the oversimplification and reliance on standards creates pain.

A related problem is: how do you identify a given work, reasonably uniquely and reasonably persistently across variants?

Taking a content hash works for git, but doesn't apply particularly well to a human-readable document where whitespace, casing, characterset, character substitutions (straight vs. curly quotes, "-" and "--" vs. en and em dash, etc.), translations, multiple output formats, etc., might all create unique (and unrelatable) hash fingerprints.

In catologues, some tuple of author, title, and publication date generally suffices, and creates the general outline of a unique, but relatable, identity foundation. A book might have multiple editions or publishers, or multiple formats (ISDN relates these to the core work). Sub-parts (a chapter or section) might be included into other works. Etc.

I'd really like to have the option, say, of creating and relating:

* The source document.

* Some standardised markup (Markdown, LaTeX, some BDSM HTML5 or similar format, etc.).

* Generated outputs (PDF, PS, ePub, DJVU, Mobi, ASCII, etc.)

* Metadata: Author, title, publisher, publication, date, URL, language, various identifiers (LoCCN, ISBN, DOI, ...), etc., subject(s), rating(s), review(s), citation(s).

For research, this could be highly invaluable.


> The problem with a fixed / structured filesystem is that it's largely inflexible. The next option would be to have some sort of hybrid -- a persistent data store plus, say, hardlinks or symlinks onto that store based on other elements, could work, but would be somewhat annoying to maintain (though not impossible, and a tools-based approach might well work).

What your are describing here is called a semantic file system, and there are some implementations of this, like for instance https://www.tagsistant.net/


Thanks, I wasn't aware of the term (though I'm familiar with semantic information systems, e.g., "semantic Web", in other contexts.

Tagsistant is ... getting there. From the description, still not quite what I'm looking for. Its duplicate-file detection, for example, would note the same logical file being present twice, but not, say, War and Peace in LaTeX, generated PDF, and scanned TIFF forms.

Some way of recognising the latter as, if not the same, then at the very least related somehow, would be highly useful.


Yes, buku can be used as a bookmarking engine in a bigger project that also scrapes data. It has REST APIs to do that.


except for the the exported pdf and ad removal and git you've basically described Pinboard.in I'm pretty sure it searches the content not just the title, tags, and comment you left. It saves a copy of the page so that the site disappearing doesn't mean you've lost the info. It's not downloadable but i don't know why i'd want to backup their backup anyway. It suggests tags / keywords (probably by harvesting the plethora of other people bookmarking things) .

and it's got an API so you could make a command line client.


Ads removal and "git/simple file structure" are mandatory features for me (PDF export is optional though).

But first and foremost, it should be an open-source tool, I forgot to mention it. I don't want to be stuck because (as already said) a company/website closes. Yes backups are possible, but when the tool is gone, it's gone, and backups are not really useful. People will have to spend days trying out to figure a new tool and how to import existing data. Open-source protect from such kind of issues.

I don't say this tool is bad, it is probably really nice for some people. But at least for me it does not fulfill my wills/needs.


The thing is, Pinboard can decide to just shut down and there it goes all your saved webpages. Besides, if you want to save the webpages and have full text search, you have to pay the premium package every year forever.


a) https://pinboard.in/settings/backup

b) Do you seriously not consider 25$ per year reasonable for such a useful service?*

* I would pay that just to read maciej's fantastic presentations.


I'll build this. It sounds like a useful and fun project to build. Will build it using Crystal to be able to ship a single binary with no dependencies. SQLite will probably be enough for this project so it'll ship with it's own DB.


Also, Mozilla's Readability library[0] should help you out to extract only relevant content (this is what's behind Firefox's reading mode). So, the only semi-difficult part is the NLP.

[0] https://github.com/mozilla/readability


Let us know the project URL if you start it, so we can follow your work!

If I would be doing it, I would probably go with python because of the existing ecosystem: NLP, readability, beautifying, image processing or deep learning, … Probably everything is already existing, the goal would be to assemble the pieces like playing lego :-) in a much more complicated way of course.

But I'm not doing it, so your choice is the correct :-) Have a nice time working on this project!

PS: Even the image captioning is almost already implemented, amazing https://github.com/tensorflow/models/tree/master/im2txt


If you want other output formats, there's little you can do to improve over pandoc. That will generate ePub, .mobi, DJVU, PDF, PS, and a multitude of other formats, on the fly. HTML is a valid input for most of those.

The main problem isn't pandoc, but HTML -- the crap that passes for Web-compatible today is simply any asshat's bad idea. I see as highly useful something which looks at what's been downloaded and reduces it to a hugely simplified structure -- Markdown will almost always be sufficient.

I've found, in writing my own strippers and manually re-writing HTML, that body content rarely amounts to more than paragraphs, italic/emphasis, and anchors/hrefs. Better-written content has internal structure via headers. Bold itself is almost never used within CMS systems for body text, it's almost always a tell for advertising or promotional content.

The sad truth is that manual rewrite or re-tagging of pages, in Markdown is often the best option I've got for getting to something remotely reasonable. The good news is that that's actually a good tool for reading an article, even if you find that on reading, it's not worth keeping :)


i agree with all that. lot of good wisdom there.

as for html-to-markdown conversion, http://markdownrules.com is good.


Neat idea, we'll surely consider it. Thanks!


The autotagging feature would be so useful. I've been looking for something that does it for a long while.


At Wire, we've actually been working on the "autotagging" (& some more) for the past 2 years. We're focused on mobile and don't have a desktop version, yet. Our current version tags a page based on the keywords in metadata, title; we're still fine tuning it and will include the content in the future. The way Wire works is: instead of the conventional bookmarks, users just saves what they find and use the same search engine they use everyday to find it again. What the user saves is stored for offline viewing. In addition, we've also included an offline p2p feature that allows the user and an nearby offline friend to share what they've saved.

Check it out on: https://goo.gl/xMgxfJ


Pinboard.in doesn't "autotag" but it suggests tags and you can click them to have them added to the form you're saving. it's quick and easy.


Glad to hear!

I like the HN community, it gather a lot of enthusiastic people!


How would that deal with pictures?


I'm not sur at which level you ask this. For storing the images, Base64 encoded in the HTML files. For making them useful, I don't think it's the most important feature, but if need be it is possible to think about using OCR for extracting text, or Deep Learning for describing them [1].

[1] https://research.googleblog.com/2014/11/a-picture-is-worth-t...


The average web page is around 2MB, 60% of which is images [1]. Base64 encoding makes the size even larger. Would that compromise on higher disk usage make sense for you? I like both the OCR and DL ideas. Shameless plug, I've been working on a bookmarking service (https://github.com/crestify/crestify), and local archiving with images is something we're looking at add. Since it is something a lot of people seem to want, contributors are welcome :).

[1] https://www.soasta.com/blog/page-bloat-average-web-page-2-mb...


The system I propose would be for sure space-consuming. Removing the ads and other unneeded contents would probably help to shrink down the page size though. This is the price to pay to keep the bookmarked content available in the long-term, and I'm ready to pay this "price".

Assuming 1 bookmark per day during 10 years, this will sum up around 8 Go. Storage is cheep enough nowadays so I can support these 8 Go over 10 years. This will probably not be mobile-friendly, but I'm almost never consulting bookmarks from my phone anyway.

Your project seems interesting, thanks for sharing! This may be especially useful to see how you implemented the readability part.


According to this post[1], Pinboard had 26M new bookmarks in 1 year (15-16), between 24.5K users, which is close to ~1100 bookmarks per user per year, 3 per day. ~22GB for 10 years. Compression and other optimizations could reduce that further.

Crestify is a web application, so you could run it on a server and access it over a mobile app. Really happy to hear you like it. And I'm always open to new ideas about it :).

[1] https://blog.pinboard.in/2016/07/pinboard_turns_seven/


I meant bookmarks to pictures. Like infographics or memes


Oh, I'm not bookmarking such kind of content usually. I guess in any case a system that check the type of content (html/image/sound/video) will be needed to select the storage format, so the annotation may then also be adapted to the content type. In any case a way to manually add tag would exist so it's feasible to add such content.


he-yo. I'm one of the earliest users of del.icio.us and also pinboard, made similar tools to buku for my Linux desktop, and added a custom tabless web browser on top of it to make bookmarking as convenient as possible. it was a very productive setup, but I don't use it anymore; https://github.com/azer/delicious-surf

Storing bookmarks locally is definitely cool but it's still not convenient enough for us to bookmark any page that we found value. If I'm browsing 30 pages about my upcoming trip to Patagonia, I won't bookmark most of it just because it's not convenient enough. If I google a solution for a problem for hours and go through tens of pages to find information, it's likely that I won't bookmark most of the pages.

You can keep bookmarks in Chrome, Safari, Pinboard, Firefox, whatever. But they are all not innovating bookmarking and likely won't.

And this is exactly why I'm building Kozmos currently. It has a desktop and mobile client, and will bring completely new perspective on bookmarking. You won't need to organize anything, it'll all be done automatically and you'll easily find your stuff thanks to advanced search engine. My goal is to bring good design and good tech together, and provide everyone the most convenient way to bookmark.

You can sign up to the private beta and get an invitation within a week, here is the link; http://getkozmos.com


I’ve started avoiding the whole pretence of “tagging” or “organisation” like I tried to do with Pinboard. My bookmarks are now Safari .webarchive files. I have access to them offline, I can back them up the same way I back everything else up, and I just organise them in folders however I want. I even get search!

A “bookmark manager” doesn’t need to be an app or a service -- it can just be a bunch of files.

Edit: I should say that Buku looks like a good program for those who like that way of things, though!


I hate changing command options, but if you're going to do this, do it early.

Please swap the definitions of '-s' (search any) and '-S' (search all).

Rationale: virtually every time I run a search, I'm interested in the most specific result, and this most especially when I have created the search space myself.

Having to hit the shift key for my default search preference is ... backward.

I know far too many online search tools which OR rather than AND arguments (probably because the underlying tools support OR more readily than AND searches) ... and ... this drives me flipping bananas. Because the more specific my search, the less specific the result.

It's the worst possible antifeature in a knowledge management tool.

I'd also suggest that the capability to distinctly search specific fields be specified:

* URL

* Title

* Tags

* Metadata (author, publisher, date).

A date-ranged search would be particularly useful.


## What's in?

- Edit bookmarks in EDITOR at prompt

- Import folder names as tags from browser html

- Append, overwrite, delete tags at prompt using >>, >, << (familiar, eh? ;))

- Negative indices with `--print` (like `tail`)

- Update in EDITOR along with `--immutable`

- Request HTTP HEAD for immutable records

- Interface revamp (title on top in bold, colour changes...)

- Per-level colourful logs in colour mode

- Changes in program OPTIONS

  - `-t` stands for tag search (earlier `--title`)

  - `-r` stands for regex search (earlier `--replace`)
- Lots of new automated test cases

- REST APIs for server-side apps

- Document, notify behaviour when not invoked from tty

- Fix Firefox tab-opening issues on Windows

Home: https://github.com/jarun/Buku


More useful to link to the main/code page, with its README, chock full of "wtf"-dissolving examples and text.

https://github.com/jarun/Buku


Just skimmed through the README, and I stored it in my list of well-documented projects. Well done!


Thank you!


I'm still glad I found pinboard.in. It works great, is a paid service (so I AM the customer) and even archives everything I tag.


Why are you happy to be a customer of something that could easily run locally for free?


I don't use Pinboard—bookmarks are mostly a place I send open tabs I'll never get around to reading so I don't feel bad about closing them—they may as well go to /dev/null.

But storing data (and backing it up, and ensuring that backups work, and having some kind of monitoring so you don't discover one day that everything silently broke and all your data is now gone, and buying the hardware to support all that, and taking time to research the purchase of that hardware, and the extra, constant, low-grade stress associated with all the above) is never free, provided you actually care that said data survive and be accessible/useful.


Bookmark managers should happen in the browser, not in the command line.

I recommend Shaarli: http://sebsauvage.net/wiki/doku.php?id=php:shaarli


> Bookmark managers should happen in the browser, not in the command line.

Really? Why? I would much rather my browsers were clients of a bookmarks server/api, and what happens/seen in one browser is exactly the same in all.

I use different email clients, and thank $GOD that they all consume IMAP instead of email storage "happening in the client."


Author of Buku here. In fact Buku can store bookmarks directly from the browser. You have 2 ways to do that (including a dedicated plugin). It also has 5 different search options with a powerful prompt to find out just any bookmark you have stored (we have users who imported even ~40K bookmarks from Delicious and are happy with Buku), extensive flexibility of editing and manipulation, encryption support, multithreaded full DB refresh and a lot more.

In addition, Buku is also developed as a library and Shaarli can use it as a powerful python backend over REST. ;)

Yes, one of our contributors did want to add a feature to generate a full webpage with thumbnails but we decided not to add it as it seemed simply ornamental when you think about the raw potential of Buku.


Oh and one more thing about Buku... it is designed to simplify your workflow beyond imagination. You search something in Shaarli, get 10 results. You want to open results 4, 5,6,7 and 8. What do you do? Click 5 times? Not with Buku. You enter:

    o 4-8
  
terminal bliss, yeah!


Who uses Bookmarks any more? I don't.

Instead, if I like a page I want to re-visit, I simply Print it to PDF. Then, I move all the PDF's from my Desktop every week, into their own permanent storage location .. meaning that I have every interesting web page I've ever read since 2000.

Trouble is, now I have a large PDF collection to manage. I get along fine with "ls -l | grep <something>" this and "pdf2txt <blah.pdf> | grep <something>" that .. but of course, this is not as 'clean' as if I had a Bookmark Manager to do all my searching/grep'ing/grok'ing/etc.


I still use bookmarks (Pinboard) but I like your approach. I'm slowly trying to remove all reliances on third-parties as I can as they're too ephermal. I'm guessing you went with print to PDF because saving the page from the browser would result in broken pages? Have you found that the PDFs don't look very good for some websites?

Alternatively, you could use a tool or extension that does full-page screenshots and then run image optimisation on them. I do this a lot for local Pinterest-type inspiration store. At the moment I use [Nimbus](https://chrome.google.com/webstore/detail/nimbus-screenshot-...) but it seems like evert few months the extension I'm using starts to fail with certain websites (scrolljackers, mostly), and I switch to a different, newer extension.

Alternatively again, but back to saving websites, surely someone's created a nice tool that will download a page to store it as an archive that won't be broken? Pinboard for example has archiving for an extra fee, so I wonder how he does it at his scale.

Related to your problem grep'ing, I'm slowly working on a small idea to have a local tagging/metadata approach for finding things.


> Alternatively again, but back to saving websites, surely someone's created a nice tool that will download a page to store it as an archive that won't be broken?

Wallabag[0] does that. It's a self-hosted Pocket-like read-it-later service that strips the page and saves the text of the article in a local database, therefore allowing you full text search right from your own server. It even adds some additional neat features like adding notes to the articles.

The only downside: damn those two currently available themes are awful.

[0] https://wallabag.org/en


> Who uses Bookmarks any more?

"Ask HN: Do you still use browser bookmarks?" (19 days ago, 451 comments):

https://news.ycombinator.com/item?id=14064096


"I don't need this" doesn't mean someone else doesn't.

I would agree that bookmarks, generally, are a poor fit to current needs or requirements. When a typical hard drive was, say, 100 - 500 MB, the idea of only saving the URL, and not the content, could be argued. With mobile devices having 128 GB - 1 TB of microSD storage, there's no reason you cannot store everything you've read, or at least everything you're interested in, locally, on desktop, laptop, or mobile.

Which is what you've done.

But you're running into the underlying problem (as am I): a pile of randomly-titled, poorly-metadata'd PDFs isn't particularly useful.

There are tools -- Zotero and ... some others -- which manage references, but IMO do so quite poorly. The problem is that what they introduce is a metadata vetting and management problem, and one that IMO GUI tools manage quite poorly. The fact that the tools aren't available on Mobile (I use an Android tablet almost exclusively, because reasons, and yes, it sucks in a great many ways), makes this problem all the more nontractable.

I have the same problem, it turns out, with bookreaders. I use two, mostly: PocketBook and FBReader, with ~2,000 or so references. Unfortunately, other than title and author search, I've little by way of organisation of these, which is ... a major problem.

I'm using Pocket, The Article Management Tool that Gets Worse The More You Use It[tm] (https://redd.it/5x2sfx), which ... suffers many of the same problems, and adds a few more of its own.


That sounds so obvious yet is brilliant... no more link-rot for any of those interesting sites you bookmarked years ago...

I suppose the next logical step would be to save the page with assets so it can be made available for posterity. Not sure what impact that would have on storage use but it would at least enable full text search.


We working on bubblehunt.com ... this is search platform, add any resources and get full text search for bookmarks, articles and other


Can't tell if you're joking or not. That's the kind of think using bookmarks avoids.


Coming to personal preferences, I do. And that was one of the main reasons I wrote buku. And I don't store the context, just a pointer to the context. Just like you don't normally pass a full structure by value over the stack but use a pointer.

When I need the context I check the original link. If it's not there I try Google Cache or archive.org. If it's lost, I find an alternative (thanks to the title and notes fields in buku). That's more or less my workflow when in comes to the 8K odd bookmarks I have.


But again, I would emphasize it's very much a personal preference.


I actually thought of having creating a bookmark utility for command line, because I have a lot of commands I use. I could create them as aliases, but I also want the ability to keep track of them and have a description what they do, browse and grep, change parameters and etc.


I use fzf [1] to achieve something similar. I have a file with commonly used commands and a key binding to select one using fuzzy search and paste it to stdin. Here's a demonstration: https://asciinema.org/a/8xkwh9579ry3u3t8a9nnsa7o9

I can link my dotfiles if you're interested.

[1]: https://github.com/junegunn/fzf


What are the issues you see in using Google Cache or archive.org to search older/lost pages?


...or using the bookmarking service that extracts the copy of the page (with Wallabag and Pinboard being two that do that that I can remember on top of my head).


So you have one datapoint and you can declare the obsolescence of bookmarks?


> Hence, Buku (after my son's nickname, meaning close to the heart in my language).

Coincidentally, `Buku` translates to `Book` in Indonesian...


Thanks for sharing!


As someone using Chrome Bookmarks, can someone please explain to me in simple words what this is?


Same thing, using the command line and giving you full control over the bookmarks (because they're staying on your machine).


Thanks!

> (because they're staying on your machine).

What's the advantage over exporting my Chrome Bookmarks as HTML?


is anyone using Buku more like a read-later system instead bookmark manager?


The demo which I'd really need to see to judge this is here: https://github.com/jarun/Buku

I'm still not sure this fits my needs or workflow, though that's more useful than the project link itself.

I'm tremendously interested in this or related tools, as I've got an exploding research problem that nothing I've seen yet comes close to addressing, and most of which introduces numerous additional problems. See:

https://ello.co/dredmorbius/post/fj5rzi8zmouyrmvg8yzzva

Short version: I've got a library of a few thousand articles, plus another few thousand books, plus another few thousand online references, which I've gathered, am continuing to gather, am trying to assess, prioritise reading of, and generate a number of outputs from, as well as use in what's likely to be a several-decades-long research and writing project.

Online services simply don't offer sufficient longevity, even should they meet my other requirements, which they don't.

Assigning metadata is a significant pain point. Coming to some aggreement as to what metadata to assign is a signficant pain point.

I'm coming to see librarians and library cataloguing as essential domain knowledge and experience. In all seriousness, I suggest any project looking to make use of categories and classification look to the US Library of Congress Classification System: it's extant, expert, unencumbered, comprehensive, hierarchical, extensible, has a change management process, and is applied to a store comprising 164 million works.

https://mammouth.cafe/@dredmorbius/56485 http://www.loc.gov/catdir/cpso/lcco/

There's also a top-level reduction to 21 distinct categories, and the possibility of, say, coming up with a short-list of frequently-used classifications, as well as of assigning multiple classifications to works.

The rationale for only storing bookmarks is ... generally not valid. There are a few types of online resources, generally:

1. Interactive or volatile pages.

2. Static pages.

For the first, search engines, web apps, landing pages, etc., storing a static instance isn't tremendously useful (though it can be more useful than you'd think). For the second, a locally-stored version is almost always more useful than the online instance.

And space for text is now beyond cheap.

I'm looking at this problem in terms of desired outputs, workflow, various states of resources, how to (reasonably) uniquely and persistently identify a given document, managing media (images, audio, video, other interactive elements), etc.

And yes, this starts to look rather much like Memex, for similar reasons.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: