Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wallabag (https://www.wallabag.org/) if you want self-hosted.

Pinboard (https://pinboard.in) offers archiving for (I believe) $25 a year.

Or Pocket (https://getpocket.com/) which used to be Read-It-Later.



Thanks for the link to Wallabag, I had never heard of it. Looks very interesting.

I use Pinboard, and pay for the archiving option. I even periodically request a tarball of the archive for my own backup. Pinboard archives the entire page, not just a readable version of the content. For archival and reference purposes, I like this. It would be nice if Pinboard also provided a readable option. In fact, a number of the apps that work with Pinboard add support for readable versions.

I will look into adding Wallabag into my workflow.


I'm cool with the idea of providing a readable option in Pinboard, since I already do something similar to get the text out of the page for indexing. Any library for this you particularly like?


Readability (https://github.com/luin/readability) is a classic, and included as part of Firefox (I think, maybe that's been discontinued). It's essentially a bag of hand-written heuristics but they're pretty good heuristics.

Some interesting reading is Christian Kohlschütter's thesis on this problem, which is framed in academia as "how do we assemble good text corpuses from webpages for data analysis, which means removing junk (boilerplate) from our HTML crawls" (https://code.google.com/archive/p/boilerpipe/wikis/WSDM2010P...). Boilerpipe would probably be the right way to go, but if you're not using Java it could be harder to integrate.


I second Readability, it works great for article heavy webpages. I used it to build a reading time estimator for chrome https://chrome.google.com/webstore/detail/read-time/nccohhim... and its open source https://github.com/usergit/read-time bonus, you can click on the extension to show only the main content of the page


Thank you and the parent both for these links!


It works great! I used the Pinboard API to download my bookmarks and used readability to crawl the original text from the source url.


Firefox maintains a fork of Readability for Firefox's reader mode here:

https://github.com/mozilla/readability


This would be great! A while back I signed up for Paperback (https://readpaperback.com/) to handle this for my Pinboard account, and then wrote my own using the Ruby Readability library.


Instaparser, when it was originally released, was priced competitively with Diffbot: https://www.diffbot.com/products/automatic/#article

Which is a nice API for extracting metadata from a web page, although I understand it might not be worth the cost.


Hopefully you see this. The biggest reason I stick with Pocket (despite the privacy implications) is because of its text-to-speech functionality. AFAIK, Pinboard is a website only, so TTS is out of the question. However, might this change in the future?


There is a great app called Voice Dream reader on iOS that takes text from a variety of sources and can do TTS. I think I paid $9.99 for it and bought an Inova voice for $4.99. I love it. I use it to have custom study guides read to me, ePubs, PDFs, text documents and my Instapaper queue.


Thanks, I'll check them out. Although from what I can tell of Voice Dream it's a big extra step in the workflow.


I would very much like to see this.

I don't know what is the state of the art for a general content extractor. (I have done a fair amount of one off web scrapers, for data collection, but nothing this generic)


Paperback offers a reasonably good reading experience for Pinboard links.

https://readpaperback.com


Paperback looks interesting. I don't see it mentioned, but does it support highlighting & notes?

The demo seems to be broken for me.


One option is to have a browser that can render webpages in reading mode like safari on iOS. Hopefully chrome and Firefox have extensions as well.


On Android FF has this functionality built in. I use it all the time.


They do


It's funny how even after all those years, people still feel the need to mention "formerly Read It Later" when they talk about Pocket.


Well, given that the parent to the comment you replied to said

>Pocket, I remember. ReadItLater used to exist, maybe still?

there was an obvious reason to name Pocket as "formerly Read It Later"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: