Wallabag (https://www.wallabag.org/) if you want self-hosted. Pinboard (https://...

leejoramo · on Aug 23, 2016

Thanks for the link to Wallabag, I had never heard of it. Looks very interesting.

I use Pinboard, and pay for the archiving option. I even periodically request a tarball of the archive for my own backup. Pinboard archives the entire page, not just a readable version of the content. For archival and reference purposes, I like this. It would be nice if Pinboard also provided a readable option. In fact, a number of the apps that work with Pinboard add support for readable versions.

I will look into adding Wallabag into my workflow.

idlewords · on Aug 23, 2016

I'm cool with the idea of providing a readable option in Pinboard, since I already do something similar to get the text out of the page for indexing. Any library for this you particularly like?

kough · on Aug 23, 2016

Readability (https://github.com/luin/readability) is a classic, and included as part of Firefox (I think, maybe that's been discontinued). It's essentially a bag of hand-written heuristics but they're pretty good heuristics.

Some interesting reading is Christian Kohlschütter's thesis on this problem, which is framed in academia as "how do we assemble good text corpuses from webpages for data analysis, which means removing junk (boilerplate) from our HTML crawls" (https://code.google.com/archive/p/boilerpipe/wikis/WSDM2010P...). Boilerpipe would probably be the right way to go, but if you're not using Java it could be harder to integrate.

userhacker · on Aug 23, 2016

I second Readability, it works great for article heavy webpages. I used it to build a reading time estimator for chrome https://chrome.google.com/webstore/detail/read-time/nccohhim... and its open source https://github.com/usergit/read-time bonus, you can click on the extension to show only the main content of the page

idlewords · on Aug 23, 2016

Thank you and the parent both for these links!

axx · on Aug 23, 2016

It works great! I used the Pinboard API to download my bookmarks and used readability to crawl the original text from the source url.

cpeterso · on Aug 23, 2016

Firefox maintains a fork of Readability for Firefox's reader mode here:

https://github.com/mozilla/readability

heliostatic · on Aug 23, 2016

This would be great! A while back I signed up for Paperback (https://readpaperback.com/) to handle this for my Pinboard account, and then wrote my own using the Ruby Readability library.

conradev · on Aug 23, 2016

Instaparser, when it was originally released, was priced competitively with Diffbot: https://www.diffbot.com/products/automatic/#article

Which is a nice API for extracting metadata from a web page, although I understand it might not be worth the cost.

kobayashi · on Aug 23, 2016

Hopefully you see this. The biggest reason I stick with Pocket (despite the privacy implications) is because of its text-to-speech functionality. AFAIK, Pinboard is a website only, so TTS is out of the question. However, might this change in the future?

mcgrath_sh · on Aug 23, 2016

There is a great app called Voice Dream reader on iOS that takes text from a variety of sources and can do TTS. I think I paid $9.99 for it and bought an Inova voice for $4.99. I love it. I use it to have custom study guides read to me, ePubs, PDFs, text documents and my Instapaper queue.

kobayashi · on Aug 26, 2016

Thanks, I'll check them out. Although from what I can tell of Voice Dream it's a big extra step in the workflow.

leejoramo · on Aug 23, 2016

I would very much like to see this.

I don't know what is the state of the art for a general content extractor. (I have done a fair amount of one off web scrapers, for data collection, but nothing this generic)

Veen · on Aug 23, 2016

Paperback offers a reasonably good reading experience for Pinboard links.

https://readpaperback.com

tedmiston · on Aug 23, 2016

Paperback looks interesting. I don't see it mentioned, but does it support highlighting & notes?

The demo seems to be broken for me.

jasikpark · on Aug 23, 2016

One option is to have a browser that can render webpages in reading mode like safari on iOS. Hopefully chrome and Firefox have extensions as well.

sdoering · on Aug 24, 2016

On Android FF has this functionality built in. I use it all the time.

dexterdog · on Aug 24, 2016

They do

miguelrochefort · on Aug 23, 2016

It's funny how even after all those years, people still feel the need to mention "formerly Read It Later" when they talk about Pocket.

eriknstr · on Aug 23, 2016

Well, given that the parent to the comment you replied to said

>Pocket, I remember. ReadItLater used to exist, maybe still?

there was an obvious reason to name Pocket as "formerly Read It Later"