Thanks for the link to Wallabag, I had never heard of it. Looks very interesting.
I use Pinboard, and pay for the archiving option. I even periodically request a tarball of the archive for my own backup. Pinboard archives the entire page, not just a readable version of the content. For archival and reference purposes, I like this. It would be nice if Pinboard also provided a readable option. In fact, a number of the apps that work with Pinboard add support for readable versions.
I will look into adding Wallabag into my workflow.
I'm cool with the idea of providing a readable option in Pinboard, since I already do something similar to get the text out of the page for indexing. Any library for this you particularly like?
Readability (https://github.com/luin/readability) is a classic, and included as part of Firefox (I think, maybe that's been discontinued). It's essentially a bag of hand-written heuristics but they're pretty good heuristics.
Some interesting reading is Christian Kohlschütter's thesis on this problem, which is framed in academia as "how do we assemble good text corpuses from webpages for data analysis, which means removing junk (boilerplate) from our HTML crawls" (https://code.google.com/archive/p/boilerpipe/wikis/WSDM2010P...). Boilerpipe would probably be the right way to go, but if you're not using Java it could be harder to integrate.
This would be great! A while back I signed up for Paperback (https://readpaperback.com/) to handle this for my Pinboard account, and then wrote my own using the Ruby Readability library.
Hopefully you see this. The biggest reason I stick with Pocket (despite the privacy implications) is because of its text-to-speech functionality. AFAIK, Pinboard is a website only, so TTS is out of the question. However, might this change in the future?
There is a great app called Voice Dream reader on iOS that takes text from a variety of sources and can do TTS. I think I paid $9.99 for it and bought an Inova voice for $4.99. I love it. I use it to have custom study guides read to me, ePubs, PDFs, text documents and my Instapaper queue.
I don't know what is the state of the art for a general content extractor. (I have done a fair amount of one off web scrapers, for data collection, but nothing this generic)
Pinboard (https://pinboard.in) offers archiving for (I believe) $25 a year.
Or Pocket (https://getpocket.com/) which used to be Read-It-Later.