Leonard and Sumana's personal notebook

Categories: personal | scraping

Overview: Extracting article text from HTML documents | My tech blog.: For work

boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages - Google Project Hosting : Whereas this is work-related

http://packages.python.org/pyquery/: A good DSL

PhantomJS: Headless WebKit with JavaScript API: This is going to be great. (Except Javascript is still awful.)


© 2000-2013 Leonard Richardson.