Blog by Sumana Harihareswara, Changeset founder

18 Nov 2014, 15:21 p.m.

Using Beautiful Soup, Pystache, and Lunr.js for an Archival Site

Hi, reader. I wrote this in 2014 and it's now more than five years old. So it may be very out of date; the world, and I, have changed a lot since I wrote it! I'm keeping this up for historical archive purposes, but the me of today may 100% disagree with what I said then. I rarely edit posts after publishing them, but if I do, I usually leave a note in italics to mark the edit and the reason. If this post is particularly offensive or breaches someone's privacy, please contact me.

My third week of my 2014 Hacker School batch, I decided to take on a project that I'd originally thought about doing a year before, during my first go at HS.

Between April 2005 and August 2007, I wrote a weekly column called "MC Masala" for the "Inside Bay Area" section of several papers in the San Francisco Bay Area, including the Oakland Tribune. My work circulated to about a million people, I'm told. A few years ago I grabbed a softcopy of almost all my archives off a periodicals database, and then in 2011 I made an abortive attempt to get the columns online, but gave up on all the fiddly textmunging bits.

But a few weeks ago I felt ready to make a go of it, and I figured this would be a fun and useful way to learn Beautiful Soup and learn to finagle a search engine. So I basically stopped doing the Matasano crypto challenges and started a new project.*

Beautiful Soup, Pystache, and sed

I wrote a script to take a list of HTML files of my old newspaper columns and scrape them using Beautiful Soup. (I only needed a tiny bit of live help from Leonard -- to whit, he got me to use the html5lib parser instead of the default.) My script output a Python dictionary containing the stories as structured data: headline, date, & body. And I wrote a script to render that data through Pystache templates I wrote and write an HTML file for each story, plus a table of contents page. (I don't intend on adding comments or starting the column back again, so I didn't think I'd want a CMS. Pystache, the Python implementation for lightweight Mustache templates, seemed like a reasonable choice.) I got some help on this, notably from a pairing session with Chase Lambert on testing Unicode stuff, and from a pairing session with Geoff Shannon on a Pystache type and inheritance problem.

Unfortunately I never quite figured out how to get one Pystache template nested in another, so there's some code duplication (perhaps partials are the answer). And I had to hack my way around some loopback issues so as to put chronological next/previous links on each article. (Story URLs are just kebab-cased dates. So, my script gets the headline and date (and thus the URL) of the next or previous story by traversing a date-sorted list of dates-and-headlines dicts, then renders the dates and URLs into variables in the template. Oh right, this is where a CMS would have been nice! Lightweight is great until it's not.)

(In the course of all this, I (with help from a sed FAQ) wrote my first real honest-to-goodness "changing a bunch of files in-place with sed" one-liner in years or possibly ever. A ton of links in several files were pointing to the parent directory instead of the current directory. So: sed -i '/head/s/\.\.\///' *.html means "In-place, change ../ to nil, in all the .html files in this directory." Whoo!)

The look, the feel

(There was a cotton ad on TV when I was a kid, with the jingle, "The look / the feel / the fabric of our lives." Sometimes Nandini and I sing it to each other. I suppose if there were an ad for Cascading Style Sheets on TV today it could use the same motto.)

I wrote the stylesheet and arranged the proper elements in the template with a bunch of help from Mozilla Developer Network's guidance on boxes and tables, and that old standby, CSS Zen Garden. I gratefully and curiously perused several nice-looking styles for inspiration and edification. I now more thoroughly understand the difference between margin and padding, and grok better why modern sites have a zillion divs.

For a "home" image, I used a picture of me that Valerie Aurora took, and for a header decoration, I used the GNU Image Manipulation Program to stitch together repetitions of a photo that Kitt Hodsden took and blogged in 2012.

Lunr.js

I thought about adding a server-side search engine with something like Lucene or ElasticSearch, but then I heard about a client-side search engine, Lunr.js. My previous HS batch had included a little JS exploration, and I'd futzed with JavaScript in my Node project the previous week, so Lunr sounded like a good approach. I got it installed okay, and borrowed Ben Smith's minified JS package and Jared Dominguez's index-builder, and got a ton of experience with Chrome developer tools. Over the course of getting Lunr.js working on my site (with help from Nicholas Cassleman and Vito LaVilla) I wrote JS to query the index and return search results. I especially like that the result shows up in the same page, without the need for a redirect or full page refresh.

I've made database schema decisions before, but I haven't previously decided on search indices. It was cool that I had the power to change up the parsed output once I realized that the structured data ought to have hrefs as the unique IDs, rather than otherwise-useless unique doc IDs.

My site!

MC Masala is live! I am so happy that these columns have a nice home now, and that I made it. I got to exercise my Python, which is strong, and I got to strengthen a bunch of other skills along the way. It's not perfect, and I have a TODO list, but it's the nicest-looking site I've ever made, and it fulfills its function well. And I made it in just a few days.


* I basically stalled on the Matasano challenges, and will come back to them someday when I don't feel so time-constrained. I did get some use out of doing the ones I did! I have now grokked byte-level stuff much better, and learned about bytearrays thanks to Allison Kaptur. And I got some laughs out of the process. Example: In challenge six, the Hamming distance the player calculates should be 37. First attempt: came up with 14. Next: 598. I literally laughed aloud. Then, when I finally got 37, I thrust my arms into the air with great vigor because I WAS A DEITY OF PURE LIGHT. But then I started getting depressingly wrong answers and kept getting them; I got help from friends, but decided to hold off and only look at one friend's potentially-spoilery explanation when I'm ready to come back, and I still haven't looked at it. I tried to remind myself of a sort of Allison Kaptur/Carol Dweck "the edge of maybe-can't/"The only thing that makes you smarter is doing hard things" attitude, that I am a Joseph Campbell hero and the greater my struggle the greater my triumph will be. But I was tearing up in frustration, and I decided to give myself a rest from crypto and level up on the main skill I'd come to Hacker School to learn, namely, webdev. And I think that was the right decision. You gotta manage your own morale and momentum -- that's a resource too.

Comments

Cat Typist
http://bitboost.com
21 Nov 2014, 15:34 p.m.

Thanks for bringing "MC Masala" back to the WWW!

There is one possible conversion bug that is in fact not important: in one place the pseudoword "resum" occurs (for resum + e with acute accent)? It is a good example of the fact that sometimes it's ok to have a few benign bugs. :-)