# (0) 18 Nov 2014, 04:21PM: Using Beautiful Soup, Pystache, and Lunr.js for an Archival Site:
My third week of my 2014 Hacker School batch, I decided to take on a project that I'd originally thought about doing a year before, during my first go at HS.
Between April 2005 and August 2007, I wrote a weekly column called "MC Masala" for the "Inside Bay Area" section of several papers in the San Francisco Bay Area, including the Oakland Tribune. My work circulated to about a million people, I'm told. A few years ago I grabbed a softcopy of almost all my archives off a periodicals database, and then in 2011 I made an abortive attempt to get the columns online, but gave up on all the fiddly textmunging bits.
But a few weeks ago I felt ready to make a go of it, and I figured this would be a fun and useful way to learn Beautiful Soup and learn to finagle a search engine. So I basically stopped doing the Matasano crypto challenges and started a new project.*
Beautiful Soup, Pystache, and sed
I wrote a script to take a list of HTML files of my old newspaper columns and scrape them using Beautiful Soup. (I only needed a tiny bit of live help from Leonard -- to whit, he got me to use the html5lib parser instead of the default.) My script output a Python dictionary containing the stories as structured data: headline, date, & body. And I wrote a script to render that data through Pystache templates I wrote and write an HTML file for each story, plus a table of contents page. (I don't intend on adding comments or starting the column back again, so I didn't think I'd want a CMS. Pystache, the Python implementation for lightweight Mustache templates, seemed like a reasonable choice.) I got some help on this, notably from a pairing session with Chase Lambert on testing Unicode stuff, and from a pairing session with Geoff Shannon on a Pystache type and inheritance problem.
Unfortunately I never quite figured out how to get one Pystache template nested in another, so there's some code duplication (perhaps partials are the answer). And I had to hack my way around some loopback issues so as to put chronological next/previous links on each article. (Story URLs are just kebab-cased dates. So, my script gets the headline and date (and thus the URL) of the next or previous story by traversing a date-sorted list of dates-and-headlines dicts, then renders the dates and URLs into variables in the template. Oh right, this is where a CMS would have been nice! Lightweight is great until it's not.)
(In the course of all this, I (with help from a sed FAQ) wrote my first real honest-to-goodness "changing a bunch of files in-place with sed" one-liner in years or possibly ever. A ton of links in several files were pointing to the parent directory instead of the current directory. So:
sed -i '/head/s/\.\.\///' *.html means "In-place, change
../ to nil, in all the
.html files in this directory." Whoo!)
The look, the feel
(There was a cotton ad on TV when I was a kid, with the jingle, "The look / the feel / the fabric of our lives." Sometimes Nandini and I sing it to each other. I suppose if there were an ad for Cascading Style Sheets on TV today it could use the same motto.)
I wrote the stylesheet and arranged the proper elements in the template with a bunch of help from Mozilla Developer Network's guidance on boxes and tables, and that old standby, CSS Zen Garden. I gratefully and curiously perused several nice-looking styles for inspiration and edification. I now more thoroughly understand the difference between margin and padding, and grok better why modern sites have a zillion
For a "home" image, I used a picture of me that Valerie Aurora took, and for a header decoration, I used the GNU Image Manipulation Program to stitch together repetitions of a photo that Kitt Hodsden took and blogged in 2012.
I've made database schema decisions before, but I haven't previously decided on search indices. It was cool that I had the power to change up the parsed output once I realized that the structured data ought to have hrefs as the unique IDs, rather than otherwise-useless unique doc IDs.
MC Masala is live! I am so happy that these columns have a nice home now, and that I made it. I got to exercise my Python, which is strong, and I got to strengthen a bunch of other skills along the way. It's not perfect, and I have a TODO list, but it's the nicest-looking site I've ever made, and it fulfills its function well. And I made it in just a few days.
* I basically stalled on the Matasano challenges, and will come back to them someday when I don't feel so time-constrained. I did get some use out of doing the ones I did! I have now grokked byte-level stuff much better, and learned about bytearrays thanks to Allison Kaptur. And I got some laughs out of the process. Example: In challenge six, the Hamming distance the player calculates should be 37. First attempt: came up with 14. Next: 598. I literally laughed aloud. Then, when I finally got 37, I thrust my arms into the air with great vigor because I WAS A DEITY OF PURE LIGHT. But then I started getting depressingly wrong answers and kept getting them; I got help from friends, but decided to hold off and only look at one friend's potentially-spoilery explanation when I'm ready to come back, and I still haven't looked at it. I tried to remind myself of a sort of Allison Kaptur/Carol Dweck "the edge of maybe-can't/"The only thing that makes you smarter is doing hard things" attitude, that I am a Joseph Campbell hero and the greater my struggle the greater my triumph will be. But I was tearing up in frustration, and I decided to give myself a rest from crypto and level up on the main skill I'd come to Hacker School to learn, namely, webdev. And I think that was the right decision. You gotta manage your own morale and momentum -- that's a resource too.
: Hacker School
# (0) 18 Nov 2014, 01:01PM: A Node.js Project, And Deciding to Shelve It:
In my second week of my 2014 Hacker School batch, I asked:
What are red flags in scifi/fantasy magazines' calls for submissions? What words/phrases make you think "ew, avoid"? -- @brainwane, 3:48 PM - 13 Oct 2014
As Moss guessed, I was thinking of making an SF&F version of joblint.org, to automatically check for suspect wording in "please submit" pages and posts by speculative fiction publishers.
I take off my hat to Rowan Manning for creating the tool and the site, which I found easy to adapt (my fork of the tool, my fork of the site). The code's in Node.js, and despite an npm problem on Ubuntu, I found it fairly easy to figure out how to change the tests, regular expressions, and error messages, modify the package dependencies and update appropriately (especially thanks to Hacker School colleagues). Check it out:
But conversation with some SF&F community members led me to believe that the joblint approach wouldn't help here. In tech industry job descriptions, you can rely on certain buzzwords and key off them; joblint should be only part of a suite that catches problems, the way a code linter should be in a software engineering process, but it prookes thought and is useful on its own. But problems with SF&F calls for submissions are often in subtler approaches rather than easy-to-match strings. So it didn't feel worthwhile for me to try for a regexes-alone approach, and I didn't want to spend my Hacker School time thinking though the automated literature analysis part of this problem; that's not what I wanted to do in this batch.
So I shelved the project and I have not gotten it even close to launch. But the code's up with a TODO list, and y'all should feel free to grab it and run with it if it strikes your fancy!
: Hacker School
# (1) 18 Nov 2014, 10:46AM: Things I Learned About Drupal And Odd 404s:
Back on October 7th, I offered "Some Tips On Domain Names And Hosting", and said: "So, next step: choosing a provider, spinning up a server, loading it up, and pointing my new domain name at it!" And then an interesting unexpected thing came up, which takes up the majority of this post (see the "Weird spam and HTTP tricks" section).
I chose DigitalOcean mainly because a peer had a $10 referral coupon thing, so I could for free enjoy the benefits of using a service that has a business model that makes sense and won't get all ad skeevy (relevant rant, parts one, two, and three).
I faced some two-factor auth problems basically because the most convenient 2FA solutions assume you are fine with installing a closed-source app on a computing device you control.
Also, when spinning up a DigitalOcean droplet for the first time and SSHing into it, I'd like to establish the authenticity of the host by verifying the ECDSA key fingerprint. Where in one's digitalocean.com settings or in the web UI should one look to find that? The answer: one can't. I looked on the web and asked around, and found a lot of people saying, "when you get to 'the authenticity of this host cannot be established, are you sure,' just say yes." There is apparently no way to verify that key fingerprint in the web UI. The attack vector is microscopic (someone else coming in and spoofing the IP address right after you spin it up and before you have a chance to SSH in). But it still annoys me. I hear Amazon EC2 has solved this problem and does give you a way to verify the fingerprint.
I followed some useful tutorials to refresh my memory so I could set up an Ubuntu server and get a LAMP stack installed. Another helped me install Drupal. I have now successfully installed Drupal!
Generally, if you want to make Drupal do what you want it to do, it's helpful to install modules that other people have made, and maybe themes. You can check out popular modules such as Views, and you can look up how to install modules and themes, and learn how to install modules and themes specifically in Drupal 7.
Thanks to much help from Fureigh (example), when I looked up an "installation profile" ("ngpprofile") that interested me, I found out about Drush and installed it. It seems as though drush wants or seems to need to do everything as root, which doesn't feel right to me, so maybe I misunderstood. Then again, a sysadmin of my acquaintance mentioned his "you gotta be kidding me" reaction to a Drupal installation HOWTO that blithely said "now
chmod 777 the web directory", so maybe I just have a different attitude to privileging than Drupal does! Some more thoughts on Drush: a slide deck, GitHub, a homepage, and a project page.
And Fureigh submitted a patch to get ngpprofile to work properly with Drush! ... And then I ungratefully did not try to use ngpprofile, and instead looked at a very very simple theme, and then fiddled manually with templates and the admin dashboard to make my site look just slightly different from a regular stock Drupal site. Drupal theming seems to be a pretty deep skill in and of itself.
I got help from the
#drupal-support IRC channel on Freenode as I went -- thanks! If I ever dip into Drupal again, I'll check out a video resource they recommended, including a "build your first Drupal 7 website" video sequence.
Weird spam and HTTP tricks
I bought a brand-new domain name via Hover and pointed it to my DigitalOcean droplet. The next day, I looked at various admin logs and noticed strange 404s that had nothing to do with my site. Clearly they were spam and the attackers hoped I would click on their URLs thinking they were referrers, or similar (if the attacked site's 404 logs are public, intentionally or accidentally, then this tactic would increase the spammer's pagerank). I'll reproduce one here, with the actual URL replaced with "myphishingsite.biz" and eliding the IP.
Hmmm. The spammer left their URL in the LOCATION field somehow, but there's no referer (Drupal spells it "referrer in the admin console). I found that I could cause a "page not found" log entry by going to a nonexistent page on my site, e.g.
TYPE page not found
DATE Thursday, October 9, 2014 - 10:46
USER Anonymous (not verified)
HOSTNAME [IP address elided]
/bleeber, but then the LOCATION for that log entry was
http://[hostname.tld]/bleeber. How was the spammer manufacturing an entry with a LOCATION of
http://myphishingsite.biz? And what was up with the truncated initial "h" in the MESSAGE field?
With a few pointers from two Hacker School colleagues, a bit of reading up on how Drupal logs 404s, what access logs look like in Apache, and what 404 actually means, and some trial-and-error, I began to see what was happening. If I went to http://myhostname.tld/http://panix.com , then my access logs included
GET /http://panix.com . But the attacker sent requests that logged as
GET http://[spamsite] (notice that there is no leading
/). So I began to suspect that the attacker programmatically sends
GET requests with some kind of intentionally malformed header. (And then this helped me explain why, in the report overview in the web-based admin console, the spammed URLs miss their first character (the h in http) -- usually you don't care about the leading slash or about the base URL when you're skimming that overview, so Drupal programmers made some kind of "omit the first character" choice.)
Time to break out
netcat! Usually, the first string after
GET in an HTTP request header is the location of the resource you want on the host that you're sending the request to (below, "myhostname.tld" is the host that I'm sending the request to). You'll often see
GET / or
GET /favicon.ico, for instance. But there's no reason you can't do something like this:
$ nc myhostname.tld 80
GET http://berkeley.edu HTTP/1.1
When I sent that HTTP request manually, I could replicate precisely what the spammers were doing, in terms of what characters showed up or got clipped in the relevant logs. For instance, the access log entry:
[IP address elided] - - [11/Oct/2014:16:23:47 -0400] "GET http://berkeley.edu HTTP/1.1" 404 7574 "-" "netcat"
And if I were specifically attacking Drupal administrators and wanted them to click on things, and I knew about the initial truncated character in the web-based admin console view, I might send a
GET request that includes an initial character to throw away:
$ nc myhostname.tld 80
GET /http://nyc.gov/ HTTP/1.1
So, my first week of my second Hacker School batch, I succeeded in learning a bunch about using the domain name system, hosting, and Drupal, AND I learned how to do hilariously wrong things with HTTP requests. (The site isn't up anymore, because that wasn't the point.) I then went on to build some more sites with different tools, and I'll blog about the rest of them in upcoming posts.
: Hacker School
# (0) 16 Nov 2014, 06:36PM: Shelter and Memory:
Mary Schmich wrote in that 1997 "wear sunscreen" advicedump, which has stuck with me and overall proven a good guide for adult Sumana:
Understand that friends come and go, but with a precious few you should hold on. Work hard to bridge the gaps in geography and lifestyle, because the older you get, the more you need the people who knew you when you were young.
This weekend I hung out with a couple of Wikimedia engineers I'd known for a while -- heck, I'd helped one of them move. One of them mentioned, "I was looking at the Wikipedia article for Team America: World Police --"
And I joked something like, "Oh, because it was interfering with the Education Program's
Team America namespace?"
And he laughed at my joke, because he remembered that two years ago, we tried to help out professors by introducing a Course namespace (basically wiki pages starting with "Course:"), but that this caused a conflict with the article about the Star Trek: Voyager episode "Course: Oblivion". Such an obscure joke.
That's the time and the place for the coziness of an inside joke -- among friends, the ones who've helped you shape your identity, so the homosocial bonding doesn't exclude newbies and imply to them that if they don't get the joke then they don't belong. I wonder what idiom speakers of other languages use; the phrase "inside joke" carries these connotations of shelter and interiority to me.
There's a saying that you know you're a New Yorker when you point to a storefront and say "I remember when that was [something different]." I've been here going on nine years, longer than I have ever lived in any other city, and I can imagine visual diffs for scores of blocks. It makes me feel rooted, like a tree. I can sense -- and sometimes give in to -- the temptation to assume that the change began when I arrived and began to observe it, as though the only important change is the change I witnessed.
My family moved over and over when I was a child, and I was poor at socializing as a teen, and I've only retained a handful of college friendships. Today I'm doing a big inbox scouring, and this musing reminds me to prioritize replying to the old pals, the ones who knew a Sumana I can barely remember.
# (1) 14 Nov 2014, 04:07PM: Sometimes Paths Are Useful:
I just finished a six-week batch at Hacker School. As an alumna, I had the option of asking to come back for three months or for a six-week minibatch, and I decided on the latter. I'll be writing more about my lessons, but today I can mostly point to my programming partner's writeup and add a silly story.
I met Greg Hendershott at !!Con months back, and then we ended up in the same batch and found that we laugh at each other's jokes. So we tried to figure out what to work on together. He's way into functional programming, Racket, Clojure, stuff like that, and has for instance written an emacs mode for Racket. In contrast, I'm only fluent in Python and have been concentrating on web dev. We found common ground in Python and an interest in security, and made a webservice that runs a static analyzer on a user-submitted code sample and returns to the user a "report card" of vulnerabilities in their code. That's what I spent the last two weeks on.
In his post, Greg describes how we rejected smaller and smaller web frameworks, finally settling on subclassing from
BaseHTTPServer (built into Python's standard library). When you do that, you have to literally define methods so that the server can handle even the most basic HTTP verbs, like
POST. We defined
POST but didn't define
GET, because we didn't need to! It felt so tremendously subversive, creating a web service that gave you a 501 (Method Not Supported) if you tried to
GET / , and yet actually did other things. Deliciously wrong.
(Also amazing: reading and subclassing from code whose initial code comments specifically and relevantly cite the work of Tim Berners-Lee and Roy Fielding. I felt such awe and gratitude, that I am part of a grand heritage of innovation and infrastructure. What an inheritance!)
So then a few days later we decided to make a simple web page or two, so that someone using a web browser could use the service. I loved the experience of API-first design, and felt amused when I implemented our server's second method,
do_GET. (One nice thing about long-term collaboration is that you can pair some of the time and also do some bits on your own, bringing them to your partner for code review.)
do_POST, didn't care about the path, because there's only one thing a user is ever going to do with our service. No URL routing required. A
GET request always caused the server to return index.html.
Then I stubbed out a small index.html page, borrowing bits and pieces from other past projects where I'd solved similar problems. And I thought "well I'll style this a bit" and copied a style.css file from one of my old sites into the project directory, linked to it in the
head element of index.html, futzed with some element names and IDs, and reloaded. Hmm, why no styling? Shift-reload. Still looked bare. I opened up the developer toolbar...
...and saw that "style.css" had the text of index.html. Because I had defined
GET to always return index.html! And when you want a browser to be able to use a stylesheet, well, it'll have to
I laughed pretty hard, then inlined the CSS. (And we did end up writing a bit of URL routing so we could serve a favicon to browsers and to serve a capabilities document to service clients.)
I get so much joy out of playing with the building blocks of the Web. It's a great feeling. Thanks for working on this with me, Greg!
: Hacker School
# 07 Nov 2014, 04:35PM: Snapshot:
Sometime in early 2010, I jotted down a few notes that I meant to blog at the time; I've now expanded them into the following entry. I was in between jobs; I think it was just after my time at Collabora, and the year before I started working for Wikimedia Foundation. I'd been in New York City for a little over four years. It's interesting to look back -- I never did turn any of those ideas into a proper conference talk, and I still remember the atmosphere of that evening, feeling out of place of course among the men in business suits in some dim bar, but still connected to them because of what we'd studied together.
Today I thought up some proposal ideas for conferences... [terrible ideas elided]
Today I also reread bits of Rick Yancey's tax collector memoirs, and I went to dinner/drinks with old colleagues, people I'd done the master's in tech management with a few years previous. Basically all guys (and jeez sexism much?). Evidently SWOT & similar tools really work when you break 'em out appropriately (in the midst of chaos, maybe?). And from what these guys tell me, HR is a mess in most big companies; if I can not just catalyse, but teach other people to replicate my success, that's marketable. The interface between a firm & its clients is crucial, but so is the interface between the firm & its employees.
It sounds like one way to keep those corporate accounting and finance skills honed would be to try looking at the financials of a company without knowing its name, and work out what it is.
What do I want in my next job? I should be open to larger orgs, larger than any I've worked with in the past, but I don't want some things I've heard are common in big organizations:
Most touchingly, my old classmate [name] said he's forever remembered my interaction with that executive who came to guest-lecture us, about whether he considers himself a success, and would he do it again. Hearing that answer changed his mind. Before coming into the Master's in Tech Management program, he'd thought, "I want to be a CIO of a big corporation." Afterwards: "I want time for family."
- stifling bureaucracy
- stifling political atmosphere that stops necessary things from being said or asked
- lengthy processes lasting more than 3 months to get rid of an underperformer
# 04 Nov 2014, 01:04PM: .illusion():
Last night one of my Hacker School peers was practicing sleight-of-hand with a card deck, and another peer walked over and said, "Oh, I used to run a magic tricks website."
I waited with bated breath for the punchline. None came! So I had to make some up.
I used to run a magic tricks website, but it disappeared.
I used to run a magic tricks website; I wrote it in Haspell.
I used to run a magic tricks website; it ran RabbitMQ.
I used to run a magic tricks website; I used SQLAlchemy. (predicated on the false memory that SQLAlchemy's logo is a tophat and cane)
I used to run a magic tricks address book application; pick a .vcard format, any .vcard format!
I used to run a magic tricks website; this is my lovely helper function.
But I felt stymied. When I think of magic tricks, I think of visuals and descriptions, not easy-to-pun jargon. And I couldn't think of any puns on the names of GOB Bluth, Penn and Teller, David Copperfield, or Criss Angel/Mindfreak.
And then Cerek Hillen came up with: "I used to run a magic tricks website; I wrote it in Brainfreak." And I thought: yes. It is done.