Wed Dec 04 2013 14:55 Secrets of (peoples' responses to) @horse_ebooks—revealed!:
As part of my @pony_strategies project (see previous post), I grabbed the 3200 most recent @horse_ebooks tweets via the Twitter API, and ran them through some simple analysis scripts to figure out how they were made and which linguistic features separated the popular ones from the unpopular.
This let me prove one of my hypotheses about the secret to _ebooks style comedy gold. I also disproved one of my hypotheses re: comedy gold, and came up with an improved hypotheses that works much better. Using these as heuristics I was able to make @pony_strategies come up with more of what humans consider the good stuff.
The timing of @horse_ebooks posts formed a normal distribution with mean of 3 hours and a standard deviation of 1 hour. Looking at ads alone, the situation was similar: a normal distribution with mean of 15 hours and standard deviation of 2 hours. This is pretty impressive consistency since Jacob Bakkila says he was posting @horse_ebooks tweets by hand. (No wonder he wanted to stop it!)
My setup is much different: I wrote a cheap scheduler that approximates a normal distribution and runs every fifteen minutes to see if it's time to post something.
Beyond this point, my analysis excludes the ads and focuses exclusively on the quotes. Nobody actually liked the ads.
The median length of a @horse_ebooks quote is 50 characters. Quotes shorter than the median were significantly more popular, but very long quotes were also more popular than quotes in the middle of the distribution.
I think that title case quotes (e.g. "Demand Furniture") are funnier than others. Does the public agree? For each quote, I checked whether the last word of the quote was capitalized.
43% of @horse_ebooks quotes end with a capitalized word. The median number of retweets for those quotes was 310, versus 235 for quotes with an uncapitalized last word. The public agrees with me. Title-case tweets are a little less common, but significantly more popular.
Since the last word of a joke is the most important, I decided to take a more detailed look each quote's last word. My favorite @horse_ebooks tweets are the ones that cut off in the middle of a sentence, so I anticipated that I would see a lot of quotes that ended with boring words like "the".
I applied part-of-speech tagging to the last word of each quote and grouped them together. Nouns were the most common by far, followed by verb of various kinds, determiners ("the", "this", "neither"), adjectives and adverbs.
I then sorted the list of parts of speech by the median number of retweets a @horse_ebooks quote got if it ended with that part of speech. Nouns and verbs were not only the most common, they were the most popular. (Median retweets for any kind of noun was over 300; verbs ranged from 191 retweets to 295, depending on the tense of the verb.) Adjectives underperformed relative to their frequency, except for comparative adjectives like "more", which overperformed.
I was right in thinking that quotes ending with a determiner or other boring word were very common, but they were also incredibly unpopular. The most popular among these were quotes that repeated gibberish over and over, e.g. "ORONGLY DGAGREE DISAGREE NO G G NO G G G G G G NO G G NEIEHER AGREE NOR DGAGREE O O O no O O no O O no O O no neither neither neither". A quote like "of events get you the" did very poorly. (By late-era @horse_ebooks standards, anyway.)
It's funny when you interrupt a noun
I pondered the mystery of the unpopular quotes and came up with a new hypothesis. People don't like interrupted sentences per se; they like interrupted noun phrases. Specifically, they like it when a noun phrase is truncated to a normal noun. Here are a few @horse_ebooks quotes that were extremely popular:
- Don t worry if you are not computer
- Don t feel stupid and doomed forever just because you failed on a science
- You constantly misplace your house
- I have completely eliminated your meal
Clearly "computer", "science", "house", "and "meal" were originally modifying some other noun, but when the sentence was truncated they became standalone nouns. Therefore, humor.
How can I test my hypothesis without access to the original texts from which @horse_ebooks takes its quotes? I don't have any automatic way to distinguish a truncated noun phrase from an ordinary noun. But I can see how many of the @horse_ebooks quotes end with a complete noun phrase. Then I can compare how well a quote does if it ends with a noun phrase, versus a noun that's not part of a noun phrase.
About 4.5% of the total @horse_ebooks quotes end in complete noun phrases. This is comparable to what I saw in the data I generated for @pony_strategies. I compared the popularity of quotes that ended in complete noun phrases, versus quotes that ended in standalone nouns.
|Quote ends in ||Median number of retweets|
|Standalone noun ||330|
|Noun phrase ||260|
So a standalone noun does better than a noun phrase, which does better than a non-noun. This confirms my hypothesis that truncating a noun phrase makes a quote funnier when the truncated phrase is also a noun. But a quote that ends in a complete noun phrase will still be more popular than one that ends with anything other than a noun.
At the time I did this research, I had about 2.5 million potential quotes taken from the Project Gutenberg DVD. I was looking for ways to rank these quotes and whittle them down to, say, the top ten percent. I used the techniques that I mentioned in my previous post for this, but I also used quote length, capitalization, and punchword part-of-speech to rank the quotes. I also looked for quotes that ended in complete noun phrases, and if truncating the noun phrase left me with a noun, most of the time I would go ahead and truncate the phrase. (For variety's sake, I didn't do this all the time.)
This stuff is currently not in olipy; I ran my filters and raters on the much smaller dataset I'd acquired from the DVD. There's no reason why these things couldn't go into olipy as part of the
ebooks.py module, but it's going to be a while. I shouldn't be making bots at all; I have to finish Situation Normal.
Wed Dec 04 2013 09:14 @pony_strategies:
My new bot, @pony_strategies, is the most sophisticated one I've ever created. It is the @horse_ebooks spambot from the Constellation Games universe.
Unlike @horse_ebooks, @pony_strategies will not abruptly stop publishing fun stuff, or turn out to be a cheesy tie-in trying to get you interested in some other project. It is a cheesy tie-in to some other project (Constellation Games), but you go into the relationship knowing this fact, and the connection is very subtle.
When explaining this project to people as I worked on it, I was astounded that many of them didn't know what @horse_ebooks was. But that just proves I inhabit a bubble in which fakey software has outsized significance. So a brief introduction:
@horse_ebooks was a spambot created by a Russian named Alexei Kouznetsov. It posted Twitter ads for crappy ebooks, some of which (but not all, or even most) were about horses. Its major innovative feature was its text generation algorithm for the things it would say between ads.
Are you ready? The amazing algorithm was this: @horse_ebooks ripped strings more or less randomly from the crappy ebooks it was selling and presented them with absolutely no context.
Trust me, this is groundbreaking. I'm sure this technique had been tried before, but @horse_ebooks was the first to make it popular. And it's great! Truncating a sentence in the right place generates some pretty funny stuff. Here are four consecutive @horse_ebooks tweets:
- Not only that, but whether you believe it (or want to believe it) the car salesmen will continue to laugh
- Demand Furniture
- Including simplified four part arrangements for the novice student and
- Just look at everything that I am going
There was a tribute comic and everything.
I say @horse_ebooks "was" a spambot because in 2011 the Twitter account was acquired by two Americans, Jacob Bakkila and Thomas Bender, who took it over and started running it not to sell crappy ebooks, but to promote their Alternate Reality Game. This fact was revealed back in September 2013, and once the men behind the mask were revealed, @horse_ebooks stopped posting.
The whole conceit of @horse_ebooks was that there was no active creative process, just a dumb algorithm. But in reality
Bakkila was "impersonating" the original algorithm—most likely curating its output so that you only saw the good stuff. No one likes to be played for a sucker, and when the true purpose of @horse_ebooks was revealed, folks felt betrayed.
As it happens, the question of whether it's artistically valid to curate the output of an algorithm is a major bone of contention in the ongoing Vorticism/Futurism-esque feud between Adam Parrish and myself. He is dead set against it; I think it makes sense if you are using an algorithm as the input into another creative process, or if your sole object is to entertain. We both agree that it's a little sketchy if you have 200,000 fans whose fandom is predicated on the belief that they're reading the raw output of an algorithm. On the other hand, if you follow an ebook spammer on Twitter, you get up with fleas. I think that's how the saying goes.
In any event, the fan comics ceased when @horse_ebooks did. There was a lot of chin-stroking and art-denial and in general the reaction was strongly negative. But that's not the end of the story.
You see, the death of @horse_ebooks led to an outpouring of imitation *_ebooks bots on various topics. (This had been happening before, actually.) As these bots were announced, I swore silent vengeance on each and every one of them. Why? Because those bots didn't use the awesome @horse_ebooks algorithm! Most of them used Markov chains, that most hated technique, to generate their text. It was as if the @horse_ebooks algorithm itself had been discredited by the revelation that two guys from New York were manually curating its output. (Confused reports that those guys had "written" the @horse_ebooks tweets didn't help matters--they implied that there was no algorithm at all and that the text was original.)
But there was hope. A single bot escaped my pronouncements of vengeance: Adam's excellent @zzt_ebooks. That is a great bot which you should follow, and it uses an approximation of the real @horse_ebooks algorithm:
- The corpus is word-wrapped at 35 characters per line.
- Pick a line to use as the first part of a tweet.
- If (random), append the next line onto the current line.
- Repeat until (random) is false or the line is as large as a tweet can get.
And here are four consecutive quotes from @zzt_ebooks:
- SHAPIRO: Ouch! SHAPIRO: Shapiro cares not! SHAPIRO: Hooray!
- things, but I saw some originality in it. The art was very simple, but it was good
- You're tackled by the opponent!
- Gender: Male Height: 5'9" Pilot? Yes Ph.D.? Yes
The ultimate genesis of @pony_strategies was this conversation I had with Adam about @zzt_ebooks. Recently my anger with *_ebooks bots reached the point where I decided to add a real *_ebooks algorithm to olipy to encourage people to use it. Of course I'd need a demo bot to show off the algorithm...
The @pony_strategies bot has sixty years worth of content loaded into it. I extracted the content from the same Project Gutenberg DVD I used to revive @everybrendan. There's a lot more where that came from--I ended up choosing about 0.0001% of the possibilities found in the DVD.
I have not manually curated the PG quotes and I have no idea what the bot is about to post. But the dataset is the result of a lot of algorithmic curation. I focused on technical books, science books and cookbooks--the closest PG equivalents to the crap that @horse_ebooks was selling. I applied a language filter to get rid of old-timey racial slurs. I privileged lines that were the beginnings of sentences over lines that were the middle of sentences. I eliminated lines that were boring (e.g. composed entirely of super-common English words).
I also did some research into what distinguished funny, popular @horse_ebooks tweets from tweets that were not funny and less popular. Instead of trying to precisely reverse-engineer an algorithm that had a human at one end, I tried to figure out which outputs of the process gave results people liked, and focused my algorithm on delivering more of those. I'll post my findings in a separate post because this is getting way too long. Suffice to say that I'll pit the output of my program against the curated @horse_ebooks feed any day. Such as today, and every day for the next sixty years.
Like its counterpart in our universe, @pony_strategies doesn't just post quotes: it also posts ads for ebooks. Some of these books are strategy guides for the "Pôneis Brilhantes" series described in Constellation Games, but the others have randomly generated titles. Funny story: they're generated using Markov chains! Yes, when you have a corpus of really generic-sounding stuff and you want to make fun of how generic it sounds by generating more generic-sounding stuff, Markov chains give the best result. But do you really want to have that on your resume, Markov chains? "Successfully posed as unimaginative writer." Way to go, man.
Anyway, @pony_strategies. It's funny quotes, it's fake ads, it's an algorithm you can use in your own projects. Use it!
Mon Dec 02 2013 09:36 November Film Roundup:
What a month! Mainly due to a huge film festival, but I also got another chance to see my favorite film of all time on the big screen. What might that film be? Clearly you haven't been reading my weblog for the past fifteen years.
- Wives (1975): This movie has a 4.9 IMDB rating, and although it's not as good as Ishtar, it deserves a lot better than a 4.9. I mean, John Cassavetes's Husbands has a 7.3, and who needs that guy?
Uh, anyway, Wives is a fun cinema verité piece where three ladies blow off married life for a while and goof off. Columbia professor Jane Gaines introduced the movie by describing the main characters' activities as a "rampage", and I think that's a little strong, but maybe by 1975 Norway standards it was a real barn-burner. The film is sort of a more commercial Celine and Julie go Boating. The humor is less reliant on in-jokes, the men are offscreen instead of totally absent, and it's ninety minutes long instead of three hours. It was pretty fun, but Celine and Julie is still the gold standard.
- Next of Kin (1979): a.k.a. "Heritage". A ha-ha-only-serious farce that prefigures Arrested Development in its depiction of the magnetic power of money to keep a dysfunctional family together. Also has a 4.9 IMDB rating, and since all the movie info is in Norwegian I gotta figure it's Norwegians hating on their own filmmakers. Why the hate, Norwegians? Did you know that Kon-Tiki is the only Norwegian film people outside of Norway have ever heard of? Show some pride and get your name out there.
I guess I'm just stirring up trouble now, so I'll go back to Next of Kin. The centerpiece of the film for me was a long sequence in the house of the late paterfamilias, in which the family argues over who inherits what, then takes everything down off the walls, puts stickers on everything, and carries all the furniture out to their cars. That must have been incredibly difficult to film, and as someone who has lived through that event (minus the arguing) I gotta say Anja Breien nailed it.
Breien attended the screening and after the movie I asked her to talk about that bit. She said she likes "people carrying things" and the "surrealistic piles" you see in Heironymus Bosch paintings. It symbolizes the alienating effect of materialism, you see. She mentioned that it was really difficult to find all those props; it had to be real expensive silver, paintings by big-name artists, etc. Sounds like they didn't insure it, either. The perfect time-travel heist!
- Gentlemen Prefer Blondes (1953): Man, that was saucy. Jane Russell and Marilyn Monroe really tear it up. Russell's "Anyone Here For Love?" number ("The gayest thing I've ever seen." -Hal) annihilates the male gaze, which spends the rest of the movie trying to recover.
I must admit I'm warming to Marilyn Monroe. I also admit that's a weird thing for a heterosexual man to say, but keep in mind that for most of my life I experienced Marilyn Monroe entirely through the medium of cardboard cutouts used as decor for fake 50s diners. Then I saw her in Love Happy, where she's terrible, and Some Like it Hot, where she's not that great. But as I mentioned a year ago, she's awesome in All About Eve, and she's great in this movie as someone determined to get hers out of a sexist society.
Uh, the worst thing I can say about this movie is the plot bogs it down. I don't really care about the machinations or the milquetoast dudes or the tiara; I just want to see Russell and Monroe hit on some more dumb jocks and maybe commit a little light insurance fraud. Plus, we have a French courtroom conducting an inquiry in English, which may be the most unrealistic thing I've ever seen in a movie.
Finally, I'd just like to point out that this movie ends with the two female characters getting married to their milquetoast dudes, but then it zooms in and cuts the dudes out of frame, so it's just Russell and Monroe standing next to each other in their wedding dresses. I can only imagine what this film would have looked like with the Subtext Glasses they handed out during its original theatrical run.
- The Wind Rises (2013) This was so close to being a good movie that I'm having a hard time pinning down the problem. I think it stems from the fact that this is one of the only Miyazaki films about an adult man. Does that make sense? Because the main character himself is fine but because he's a grown man I guess he's got to have this love interest who is sickly and angelic and apparently highly fictionalized. This would be okay if she was the mostly-offscreen mom from Totoro, but here she's supposed to carry the entire feminine side of the film and it's not good.
The other problem is that the movie doesn't tell its actual, interesting story--it obliquely tells the space around the story. Which, okay, it's a Japanese film and I'm not opposed to this technique in general, and I liked the way the actual story was told through foreshadowing and implication, but it also means we never see the main character directly struggle with the central problem of the film: the fact that he's designing beautiful things that will kill people. It skips past that part to focus on a cheesy fictionalized love story. I did not consider that a good trade.
- Good news, highbrow artists! I figured out how to get me to watch your
avant-garde abstract film. Just use a computer to make it before 1988!
The museum had a
festival of early computer films, and I didn't see any of the
features, but I watched almost all the shorts. It was a mix of really
great films and incredibly boring films. (Making your film with a
computer before 1988 does not guarantee I will give a good review. Offer still not valid for Andy Warhol.)
The worst offender was Woody
Vasulka's Explanation (1974), a twelve-minute film in which a mesh
is deformed and rotated before your eyes, over and over again. The
mesh is the visual representation of a waveform which is also played
aurally, and which always manifests as an obnoxious droning
noise. Twelve minutes, folks. Explanation beats out Trent's
Last Case to become the worst movie I've ever seen at the museum.
In the Q&A afterwards someone spoke up for the audience and
demanded an explanation for Explanation. The answer actually
made sense! Films like Explanation weren't meant to be screened
in a theater. They were meant to be looped on a television in an art
gallery. The essential affordance of an art gallery being that you can
leave when you get tired of it, rather than sitting it out because
there's an hour of hopefully better stuff afterwards.
It also would have helped if we'd seen the copyright date at the
beginning of Explanation instead of the end, because most of
the time I was thinking "This mesh deformation stuff would be
groundbreaking for the early 70s, but if this turns out to be from
1986 I'm going to hack Woody Vasulka's Twitter account and make him
The other big sonic annoyance was that most of the films up to
about 1972 had soundtracks featuring gratuitous sitar/gamelan/Japanese flute music that often didn't even match the animation. With no other point of reference, the new genre of
computer graphics was comparable only to the wonders of LSD, so... toss
in some hippy Eastern music! This interview about the film series puts it more diplomatically:
Science and Film: Can you discuss the early films’ fascination with Asian music and imagery?
Gregory Zinman: The influence of Asian music and imagery in early computer films can be traced to a couple of intertwining concerns. Following the horrors of the second world war, many people, including artists, were searching for different belief systems and ways of thinking about humanity’s place in the universe. This resulted, in part, in a flowering of interest in Eastern religions and philosophies, which in turn resulted in a number of cinematic works that simultaneously referenced other worlds and altered consciousnesses.
In a bit of cross-cultural revenge, we
also saw a Japanese film (1969's Computer Movie No. 2), in
which the soundtrack was Wendy Carlos's version of the third Brandenburg from Switched-On Bach, constantly interrupted by modem handshaking sounds. Make it stop!
Enough negativity. Let's cover the highlights, with links to full
video or clips or at least semi-official pages about the films where possible.
First, the abstract stuff. I loved Mary Ellen Bute's very early, good-natured Abstronic
(1952) and Mood Contrasts (1953). Especially the narrator at
the beginning of Abstronic who explains the concept of computer
art and then says "Enjoy yourself!" Here's a page with a couple clips of Mood Contrasts and I also discovered another great Bute film called Dada. Probably the cheeriest thing ever to be called Dada.
The Whitney family--John Sr., John Jr., and James, but sadly not my uncle Jon Whitney--were well represented and seem to have set the standard with films like Side Phase Drift (1965)
Lapis (1966) and Permutations (1968) and Arabesque (1975). The standard being "pointilism because otherwise the computer can't handle the math" and "slap some Asian music on the soundtrack."
But the champion of the abstract section IMO was Larry Cuba's work. 1978's 3/78 (Objects and Transformations) has a clear Whitney influence (moving dots + Japanese flute soundtrack), but by 1985 computer power had advanced to the point where he was able to create what ranks alongside Composition in Blue (1935) as one of my favorite abstract films of all time, the gloriously isometric Calculated Movements (here's a 30-second excerpt).
Cuba made Calculated Movements with a
system called GRASS, which I believe he also used to create the
animated Death Star infographic in Star Wars (1977). He was
present for the screening, and in the Q&A I asked him if he still had
the Calculated Movements source code and if there was a
framework for running GRASS on modern computers. He dodged the first
question and said no to the second--someone was working on something
for Windows but the project died. He did mention that he considered Processing to be the successor to GRASS.
Between abstract and representative film sits the surreal, neon candy-colored
demo reel for the computer graphics studio of Robert Abel and Associates. Their work was apparently described as "a psychedelic trip gone straight," and if I'm misremembering that quote, I'll use those exact words to describe it right now. We saw the 1974 reel and I can't find that exact one online, but here are a few later ones: 1981 and 1982
I especially enjoyed RAA's bonkers 1974 ad for 7-Up, which really lightened the mood after a half-hour of the Whitneys, I tell you what. Here's a YouTube playlist of their stuff. Here's a sequel to the 7-Up commercial with a McDonalds tie-in. Outstanding. This studio seems to have driven a big chunk of the late-70s early-80s aesthetic.
And now, my perrenial favorite, representative film. Yay!
- La Faim (1974) used computer animation and morphing to
create a traditional-style (albeit avant-garde) animated short. I'm
surprised the disturbing, grotesque faces on display in this film
aren't used in more memes. (See sample meme to the right.)
- Vol Libre (1980): This one really wowed 'em at SIGGRAPH with its fractal geometry. Bonus sci-fi connection: director Loren Carpenter says, "I used an antialiased version of this software to create the fractal planet in the Genesis Sequence of Star Trek 2, the Wrath of Khan."
- Voyager 2 Flyby (1981): We saw the second Saturn flyby, but YouTube also has the first Saturn flyby, as well as the 1986 sequel about Uranus and 1989's chiling "Neptune and Triton".
Jim Blin, creator of the Saturn flyby film, said, "Our storyboard was the NASA flight plan." (He wasn't there; the guy introducing the films told us that he said this.) The Voyager flyby film was apparently the first time computer graphics were shown on the nightly news as part of the news, rather than just in interstitals and 7-up commercials from Robert Abel and Associates.
- Human Vectors (1982): This isn't a great work of art, but it was filmed off of a Vectrex, so it looks like nothing else in the show. It was apparently rescued
by the New Museum's recent XFR STN project. I laughed at the C debugging joke.
- Big Electric Cat (1982): An 80s rock video. Not
that great but I'm including it here because it's so weird. One of the
directors was present and he introduced the video by saying: "It was
the 80s." It sure was.
- Adventures in Success (1983): Now this is more like it! A
funny music video for a good rock song. It's catchy and
toe-tapping and satirical and also very 80s. Highly recommended.
- No No Nooky TV (1987): The journal of a love affair between
a woman and her Amiga 1000. Funny and dirty and filled with the 16-color
joy that flows from late-1980s computer paint programs. A triumph! Vimeo says the video is only 2:40, but the entire film is there.
I would be really interested to hear about the relationship between the demoscene and the computer film scene. I'm pretty sure there was no connection whatsoever, for a variety of reasons, but I would like to hear some people who came in to computer art through the "art" side talk about the stuff that came out from the "computer" side. I'm talking about the tension between Human Vectors (which is technically very skilled but nothing special artistically) and No No Nooky TV (which is clearly the work of a professional filmmaker but was made using only the programs that come loaded on the Amiga).
I didn't bring this up in Q&A because I figured no one would know what I was talking about, and if they did it would derail the whole Q&A. Perhaps I should have had more faith in computer animators. I guess I'll have to wait for the Jason Scott documentary.
I also think the museum did a good job of showcasing excellent
work by women in a medium dominated (?) by male artists. The earliest films shown were Mary Ellen
Bute's, and my two favorite films of the show were made by women:
Lynn Goldsmith (who co-directed and sang Adventures in Success)
and Barbara Hammer (No No Nooky TV). There was also a whole
discussion with Lillian Schwartz which I didn't attend.
If this has whetted your appetite for old-fashioned computer animation, there's plenty more where that came from (the past).
- The Big Lebowski (1998): I'm not someone who rewatches movies, and I've now seen The Big Lebowski six times. What can I say now that I haven't already said?
Well, how about this. My favorite thing about Thomas Pynchon is that each of his characters is surrounded by a protective bubble of literary genre, which colors the way the narrative is reported and even shapes the plot. This is most obvious with the Chums of Chance in Against the Day, who start off having a carefree Tom Swift adventure that, as they grow up, gradually becomes a WWI military novel. The Big Lebowski does the same thing for film.
I admit it took the publication of Inherent Vice, Thomas Pynchon's own version of The Big Lebowski, for me to realize this, but there it is. Walter is in an action movie. Maude Lebowski is in an arty Eurofilm where people trade wisecracks and laugh about nothing. The Stranger is in a Western. Bunny Lebowski is in an acausal porno. Jeffrey Lebowski is in a biopic of himself, with classical music and a narrator sonoriously recounting his accomplishments. The Dude doesn't want to be in a movie at all, but his decision to get revenge for the death of his
partner rug puts him into a bubble of film noir. And Donny is like a child who wanders into the middle of a movie and wants to know what's going on.
And I don't know what else to say. The Big Lebowski is my favorite movie. It's very nearly the perfect fiasco comedy, and since that's the best kind of movie, it's very nearly the perfect movie. But how many times can you watch the perfect movie? How can I laugh at a really funny joke knowing that my laughter rings hollow because I knew the joke's exact timing?
Here it stands, like Shakespeare's Hamlet or Larry Cuba's Star Wars, the source of cliches that will last a thousand years. Can I set down The Big Lebowski and walk away without betraying my love for it? Nay, and yet I must! For this is not 'Nam. This is Film Roundup. There are rules.
Sat Nov 30 2013 09:43 @everybrendan Season Two:
Last year I wrote one of my first Twitter bots, @everybrendan. Inspired by Adam's infamous @everyword, it ran for two months, announcing possible display names for Brendan's Twitter account (background), taken from Project Gutenberg texts. Then I got tired of individually downloading, preparing, and scraping the texts, so I let it lapse a year ago today, with a call for requests for a "season two" that never materialized.
Well, season two is here, and it's a doozy. I've gone through Project Gutenberg's 2010 dual-layer DVD and found about 300,000 Brendan names in about 20,000 texts, enough to last @everybrendan until the year 2031. At that point I'll get whatever future-dump contains the previous twenty years of Project Gutenberg texts and do season three, which should keep us going until the Singularity. The season two bot announces each new text with a link, so it educates even as it infuriates.
I've been wanting to do this for a while, but it's a very tedious process to handle Project Gutenberg texts in bulk. Most texts are available in a wide variety of slightly different formats. The texts present their metadata in many different ways, especially when it comes to the dividing line between the text proper and the Project Gutenberg information. Some of the metadata is missing, some of it is wrong, and there's one Project Gutenberg book that doesn't seem to be in the database at all.
I started dealing with these problems for my NaNoGenMo project and realized that it wouldn't be difficult to get something working in time for the @everybrendan anniversary. I've put the underlying class in olipy: it's effectively a parser for Gutenberg texts, and a way to iterate over a CD or DVD image full of them. It can also act as a sort of
lint for missing and incorrect metadata, although I imagine Project Gutenberg doesn't want to change the contents of files that have been on the net for fifteen years, even if some of the information is wrong.
The Gutenberg iterator still needs a lot of work. It's good enough for @everybrendan, but not for my other projects that will use Gutenberg data, so I'm still working on it. My goal is to cleanly iterate over the entire 2010 DVD without any problems or missing metadata. The problems are concentrated in the earlier texts, so if I can get the 2010 DVD to work it should work going forward.
(2) Wed Nov 27 2013 09:48 Bots Should Punch Up:
Over the weekend I went up to Boston for Darius Kazemi's "bot summit". You can see the four-hour video if you're inclined. I talked about @RealHumanPraise with Rob, and I also went on a long-winded rant that suggested a model of extreme bot self-reliance. If you take your bots seriously as works of art, you should be prepared to continue or at least preserve them once you're inevitably shut off from your data sources and your platform.
We spent a fair amount of time discussing the ethical issues surrounding bot construction, but there was quite a bit of conflation of what's "ethical" with what's allowed by the Twitter platform in particular, and website Terms of Service in general. I agree you shouldn't needlessly antagonize your data sources or your platform, but what's "ethical" and what's "allowed" can be very different things. However, I do have one big piece of ethical guidance that I had to learn gradually and through osmosis. Since bots are many hackers' first foray into the creative arts, it might help if I spell it out explicitly.
Here's an illustrative example, a tale of two bots. Bot #1 is @CancelThatCard. It finds people who have posted pictures of their credit or debit card to Twitter, and lets them know that they really ought to cancel the card and get a new one.
Bot #2 is @NeedADebitCard. It finds the same tweets as @CancelThatCard, but it retweets the pictures, collecting them in one place for all to see.
Now, technically speaking, @CancelThatCard is a spammer. It does nothing but find people who mentioned a certain phrase on Twitter and sends them a boilerplate message saying "Hey, look at my website!" For this reason, @CancelThatCard is constantly getting in trouble with Twitter.
As far as the Twitter TOS are concerned, @NeedADebitCard is the Gallant to @CancelThatCard's Goofus. It's retweeting things! Spreading the love! Extending the reach of your personal brand! But in real life, @CancelThatCard is providing a public service, and @NeedADebitCard is inviting you to steal money from teenagers. (Or, if you believe its bio instead of its name, @NeedADebitCard is a pathetic attempt to approximate what @CancelThatCard does without violating the Twitter TOS.)
At the bot summit I compared the author of a bot to a ventriloquist. Society allows a ventriloquist a certain amount of license to say things via the dummy that they wouldn't say as themselves. I know ventriloquism isn't exactly a thriving art, but the same goes for puppets, which are a little more popular. If you're an MST3K fan, imagine Kevin Murphy saying Tom Servo's lines without Tom Servo. It's pretty creepy.
We give a similar license to comedians and artists. Comedians insult audience members, and we laugh. Artists do strange things like exhibit a urinal as sculpture, and we at least try to take them seriously and figure out what they're saying.
But you can't say absolutely anything and expect "That wasn't me, it was the dummy!" to get you out of trouble. There is a general rule for comedy and art: always punch up, never punch down. We let comedians and artists and miscellaneous jesters do outrageous things as long as they obey this rule. You can poke fun at yourself (Stephen Colbert famously said "There's no status I would not surrender for a joke"), you can make a joke at the expense of someone with higher social status than you, but if you mock someone with lower status, it's not cool.
If you make a joke, and people get really offended, it's almost certainly because you violated this rule. People don't get offended randomly. Explaining that "it was just a joke" doesn't help; everyone knows what a joke is. The problem is that you used a joke as a means of being an asshole. Hiding behind a dummy or a stage persona or a bot won't help you.
@NeedADebitCard feels icky because it's punching down. It's saying "hey, these idiots posted pictures of their debit cards, go take advantage of them." Is there a joke there? Sure. Is it ethical to tell that joke? Not when you can make exactly the same point without punching down, as @CancelThatCard does.
The rules are looser when you're in the company of other craftspeople. If you know about the "Aristocrats" joke, you'll know that comedians tell each other jokes they'd never tell on the stage. All the rules go out the window and the only thing that matters is triggering the primal laughter response. But also note that the must-have guaranteed punchline of the "Aristocrats" joke ensures that it always ends by punching upwards.
You're already looking for loopholes in this rule. That's okay. Hackers and comedians and artists are always attracted to the grey areas. But your bot is an extension of your will, and if you're a white guy like me, most of the grey areas are not grey in your favor.
This is why I went through thousands of movie review blurbs for @RealHumanPraise in an attempt to get rid of the really sexist ones. It's an unfortunate fact that Michelle Malkin has more influence over world affairs than I will ever have. So I have no problem mocking her via bot. But it's really easy to make an incredibly sexist joke about Michelle Malkin as a way of trying to put her below me, and that breaks the rule.
There was a lot of talk at the bot summit about what we can do to avoid accidentally offending people, and I think the key word is 'accidentally.' The bots we've created so far aren't terribly political. Hell, Ed Henry, chief White House correspondent for FOX News, follows @RealHumanPraise on Twitter. If he enjoys it, it's not the most savage indictment.
In comedy terms, we botmakers are on the nightclub stage in the 1950s. We're creating a lot of safe nerdy Steve Allen comedy and we're terrified that our bot is going to accidentally go off and become Andrew Dice Clay for a second. There's nothing wrong with Steve Allen comedy, but I'd also like to see some George Carlin type bots; bots that will, by design, offend some people. (Darius's @AmIRiteBot is the only example I know of.)
Artists are, socially if not legally, given a certain amount of license to do things like infringe on copyright and violate Terms of Service agreements. If you get in trouble, the public will be on your side, unless you betrayed their trust by breaking the fundamental ethical rule of comedy. So do it right. Design bots that punch up.