Blog by Sumana Harihareswara, Changeset founder

13 Nov 2013, 7:38 a.m.

Missing From Wikipedia: Tool to Help Fight Systemic Bias

Hi, reader. I wrote this in 2013 and it's now more than five years old. So it may be very out of date; the world, and I, have changed a lot since I wrote it! I'm keeping this up for historical archive purposes, but the me of today may 100% disagree with what I said then. I rarely edit posts after publishing them, but if I do, I usually leave a note in italics to mark the edit and the reason. If this post is particularly offensive or breaches someone's privacy, please contact me.

Wikimedia Diversity Conference-1 This week I wrote a tool I currently call "missing from Wikipedia" although the name may change. You feed it a list of people's names and the language Wikipedia you want to check, and it tells you who from that list does not currently have Wikipedia pages about them.

For instance, I gave it the ~2100 names from the table of contents from the Oxford Dictionary of African Biography (edited by Emmanuel K. Akyeampong and Henry Louis Gates), and asked about English Wikipedia. The list of people who (I think) do not have enwiki articles about them has 948 names. That means we do cover about half those Africans already, e.g., Nadine Gordimer. (This is an approximation, because I know some names need more finagling; for instance, currently the script messes up Barack Obama Sr.'s name so it wrongly thinks he doesn't have an enwiki page about him.)

I wrote this for Keilana (yay) as a tool to help fight systemic bias on Wikimedia projects. I hope other people find it useful. I've just added some code so that it prints out the percentage of missing people when it's done running, so you have a better measure of (for instance) French Wikipedia's coverage of important Senegalese leaders. I met Keilana in Berlin this past weekend at the Wikimedia Diversity Conference, and got to show her the power of APIs.

When I came to Hacker School, I had a general goal: "When I see a problem that could be solved by writing some Python and reading from/writing to an existing API, I want to recognize that and be able to solve the problem that way." Now I'm a little over halfway through and I have done it!

The code's GPL'd. Enjoy.