Contribution Metrics Are Messy: An Example

Blog by Sumana Harihareswara, Changeset founder

13 May 2022, 16:00 p.m.

Open Source and Free Culture

I frequently notice folks asking or answering questions like "how many contributors does this open source project have?" or "how much contribution is this project getting?" Here's an example of why those aren't simple questions.

ffmpeg is a mindbendingly powerful command-line tool to play and transform audio and video files. There are a zillion commands and flags you can use and it's hard to memorize them. So, several years ago, Ashley Blewer started a great website called ffmprovisr. It's a cookbook of useful ffmpeg recipes, like "join 2 files of the same type" and "compare two video files for content similarity using perceptual hashing". It's grown into an open source project with many users and multiple committers, gotten redesigned, and even inspired imitators.

So. Would you call ffmprovisr a contribution to ffmpeg? It's useful documentation for ffmpeg, but doesn't live in the ffmpeg repository/repositories. It helps more people use ffmpeg (and probably reduces the number of support queries its maintainers get). By creating ffmprovisr, has Blewer become a contributor to ffmpeg? Should we have a category like "indirect contribution", and, if so, how would we delineate that?

Let's go older. In the mid-2000s, the biggest ad for the Ruby programming language was the Rails web framework. Rails was the gateway through which a ton of programmers started to learn and love Ruby. So every Rails committer, documenter, trainer, and bug reporter also ended up doing a favor for Ruby. Can we say that Rails is a contribution to Ruby? Is it useful to say that?

It depends on what further question you're trying to answer. We ask "what is a contribution?" as part of asking "how much contribution are we getting?" "how many contributors do they have?" or "who are this project's contributors?". And those questions have different answers depending on what you want to do with the answers, because these questions have different answers:

We just had a breach. Whose credentials need to get rotated?
Next week we'll host a livechat to work on the roadmap. Whom should we invite?
We're moving the trunk branch from "master" to main". Who needs notification and maybe training?
Should our company keep putting energy into this as an open source project for the recruiting and marketing benefits, or take it proprietary?
Tidelift has money available for our project. How should we split it up?
If we give your project a financial grant, are we investing in a group that's more gender-diverse than the average open source project?
A conference wants us to give a talk. Who can speak authoritatively about the project? And who can speak on the project's behalf?
Should we recruit more user support/bug triage people who have specific experience with [a popular topic]?
Should we make the effort of having this developer documentation translated into Chinese?
If we participate in Outreachy, which includes both code and process contributions, how many people might be able to mentor the apprentices?
Which projects should my project integrate with/support/depend on?
Of these five projects, which is the most likely to still be around in five years?

And so on. (These questions range through most of the five major ways projects get stuck.) For some of those questions, answers for a project like ffmpeg change depending on whether you ignore ffmprovisr, or catalog it as something like a plugin or extension to ffmpeg, a contribution to the ffmpeg ecology. And some answers for a language, framework, or operating system -- something like Ruby, where usage depends on people making useful tools built on top of the foundation you provide -- only make sense if you incorporate data about ecology inhabitants like Rails.

Sometimes you can answer those questions just by checking some pre-compiled stats in a GitHub repository. Sometimes you can't, because the answers aren't there; they're in a different repository altogether, or on StackOverflow, on mailing lists, or in a mix of places including individuals' private conversations.

If you want to dive deeper, I don't know who's thought more about them than CHAOSS (Community Health Analytics Open Source Software), and as such, GrimoireLab tries to actually gather information from lots of sources instead of just the GitHub API. And even so, any quantitative measure will probably need to be supplemented by a qualitative assessment that can catch ecology-level factors, as in the case of ffmprovisr, if you want to make a significant choice based on it.

If you want to work in the general area of quantitative contributor metrics, CHAOSS will welcome you. But if you're not, then instead of looking for the One Grand Definition of "contributor" or "contribution", get more concrete about the questions you want to answer so you can go from there.

HTTP Can Do That?!

My Plans for WisCon 2022

Comments