2021 Community Contribution Stats by Quarter

Hey everyone!

I’d like to include code contribution stats by quarter to our 2021 Annual Report, as we’ve done in the past. @burke set up some scripts to help with this. It’s how he came up with the stats and fantastic graphs that we put up on our Wiki.

Anyone interested in helping pull this together? Please ping me!

I’m interested :slight_smile: And I think this is a great idea.

The major issue I see right now is with the automated scripts: Right now they exclude all esm repos so a substantial amount of organization contribution (i.e. to anything related to 3.x) would not be included in the metrics.

@burke could we update your query to include something like “OR repo name includes esm-” to fix this?

@jennifer charts look great :grinning:, openmrs-bot can be filtered out though.

1 Like

100% agreed with @saurabh above; actually that is critical otherwise the metrics will be very wonky (the openmrs bot is quite a productive little guy!)

1 Like

@burke or @dkayiwa - how would we update this query to filter out openmrs bot and include the openmrs/esm- repos?

SELECT *
FROM `githubarchive.year.*`
WHERE _TABLE_SUFFIX BETWEEN '2020' AND '2020'
AND (repo.name like '%openmrs%' OR repo.name like '%esm%' OR org.login='openmrs')

Unfortunately, “ esm” (which stands for ECMAScript, the standard behind JavaScript) is widely used within repository names outside of OpenMRS. If you search github for repos names “esm”, you’ll quickly notice the vast majority have nothing to do with OpenMRS.

I think most of the community esm repos start with “openmrs”, so would be included. We could try to get everyone to tag repos with “OpenMRS”, but that’s not sustainable and repo tags aren’t included in the GitHub archive tables we’re using. Probably the most robust approach to catch frontend repos in other orgs or personal accounts that don’t have “openmrs” in their name would be to find repose with “openmrs” as a dependency in a packages.json… if that’s something we can search. I’ll have to check.

FYI - we filter out several bots (I think PIH has one too) in the stats.

2 Likes

Unfortunately, I haven’t found a way to include repositories that don’t have “openmrs” in their name (e.g., implementation-specific esm repositories, which have adopted a naming convention without “openmrs” in the name).

I thought we might be able to select all repositories that contain a packages.json with a dependency on "openmrs" (which would identify all OpenMRS esms). But the github archive doesn’t the actual content of commits, only references to the commits. So, these details can’t be used in the BigQuery SQL statement. :confused:

I thought an alternative (less automatic, but better than nothing) approach might be to include all repos with an “openmrs” topic (e.g., we try to convince everyone to add “openmrs” as a topic on their OpenMRS repositories), but it doesn’t look like topic changes generate events, so even this detail isn’t available to us for BigQuery queries.

It looks like we’re limited to repository name, organization, and user from which to select target repositories for our stats. Both organizations and users have non-OpenMRS activity in GitHub that we wouldn’t want to include, which is why I’m using “all repositories that are either under the OpenMRS org or have ‘openmrs’ in their name.”

Other than manually curating a list of repositories to add to the list (which isn’t sustainable), the only other options I can see are:

  • Convince the frontend squad to adopt repo naming conventions that always contain “openmrs” (but I know they like shorter names)
  • Come up with some alternative way (outside of github archive) to find all repositories that meet specific criteria (like “openmrs” as a dependency in a package.json file) and push this list of repository names into a BigQuery table that we could join with our github archive search. GitHub’s advanced search allows you to search for files by name, but I don’t see an option to search for files of a specific name with specific contents. :confused:

From good ol’ Stack Overflow, it looks like it is possible to do an advanced search through GitHub’s web interface like https://github.com/search?p=3&q=%22openmrs%22+filename%3Apackage.json&type=Code, which finds package.json in repositories like samuelmale/ohri-form-engine. That’s good.

Unfortunately, the GitHub API has constraints on search queries and the same search to the API returns a validation error “Must include at least one user, organization, or repository”. :cry: So, we’d need to hack something to invoke the web interface, run through all pages, and grep the repository names. For example,

// List of repositories on current page
Array.from(document.getElementsByClassName("Link--secondary"))
  .map( e => e.innerText ).filter( x => x );

// URL to next page
document.getElementsByClassName("next_page")[0].href;

I’ll play around with it. I don’t think I can automate collecting repository names this way and pushing them to a table in BigQuery. But I can manually collect the names and manually make a list of repos in BigQuery using this technique to see if we can expand our stats to ESMs without “openmrs” in their repo name.

So, here’s my hacky solution:

// Browse to:
// https://github.com/search?p=1&q=%22openmrs%22+filename%3Apackage.json&type=Code
// then paste this code into browser console

function request(url, callback) {
  var xhr = new XMLHttpRequest();
  xhr.open("GET", url);
  xhr.onload = function() {
    callback(xhr.response);
  };
  xhr.send();
}

var reposList = [];

function repoMan(url) {
  console.log("Loading " + url);
  request(url, function(response) {
    var doc = new DOMParser().parseFromString(response, "text/html");

    var repos = Array.from(doc.getElementsByClassName("Link--secondary"))
      .map( e => e.innerText.trim() )
      .filter( x =>
        // non-empty and don't contain "openmrs" in repo name
        x && !x.toLocaleLowerCase().includes("openmrs")
      );

    if (repos) {
      reposList = reposList.concat(repos);
    }

    const next = doc.getElementsByClassName("next_page")[0].href
    if (next) {
      // avoid GitHub secondary rate limit
      setTimeout(repoMan.bind(null, next), 15000);
    } else {
      reposList = [...new Set(reposList)]; // unique values
      reposList.sort();
      console.log(JSON.stringify(reposList));
    }
  });
}

repoMan(window.location.href);

This loads each search result page from GitHub’s web interface (currently 30 pages) slowly enough to avoid GitHub’s secondary rate limit.

Running it today, I found 57 openmrs-related repositories without “openmrs” in their name.

Click for list of repos...

I’ll load these into a table in BigQuery so we can include stats from these repositories as well.

1 Like

@burke I’m so sorry, I had no idea that this would take you down such a time sink.

Please do not include those 57 repos. 95% of those 57 repos do not look like actual “community contributions” so I really wouldn’t include most of these (it would be misleading - most look like distro-specific or personal forks).

What I should have done is just given you a short list of which ones to confirm are captured. Please just make sure the following repos are included (as I can at least confirm that these are all “Shared Assets”):

That should do the trick I think.

Sure, we can filter down to specific “community” repos in analysis. When searching the GitHub archive, it’s better to cast a wide net, since it lets us find the activity that we don’t know that we don’t know. For example, who knew that rshu, joelss-tech, and savicsorg were forking more OpenMRS repositories than anyone else in 2021? :slight_smile:

1 Like

Our community stats have always looked at the big picture and it’s important that we keep doing so.

It gives us a broad sense of who is working on OpenMRS as a whole - whether they are contributing to OpenMRS repos or their own implementation’s repo. This is about celebrating all of the code contributions that OpenMRS developers are making worldwide to improve core OpenMRS products, distributions, or individual implementations.

We can also look at who is contributing back to core repos, how that changes over time, and use that data to guide us on our efforts to increase contributions back to the core, but let’s not focus only on that single level of contribution.

@jennifer, let me know here or in Slack if there are specific metrics you need. If you run openmrs-contrib-metrics yourself, until I get around to getting a “NOT: bot” filter to load by default, you’ll want to create your own by creating a filter actor.keyword “is not one of” with the list of known bot accounts. Once created, you can pin this filter globally to filter bots out from all metrics.

Click for list of known bot accounts...
  • dependabot[bot]
  • openmrs-bot
  • codacy-bot
  • pihinformatics
  • snyk-bot
  • mention-bot
  • whitesource-bolt-for-github[bot]
  • pull[bot]
  • codecov[bot]
  • dependabot-preview[bot]
  • github-actions[bot]
  • houndci-bot
  • renovate[bot]
  • sonatype-depshield[bot]
  • librebot
  • transifex-integration[bot]
  • imgbot[bot]
  • google-cla[bot]
  • sonarcloud[bot]
  • npmcdn-to-unpkg-bot
  • jsparrow-app[bot]