The major issue I see right now is with the automated scripts: Right now they exclude all esm repos so a substantial amount of organization contribution (i.e. to anything related to 3.x) would not be included in the metrics.
@burke could we update your query to include something like “OR repo name includes esm-” to fix this?
100% agreed with @saurabh above; actually that is critical otherwise the metrics will be very wonky (the openmrs bot is quite a productive little guy!)
@burke or @dkayiwa - how would we update this query to filter out openmrs bot and include the openmrs/esm- repos?
SELECT *
FROM `githubarchive.year.*`
WHERE _TABLE_SUFFIX BETWEEN '2020' AND '2020'
AND (repo.name like '%openmrs%' OR repo.name like '%esm%' OR org.login='openmrs')
Unfortunately, “ esm” (which stands for ECMAScript, the standard behind JavaScript) is widely used within repository names outside of OpenMRS. If you search github for repos names “esm”, you’ll quickly notice the vast majority have nothing to do with OpenMRS.
I think most of the community esm repos start with “openmrs”, so would be included. We could try to get everyone to tag repos with “OpenMRS”, but that’s not sustainable and repo tags aren’t included in the GitHub archive tables we’re using. Probably the most robust approach to catch frontend repos in other orgs or personal accounts that don’t have “openmrs” in their name would be to find repose with “openmrs” as a dependency in a packages.json… if that’s something we can search. I’ll have to check.
FYI - we filter out several bots (I think PIH has one too) in the stats.
Unfortunately, I haven’t found a way to include repositories that don’t have “openmrs” in their name (e.g., implementation-specific esm repositories, which have adopted a naming convention without “openmrs” in the name).
I thought we might be able to select all repositories that contain a packages.json with a dependency on "openmrs" (which would identify all OpenMRS esms). But the github archive doesn’t the actual content of commits, only references to the commits. So, these details can’t be used in the BigQuery SQL statement.
I thought an alternative (less automatic, but better than nothing) approach might be to include all repos with an “openmrs” topic (e.g., we try to convince everyone to add “openmrs” as a topic on their OpenMRS repositories), but it doesn’t look like topic changes generate events, so even this detail isn’t available to us for BigQuery queries.
It looks like we’re limited to repository name, organization, and user from which to select target repositories for our stats. Both organizations and users have non-OpenMRS activity in GitHub that we wouldn’t want to include, which is why I’m using “all repositories that are either under the OpenMRS org or have ‘openmrs’ in their name.”
Other than manually curating a list of repositories to add to the list (which isn’t sustainable), the only other options I can see are:
Convince the frontend squad to adopt repo naming conventions that always contain “openmrs” (but I know they like shorter names)
Come up with some alternative way (outside of github archive) to find all repositories that meet specific criteria (like “openmrs” as a dependency in a package.json file) and push this list of repository names into a BigQuery table that we could join with our github archive search. GitHub’s advanced search allows you to search for files by name, but I don’t see an option to search for files of a specific name with specific contents.
Unfortunately, the GitHub API has constraints on search queries and the same search to the API returns a validation error “Must include at least one user, organization, or repository”. So, we’d need to hack something to invoke the web interface, run through all pages, and grep the repository names. For example,
// List of repositories on current page
Array.from(document.getElementsByClassName("Link--secondary"))
.map( e => e.innerText ).filter( x => x );
// URL to next page
document.getElementsByClassName("next_page")[0].href;
I’ll play around with it. I don’t think I can automate collecting repository names this way and pushing them to a table in BigQuery. But I can manually collect the names and manually make a list of repos in BigQuery using this technique to see if we can expand our stats to ESMs without “openmrs” in their repo name.
@burke I’m so sorry, I had no idea that this would take you down such a time sink.
Please do not include those 57 repos. 95% of those 57 repos do not look like actual “community contributions” so I really wouldn’t include most of these (it would be misleading - most look like distro-specific or personal forks).
What I should have done is just given you a short list of which ones to confirm are captured. Please just make sure the following repos are included (as I can at least confirm that these are all “Shared Assets”):
All OpenMRS Repos (this would cover all the relevant esms), OR just ensure all the esms listed here are covered
Sure, we can filter down to specific “community” repos in analysis. When searching the GitHub archive, it’s better to cast a wide net, since it lets us find the activity that we don’t know that we don’t know. For example, who knew that rshu, joelss-tech, and savicsorg were forking more OpenMRS repositories than anyone else in 2021?
Our community stats have always looked at the big picture and it’s important that we keep doing so.
It gives us a broad sense of who is working on OpenMRS as a whole - whether they are contributing to OpenMRS repos or their own implementation’s repo. This is about celebrating all of the code contributions that OpenMRS developers are making worldwide to improve core OpenMRS products, distributions, or individual implementations.
We can also look at who is contributing back to core repos, how that changes over time, and use that data to guide us on our efforts to increase contributions back to the core, but let’s not focus only on that single level of contribution.
@jennifer, let me know here or in Slack if there are specific metrics you need. If you run openmrs-contrib-metrics yourself, until I get around to getting a “NOT: bot” filter to load by default, you’ll want to create your own by creating a filter actor.keyword “is not one of” with the list of known bot accounts. Once created, you can pin this filter globally to filter bots out from all metrics.