Rebuilding the entire RubyGems in Copr

09. 11. 2021 | 15. 11. 2021 | Jakub Kadlčík | EN copr fedora rubygems

We took all 166 699 packages from RubyGems.org and rebuilt them in Copr. Let’s explore the results together.

Success rate

From the 166 699 Gems hosted on RubyGems.org, 98 816 of them were successfully built in Copr for Fedora Rawhide. That makes a 59.3% success rate. For the rest of them, it is important to distinguish in what build phase they failed. Out of 67 883 failures, 62 717 of them happened while converting their Gemfile into spec and only 5 166 when building the actual RPM packages. It means that if a Gem can be properly converted to a spec file, there is a 95% probability for it to be successfully built into RPM.

By far, the majority of failures were caused by a missing license field for the particular Gems. There is likely nothing wrong with them, and technically, they could be built without any issues, but we simply don’t have legal rights to do so. Therefore such builds were aborted before even downloading the sources. This affected 62 049 packages.

More stats

All Gems were rebuilt within the @rubygems/rubygems Copr project for fedora-rawhide-x86_64 and fedora-rawhide-i386.

We submitted all builds at once, starting on Sep 11, 2021, and the whole rebuild was finished on Oct 17, 2021. It took Copr a little over a month, and within that time, the number of pending builds peaked at 129 515.

The number of running builds doesn’t represent 24 468 running builds at once but rather the number of builds that entered the running state on that day. It doesn’t represent Copr throughput accurately though, as we worked on eliminating performance issues along the way. A similar mass rebuild should take a fraction of the time now.

The resulting RPM packages ate 55GB per chroot, therefore 110GB in total. SRPM packages in the amount of 640MB were created as a byproduct.

The repository metadata has 130MB and it takes DNF around 5 minutes on my laptop (Lenovo X1 Carbon) to enable the repository and install a package from it for the first time (because it needs to create a cache). Consequent installations from the repository are instant.

In perspective

To realize if those numbers are anyhow significant or interesting, I think we need to compare them with other repositories.

#	@rubygems/rubygems	Fedora Rawhide (F36)	EPEL8
The Number of packages	98 816	34 062	4 806
Size per chroot	55GB	83GB	6.7GB
Metadata size	130MB	61MB	11MB
`dnf makecache`	~5 minutes	~22 seconds	1 second

Motivation

What was the point of this experiment anyway?

The goal was to rebuild all packages from a third-party hosting service that is specific to some programming language. There was no particular reason why we chose RubyGems.org among other options.

We hoped to pioneer this area, figure out the pain points, and make it easier for others to mass-rebuild something that might be helpful to them. While doing so, we had the opportunity to improve the Copr service and test the performance of the whole RPM toolchain against large repositories.

There are reasons why to avoid installing packages directly via gem, pip, etc, but that’s for a whole other discussion. Let me just reference a brief answer from Stack Overflow.

Internals

Surprisingly enough, the mass rebuild itself wasn’t that challenging. The real work manifested itself as its consequences (unfair queue, slow createrepo_c, timeouts everywhere). Rebuilding the whole RubyGems.org was as easy as:

Figuring out a way to convert a Gemfile into spec. Thank you, gem2rpm!
Figuring out how to submit a single Gem into Copr. In this case, we have built-in support for gem2rpm (see the documentation), therefore it was as easy as copr-cli buildgem .... Similarly, we have built-in support for PyPI. For anything else, you would have to utilize the Custom source method (at least until the support for such tool/service is built into Copr directly).
Iterating over the whole RubyGems.org repository and submitting gems one by one. A simple script is more than sufficient, but we utilized copr-rebuild-tools that I wrote many years ago.
Setting up automatic rebuilds of new Gems. The release-monitoring.org (aka Anitya) is perfect for that. We check for new RubyGems.org updates every hour, and it would be trivial to add support for any other backend. Thanks to Anitya, the repository will always provide the most up-to-date packages.

Takeaway for RubyGems

If you maintain any Gems, please make sure that you have properly set their license. If you develop or maintain any piece of software, for that matter, please make sure it is publicly known under which license it is available.

Contrary to the common belief, unlicensed software, even though publicly available on GitHub or RubyGems, is in fact protected by copyright, and therefore cannot be legally used (because a license is needed to grant usage rights). As such, unlicensed software is neither Free software nor open source, even though technically it can be downloaded and installed by anyone.

If I could have a wishful message towards RubyGems.org maintainers, please consider placing a higher significance on licensing and make it required instead of recommended.

For the reference, here is a list of all 65 206 unlicensed Gems generated by the following script (on Nov 14 2021). https://gist.github.com/FrostyX/e324c667c97ff80d7f145f5c2c936f27#file-rubygems-unlicensed-list

#!/bin/bash
for gem in $(gem search --remote |cut -d " " -f1) ; do
   url="https://rubygems.org/api/v1/gems/$gem.json"
   metadata=$(curl -s $url)
   if ! echo $metadata |jq -e '.licenses |select(type == "array" and length > 0)'\
      >/dev/null; then
       echo $metadata |jq -r '.name'
   fi
done

There are also 3 157 packages that don’t have their license field set on RubyGems.org but we were able to parse their license from the sources. https://gist.github.com/FrostyX/e324c667c97ff80d7f145f5c2c936f27#file-rubygems-license-only-in-sources-list

Takeaway for DNF

It turns out DNF handles large repositories without any major difficulties. The only inconvenience is how long it takes to create its cache. To reproduce, enable the repository.

dnf copr enable @rubygems/rubygems

And create the cache from scratch. It will take a while (5 minutes for the single repo on my machine).

dnf makecache

I am not that familiar with DNF internals, so I don’t really know if this is something that can be fixed. But it would certainly be worth exploring if any performance improvements can be done.

Takeaway for createrepo_c

We cooperated with createrepo_c developers on multiple performance improvements in the past, and these days createrepo_c works perfectly for large repositories. There is nothing crucial left to do, so I would like to briefly describe how to utilize createrepo_c optimization features instead.

First createrepo_c run for a large repo will always be slow, so just get over it. Use the --workers parameter to specify how many threads should be spawned for reading RPMs. While this brings a significant speedup (and cuts the time to half), the problem is, that even listing a large directory is too expensive. It will take tens of minutes.

Specify the --pkglist parameter to let createrepo_c generate a new file containing the list of all packages in the repository. It will help us to speed up the consecutive createrepo_c runs. For them, specify also --update, --recycle-pkglist, and --skip-stat. The repository regeneration will take only a couple of seconds (437451f).

Takeaway for appstream-builder

On the other hand, appstream-builder takes more than 20 minutes to finish, and we didn’t find any way to make it run faster. As a (hopefully) temporary solution, we added a possibility to disable AppStream metadata generation for a given project (PR#742), and recommend owners of large projects to do so.

From the long-term perspective, it may be worth checking out whether there are some possibilities to improve the appstream-builder performance. If you are interested, see upstream issue #301.

Takeaway for Copr

The month of September turned into one big stress test, causing Copr to be temporarily incapacitated but helping us provide a better service in the long-term. Because we never had such a big project in the past, we experienced and fixed several issues in the UX and data handling on the frontend and backend. Here are some of them:

Due to periodically logging all pending builds, the apache log skyrocketed to 20GB and consumed all available disk space (PR#1916).
Timeouts when updating project settings (PR#1968)
Unfair repository locking caused some builds unjustifiably long to be finished (PR#1927).
We used to delegate pagination to the client to provide a better user experience (and honestly, to avoid implementing it ourselves). This made listing builds and packages in a large project either take a long time or timeout. We switched to backend pagination for projects with more than 10 000 builds/packages (PR#1908).
People used to scrap the monitor page of their projects but that isn’t an option anymore due to the more conservative pagination implementation. Therefore we added proper support for project monitor into the API and copr-cli (PR#1953).
The API call for obtaining all project builds was too slow for large projects. In the case of the @rubygems/rubygems project, we managed to reduce the required time from around 42 minutes to 13 minutes (PR#1930).
The copr-cli command for listing all project packages was too slow and didn’t continuously print the output. In the case of the @rubygems/rubygems project, we reduced its time from around 40 minutes to 35 seconds (PR#1914).

Let’s build more

To achieve such mass rebuild, no special permissions, proprietary tools, or any requirements were necessary. Any user could have done it. In fact, some of them already did.

iucar/cran
@python/python3.10
PyPI rebuild is being worked on by Karolina Surma

But don’t be fooled, Copr can handle more. Will somebody try Npm, Packagist, Hackage, CPAN, ELPA, etc? Let us know.

I would suggest starting with Copr Mass Rebuilds documentation.