Using Software Heritage for technical writing

There’s this nice service called Software Heritage, a large software archive intended to be used as a reference for software in writings (e.g., research papers, blog posts, technical documents). We’ll be showing how to take advantage of it for technical writing as it provides the following benefits.

It offers a centralized and universal way of identifying and referring to software similarly to Digital object identifers (DOI).
It consolidates all of the sources into one centralized archive, reducing the need to search and manage to different forges such as GitHub, GitLab, Bitbucket, and Sourcehut.
Long-term preservation which mitigates against problems such as vanishing upstreams and sunsetting services. For example, when referring to NixOS/nixpkgs@nixos-22.11 and if ever GitHub goes down or if the Nix community decides to move into other Git forges, it will affect none of it once the software has been archived within Software Heritage and can be referred for the rest of time.
It offers granularity to what part of the software technical writers can refer to from the whole project, to a certain point in history, to certain files and directories, and all the way down to lines of code.

How Software Heritage works?

Software Heritage actively archives software from several sources such as…

Software forges like GitHub, GitLab instances, and even Gitea instances.
Linux distributions package archives such as from Debian, Nix, and Guix.
Several software indices such as PyPi, crates.io, and npm.

The service will periodically capture a snapshot of the same software project (which we have access to, among other things as we’ll see later in the post).

Furthermore, it can save source code that is managed by different version control software such as Git, Mercurial, and Subversion. Software Heritage is also offering its services with its API over HTTPS which is nice if you want to create some neat little scripts or integrate it in your software. Most of the functionality are already available with its website which I covered in a later section. But first, we’ll have to learn an important concept with Software Heritage: its identifier system to access the archive in the first place.

Dialog on archive statistics

As of 2023-05-11, Software Heritage contains at least 230 million projects with 3.2 billion commits. It is only expected to go up throughout the years. GitHub takes the majority of the sources as the archive have 175 million projects from it.

Does that include forks and everything?

Also, that’s a bit scary to think about GitHub having 75% of the archive. What about the second largest?

I’m fairly sure it does include forks in the archive which makes the actual number of projects a lot less but it is still an impressive count considering the wide coverage of sources they monitor. They only archive public repos or if the user opt in of the archival integration such as in GitHub.

As for the second largest source, it seems to come from GitLab instances totalling… 4 million projects. That’s at least 1%.

Well, that’s at least… unfortunate.

Its identifier system

The main intention of the project is to provide a centralized archive for identifying and referencing software. The primary way of using such service is with an identifier system like DOI for digital objects, ISBN for books, and ISWC for music. Software Heritage uses its own identifier system called SoftWare Heritage persistent IDentifier or SWHID for short. The identifier follows a certain format.

Parts of SWHID

The following examples should be enough to show what they look like.

Table 1. Examples of SWHIDs
SWHID	Description
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2	GPLv3 document.
swh:1:dir:101a60787ec70986789c64d2379be174ed73e2e5	`maintainers/` directory from nixpkgs.
swh:1:rev:db1e4eeb0f9a9028bcb920e00abbc1409dd3ef36	22.11 branch of nixpkgs.
swh:1:rel:8763b71ed3a51974c61edb7781832a50b176f966	GNOME Shell v3.38.6 release.
swh:1:snp:fc3c21b5f61d1e283ba9ec52f632c372675eaebc	A gnome-shell snapshot.

As you can tell from the table, SWHID also offers some control of granularity of what parts of software we want to refer: from individual files and directories, from a certain point in the history of the project, and from a certain point of time of capture.

What about pointing to specific lines of code? Didn’t the preamble mentioned something like "granularity down to the lines of code"?

You’ll see it later.

An interesting property with SWHIDs is that they are intrinsic identifiers: meaning you get the object alongside the identifier. Unlike DOIs and ISBNs where objects are arbitrarily assigned by a central authority, SWHIDs are computationally generated from the object. This means SWHIDs are deterministic and we can do a reverse lookup with the object. In fact, it can be computed with objects locally in your machine.

Due to its intrinsic nature of SWHIDs and with the ability to refer various parts of a software, we’re also slowly unraveling the fact that Software Heritage archive itself is essentially a gigantic Merkle tree where it contains several objects. Let’s go back to the previous table of SWHIDs again and see what those are.

A content object contains the content of a file.
A directory object contains other directory objects and content objects.
A revision object is a point in time of the development history of the project. It also points to the root directory of the project.
A release object is essentially the same as a revision object but with additional metadata. In practice, this is typically the revision developers tagged for release (e.g., KDE Plasma 5.23, GNOME 42, Linux kernel 6.3).
A snapshot object contains the whole source code including all visible branches at that point in time.

Similarities with Git

Wait… this sounds similar to the Git internals.

That’s because it is using a similar data model as Git with the graph objects and even the object identifier being a hex-encoded SHA1 hash.

Does this mean it is compatible with Git then?

Yes but it is more coincidental than anything. This is especially clearer once you noticed the service supports importing software from version control software other than Git. Just don’t expect that to work every time.

SWHID qualifiers

While the previously shown SWHIDs is enough and working as intended, there are some lack of information with the identifier alone. From the identifier system, one cannot easily infer certain information that we often needed such as the URL of the repository and the path relative to the repository. This is also reflected in the website interface if you’ve visited the links where it just strictly presents the software artifact (e.g., content, directory, revision).

Let’s take swh:1:dir:101a60787ec70986789c64d2379be174ed73e2e5 as an example where we just see a directory and nothing else.
Or let’s take the swh:1:rev:db1e4eeb0f9a9028bcb920e00abbc1409dd3ef36 where we see visit a revision of the project but we cannot see if it came from the canonical repository.
With yet another example, let’s take swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 where the exact content object can appear for GPLv3-licensed projects. ^[1] Not to mention we can’t tell where the license file is located in the repository, let alone the repository.

This is because of the data model of the archive being a gigantic Merkle tree where objects may be shared among multiple projects. This makes certain tasks to be tedious such as identifying whether the artifact belong from a canonical repository or one of its many forks which is also included in the archive.

Because of this, SWHIDs may also have a semicolon-delimited (;) list of qualifiers that adds contextual information.

SWHID with qualifiers

Each qualifier may mean different things which is documented nicely in its website. Let’s take the previous table and add more contextual information with it.

Table 2. Previous SWHIDs with contextual information
SWHID	Description
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538	Section 11 of the GPLv3 license from system76-firmware.
swh:1:dir:101a60787ec70986789c64d2379be174ed73e2e5;origin=https://github.com/NixOS/nixpkgs;visit=swh:1:snp:857ce072b5dbf50f1ae55d8233cb321dd42b5992;anchor=swh:1:rev:db1e4eeb0f9a9028bcb920e00abbc1409dd3ef36;path=/maintainers/	`maintainers/` directory from the nixpkgs-22.11 branch from the canonical nixpkgs repository.
swh:1:rev:db1e4eeb0f9a9028bcb920e00abbc1409dd3ef36;origin=https://github.com/NixOS/nixpkgs;visit=swh:1:snp:857ce072b5dbf50f1ae55d8233cb321dd42b5992	A certain revision from the nixpkgs-22.11 branch from the canonical nixpkgs repository.
swh:1:rel:8763b71ed3a51974c61edb7781832a50b176f966;origin=https://gitlab.gnome.org/GNOME/gnome-shell;visit=swh:1:snp:54081c29aa31e4a626a06b70e2a8571fad83e092	The canonical GNOME Shell v3.38.6 release.
swh:1:snp:fc3c21b5f61d1e283ba9ec52f632c372675eaebc;origin=https://gitlab.gnome.org/GNOME/gnome-shell	A snapshot of the canonical gnome-shell repository captured on January 4th of 2023.

If you click each of the link, the website interface is more complete compared to the previous table of SWHIDs.

Side-to-side comparison of the website interface in swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 without and with qualifiers

Side-to-side comparison of the website interface for swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 without and with qualifiers

This practice of adding contextual information is recommended as documented from its FAQ. More specifically, the contextual information has to be as full as possible which you can easily get the identifier with all relevant qualifiers in its archive website interface which we’ll cover next. You can see more of them from Guidelines for referencing SWHIDs.

So what happens when I give a qualifier with a wrong value such as the origin qualifier that points to an non-existent origin or an anchor qualifier that points to an invalid SWHID?

Why don’t you try those out yourself? Here’s a list of them just for starters.

You could also mix and match qualifiers that are not supposed to appear in certain object types such as the lines qualifier in non-content objects.

Using the Software Heritage archive website

Throughout the Software Heritage ecosystem, there are tools that make use of the service. Its main interface is on the archive website interface is what you’re likely to use the most. The workflow from the website interface is pretty simple: you search for the origin of the software, enter the corresponding object, and specify what you want to refer to. ^[2]

The most important thing to note with this website is using it as a resolver for SWHIDs that is similarly used with DOIs. You’ve already seen its usage with the links from the previous tables such as in Examples of SWHIDs and Previous SWHIDs with contextual information. Using it as a resolver is simple: just append the identifier on the root endpoint of the service.

https://archive.softwarearchive.org/$SWHID

With the user-facing side of the website, what you’ll see first is a search interface. Take note the quality of the search results is not perfect nor usable if you’re not aware of the quirks of its search engine. For example, merely entering the name of the software is not typically enough for searching.

The search result from the query "linux kernel" from the Software Heritage archive website

The search result for the query "linux kernel"

Even searching with metadata doesn’t help.

The search result from the query "linux kernel" including its metadata in the Software Heritage archive website

The search result for the query "linux kernel" with metadata

It’s pretty obvious that it doesn’t have enough quality results. Instead, I recommend to enter the origin URL that you’re searching for (e.g., https://github.com/torvalds/linux). If there is an exact match of the given origin, the website will directly go to the page of the software artifact with that origin. This is especially nice for the sources it already monitors such as GitHub, GitLab instances, and Gitea instances. This even works for package indices such as Pypi and npm (e.g., https://pypi.org/project/swh.core, https://www.npmjs.com/package/vue). For more details, there is a dedicated page on what sources are being monitored which you can infer what URLs can be resolved in this way.

Once you get into the software artifact of your choosing (e.g., directory, file, revision, snapshot), you can get the identifier with the permalink tab on the side of the website.

Using the permalink tab on the website

Other Software Heritage tools

Other than the website, there are tools available to easily make use of the service. The ecosystem of Software Heritage is somewhat limiting as Software Heritage itself is relatively young but it does have nice tools to begin with. Let’s take a closer look at them.

swh identify is a command-line interface that prints the SWHID of the given objects. SWHID are computationally generated that can be done locally which is nice if you have the codebase on disk and want to refer to them through the archive.
A nice way to explore the archive is with Software Heritage Filesystem (SwhFS) which comes with a command-line interface (swh fs). This tool alongside swh identify is one way to explore the archive entirely on the terminal.
A web client for SWH in Python which is nice if you’re using Python in the first place.
Some SWH-related browser extensions. ^[3] Among them is the UpdateSWH which checks and includes the archival of a repository in the queue, all in a simple interface.
For those who are writing with LaTeX, there is a package for adding software entry types in BibLaTeX.

Furthermore, there are initiatives to integrate it with projects such as with Guix and peer-to-peer access with IPFS.

SWH tools wishlist

As the ecosystem around Software Heritage is young, there are some tools and services that could use and integrate with the service. The following list is what I would like to see.

More integration with software forges. Though this could be implemented with browser extensions, it would be nicer if forges such as GitHub and Gitea can integrate the service even if it through extensions. GitHub already has some foundations with this feature as it has citation support.
Zotero integration with the service. You could go into the archive and quickly get the reference just as you would on arXiv.

Appendix A: Guidelines for referencing SWHIDs

While using SWHIDs is a done-and-forget procedure (for the most part), there is a set of guidelines to make usage of them a bit easier.

Per the documentation, it is recommended to use swh:dir: SWHIDs over swh:rev: or swh:rel: since swh:dir: can be computed without relying on the Software Heritage archive. The revision and release identifiers are mostly used as part of the metadata such as the one example from Previous SWHIDs with contextual information.
As already mentioned, SWHIDs with full contextual qualifiers are recommended. This should be easy to retrieve considering the website interface gets them for you as seen from this video.
If you want to create a hyperlink, it is advisable to make the core identifier as the link text to address the obvious problem of length making it harder to read (case in point, in this table). For a proper example of a hyperlink, here is one with swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2.

Appendix B: Extending Asciidoctor for linking SWHIDs

Linking SWHIDs could be tedious when writing documents. In Asciidoctor, there are features where this makes it easier. Specifically, we’re talking about storing the identifiers in document attributes.

Using attributes for storing and linking SWHIDs

:swh-system76-firmware-license-core-identifier: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
:swh-system76-firmware-license: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538

link:https://archive.softwarearchive.org/{swh-system76-firmware-license}[{swh-system76-firmware-license-core-identifier}]

In my opinion, this is still tedious since we have to store two attributes that would need separate changes where it should be only one. Fortunately, Asciidoctor can be extended to introduce new syntax which I’ve previously shown how Asciidoctor can be extended. We can apply a similar solution here.

For our initial version of the new syntax, it looks like the following.

swh:$SWHID[$CAPTION]

It is an inline macro that accepts an SWHID and can accept a caption as the link text. Take note the caption is optional with the core identifier being the default caption. The following listing should show a complete list of use cases we considered for this macro.

sample.adoc

// Should produce a link at https://archive.softwareheritage.org/$SWHID with
// '$SWHID_CORE_IDENTIFIER' as the link text.
swh:swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538[]

// Similar as above but with the link text replaced with 'replacing the caption'.
swh:swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538[replacing the caption]

// For aesthetic purposes, you could also use the `swh` macro with the `swh:`
// cut off from the SWHID.
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538[]

The inline macro should produce a link target to the default SWHID resolver at https://archive.softwareheritage.org. Anyways, here’s the code for the swh Asciidoctor extension.

lib/asciidoctor/swhid-inline-macro/extension.rb

# frozen_string_literal: true

class SWHInlineMacro < Asciidoctor::Extensions::InlineMacroProcessor
  use_dsl

  named :swh
  name_positional_attributes 'caption'

  def process(parent, target, attrs)
    doc = parent.document

    # We're only considering `swh:` starting with the scheme version. Also, it
    # looks nice aesthetically.
    swhid = target.start_with?('swh:') ? target : %(swh:#{target})
    swhid_core_identifier = (swhid.split ';').at 0

    text = attrs['caption'] || swhid_core_identifier
    target = %(https://archive.softwareheritage.org/#{swhid})

    doc.register :links, target
    create_anchor parent, text, type: :link, target: target
  end
end

You cannot make use of the extension as it is not registered within the Asciidoctor registry yet. Let’s make the file that does that.

lib/asciidoctor-custom-extensions.rb

# frozen_string_literal: true

require 'asciidoctor'
require 'asciidoctor/extensions'

require_relative './asciidoctor/custom_extensions/swhid_link_inline_macro'

Asciidoctor::Extensions.register do
  inline_macro SWHInlineMacro
end

Now with the extension in place, you can use it with Asciidoctor like with the following listing.

asciidoctor -r ./lib/asciidoctor-custom-extensions.rb sample.adoc

Voila! Now you have an nicer way of linking them SWHIDs with the archive. This extension should be usable for all backends since it is a simple shorthand for linking SWHIDs to the archive.

1. You cannot modify the GPLv3 document itself since it is a copyrighted document so any GPL-licensed projects should have the same license text thus the same object.

2. Y’know, identifying and referring parts of software as this point is already hammered multiple times by this point. :)

3. As of 2023-05-09, only one is made public so far.