Using Software Heritage for technical writing
There’s this nice service called Software Heritage, a large software archive intended to be used as a reference for software in writings (e.g., research papers, blog posts, technical documents). We’ll be showing how to take advantage of it for technical writing as it provides the following benefits.
-
It offers a centralized and universal way of identifying and referring to software similarly to Digital object identifers (DOI).
-
It consolidates all of the sources into one centralized archive, reducing the need to search and manage to different forges such as GitHub, GitLab, Bitbucket, and Sourcehut.
-
Long-term preservation which mitigates against problems such as vanishing upstreams and sunsetting services. For example, when referring to NixOS/nixpkgs@nixos-22.11 and if ever GitHub goes down or if the Nix community decides to move into other Git forges, it will affect none of it once the software has been archived within Software Heritage and can be referred for the rest of time.
-
It offers granularity to what part of the software technical writers can refer to from the whole project, to a certain point in history, to certain files and directories, and all the way down to lines of code.
How Software Heritage works?
Software Heritage actively archives software from several sources such as…
-
Software forges like GitHub, GitLab instances, and even Gitea instances.
-
Linux distributions package archives such as from Debian, Nix, and Guix.
-
Several software indices such as PyPi, crates.io, and npm.
The service will periodically capture a snapshot of the same software project (which we have access to, among other things as we’ll see later in the post).
One thing you have to keep in mind with this service is the project developers don’t check the source code for any issues (e.g., quality, intent). Whatever that is stored from the original source will be included as part of the archive. |
Furthermore, it can save source code that is managed by different version control software such as Git, Mercurial, and Subversion. Software Heritage is also offering its services with its API over HTTPS which is nice if you want to create some neat little scripts or integrate it in your software. Most of the functionality are already available with its website which I covered in a later section. But first, we’ll have to learn an important concept with Software Heritage: its identifier system to access the archive in the first place.
Its identifier system
The main intention of the project is to provide a centralized archive for identifying and referencing software. The primary way of using such service is with an identifier system like DOI for digital objects, ISBN for books, and ISWC for music. Software Heritage uses its own identifier system called SoftWare Heritage persistent IDentifier or SWHID for short. The identifier follows a certain format.
The following examples should be enough to show what they look like.
SWHID | Description |
---|---|
GPLv3 document. |
|
|
|
22.11 branch of nixpkgs. |
|
GNOME Shell v3.38.6 release. |
|
A gnome-shell snapshot. |
As you can tell from the table, SWHID also offers some control of granularity of what parts of software we want to refer: from individual files and directories, from a certain point in the history of the project, and from a certain point of time of capture.
The parts of software such as files, directories, and revisions are collectively referred to as software artifacts (or objects) as you’ll see from its documentation. |
What about pointing to specific lines of code? Didn’t the preamble mentioned something like "granularity down to the lines of code"?
You’ll see it later.
An interesting property with SWHIDs is that they are intrinsic identifiers: meaning you get the object alongside the identifier. Unlike DOIs and ISBNs where objects are arbitrarily assigned by a central authority, SWHIDs are computationally generated from the object. This means SWHIDs are deterministic and we can do a reverse lookup with the object. In fact, it can be computed with objects locally in your machine.
Due to its intrinsic nature of SWHIDs and with the ability to refer various parts of a software, we’re also slowly unraveling the fact that Software Heritage archive itself is essentially a gigantic Merkle tree where it contains several objects. Let’s go back to the previous table of SWHIDs again and see what those are.
-
A content object contains the content of a file.
-
A directory object contains other directory objects and content objects.
-
A revision object is a point in time of the development history of the project. It also points to the root directory of the project.
-
A release object is essentially the same as a revision object but with additional metadata. In practice, this is typically the revision developers tagged for release (e.g., KDE Plasma 5.23, GNOME 42, Linux kernel 6.3).
-
A snapshot object contains the whole source code including all visible branches at that point in time.
SWHID qualifiers
While the previously shown SWHIDs is enough and working as intended, there are some lack of information with the identifier alone. From the identifier system, one cannot easily infer certain information that we often needed such as the URL of the repository and the path relative to the repository. This is also reflected in the website interface if you’ve visited the links where it just strictly presents the software artifact (e.g., content, directory, revision).
-
Let’s take swh:1:dir:101a60787ec70986789c64d2379be174ed73e2e5 as an example where we just see a directory and nothing else.
-
Or let’s take the swh:1:rev:db1e4eeb0f9a9028bcb920e00abbc1409dd3ef36 where we see visit a revision of the project but we cannot see if it came from the canonical repository.
-
With yet another example, let’s take swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 where the exact content object can appear for GPLv3-licensed projects. [1] Not to mention we can’t tell where the license file is located in the repository, let alone the repository.
This is because of the data model of the archive being a gigantic Merkle tree where objects may be shared among multiple projects. This makes certain tasks to be tedious such as identifying whether the artifact belong from a canonical repository or one of its many forks which is also included in the archive.
Because of this, SWHIDs may also have a semicolon-delimited (;
) list of qualifiers that adds contextual information.
Each qualifier may mean different things which is documented nicely in its website. Let’s take the previous table and add more contextual information with it.
SWHID | Description |
---|---|
Section 11 of the GPLv3 license from system76-firmware. |
|
|
|
A certain revision from the nixpkgs-22.11 branch from the canonical nixpkgs repository. |
|
swh:1:snp:fc3c21b5f61d1e283ba9ec52f632c372675eaebc;origin=https://gitlab.gnome.org/GNOME/gnome-shell |
A snapshot of the canonical gnome-shell repository captured on January 4th of 2023. |
If you click each of the link, the website interface is more complete compared to the previous table of SWHIDs.
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
without and with qualifiersThis practice of adding contextual information is recommended as documented from its FAQ. More specifically, the contextual information has to be as full as possible which you can easily get the identifier with all relevant qualifiers in its archive website interface which we’ll cover next. You can see more of them from Guidelines for referencing SWHIDs.
While the link text shown in the table are shown with the complete identifier with all qualifiers, it is recommended to show only the core identifier as the link text. This is to address the obvious problem of length making it harder to read. For a proper example of a hyperlink, here is one with swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2. |
So what happens when I give a qualifier with a wrong value such as the origin
qualifier that points to an non-existent origin or an anchor
qualifier that points to an invalid SWHID?
Why don’t you try those out yourself? Here’s a list of them just for starters.
You could also mix and match qualifiers that are not supposed to appear in certain object types such as the lines
qualifier in non-content objects.
Using the Software Heritage archive website
Throughout the Software Heritage ecosystem, there are tools that make use of the service. Its main interface is on the archive website interface is what you’re likely to use the most. The workflow from the website interface is pretty simple: you search for the origin of the software, enter the corresponding object, and specify what you want to refer to. [2]
The most important thing to note with this website is using it as a resolver for SWHIDs that is similarly used with DOIs. You’ve already seen its usage with the links from the previous tables such as in Examples of SWHIDs and Previous SWHIDs with contextual information. Using it as a resolver is simple: just append the identifier on the root endpoint of the service.
https://archive.softwarearchive.org/$SWHID
With the user-facing side of the website, what you’ll see first is a search interface. Take note the quality of the search results is not perfect nor usable if you’re not aware of the quirks of its search engine. For example, merely entering the name of the software is not typically enough for searching.
Even searching with metadata doesn’t help.
It’s pretty obvious that it doesn’t have enough quality results.
Instead, I recommend to enter the origin URL that you’re searching for (e.g., https://github.com/torvalds/linux
).
If there is an exact match of the given origin, the website will directly go to the page of the software artifact with that origin.
This is especially nice for the sources it already monitors such as GitHub, GitLab instances, and Gitea instances.
This even works for package indices such as Pypi and npm (e.g., https://pypi.org/project/swh.core
, https://www.npmjs.com/package/vue
).
For more details, there is a dedicated page on what sources are being monitored which you can infer what URLs can be resolved in this way.
Once you get into the software artifact of your choosing (e.g., directory, file, revision, snapshot), you can get the identifier with the permalink tab on the side of the website.
Other Software Heritage tools
Other than the website, there are tools available to easily make use of the service. The ecosystem of Software Heritage is somewhat limiting as Software Heritage itself is relatively young but it does have nice tools to begin with. Let’s take a closer look at them.
-
swh identify
is a command-line interface that prints the SWHID of the given objects. SWHID are computationally generated that can be done locally which is nice if you have the codebase on disk and want to refer to them through the archive. -
A nice way to explore the archive is with Software Heritage Filesystem (SwhFS) which comes with a command-line interface (
swh fs
). This tool alongsideswh identify
is one way to explore the archive entirely on the terminal. -
A web client for SWH in Python which is nice if you’re using Python in the first place.
-
Some SWH-related browser extensions. [3] Among them is the UpdateSWH which checks and includes the archival of a repository in the queue, all in a simple interface.
-
For those who are writing with LaTeX, there is a package for adding software entry types in BibLaTeX.
Furthermore, there are initiatives to integrate it with projects such as with Guix and peer-to-peer access with IPFS.
Appendix A: Guidelines for referencing SWHIDs
While using SWHIDs is a done-and-forget procedure (for the most part), there is a set of guidelines to make usage of them a bit easier.
-
Per the documentation, it is recommended to use
swh:dir:
SWHIDs overswh:rev:
orswh:rel:
sinceswh:dir:
can be computed without relying on the Software Heritage archive. The revision and release identifiers are mostly used as part of the metadata such as the one example from Previous SWHIDs with contextual information. -
As already mentioned, SWHIDs with full contextual qualifiers are recommended. This should be easy to retrieve considering the website interface gets them for you as seen from this video.
-
If you want to create a hyperlink, it is advisable to make the core identifier as the link text to address the obvious problem of length making it harder to read (case in point, in this table). For a proper example of a hyperlink, here is one with swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2.
Appendix B: Extending Asciidoctor for linking SWHIDs
Linking SWHIDs could be tedious when writing documents. In Asciidoctor, there are features where this makes it easier. Specifically, we’re talking about storing the identifiers in document attributes.
:swh-system76-firmware-license-core-identifier: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
:swh-system76-firmware-license: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538
link:https://archive.softwarearchive.org/{swh-system76-firmware-license}[{swh-system76-firmware-license-core-identifier}]
In my opinion, this is still tedious since we have to store two attributes that would need separate changes where it should be only one. Fortunately, Asciidoctor can be extended to introduce new syntax which I’ve previously shown how Asciidoctor can be extended. We can apply a similar solution here.
This is the very solution used for linking SWHIDs in my website. |
For our initial version of the new syntax, it looks like the following.
swh:$SWHID[$CAPTION]
It is an inline macro that accepts an SWHID and can accept a caption as the link text. Take note the caption is optional with the core identifier being the default caption. The following listing should show a complete list of use cases we considered for this macro.
sample.adoc
// Should produce a link at https://archive.softwareheritage.org/$SWHID with
// '$SWHID_CORE_IDENTIFIER' as the link text.
swh:swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538[]
// Similar as above but with the link text replaced with 'replacing the caption'.
swh:swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538[replacing the caption]
// For aesthetic purposes, you could also use the `swh` macro with the `swh:`
// cut off from the SWHID.
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;origin=https://github.com/pop-os/system76-firmware;lines=471-538[]
The inline macro should produce a link target to the default SWHID resolver at https://archive.softwareheritage.org
.
Anyways, here’s the code for the swh
Asciidoctor extension.
lib/asciidoctor/swhid-inline-macro/extension.rb
# frozen_string_literal: true
class SWHInlineMacro < Asciidoctor::Extensions::InlineMacroProcessor
use_dsl
named :swh
name_positional_attributes 'caption'
def process(parent, target, attrs)
doc = parent.document
# We're only considering `swh:` starting with the scheme version. Also, it
# looks nice aesthetically.
swhid = target.start_with?('swh:') ? target : %(swh:#{target})
swhid_core_identifier = (swhid.split ';').at 0
text = attrs['caption'] || swhid_core_identifier
target = %(https://archive.softwareheritage.org/#{swhid})
doc.register :links, target
create_anchor parent, text, type: :link, target: target
end
end
As an exercise, you could add an option to replace the resolver domain with the |
You cannot make use of the extension as it is not registered within the Asciidoctor registry yet. Let’s make the file that does that.
lib/asciidoctor-custom-extensions.rb
# frozen_string_literal: true
require 'asciidoctor'
require 'asciidoctor/extensions'
require_relative './asciidoctor/custom_extensions/swhid_link_inline_macro'
Asciidoctor::Extensions.register do
inline_macro SWHInlineMacro
end
Now with the extension in place, you can use it with Asciidoctor like with the following listing.
asciidoctor -r ./lib/asciidoctor-custom-extensions.rb sample.adoc
Voila! Now you have an nicer way of linking them SWHIDs with the archive. This extension should be usable for all backends since it is a simple shorthand for linking SWHIDs to the archive.