Debsources, python3, and funky file names

Posted: January 24, 2022
Categories: debian, debsources, python

Rumors are running that python2 is not a thing anymore.

Well, I'm certainly late to the party, but I'm happy to report that sources.debian.org is now running python3.

Wait, it wasn't?

Back when development started, python3 was very much a real language, but it was hard to adopt because it was not supported by many libraries. So python2 was chosen, meaning print-based debugging was used in lieu of print()-based debugging, and str were bytes, not unicode.

And things were working just fine. One day python2 EOL was announced, with a date far in the future. Far enough to procrastinate for a long time. Combine this with a codebase that is stable enough to not see many commits, and the fact that Debsources is a volunteer-based project that happens at best on week-ends, and you end up with a dormant software and a missed deadline.

But, as dormant as the codebase is, the instance hosted at sources.debian.org is very popular and gets 200k to 500k hits per day. Largely enough to be worth a proper maintenance and a transition to python3.

Funky file names

While transitioning to python3 and juggling left and right with str, bytes and unicode for internal objects, files, database entries and HTTP content, I stumbled upon a bug that has been there since day 1.

Quick recap if you're unfamiliar with this tool: Debsources displays the content of the source packages in the Debian archive. In other words, it's a bit like GitHub, but for the Debian source code.

And some pieces of software out there, that ended up in Debian packages, happen to contain files whose names can't be decoded to UTF-8. Interestingly enough, there's no such thing as a standard for file names: with a few exceptions that vary by operating system, any sequence of bytes can be a legit file name. And some sequences of bytes are not valid UTF-8.

Of course those files are rare, and using ASCII characters to name a file is a much more common practice than using bytes in a non-UTF-8 character encoding. But when you deal with almost 100 million files on which you have no control (those files come from free software projects, and make their way into Debian without any renaming), it happens.

Now back to the bug: when trying to display such a file through the web interface, it would crash because it can't convert the file name to UTF-8, which is needed for the HTML representation of the page.

Bugfix

An often valid approach when trying to represent invalid UTF-8 content is to ignore errors, and replace them with ? or �. This is what Debsources actually does to display non-UTF-8 file content.

Unfortunately, this best-effort approach is not suitable for file names, as file names are also identifiers in Debsources: among other places, they are part of URLs. If an URL were to use placeholder characters to replace those bytes, there would be no deterministic way to match it with a file on disk anymore.

The representation of binary data into text is a known problem. Multiple lossless solutions exist, such as base64 and its variants, but URLs looking like https://sources.debian.org/src/Y293c2F5LzMuMDMtOS4yL2Nvd3NheS8= are not readable at all compared to https://sources.debian.org/src/cowsay/3.03-9.2/cowsay/. Plus, not backwards-compatible with all existing links.

The solution I chose is to use double-percent encoding: this allows the representation of any byte in an URL, while keeping allowed characters unchanged - and preventing CGI gateways from trying to decode non-UTF-8 bytes. This is the best of both worlds: regular file names get to appear normally and are human-readable, and funky file names only have percent signs and hex numbers where needed.

Here is an example of such an URL: https://sources.debian.org/src/aspell-is/0.51-0-4/%25EDslenska.alias/. Notice the %25ED to represent the percentage symbol itself (%25) followed by an invalid UTF-8 byte (%ED).

Transitioning to this was quite a challenge, as those file names don't only appear in URLs, but also in web pages themselves, log files, database tables, etc. And everything was done with str: made sense in python2 when str were bytes, but not much in python3.

What are those files? What's their network?

I was wondering too. Let's list them!

import os

with open('non-utf-8-paths.bin', 'wb') as f:
    for root, folders, files in os.walk(b'/srv/sources.debian.org/sources/'):
        for path in folders + files:
            try:
                path.decode('utf-8')
            except UnicodeDecodeError:
                f.write(root + b'/' + path + b'\n')

Running this on the Debsources main instance, which hosts pretty much all Debian packages that were part of a Debian release, I could find 307 files (among a total of almost 100 million files).

Without looking deep into them, they seem to fall into 2 categories:

File names that are not valid UTF-8, but are valid in a different charset. Not all software is developed in English or on UTF-8 systems.
File names that can't be decoded to UTF-8 on purpose, to be used as input to test suites, and assert resilience of the software to non-UTF-8 data.

That last point hits home, as it was clearly lacking in Debsources. A funky file name is now part of its test suite. ;)