Binary data in URLs

Posted: November 04, 2021
Category: python

I've been trying to transmit arbitrary binary data in an URL, more precisely in the resource path. It turns out to be more complex than envisioned.

Bytes in filenames

Let's say we have a web application, serving files located in a filesystem. It responds to HTTP GET queries, where the resource path matches the path of a file in that filesystem: https://example.org/path/to/resource.txt.

However, on POSIX systems, the various components of a path (path, to, resource.txt) are made of bytes, not characters of a given charset. It means some directories or files can be made of bytes that can't be decoded to ASCII or UTF-8. For example, using a hexadecimal notation for a file name on disk:

on disk  | 66 69 6C 65 80 2E 74 78 74
as ascii | f  i  l  e  ?? .  t  x  t
as utf-8 | f  i  l  e  ?? .  t  x  t

Most of the bytes of the file name can be decoded to ASCII or UTF-8, except one: 80. 80 (or 128 in decimal) is not in the ASCII table (that only covers bytes from 00 to 7F), and is a continuation bytes in UTF-8, so it is only valid as part of a multi-byte character, and can't be the first one. So we can have legit file names that can't be understood as ASCII or UTF-8. They might be valid in another charset, or not.

So how can we construct an URL to be used to represent that file in our web application?

Percent-encoding

There exists a well-known solution to this exact problem: percent-encoding.

The idea is to encode the value of any byte in the form %XY, where XY is the hexadecimal representation of that byte.

This way, all bytes can be represented with characters valid in an URL (from %00 to %FF), while keeping the URL readable because other characters are not modified.

Percent-encoding has been described in RFC 1738, 2.2 and RFC 3986, 2.1. They specify it can be used to represent non-printable characters and arbitrary binary data, whose interpretation is then left to the application (to be used as bytes or decoded in a given charset, for example).

Using percent-encoding, the URL for the weirdly-named file above would be https://example.org/file%80.txt. All good: it is readable, unambiguous, and easy to decode back into bytes.

Enter CGI and WSGI

An URL, as displayed for example in a browser URL bar, is a series of characters used to represent the address of a web resource somewhere on a system. Various components of an URL are used by various Internet protocols; for example the scheme (http or https) translates directly to HTTP and HTTPS, the authority (example.org) is used to initiate a DNS request and as part of the Host header in HTTP, and so on.

When a client initiates an HTTP request to ask a server for https://example.org/file%80.txt, the request will look like this:

GET /file%80.txt HTTP/1.1
Host: example.org
Accept: */*

As you can see, the URL is already broken down. In our case, that's not an issue: our application receives the resource path of the URL (/file%80.txt) and can serve the corresponding file.

    client               app            filesystem
    ------               ---            ----------
GET /file%80.txt  ->  ok, 1 sec  ->  /app/file\x80.txt
                 HTTP          stat(1)

This works as long as every system out there agrees not to do anything with the resource path of the URL. If everybody leaves the %80 in place so the application can use that byte to map it to the right file, all is good.

Unfortunately, it is not always the case. CGI, the Common Gateway Interface, is a web standard to define the interface between web servers (e.g. Apache) and web applications. Defining such a standard is very useful, and allows application developers to easily expose their applications behind web servers, as long as they are compatible with CGI.

In order for a web server to expose the necessary request information to any application running behind, CGI defines a set of environment variables that are given to the application. One of them is PATH_INFO, it contains the resource path of the URL. RFC 3875 4.1.5 says about this variable: "Unlike a URI path, the PATH_INFO is not URL-encoded" and "treatment of non US-ASCII characters in the path is system-defined". This means that the percent-encoded bytes are decoded or might be lost in translation; the interpretation of these bytes is left to the server, and not the application.

WSGI, which is basically CGI for Python, specified in PEP 3333, doesn't say anything about this: it assumes PATH_INFO uses the same rules as for CGI.

Updating our schema:

    client             server          app                filesystem
    ------             ------          ---                ----------
GET /file%80.txt  ->  ok, 1 sec  -> /file�.txt  ->  /app/file\xef\xbf\xbd.txt
                 HTTP         PATH_INFO       stat(1)

As you can see, the byte data is lost when translating that information to the value of an environment variable. In this example, it's been substituted by the Unicode replacement character (encoded as EFBFBD in UTF-8) - this behaviour is server-dependent.

No way to easily map URLs to file paths anymore!

Alternatives

I can think of a few ways to work around this. These different alternatives might satisfy some needs, but don't perfectly solve the problem at hand.

Percent-encoding in the query string

The query string of an URL (the part that comes after ?, such as ?foo=bar&x=1) is transmitted in CGI through a different variable, rightfully named QUERY_STRING. When it contains percent-encoded bytes, they are left as-is, and information is not lost. See RFC 3875 4.1.7 for the specification.

Of course, it is not always convenient or elegant to encode this information in the query string rather than the resource path.

Other means to encode binary data

Percent-encoding is one way to encode bytes into human-readable ASCII characters, but there are others. See for example base64url, an encoding meant exactly to be included in URLs.

It has one major downside: regular characters are also encoded, and can't easily be read by the human eye, whereas percent-encoding encodes only bytes that are not allowed in an URL and leaves the other ones untouched.

Obtain the URL raw path

Some web servers add additional environment variables to the set transmitted to applications. For example, Apache can transmit the non-decoded REQUEST_URI. However, this is not standardized, so there's no guarantee an application using it can be ported to a different server. Moreover, its behaviour is not always clearly defined, for example in its treatment of encoded slashes, or how URL-rewriting rules come into play.

Double-percent encoding

Finally, it is possible to percent-encode an URL twice. It looks like this:

/file\x80.txt
    | URL encode
    v
/file%80.txt
    | URL encode
    v
/file%2580.txt

After the first conversion, the % sign is encoded into its own hexadecimal value in the ASCII table, %25. When percent-decoding happens in CGI, %25 becomes %, hence %2580 becomes %80, the original value that was intended to be transmitted. The conversion looks like this in Python:

import urllib.parse

def url_encode(filename: bytes) -> str:
    return urllib.parse.quote(urllib.parse.quote_from_bytes(filename))

def url_decode(url: str) -> bytes:
    return urllib.parse.unquote_to_bytes(url)

>>> url_encode(b"file\x80.txt")
'file%2580.txt'

>>> url_decode("file%80.txt")
b'file\x80.txt'

Of course, those 2 functions don't mirror each other, as the decoding of an URL is done once by the CGI server.

However, this solution feels very hackish, and clutters the URL with extra characters.

Conclusion

Note that I used the term URL throughout this post, and not URI. This is for consistency, and because my original problem concerned URLs. Likewise, I used the term "server" to refer to systems that speak HTTP on one side and CGI/WSGI on the other; both terms "gateway" and "server" are used in RFC documents.

There are certainly other alternatives that I didn't think of to reliably transmit binary information in URLs. Hit me up if you know any!