Binary data in URLs
Posted: November 04, 2021
Category:
python
I've been trying to transmit arbitrary binary data in an URL, more precisely in the resource path. It turns out to be more complex than envisioned.
Bytes in filenames
Let's say we have a web application, serving files located in a filesystem. It
responds to HTTP GET queries, where the resource path matches the path of a file
in that filesystem: https://example.org/path/to/resource.txt
.
However, on POSIX systems, the various components of a path (path
, to
,
resource.txt
) are made of bytes, not characters of a given charset. It means
some directories or files can be made of bytes that can't be decoded to ASCII or
UTF-8. For example, using a hexadecimal notation for a file name on disk:
on disk | 66 69 6C 65 80 2E 74 78 74
as ascii | f i l e ?? . t x t
as utf-8 | f i l e ?? . t x t
Most of the bytes of the file name can be decoded to ASCII or UTF-8, except one:
80
. 80
(or 128
in decimal) is not in the ASCII table (that only covers
bytes from 00
to 7F
), and is a continuation bytes in UTF-8, so it is only
valid as part of a multi-byte character, and can't be the first one. So we can
have legit file names that can't be understood as ASCII or UTF-8. They might be
valid in another charset, or not.
So how can we construct an URL to be used to represent that file in our web application?
Percent-encoding
There exists a well-known solution to this exact problem: percent-encoding.
The idea is to encode the value of any byte in the form %XY
, where XY
is the
hexadecimal representation of that byte.
This way, all bytes can be represented with characters valid in an URL (from
%00
to %FF
), while keeping the URL readable because other characters are not
modified.
Percent-encoding has been described in RFC 1738, 2.2 and RFC 3986, 2.1. They specify it can be used to represent non-printable characters and arbitrary binary data, whose interpretation is then left to the application (to be used as bytes or decoded in a given charset, for example).
Using percent-encoding, the URL for the weirdly-named file above would be
https://example.org/file%80.txt
. All good: it is readable,
unambiguous, and easy to decode back into bytes.
Enter CGI and WSGI
An URL, as displayed for example in a browser URL bar, is a series of characters
used to represent the address of a web resource somewhere on a system. Various
components of an URL are used by various Internet protocols; for example the
scheme (http
or https
) translates directly to HTTP and HTTPS, the authority
(example.org
) is used to initiate a DNS request and as part of the Host
header in HTTP, and so on.
When a client initiates an HTTP request to ask a server for
https://example.org/file%80.txt
, the request will look like this:
GET /file%80.txt HTTP/1.1
Host: example.org
Accept: */*
As you can see, the URL is already broken down. In our case, that's not an
issue: our application receives the resource path of the URL
(/file%80.txt
) and can serve the corresponding file.
client app filesystem
------ --- ----------
GET /file%80.txt -> ok, 1 sec -> /app/file\x80.txt
HTTP stat(1)
This works as long as every system out there agrees not to do anything with the
resource path of the URL. If everybody leaves the %80
in place so the
application can use that byte to map it to the right file, all is good.
Unfortunately, it is not always the case. CGI, the Common Gateway Interface, is a web standard to define the interface between web servers (e.g. Apache) and web applications. Defining such a standard is very useful, and allows application developers to easily expose their applications behind web servers, as long as they are compatible with CGI.
In order for a web server to expose the necessary request information to any
application running behind, CGI defines a set of environment variables that are
given to the application. One of them is PATH_INFO
, it contains the resource
path of the
URL. RFC 3875 4.1.5
says about this variable: "Unlike a URI path, the PATH_INFO is not URL-encoded"
and "treatment of non US-ASCII characters in the path is system-defined". This
means that the percent-encoded bytes are decoded or might be lost in
translation; the interpretation of these bytes is left to the server, and not
the application.
WSGI, which is basically CGI for Python, specified in
PEP 3333, doesn't say anything
about this: it assumes PATH_INFO
uses the same rules as for CGI.
Updating our schema:
client server app filesystem
------ ------ --- ----------
GET /file%80.txt -> ok, 1 sec -> /file�.txt -> /app/file\xef\xbf\xbd.txt
HTTP PATH_INFO stat(1)
As you can see, the byte data is lost when translating that information to the
value of an environment variable. In this example, it's been substituted by the
Unicode replacement character �
(encoded as EFBFBD
in UTF-8) - this
behaviour is server-dependent.
No way to easily map URLs to file paths anymore!
Alternatives
I can think of a few ways to work around this. These different alternatives might satisfy some needs, but don't perfectly solve the problem at hand.
Percent-encoding in the query string
The query string of an URL (the part that comes after ?
, such as
?foo=bar&x=1
) is transmitted in CGI through a different variable, rightfully
named QUERY_STRING
. When it contains percent-encoded bytes, they are left
as-is, and information is not lost. See
RFC 3875 4.1.7
for the specification.
Of course, it is not always convenient or elegant to encode this information in the query string rather than the resource path.
Other means to encode binary data
Percent-encoding is one way to encode bytes into human-readable ASCII characters, but there are others. See for example base64url, an encoding meant exactly to be included in URLs.
It has one major downside: regular characters are also encoded, and can't easily be read by the human eye, whereas percent-encoding encodes only bytes that are not allowed in an URL and leaves the other ones untouched.
Obtain the URL raw path
Some web servers add additional environment variables to the set transmitted to
applications. For example, Apache can transmit the non-decoded
REQUEST_URI
. However, this is not standardized, so there's no guarantee an
application using it can be ported to a different server. Moreover, its
behaviour is not always clearly defined, for example in its treatment of encoded
slashes, or how URL-rewriting rules come into play.
Double-percent encoding
Finally, it is possible to percent-encode an URL twice. It looks like this:
/file\x80.txt
| URL encode
v
/file%80.txt
| URL encode
v
/file%2580.txt
After the first conversion, the %
sign is encoded into its own hexadecimal
value in the ASCII table, %25
. When percent-decoding happens in CGI, %25
becomes %
, hence %2580
becomes %80
, the original value that was intended
to be transmitted. The conversion looks like this in Python:
import urllib.parse
def url_encode(filename: bytes) -> str:
return urllib.parse.quote(urllib.parse.quote_from_bytes(filename))
def url_decode(url: str) -> bytes:
return urllib.parse.unquote_to_bytes(url)
>>> url_encode(b"file\x80.txt")
'file%2580.txt'
>>> url_decode("file%80.txt")
b'file\x80.txt'
Of course, those 2 functions don't mirror each other, as the decoding of an URL is done once by the CGI server.
However, this solution feels very hackish, and clutters the URL with extra characters.
Conclusion
Note that I used the term URL throughout this post, and not URI. This is for consistency, and because my original problem concerned URLs. Likewise, I used the term "server" to refer to systems that speak HTTP on one side and CGI/WSGI on the other; both terms "gateway" and "server" are used in RFC documents.
There are certainly other alternatives that I didn't think of to reliably transmit binary information in URLs. Hit me up if you know any!