Description
This module contains functions for parsing and handling URIs (RFC 3986
) and form-urlencoded query strings (HTML 5.2
).
Parsing and serializing non-UTF-8 form-urlencoded query strings are also supported (HTML 5.0
).
A URI is an identifier consisting of a sequence of characters matching the syntax rule named URI in RFC 3986
.
The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The interpretation of a URI depends only on the characters used and not on how those characters are represented in a network protocol.
The functions implemented by this module cover the following use cases:
There are four different encodings present during the handling of URIs:
- Inbound binary encoding in binaries
- Inbound percent-encoding in lists and binaries
- Outbound binary encoding in binaries
- Outbound percent-encoding in lists and binaries
Functions with uri_string()
argument accept lists, binaries and mixed lists (lists with binary elements) as input type. All of the functions but transcode/2
expects input as lists of unicode codepoints, UTF-8 encoded binaries and UTF-8 percent-encoded URI parts ("%C3%B6" corresponds to the unicode character "ö").
Unless otherwise specified the return value type and encoding are the same as the input type and encoding. That is, binary input returns binary output, list input returns a list output but mixed input returns list output.
In case of lists there is only percent-encoding. In binaries, however, both binary encoding and percent-encoding shall be considered. transcode/2
provides the means to convert between the supported encodings, it takes a uri_string()
and a list of options specifying inbound and outbound encodings.
RFC 3986
does not mandate any specific character encoding and it is usually defined by the protocol or surrounding text. This library takes the same assumption, binary and percent-encoding are handled as one configuration unit, they cannot be set to different values.
Quoting functions are intended to be used by URI producing application during component preparation or retrieval phase to avoid conflicts between data and characters used in URI syntax. Quoting functions use percent encoding, but with different rules than for example during execution of recompose/1
. It is user responsibility to provide quoting functions with application data only and using their output to combine an URI component.
Quoting functions can for instance be used for constructing a path component with a segment containing '/' character which should not collide with '/' used as general delimiter in path component.