不同高级语言的 URL 编码差异

目录:

水群时看到有群友遇到了因 URL 对字符 * 的编码不符合预期问题导致的程序错误,便做此篇测试部分高级语言的 URL 编码实现有何不同。

相关标准

由于 RFC 1738: Uniform Resource Locators (URL) 并非互联网标准 (Internet Standard),故本文参考互联网标准 RFC 3986: Uniform Resource Identifier (URI): Generic Syntax 编写。该标准推荐使用通用术语 "URI",而不是限制性更强的术语 "URL "和 "URN" (RFC3305)

RFC 3986 对 URI 中非保留字符的定义如下:

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

在 URI 编码时,对于非保留字符 unreserved 应保持不进行转义,但是该标准同样说明了如果遇到了转义了这些字符的 URI 编码,在解码时仍需要将其恢复为原字符。

URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource.  However, URI comparison implementations
do not always perform normalization prior to comparison (see Section
6).  For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.

该标准中同样指出了 ~ 字符在旧的 URI 编码实现中经常转义为 %7E

For example, the octet
corresponding to the tilde ("~") character is often encoded as "%7E"
by older URI processing implementations; the "%7E" can be replaced by
"~" without changing its interpretation.

对于可能需要转义的保留字符,该标准将其分为两类:

reserved    = gen-delims / sub-delims
gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

其中 gen-delims 和 URI 的结构相关,必须要进行转义,而 sub-delims 是否需要需要根据所在位置判断。特别地,由于转义使用 % 符号,所以 % 符号自身也需要进行转义。

典型的 URI 组成部分如下:

      foo://example.com:8042/over/there?name=ferret#nose
      \_/   \______________/\_________/ \_________/ \__/
       |           |            |            |        |
    scheme     authority       path        query   fragment
       |   _____________________|__
      / \ /                        \
      urn:example:animal:ferret:nose

sub-delims 相关的文法片段如下:

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )

host          = IP-literal / IPv4address / reg-name
IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"
IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
reg-name      = *( unreserved / pct-encoded / sub-delims )

path          = path-abempty    ; begins with "/" or is empty
              / path-absolute   ; begins with "/" but not "//"
              / path-noscheme   ; begins with a non-colon segment
              / path-rootless   ; begins with a segment
              / path-empty      ; zero characters
path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>
segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

query         = *( pchar / "/" / "?" )

fragment      = *( pchar / "/" / "?" )

如果按照以上文法推导,sub-delims 中的字符在 authority path query fragment 中均可能保持原样。为找出不同高级语言对这些字符转义处理的差别,下面进行了一个简单的测试,先给出了测试结果,具体的测试代码及输出在最后给出。

测试结果

仅测试了在 query 段中的编码和解码情况,在所有编码测试中,以 sub-delims 中的字符均已编码,unreserved 中的特殊字符均未编码为参考结果,标注与参考结果有差别的字符表。解码测试使用全部特殊字符转义的字符串,由于解码结果均相同,不额外展示在表格中。

语言 Module / Function sub-delims 未被转义 unreserved 被转义
Python urllib.parse
Go net/url
Java java.net.URLEncoder java.net.URLDecoder * ~
JavaScript URLSearchParams * ~
Node.js querystring !'()*
C# System.Net.WebUtility !()*
PHP http_build_query parse_str ~

虽然编码时对符号的转义处理不同,但是使用全部转义的 sub-delims 以及 unreserved 中的特殊字符进行测试时被测程序都能正确进行解码。

详细的测试结果如下。

测试代码

Python:

from urllib.parse import urlencode, unquote

print(urlencode({"param":"!$&'()*+,;=-._~"}))
print(unquote("param=%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e"))
param=%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
param=!$&'()*+,;=-._~

Go:

package main

import (
    "fmt"
    "net/url"
)

func main() {
    fmt.Println(url.QueryEscape("!$&'()*+,;=-._~"))
    fmt.Println(url.QueryUnescape("%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e"))
}
%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
!$&'()*+,;=-._~ <nil>

Java:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) throws UnsupportedEncodingException {
        System.out.println(URLEncoder.encode("!$&'()*+,;=-._~", StandardCharsets.UTF_8.toString()));
        System.out.println(URLDecoder.decode("%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e", StandardCharsets.UTF_8.toString()));
    }
}
%21%24%26%27%28%29*%2B%2C%3B%3D-._%7E
!$&'()*+,;=-._~

JavaScript:

const encode = new URLSearchParams();
encode.set("param", "!$&'()*+,;=-._~");
console.log(encode.toString());
const decode = new URLSearchParams("param=%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e");
console.log(encode.get("param"));
param=%21%24%26%27%28%29*%2B%2C%3B%3D-._%7E
!$&'()*+,;=-._~

Node.js(JavaScript):

const querystring = require("querystring");
console.log(querystring.stringify({ param: "!$&'()*+,;=-._~" }));
console.log(querystring.parse("param=%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e").param);
param=!%24%26'()*%2B%2C%3B%3D-._~
!$&'()*+,;=-._~

C#:

using System;

class Program
{
    static void Main()
    {
        Console.WriteLine(System.Net.WebUtility.UrlEncode("!$&'()*+,;=-._~"));
        Console.WriteLine(System.Net.WebUtility.UrlDecode("%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e"));
    }
}
!%24%26%27()*%2B%2C%3B%3D-._%7E
!$&'()*+,;=-._~

PHP:

<?php
echo http_build_query(["param" => "!$&'()*+,;=-._~"]) . "\n";
$decoded = [];
parse_str("param=%21%24%26%27%28%29%2A%2B%2C%3B%3D%2d%2e%5f%7e", $decoded);
echo $decoded["param"] . "\n";
?>
param=%21%24%26%27%28%29%2A%2B%2C%3B%3D-._%7E
!$&'()*+,;=-._~

评论