http::formatQuery

::http::formatQuery is a command provided by the http package.

Inverse Operation

ncgi provides the inverse operation:

ncgi::input $query
ncgi::nvlist

CGI Parameter Encoding Character Case

Does someone know if it's allowed to use upper OR lower characters in CGI parameter encoding? E.g., are the following two results equivalent?:

% http::formatQuery äöü
%c3%a4%c3%b6%c3%bc
%

% http::formatQuery2 äöü
%C3%A4%C3%B6%C3%BC
%

I read a few lines of RFC 3875 , but did not fully understand everything yet...

MJ: The URI escaping is described in RFC 2396 , which is referenced by the CGI RFC 3875. There it states:

   An escaped octet is encoded as a character triplet, consisting of the
   percent character "%" followed by the two hexadecimal digits
   representing the octet code. For example, "%20" is the escaped
   encoding for the US-ASCII space character.

      escaped     = "%" hex hex
      hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                            "a" | "b" | "c" | "d" | "e" | "f"

So yes they are equivalent.

Mat 2009-10-27: Though they should be equivalent, many implementations don't recognize either of those variants.

As it turns out, even Amazon's Product Advertising API requires query strings to use upper case character triplets..

This proc will convert them:

proc formatQuery2 val {
    return [subst [regsub -all {(%.{2})} [
        ::http::formatQuery $val] {[string toupper \1]}]]
}

PYK: If $val can contain [ or $, formatQuery2 would be vulnerable to injection attacks.

Default URL encoding

MHo 2010-03-06: Confusion:

% info pa
8.5.8
% package require http
2.7.5
% http::formatQuery umlaut1 ä umlaut2 ö umlaut3 Ü
umlaut1=%c3%a4&umlaut2=%c3%b6&umlaut3=%c3%9c
% package require ncgi
1.3.2
% ncgi::encode äöÜ
%E4%F6%DC

Does http::formatQuery produce the right encoding for german "Umlauts" here???

Lars H, 2010-03-07: Looks like http::formatQuery is using utf-8, whereas ncgi::encode is using iso8859-1 (a.k.a. binary). AFAIK, an HTTP URL is just an octet sequence, so there is no "default encoding" — providing encoding information requires a higher level mechanism, so it probably depends on who you're talking to. I notice Google search queries tend to contain substrings &ie=UTF-8&, so that is probably how they do it. Others might do it differently. HTML 4 seems to indicate that only ASCII is valid for HTML form data.

A guess would be that ncgi::encode expects its caller to handle encoding conversions (try it with a non-iso8859-1 character such as ğ), whereas http::formatQuery has the assumption of utf-8` built-in.

AMG: I discovered during the development of Wibble that HTTP 1.1 uses the ISO8859-1 encoding unless explicitly overridden with Content-Type: headers rfc 2616 .

Lars H: That concerns the body content, though. http::formatQuery is more about constructing the URL, which (I believe) cannot rely on headers for its interpretation, is it not?