Hi all! We have built a tcl daemon that receives and processes emails over pop3 (tcllib). all in all it's working great, but if we receive an email (in our case sent from thunderbird) that contains special characters (other than the 7-bit ascii characters) in the subject, the email header line is encoded like this:
Subject: Re: Session =?ISO-8859-1?Q?l=E4uft_ab!_sid=23CBFE0=23?=
which is in fact correct, because RFC5322 says emails shall only contain 7-bit ascii chars. but the tcllib mime package does not decode this line. we tried it with tcllib 1.9 and 1.12 on Tcl 8.4
has anyone an idea how we could parse this ourselves? i did not find any info about this type of quoting. If it is interesting, here are some other Header lines:
MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit
btw, if we send emails over tcllib mime/smtp package, special characters aren't encoded as well, so it seems tcllib doesn't fully follow rfc5322.
we have made a simple proc which works for our cases:
proc decodeRFC5322 {string} { # if the line is to long, it is separated to more lines, each line beginning with a single space -> join the lines set string [string map {"\n " "" "\n" ""} $string] set result "" set encodings [encoding names] set lEncodings [string tolower $encodings] # the encoded text has the following format: =?<encoding>?Q?<encoded_text>?= # in <encoded_text> every non-7-bit-ascii character is encoded like =<hexcode> (<hexcode> is exactly 2 characters long) # additionally, spaces are replaced with underlines. # It's possible, that more than one encoding parts are in the same header while {[set startIndex [string first "=?" $string]] > -1} { set delimiterIndex [string first "?Q?" $string $startIndex] set endIndex [string first "?=" $string $startIndex] # look, if the found indices match the pattern, otherwise break... if {$delimiterIndex - 1 > $startIndex + 2 && $endIndex > $delimiterIndex + 3} { if {$startIndex > 0} { append result [string range $string 0 [expr $startIndex -1]] } set encoding [string range $string [expr $startIndex + 2] [expr $delimiterIndex - 1]] set encodedText [string range $string [expr $delimiterIndex + 3] [expr $endIndex - 1]] set string [string range $string [expr $endIndex + 2] end] set i -1 set lEncoding [string tolower $encoding] # try to examine the encoding name if {[set i [lsearch -exact $lEncodings $lEncoding]] == -1} { # e.g. the iso-encodings have a dash in the string, but in tcl's encoding list, not. perhaps this string map has to be extended in future? if {[set i [lsearch -exact $lEncodings [string map {"iso-" "iso"} $lEncoding]]] == -1} { } } # only if we found an encoding, we proceed... otherwise append the quoted text as is... if {$i > -1} { set targetEncoding [lindex $encodings $i] # spaces are encoded with underlines set text2decode [string map {"_" " "} $encodedText] # subst [string map {"=" "\\x"} $text2decode] does not work, because \x possibly takes more than 2 following characters # => iterate over it... while {[regexp {^([^=]*)=([a-fA-F0-9]{2})(.*)$} $text2decode dummy pre code post]} { append result $pre append result [encoding convertfrom $targetEncoding [subst -nocommands -novariables "\\x$code"]] set text2decode $post } append result $text2decode } else { append result "=?$encoding?Q?$encodedText?=" } } else { break } } append result $string return $result }
So, with this proc we process the mails like this:
set m [mime::initialize -string $mail] array set mailArray [list] set mailArray(source) $mail set headers {From To Subject Date} foreach h $headers { set mailArray($h) [list] if ![catch {set temp [mime::getheader $m $h]} e] { set decoded [list] foreach entry $temp { lappend decoded [decodeRFC5322 $entry] } set mailArray($h) $decoded } }
I'm pretty sure that this work around proc is not the ultimate solution, but it's working for all cases we tested :-) I'm especially unsure with the way it's looking for the right encoding, and I'm not sure, if I used the encoding convertfrom command in the right way... Perhaps anyone could read over it and comment it if needed? Thanks!
jnc 2011-20-10: I made this proc and use it w/success so far, but can't say I really tested it on all possible cases:
proc ConvertEncodedString { s } { set values [regexp -all -inline {=\?([\w\-]+)\?(.)\?(.*?)\?=} $s] if { $values == {} } { return $s } lassign $values _ charset enctyp value if { [string match iso-* $charset] } { set charset [string replace $charset 3 3] } if { $enctyp == "B" } { set value [::base64::decode $value] } elseif { $enctyp == "Q" } { set value [::mime::qp_decode $value] } return [encoding convertfrom $charset $value] }
This handles items that are not encoded, encoded with base64 and encoded with quoted-printable.
Two simple unit tests I have for the proc:
test ConvertEncodedString-1.0 {Decoding a base64 encoded string} { ::email::ConvertEncodedString {=?iso-8859-1?B?Q2l0aSBDYXJkcw==?=} } {Citi Cards} test ConvertEncodedString-1.1 {Decoding a quoted-printed encoded string} { ::email::ConvertEncodedString {=?cp1252?Q?John=20Doe?=} } {John Doe}