Parsing RFC 1036 headers

One of the applications of the code, described in Reading and parsing RFC 822 headers, is to manage a news archive. Messages in the newsgroups are formatted according to the RFC 1036 [L1 ], which is a variant of RFC 822, but somewhat more restrictive.

The code below is to provide some sanity checks (e. g., multiple instances of the same header are disallowed) and parsing for the values of a couple of RFC 1036 header fields (Message-ID and Xref.)

Not much, but still better than nothing.

Namespace
::rfc-1036

The procs below belong to this namespace, exported and ready to use.

Function
trim string

Returns string with any leading spaces and tabs removed.

Function
get-field arrayName field

Returns the value of the header field field, as stored in array arrayName, with any leading spaces and tabs removed. If there're no, or multiple, instances of the field, an error is signaled.

Function
parse-Message-ID value ?checkOnlyP?

Parses value as a Message-ID and returns the list of two elements: the unique part and the host part. If value cannot be parsed as a Message-ID, an error is signaled. If checkOnlyP is given and is a true value, an empty string is returned, but the error checking is nevertheless performed.

Function
parse-Xref value ?checkOnlyP?

Parses value as an Xref and returns the list of two values: the host part and the list of newsgroup-article number pairs, suitable for a later foreach:

 foreach { group number } $pairs {
    ...
 }

If value cannot be parsed as an Xref, an error is signaled. If checkOnlyP is given and is a true value, an empty string is returned, but the error checking is nevertheless performed.

TODO: implement parsing of more RFC 1036 header fields; Newsgroups, References probably will be most useful, and Path is probably easiest to implement.


 ### rfc1036p.tcl --- Parsing RFC 1036 headers  -*- Tcl -*-
 ## $Id: 16304,v 1.1 2006-08-21 18:00:33 jcw Exp $

 ### Copyright (C) 2006 Ivan Shmakov

 ##  This library is free software; you can redistribute it and/or modify
 ##  it under the terms of the GNU Lesser General Public License as
 ##  published by the Free Software Foundation; either version 2.1 of the
 ##  License, or (at your option) any later version.

 ##  This library is distributed in the hope that it will be useful, but
 ##  WITHOUT ANY WARRANTY; without even the implied warranty of
 ##  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 ##  Lesser General Public License for more details.

 ##  You should have received a copy of the GNU Lesser General Public
 ##  License along with this program; if not, write to the Free Software
 ##  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
 ##  USA

 ### Code:

 namespace eval ::rfc-1036 {
     ## basic REs
     ## FIXME: the RFC says nothing about these two
     variable group-re   "\[0-9a-zA-Z_.+-\]+"
     variable host-re    "\[0-9a-zA-Z.-\]+"
     ## header value REs
     variable message-id-re \
         "^<(\[^<@> \t\]+)@(\[^<@> \t\]+)>\$"
     variable xref-pair-re "(${group-re}):(\[0-9\]+)"
     variable xref-re \
         "^(${host-re})((?:\[ \t\]+${xref-pair-re})+)\$"
     ## exports
     namespace export \
         trim \
         get-field \
         parse-Message-ID \
         parse-Xref
 }

 ## : Return STR with any leading spaces and tabs removed
 proc ::rfc-1036::trim { str } {
     ## .
     string trimleft $str " \t"
 }

 ## : Return the value of FIELD of the header
 proc ::rfc-1036::get-field { arrayName field } {
     upvar 1 $arrayName header

     set varn header($field)
     if { ! [ info exists $varn ] } {
         error "no `$field' header field"
     }

     set lst [ set $varn ]
     ## NB: RFC 1036 says nothing about this case;
     ##     at least my INN rejects articles with duplicate headers
     if { [ llength $lst ] != 1 } {
         error "multiple `$field' header fields"
     }

     ## .
     trim [ lindex $lst 0 ]
 }

 ## : Return the unique part and the host part of VALUE
 proc ::rfc-1036::parse-Message-ID { value { check-only? 0 } } {
     variable message-id-re

     if { ! [ regexp -- ${message-id-re} $value \
                 dummy unique host ] } {
         error "`Message-ID' does not match the pattern"
     }

     if { ${check-only?} } {
         ## .
         return
     }

     ## .
     list $unique $host
 }

 ## : Return the host part and the list of locations of VALUE
 proc ::rfc-1036::parse-Xref { value { check-only? 0 } } {
     variable xref-pair-re
     variable xref-re

     if { ! [ regexp -- ${xref-re} $value \
                  dummy host rest ] } {
         error "`Xref' does not match the pattern"
     }

     if { ${check-only?} } {
         ## .
         return
     }

     set pairs [ list ]
     foreach s [ split $rest " \t" ] {
         if { ! [ string length $s ] } { continue }
         if { ! [ regexp -- ${xref-pair-re} $s \
                      dummy group number ] } {
             error "unreachable"
         }
         lappend pairs $group $number
     }

     ## .
     list $host $pairs
 }

 package provide rfc1036::parse 0.1.1

 ### Emacs stuff
 ## Local variables:
 ## fill-column: 72
 ## indent-tabs-mode: nil
 ## ispell-local-dictionary: "english"
 ## mode: outline-minor
 ## outline-regexp: "###\\|proc"
 ## End:
 ## LocalWords:  
 ### rfc1036p.tcl ends here