Version 58 of XML

Updated 2007-11-21 14:32:51 by LV

XML = eXtensible Markup Language [L1 ]. Very generally spoken it is a simplified form of SGML, but stricter (more regular) in some aspects:

  • Singleton elements must end with />
  • attribute values must be quoted

Example:

 <father name="Jack" att1="1">
   <child name="Tom" born="1997" />
 </father>

"Programming XML in Tcl" [L2 ] surveys the state-of-the-art as of spring 2001, mainly from a Zveno-biased perspective.

One deficiency of that article is its neglect of Jochen Loewer's tDOM work.


One way of specifying the valid tag structure of a class of documents is to use a Document Type Definition, DTD for short. This way was inherited from SGML. There are alternative ways ... XMLSchema, Relax(NG), ...


Perhaps the single most important introductory point to make to Tcl developers about XML is that it's built-in! Almost--while the core Tcl distribution doesn't know about XML, it does have excellent Unicode abilities, and both the ActiveTcl and Kitten installations of Tcl include XML packages.


tDOM builds-in a pretty-printing serialization option. Those with an interest in a comparable function for TclDOM are welcome to try/use/improve/... dom_pretty_print [L3 ]. "XML pretty-printing" will eventually have more on this topic.


How can you start to generate your own XML documents with Tcl? In answering just that question in a mailing list [reference?], Steve Ball succinctly advised, "When creating XML, I generally use TclDOM. Create a DOM tree in memory, and then use 'dom::DOMImplementation serialize $doc' to generate the XML. The TclDOM package will make sure that the generated XML is well-formed.

Alternatively, XML is just text so there's no reason why you can't just create the string directly. Eg:

        puts <document>$content</document>"

The problem with this is that (a) you have to worry about the XML syntax nitty-gritty and (b) the content variable may contain special characters which you have to deal with.

There are also some generation packages available, like the 'html' package in tcllib (this will be added to TclXML RSN, when my workload permits)."

DKF - If you're going for the cheap-hack method of XML generation mentioned above, you'll want this:

  proc asXML {content {tag document}} {
     set XML_MAP {
        < &lt;
        > &gt;
        & &amp;
        \" &quot;
        ' &apos;
     }
     return <$tag>[string map $XML_MAP $content]</$tag>
  }

Naturally, the XML_MAP variable is factorisable... MHo: Why not using html::quoteFormValue for this purpose?

For generation of XML (HTML) the pure Tcl way, have a look at the xmlgen module of TclXML on sourceforge: http://sourceforge.net/projects/tclxml/ .


If you want to get peticular about entity encoding arbitrary text, this is working for me:

 variable entityMap [list & &amp\; < &lt\; > &gt\; \" &quot\;\
        \u0000 &#x0\; \u0001 &#x1\; \u0002 &#x2\; \u0003 &#x3\;\
        \u0004 &#x4\; \u0005 &#x5\; \u0006 &#x6\; \u0007 &#x7\;\
        \u0008 &#x8\; \u000b &#xB\; \u000c &#xC\; \u000d &#xD\;\
        \u000e &#xE\; \u000f &#xF\; \u0010 &#x10\; \u0011 &#x11\;\
        \u0012 &#x12\; \u0013 &#x13\; \u0014 &#x14\; \u0015 &#x15\;\
        \u0016 &#x16\; \u0017 &#x17\; \u0018 &#x18\; \u0019 &#x19\;\
        \u001A &#x1A\; \u001B &#x1B\; \u001C &#x1C\; \u001D &#x1D\;\
        \u001E &#x1E\; \u001F &#x1F\;]

 proc entityEncode {text} {
    variable entityMap
    return [string map $entityMap $text]
 }

Notice I drop \t, \n and \r as those are acceptable chars DG


 What: xml2rfc
 Where: http://xml.resource.org/ 
        http://www.ietf.org/rfc/rfc2629.txt 
 Description: A tool that converts XML source into ASCII, HTML, or nroff
        format.  Intended for support of RFC 2629.  On the above web
        page is both a CGI for converting an XML file into the various
        formats, as well as links to the conversion tool itself.  The
        tool itself includes a Tcl/TclXML tool.
 Updated: 11/2001 
 Contact: See web site

It's remarkable that there are two reasonably well-supported XML editors written (mostly) in Tcl: waX Me Lyrical (WAX), which replaces the earlier Swish in the TclXML project, and xe, maintained as part of tDOM.

de: With all respect, xe isn't an XML editor. It's an XML query tool (query language is XPath). - RS: See starDOM for a simple tDOM-based browser that allows editing, reparsing and validating XML source.


XML-RPC -- TclSOAP


RS notes that Internet Explorer makes for a convenient utility to confirm that an XML document is well-formed (although not necessarily valid). Now (since Fall 2002) he only uses starDOM, because of speed and scriptability ;-)

de: IE is useful, to some degree (up to a few MByte XML data size), as an XML Viewer, because it displays the XML document in a tree-like structure.

If you need XML validation, I recommend rxp http://www.ltg.ed.ac.uk/~richard/rxp.html . This avoids any java installation hassle (and the start up time of the java virtual maschine), is open source, runs on every relevant OS, a MS plattform binary is avaliable, if you're in need, it's very conformant and mature and it's the fastest under the more common validating XML parsers. Since rxp is a command line application, it's easily usable from a tcl programm exec.

If you insist in doing XML validation with a tcl extension, there are only two (and maybe a half) options:

Newer tDOM distributions include a validation extension tnc, which is usable both for SAX and DOM processing. It's pretty fast (even faster as rxp).

Xerces-C++ is, among other things, a validating XML parser. Some times ago Steve Ball started to wrap it as tcl extension http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/tclxml/xercessax/ Lately, Steve Ball wrote at the TclXML list: "I never got the Xerces-C++ wrapper working, but instead I've got a working libxml2 wrapper for TclDOM. At the moment you need to checkout the CVS development tree to get access to it." libxml2 also includes a validating XML parser.

And the half option? Well, it should be doable to utilize one of the various java XML parser with tclblend. I strongly recommend to stick with one of the options above. But if you are a tclblend hero and figure it out, I would be interested in the exact steps.


Joe English in c.l.t: What I usually do to get indented XML is to generate whitespace *inside* the tags, like so:

    <foo x='a' y='b' z='c'
      ><bar
          ><baz>stuff</baz
          ><qux>stuff</qux   ></bar
    ></foo>

This style looks a little weird at first, but it's the most reliable way to "pretty-print" XML without changing the content.


See also A little XML browser using tDOM and BWidgets' Tree, and its refinement starDOM - A little XML parser in pure Tcl


The Perl/Tk folks have written an XML viewer [L4 ].


Overheard in the Tcl chatroom: "Cameron Laird: XML is the moral equivalent of ASCII. 'Wouldn't want to leave home without it; 'scares me that managers think it's a big deal." CL adds, some weeks later: It continues to surprise me how many developers I encounter who tell me they've been instructed to backstitch XML into working applications for no functional reason.


A cute entity encoder when producing XML from arbitrary text:

 interp alias {} xesc {} string map {< &lt; > &gt; & &amp;} ;# RS

A cute XML generator (sorry, no attributes, no entities):

 proc < {name args} {return <$name>[join $args ""]</$name>\n}  ;#RS

You can control the tree structure by the nesting of the calls to "<" (here using the auto-indentation of emacs:

 < root \
    [< branch 1 \
         [< leaf 1] \
         [< leaf 2]] \
    [< branch 2 \
         [< leaf 3] \
         [< leaf 4]]

produces this semi-prettyprint:

 <root><branch>1<leaf>1</leaf>
 <leaf>2</leaf>
 </branch>
 <branch>2<leaf>3</leaf>
 <leaf>4</leaf>
 </branch>
 </root>

Another variation, again by RS:

 proc < {element {value ""} {attributes {}}} {
   set res <$element
   foreach {att attval} $attributes {append res " $att='$attval'"}
   if {$value eq ""} {
       append res " />"
   } else {
       append res >[string map {& &amp; < &lt; > &gt;} $value]</$element>
    }
 }
 % < try "this is <test> & value" {lang EN}
 <try lang='EN'>this is &lt;test&gt; &amp; value</try>
 % < try "" {lang EN}
 <try lang='EN' />

LV In the news: [L5 ] is an article about a company with a couple of patents that they claim are infringed upon by use of XML. They are working out an agreement with a firm that will handle contacting anyone using xml to collect licensing fees... so far, they've contacted 47 companies. RLH Yes but there is prior art, so I hope those companies fight it.


EKB Long ago (before I knew about this Wiki), I wrote xtt: XML <--> Text Tag translator. I just rediscovered it in my files, so I'm sharing it with the world.


XML tutorials xmi2txt