[http://groups.google.com/groups?th=4bbebbb242ec1e1e] - [NEM] This link is dead for me on 2 June, 2005. This [http://groups-beta.google.com/group/comp.lang.tcl/browse_thread/thread/cf0ae7f9cba4c0df/021f15cfa5b61862?q=regexp+xml&rnum=6#021f15cfa5b61862] article from comp.lang.tcl certainly looks relevant, however. [[Wiki page on e-mail addresses]] [[different meanings of "[regular expressions]"]] [[ [Perl] disease] [[When REs go wrong]] [Regular expression examples] ---- 05Apr03 [Brian Theado] - For XML, I'm guessing the title of this page is referring to one-off regular expressions, but see [http://www.cs.sfu.ca/~cameron/REX.html] for a paper describing shallow parsing of [XML] using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in [Perl], Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the [Javascript] implementation apparently). The demo helped me understand what they meant by "shallow parsing". For a [Tcl] translation, see [XML Shallow Parsing with Regular Expressions]. ---- Why are regular expressions not suited for parsing email addresses? "[Regular expression to validate e-mail addresses]" comments on this. A few more comments appear in "The Limits to Regular Expressions" [http://www.unixreview.com/documents/s=2472/uni1037388368795/] and "Regular Expressions Do Not Solve All Problems" [http://informit.com/articles/article.asp?p=102171&redir=1], themselves descendants of Jamie Zawinski's notorious judgment [http://slashdot.org/comments.pl?sid=19607&cid=1871619] REs multiply, rather than solve, problems. ---- [D. McC]: OK, so what can you use instead of REs to solve, rather than multiply, problems? [AM] In Tcl you have a number of options, depending on what you really want to do: * Searching for individual words - consider [[lsearch]] * Searching for particularly simple patterns - consider [[string match]] * Try coming up with simple REs that solve the matching problem to, say, 80 or 90% and use a second step to get rid of the "false positives" * Use a combination of all three * If you are trying to match text that spans multiple lines, not uncommon, turn it into one long string first, removing any unnecessary characters (like \ or \n) That is just a handful of methods. I am sure others can come up with more methods. [DKF]: For [XML] and [HTML], use a proper parser to build a [DOM] tree. For email addresses, do a cheap hack that does the 99.999% of the cases seen in practice. :^) [NEM]: I'm not sure if [[lsearch]] or [[string match]] would be the way to go if [[regexp]] wasn't good enough. The direction I'd go in would be to use one of the many parser generators available for Tcl (e.g. I've heard good things about [taccle]), or check out some of the tools in tcllib (look at the [grammar_fa] stuff by [AKU]). Or, you could roll your own [parser using recursive descent]. At some point soon, I'd like to experiment with parser combinators [http://www.cs.nott.ac.uk/Department/Techreports/96-4.html], which look great. Note that most of these techniques probably make use of regexps as part of the solution. However, regular expressions, in their most basic form, can only recognise regular grammars (see [http://en.wikipedia.org/wiki/Chomsky_hierarchy] for a description of the Chomsky language hierarchy), but many times what needs to be parsed is context-free or context-sensitive (XML is context-free, IIRC). [AM] I use [[lsearch]] and [[string match]] for identifying lines of interest - quite often the first thing you need to do. I do not intend them as replacements for splitting up the text in smaller pieces... ---- [CMcC] It should be realised that the shallow parser in [XML Shallow Parsing with Regular Expressions] treats attributes and DOCTYPE as opaque - it doesn't attempt to parse them, and therefore the regexp parser doesn't parse XML, but rather a simpler language that looks a lot like XML (and is for all practical purposes largely equivalent.) Above, [NEM] speaks, correctly, of ''parsing'' as identical in meaning to ''recognising''. It is important to remember however, that very few people are interested merely in answering the question ''is this string a valid XML document?''. Most of what we mean when we speak of ''parsing'' is actually translation - we wish to take an XML DI (or a putative XML DI) and transform it into something else with which one can directly work, for example a tree. It may be possible to translate an XML DI into something without ever recognising it. That translation and translators often include a parser, or something we call a parser, does not mean that all translation is predicated on parsing. It may be that a process which does not ''parse'' XML (in that it fails to recognise that a given string does not conform to the XML grammar) nonetheless produces a coherent, isomorphic, structure- and semantic-preserving translation for those strings which happen to conform to the XML grammar. As the paper referenced in the shallow parsing page points out, this behaviour may actually be more desirable than parsing per se. Nothing I've written here should be taken to mean that I think unrestricted, undisciplined, indiscriminate, careless or naive use of regexps is a good thing. None of those adjectives describe the shallow XML parser used as a translator, so I feel the title of this page is insufficiently nuanced. [NEM] I disagree with the argument that you can do a meaningful translation without doing any parsing. The XML shallow parser ''is'' a parser, just not a complete one. Perhaps that is sufficient for many problems, but it is still parsing. You have to recognise ''something'' in order to perform ''any'' meaningful operation on some data. Now, you can go far with regular expressions, and as implemented in most languages they are more powerful than the regular expressions of formal language theory. However a brief look over the XML shallow parser source code to me reveals that using solely regexps for any substantial pattern matching is ''not a good idea'', which is what this page is all about (not that I started this page). Perhaps we should reverse the question: what are considered to be the advantages to using regexps for this sort of work? Certainly not simplicity or code clarity, I'd guess. [CMcC] the argument is rather that one can do meaningful translation of XML DIs without parsing XML. The XML shallow parser is a complete parser, just not of XML. I don't know that the contention 'you have to recognise something in order to perform a meaningful operation' is well founded in all its generality (consider [[incr x]].) I think, to be fair, the looking over the shallow parser needs to be balanced against a similar perusal of a pure-tcl and a C implementation. It's certainly shorter than them, which leads me to an attempted answer to your question: It may be that a regexp based translation runs faster than a pure-tcl less regexp'd translation.