Version 30 of Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses

[L1 ] - NEM This link is dead for me on 2 June, 2005. This [L2 ] article from comp.lang.tcl certainly looks relevant, however.

[Wiki page on e-mail addresses]

[different meanings of "regular expressions"]

[ Perl disease]

[When REs go wrong]

Regular expression examples

05Apr03 Brian Theado - For XML, I'm guessing the title of this page is referring to one-off regular expressions, but see [L3 ] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing". For a Tcl translation, see XML Shallow Parsing with Regular Expressions.

Why are regular expressions not suited for parsing email addresses? "Regular expression to validate e-mail addresses" comments on this.

A few more comments appear in "The Limits to Regular Expressions" [L4 ] and "Regular Expressions Do Not Solve All Problems" [L5 ], themselves descendants of Jamie Zawinski's notorious judgment [L6 ] REs multiply, rather than solve, problems.

D. McC: OK, so what can you use instead of REs to solve, rather than multiply, problems?

AM In Tcl you have a number of options, depending on what you really want to do:

Searching for individual words - consider [lsearch]
Searching for particularly simple patterns - consider [string match]
Try coming up with simple REs that solve the matching problem to, say, 80 or 90% and use a second step to get rid of the "false positives"
Use a combination of all three
If you are trying to match text that spans multiple lines, not uncommon, turn it into one long string first, removing any unnecessary characters (like \ or \n)

That is just a handful of methods. I am sure others can come up with more methods.

DKF: For XML and HTML, use a proper parser to build a DOM tree. For email addresses, do a cheap hack that does the 99.999% of the cases seen in practice. :^)

NEM: I'm not sure if [lsearch] or [string match] would be the way to go if [regexp] wasn't good enough. The direction I'd go in would be to use one of the many parser generators available for Tcl (e.g. I've heard good things about taccle), or check out some of the tools in tcllib (look at the grammar_fa stuff by AKU). Or, you could roll your own parser using recursive descent. At some point soon, I'd like to experiment with parser combinators [L7 ], which look great. Note that most of these techniques probably make use of regexps as part of the solution. However, regular expressions, in their most basic form, can only recognise regular grammars (see [L8 ] for a description of the Chomsky language hierarchy), but many times what needs to be parsed is context-free or context-sensitive (XML is context-free, IIRC).

AM I use [lsearch] and [string match] for identifying lines of interest - quite often the first thing you need to do. I do not intend them as replacements for splitting up the text in smaller pieces...

CMcC It should be realised that the shallow parser in XML Shallow Parsing with Regular Expressions treats attributes and DOCTYPE as opaque - it doesn't attempt to parse them, and therefore the regexp parser doesn't parse XML, but rather a simpler language that looks a lot like XML (and is for all practical purposes largely equivalent.)

Above, NEM speaks, correctly, of parsing as identical in meaning to recognising. It is important to remember however, that very few people are interested merely in answering the question is this string a valid XML document?.

Most of what we mean when we speak of parsing is actually translation - we wish to take an XML DI (or a putative XML DI) and transform it into something else with which one can directly work, for example a tree. It may be possible to translate an XML DI into something without ever recognising it. That translation and translators often include a parser, or something we call a parser, does not mean that all translation is predicated on parsing.

It may be that a process which does not parse XML (in that it fails to recognise that a given string does not conform to the XML grammar) nonetheless produces a coherent, isomorphic, structure- and semantic-preserving translation for those strings which happen to conform to the XML grammar.

As the paper referenced in the shallow parsing page points out, this behaviour may actually be more desirable than parsing per se.

Nothing I've written here should be taken to mean that I think unrestricted, undisciplined, indiscriminate, careless or naive use of regexps is a good thing. None of those adjectives describe the shallow XML parser used as a translator, so I feel the title of this page is insufficiently nuanced.

NEM I disagree with the argument that you can do a meaningful translation without doing any parsing. The XML shallow parser is a parser, just not a complete one. Perhaps that is sufficient for many problems, but it is still parsing. You have to recognise something in order to perform any meaningful operation on some data. Now, you can go far with regular expressions, and as implemented in most languages they are more powerful than the regular expressions of formal language theory. However a brief look over the XML shallow parser source code to me reveals that using solely regexps for any substantial pattern matching is not a good idea, which is what this page is all about (not that I started this page). Perhaps we should reverse the question: what are considered to be the advantages to using regexps for this sort of work? Certainly not simplicity or code clarity, I'd guess.