
Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else?
I understand that there are fundamental reasons related to finite-state automata that explain why a regular expression can't find matching quotation marks, parentheses, opening and closing XML tags, etc.
Correct. Matching parens/braces/anything are irregular, but they are context-free. Thus you cannot use a regular expression, but you can use an parser that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k) (e.g. parsec). For shell scripting with XML, this is best achieved with xmlstarlet (which uses XSLT/XPATH under the hood). For HTML, turn it into XHTML first with tidy. I don't have a good solution for *correctly* dealing with anything less formal in a shell script -- usually I write a "good enough" regexp, or switch to a "real" programming language.
I would still like to learn it though.
Here is a reading list for you. https://en.wikipedia.org/wiki/Regular_language https://en.wikipedia.org/wiki/Kleene_star https://en.wikipedia.org/wiki/Deterministic_finite_automaton (as an equivalence) https://en.wikipedia.org/wiki/Context-free_grammar https://en.wikipedia.org/wiki/LL(k) https://en.wikipedia.org/wiki/LALR PS: note that many "regexp" libraries are super-regular; they can recognize more than just regular languages. In particular, a match like (.)b\1 that will match aba but not abb, is no longer regular (I think).