Re: sed and matching braces in the source string

9 Aug 2012

      Jason White wrote:
...
James Harper <james.harper@bendigoit.com.au> wrote:
...
The problem is that my sed script says to start at the "(" and then
read up until a ")", but I really mean to say read up until a
matching ")". Can I do this with sed or should I be using something
else?
I understand that there are fundamental reasons related to
finite-state automata that explain why a regular expression can't
find matching quotation marks, parentheses, opening and closing XML
tags, etc.
Correct.

Matching parens/braces/anything are irregular, but they are context-free.

Thus you cannot use a regular expression, but you can use an parser
that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k)
(e.g. parsec).

For shell scripting with XML, this is best achieved with xmlstarlet
(which uses XSLT/XPATH under the hood).  For HTML, turn it into XHTML
first with tidy.  I don't have a good solution for *correctly* dealing
with anything less formal in a shell script -- usually I write a "good
enough" regexp, or switch to a "real" programming language.
...
I would still like to learn it though.
Here is a reading list for you.

https://en.wikipedia.org/wiki/Regular_language
https://en.wikipedia.org/wiki/Kleene_star
https://en.wikipedia.org/wiki/Deterministic_finite_automaton (as an equivalence)

https://en.wikipedia.org/wiki/Context-free_grammar
https://en.wikipedia.org/wiki/LL(k)
https://en.wikipedia.org/wiki/LALR

PS: note that many "regexp" libraries are super-regular; they can
recognize more than just regular languages.  In particular, a match
like (.)b\1 that will match aba but not abb, is no longer regular (I
think).

Re: sed and matching braces in the source string

Trent W. Buck