sed and matching braces in the source string

I have a string in the following format: ALTER TABLE QuestionnaireCategory ADD CONSTRAINT DF_QuestionaireCategory_QuestionaireCategoryID DEFAULT ("123") FOR QuestionnaireCategoryID That I need to convert to: ALTER TABLE QuestionnaireCategory ALTER QuestionnaireCategoryID SET DEFAULT ("123") The former being MSSQL format and the latter being postgresql. I have a sed script that does it and works great except for when the DEFAULT value itself contains braces, eg: DEFAULT (newid()) The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else? Thanks James

James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else?
If you can be sure it's the last quote on the line that you want to match, you could match, say, "[^"]*$ I understand that there are fundamental reasons related to finite-state automata that explain why a regular expression can't find matching quotation marks, parentheses, opening and closing XML tags, etc. Unfortunately I don't have access to a book on the subject, and I'm not a computer science graduate (computer science graduates are supposed to have learned this material - if not, then there's something amiss in the computer science course). I would still like to learn it though.

Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else?
I understand that there are fundamental reasons related to finite-state automata that explain why a regular expression can't find matching quotation marks, parentheses, opening and closing XML tags, etc.
Correct. Matching parens/braces/anything are irregular, but they are context-free. Thus you cannot use a regular expression, but you can use an parser that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k) (e.g. parsec). For shell scripting with XML, this is best achieved with xmlstarlet (which uses XSLT/XPATH under the hood). For HTML, turn it into XHTML first with tidy. I don't have a good solution for *correctly* dealing with anything less formal in a shell script -- usually I write a "good enough" regexp, or switch to a "real" programming language.
I would still like to learn it though.
Here is a reading list for you. https://en.wikipedia.org/wiki/Regular_language https://en.wikipedia.org/wiki/Kleene_star https://en.wikipedia.org/wiki/Deterministic_finite_automaton (as an equivalence) https://en.wikipedia.org/wiki/Context-free_grammar https://en.wikipedia.org/wiki/LL(k) https://en.wikipedia.org/wiki/LALR PS: note that many "regexp" libraries are super-regular; they can recognize more than just regular languages. In particular, a match like (.)b\1 that will match aba but not abb, is no longer regular (I think).

Warning: Simplest method presented last. On 09.08.12 14:13, Trent W. Buck wrote:
Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else? ...
Thus you cannot use a regular expression, but you can use an parser that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k) (e.g. parsec).
Just modelling the mechanics mentally, it seems that it might be quicker to implement with just a lexer (e.g. lex) to pick out the open and closing braces, then increment/decrement a counter (in the lexer) till a nesting level match is detected, signalling the end of the token. Hacking the other text rearrangements in C, either in the lexer or in a grammar, can be a bit clumsy, though that is moderated by suitable lexer regexes, possibly with the help of lexer states. However, if the text really is as presented, then KISS ought to do it. In awk, the line is by default seen as space-separated fields, so newid() or (newid()) or ((newid(Ooh)de)elephants!) is always detected as one field, making braces nesting irrelevant. If at some stage, input text with random spaces, e.g. "(newid ( ) )" is encountered, then it is a simple matter to add a line or two of prefiltering to the awk script, to effect repair. These operations on the line would cause the fields to be automatically re-evaluated, allowing the rearrangements to then be made on complete fields. (As I see it, repair merely involves detection of / +[)(]/ , and elision of the spaces; / +/ ) And if at some stage, arithmetic expressions with spaces crop up, then detecting something along the lines of / +[)(0-9*/+-]/ might cover that use case, still without having to delve into grammars or even lexer gymnastics. The problem looks like fun. :-) Erik -- Withdrawals from Spain in May were 41.3 billion euros and are now up to 163 billion in total. Switzerland, meanwhile, is buying 3 billion euros a day with newly printed Swiss francs to try to keep its currency down because of all the money flooding into Zurich. http://www.abc.net.au/news/2012-08-01/kohler-era-of-deleveraging/4168778

Warning: Simplest method presented last.
On 09.08.12 14:13, Trent W. Buck wrote:
Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else? ...
Thus you cannot use a regular expression, but you can use an parser that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k) (e.g. parsec).
Just modelling the mechanics mentally, it seems that it might be quicker to implement with just a lexer (e.g. lex) to pick out the open and closing braces, then increment/decrement a counter (in the lexer) till a nesting level match is detected, signalling the end of the token. Hacking the other text rearrangements in C, either in the lexer or in a grammar, can be a bit clumsy, though that is moderated by suitable lexer regexes, possibly with the help of lexer states.
However, if the text really is as presented, then KISS ought to do it. In awk, the line is by default seen as space-separated fields, so newid() or (newid()) or ((newid(Ooh)de)elephants!) is always detected as one field, making braces nesting irrelevant.
If at some stage, input text with random spaces, e.g. "(newid ( ) )" is encountered, then it is a simple matter to add a line or two of prefiltering to the awk script, to effect repair. These operations on the line would cause the fields to be automatically re-evaluated, allowing the rearrangements to then be made on complete fields. (As I see it, repair merely involves detection of / +[)(]/ , and elision of the spaces; / +/ )
And if at some stage, arithmetic expressions with spaces crop up, then detecting something along the lines of / +[)(0-9*/+-]/ might cover that use case, still without having to delve into grammars or even lexer gymnastics.
The problem looks like fun. :-)
I've solved it for now within the limited scope of my problem, which are: "DEFAULT (newid()) " "DEFAULT ('some string') " The first never has any spaces, so I can do " *\([^ ]*\) *" The second never has any braces, so I can do " *\(('[^']')\) *" (or something like that) So it's working now and will get me through this conversion, even if it's a bit fragile for general use. Thanks for the suggestions! James

Erik Christiansen wrote:
Warning: Simplest method presented last.
On 09.08.12 14:13, Trent W. Buck wrote:
Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else? ...
Thus you cannot use a regular expression, but you can use an parser that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k) (e.g. parsec).
Just modelling the mechanics mentally, it seems that it might be quicker to implement with just a lexer (e.g. lex) to pick out the open and closing braces [...]
Well, note that my second example lacks the traditional separation of lexing and parsing phases, as seen in lex/yacc. http://research.microsoft.com/en-us/um/people/daan/download/parsec/parsec.pd... In that paper, ยง2.2 (pp. 6) covers a language with matching parens and nothing else: parens = do { char '('; parens; char ')'; parens; } <|> return () A one-liner is pretty quick and simple ;-) PS: there are better references that was simply the first that ddg returned, and it amused me to remind readers that a lot of Haskell comes out of MS Research :-)

On Thu, 09 Aug 2012 12:21:13 +1000, James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else?
If you mean that you are using a regular expression, then the theoretical answer is that you can't match an arbitrarily nested set of parentheses. This is because a regular expression only has a finite state sequence, whereas nested parentheses can be nested arbitrarily deep (the state machine has no memory apart from the current state). However, if you know in advance the maximum nesting, you can construct a regular expression to do the matching. For example, the following regex will match 1 or 2 levels of parentheses: ([^()]*\(([^()]*)[^()]*\)*) It works by using an inner \(...\)* to match the (optional) inner parentheses. So for example if you give it the string: ab(cd(ef)gh(ij)kl)mn it will match: (cd(ef)gh(ij)kl) Whereas if you give it the string: cd(ef)gh it will match: (ef) It gets rather unreadable and typo prone if you have to be able to nest more than a few times; if so, you might be better off using a parsing language which is more powerful than regular expressions, or making use of some quirk in the input to find the last parenthesis (eg a trailing space or end of line). Glenn -- sks-keyservers.net 0xb1e82ec9228ac090
participants (5)
-
Erik Christiansen
-
Glenn McIntosh
-
James Harper
-
Jason White
-
Trent W. Buck