
Warning: Simplest method presented last.
On 09.08.12 14:13, Trent W. Buck wrote:
Jason White wrote:
James Harper <james.harper@bendigoit.com.au> wrote:
The problem is that my sed script says to start at the "(" and then read up until a ")", but I really mean to say read up until a matching ")". Can I do this with sed or should I be using something else? ...
Thus you cannot use a regular expression, but you can use an parser that accepts some subset of CFGs, such as LALR (e.g. yacc) or LL(k) (e.g. parsec).
Just modelling the mechanics mentally, it seems that it might be quicker to implement with just a lexer (e.g. lex) to pick out the open and closing braces, then increment/decrement a counter (in the lexer) till a nesting level match is detected, signalling the end of the token. Hacking the other text rearrangements in C, either in the lexer or in a grammar, can be a bit clumsy, though that is moderated by suitable lexer regexes, possibly with the help of lexer states.
However, if the text really is as presented, then KISS ought to do it. In awk, the line is by default seen as space-separated fields, so newid() or (newid()) or ((newid(Ooh)de)elephants!) is always detected as one field, making braces nesting irrelevant.
If at some stage, input text with random spaces, e.g. "(newid ( ) )" is encountered, then it is a simple matter to add a line or two of prefiltering to the awk script, to effect repair. These operations on the line would cause the fields to be automatically re-evaluated, allowing the rearrangements to then be made on complete fields. (As I see it, repair merely involves detection of / +[)(]/ , and elision of the spaces; / +/ )
And if at some stage, arithmetic expressions with spaces crop up, then detecting something along the lines of / +[)(0-9*/+-]/ might cover that use case, still without having to delve into grammars or even lexer gymnastics.
The problem looks like fun. :-)
I've solved it for now within the limited scope of my problem, which are: "DEFAULT (newid()) " "DEFAULT ('some string') " The first never has any spaces, so I can do " *\([^ ]*\) *" The second never has any braces, so I can do " *\(('[^']')\) *" (or something like that) So it's working now and will get me through this conversion, even if it's a bit fragile for general use. Thanks for the suggestions! James