[prog] lex/yacc problem

Almut Behrens almut-behrens at gmx.net
Wed May 28 22:50:12 EST 2003


On Wed, May 28, 2003 at 09:14:01AM -1000, Jimen Ching wrote:
> On Wed, 28 May 2003, Almut Behrens wrote:
> >do you hava a rule somewhere that handles the occurrence of whitespace, e.g.
> >
> ><INITIAL>{WS}  { }
> >
> >in its most simple case (i.e. to ignore whitespace)?  If not, the lexer
> >will pass the string " = " to the parser, instead of "=", as your yacc
> >grammar expects.
> 
> I have the following two patterns in the lex specification:
> 
> WSs			{WS}+
> 
> <INITIAL>{NL}		{ cur_lineno++; }
> <INITIAL>{WSs}		{ }
> 
> Note, these two patterns are at the top of the lexer specification (comes
> before all other patterns).


okay, that should be fine then.

Taking a somewhat closer look at it, I think it's your lexer which is
already tokenizing the "#1 'b1001" into the sequence YYPOUND, Binary,
instead of YYPOUND, Number, Binary.  (I'm assuming that you, in fact,
do want to split it up into (#1)('b1001), with the (#1) being the
optional_delay -- at least that's how I read your grammar...)

The thing is, that - at the lexer level - this is potentially
ambiguous, as your {Binary} can also take an optional number at the
beginning, so the sequence (#)(1 'b1001) would equally make sense...

One way to get around this would be to use a start condition, being
enabled when the '#' is encountered. In that context, you'd then return
a {Number} prematurely (i.e. before the Binary pattern gets a chance to
match).

Just played around a little... I think the following simplified lexer
would return a token sequence that your grammar can handle:

%{
    int yywrap(void) { return 1; }
%}

WS            [ \t\r\b]
Digit         [0-9]
DigitU        [0-9_]
Letter        [a-zA-Z]
LetterU       [a-zA-Z_]
WordNum       [0-9a-zA-Z]
WordNumU      [0-9a-zA-Z_]
Number        {Digit}{DigitU}*
Word          {LetterU}{WordNumU}*
Binary        ([-+]?{Number}{WS}*)?\'[bB]{WS}*[01xXzZ?][01xXzZ?_]*

%x delay

%%
#                 { BEGIN(delay);   return 1; }
<delay>{Number}   { BEGIN(INITIAL); return 2; }
{Binary}          return 3;
{Word}            return 4;
{WS}              /* eat up whitespace */
.                 return (int) yytext[0];
  /* default rule for literal character tokens such as '=', ';' */
  
%%

main() {
    int r;
    while (r = yylex()) { printf("[r=%d]", r); }
}


When you run this standalone, you should get (from the printf in the
while loop) for your string "result = #1 'b1001;":

  [r=4][r=61][r=1][r=2][r=3][r=59]

which corresponds to the token sequence

  Word, '=', '#', Number, Binary, ';'
  |               |       |
  result          1       'b1001

I think this is a sequence your parser should be able to reduce
correctly...
(Instead of the 1, 2, 3, 4 constants here, you'd of course have other
values corresponding to YYPOUND, YYNUMBER, etc.)


Hope that makes a bit more sense,

Almut



More information about the Programming mailing list