ChipMaster's bwBASIC This also includes history going back to v2.10. *WARN* some binary files might have been corrupted by CRLF.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

321 lines
16 KiB

  1. ORIGINAL: http://antlr.org/papers/Clarke-expr-parsing-1986.pdf
  2. PAGE 1
  3. The top-down parsing of expressions
  4. Keith Clarke
  5. Dept. of Computer Science and Statislics, Queen Mary College. London, E] 4NS. '
  6. ,A
  7. 13th June 1986 Page 1
  8. PAGE 2
  9. _ The top-down parsing of expressions
  10. Summary
  11. Using grammar and program transformation a compact, efficient and easily used method of parsing infix
  12. expressions is derived. The resulting algorithm does not require the construction of precedence matrices,
  13. neither does it normally require transformation of the reference grammar for expressions. The method uses
  14. only two parsing procedures with functions which give the numerical precedence of each operator, and
  15. indicate which operators are left-associative and which right-associative.
  16. KEYWORDS Recursive-descent parsing Expression parsing Operator-precedence parsing
  17. Introduction `
  18. Infix expressions can easily be parsed by conventional recursive descent methods, but the resulting parser
  19. is very inefficient - even for a trivial expression a number of procedure calls equal to the number of i
  20. precedence levels is needed. It has been statedA that the transformations needed to make the grammar
  21. suitable as the basis of such a parser are tedious, and the final parser too opaque. For these reasons
  22. compiler writers often recourse to operator-precedence parsers for expressions, embedded in a top-down
  23. parser for the rest of the language.
  24. Richards1 gives a procedure for parsing expressions which uses recursion rather than an explicit stack
  25. despite being derived using the operator-precedence technique. The procedure is essentially identical to the
  26. one given here (performing the same number of tests of each operator symbol), but requires the
  27. construction of the operator-precedence matrix. We show how the procedure can be derived by grammar-
  28. and program-transformation from the recursive descent method, using only numerical precedences and an
  29. indication of left- or right~associativity. The resulting program can parse an expression using a number of
  30. procedure calls approximately equal to the number of nodes in its abstract syntax tree, independent of the
  31. number of precedence levels in the grammar. The operator symbols of any particular expression grammar
  32. appear only in two small functions, which frequently removes the need for grammar transformation before
  33. the method can be used. Hanson2 shows how the number of procedure definitions used in a recursive
  34. descent parser can be reduced, but does not show how the number of procedure activations can be reduced.
  35. For this reason, Hanson’s method is not infact equivalent to Richards’.
  36. 13th June 1986 Page 2
  37. PAGE 3
  38. The top-down parsing of expressions
  39. The algorithm
  40. Two simple functions (Priority and RighiAssoc) are used to map character values to numbers
  41. (representing precedences) and booleans, respectively. Examples are given below. These encode the usual
  42. relative precedences and associativities of addition, multiplication and exponentiation. Non-operator
  43. characters are assigned a precedence of zero, so that it is not necessary to specify what characters can
  44. follow an expression,
  45. proc Priority(c) =
  46. case c of
  47. a+" => .1 D n*n => 2 D "T" => 3
  48. ` otherwise 0
  49. endcase
  50. endproc
  51. proc RightAssoc(c) =
  52. case c of "T" => true
  53. otherwise false
  54. endcase
  55. endproc
  56. The next two procedures complete the parser for formulae composed of these infix operators and single-
  57. character variable names, constructing the correct abstract syntax tree.
  58. ~ The notation used is similar to that of Bornat3. The keyword initialise is used for a declaration of an
  59. updatable location, initialised to the value of the expression given; let similarly declares a non-updatable
  60. variable. The two functions mkleaf and mknode construct new syntax tree nodes containing operands and
  61. operators respectively. Function CurrC returns the character at the current reading position, without side-
  62. effects. Similarly, function GeiC returns the current character, but also advances the reading position by
  63. one. A version of Dijkstra’s guarded commands (eg GriesS) has been used in the imperative parts of the
  64. program to express conditional execution and iteration.
  65. An expression is parsed using an initial call E(1). It is clear by inspection that the number of calls of E is
  66. one more than the number of operator nodes of the syntax tree constructed, while the number of calls of P
  67. is equal to the number of leaves of the tree. Each use of parentheses in the expression parsed leads to one
  68. 13th June 1986 Page 3
  69. PAGE 4
  70. The top-down parsing of expressions
  71. The algorithm
  72. additional call of each procedure.
  73. proc E(prec) =
  74. initialise p:=P();
  75. do Priority(CurrC()) 2 prec ->
  76. let oper = GetC();
  77. let opreo = Priority(oper);
  78. p := mknodeUJ, oper, E(if RightAssoc(oper) then oprec else oprec+1))
  79. od;
  80. return p
  81. endproc
  82. proc P() =
  83. case CurrC() of
  84. "(" =>
  85. begin
  86. GetC();
  87. let p=E(1);
  88. it GetC() ¢ ")" »~> fall li;
  89. return p
  90. end
  91. otherwise
  92. return mkleat(GetC())
  93. endcase
  94. endproc
  95. The structure of the parsing routine is directly related to the ambiguous operator grammar:
  96. E=P{"+"E|"*“E|"T"E}
  97. P = "(" E ")" l letter
  98. where curly brackets indicate zero or more repetitions of their contents, and vertical bar indicates choice.
  99. The analyser uses ‘pragmatic’ information about the operators to guide a deterministic one-track parse in
  100. such a way that simple formulae are recognised more efficiently than with conventional one-track analysers
  101. based on unambiguous grammars.
  102. Conventional top-down parsing `
  103. There is a standard technique (eg. Aho4) for constructing a top-down parser for infix expressions, given
  104. an ambiguous grammar and precedence/associativity information. A grammar rule is introduced for each
  105. level of precedence. For each level, either all the operators are left associative or else they are right
  106. associative. In the former case the rule takes the form:
  107. En z En+1{0n En+1}
  108. where n is the precedence of the operators On introduced in this rule. For right associative operators a
  109. 13th June 1986 Page 4
  110. PAGE 5
  111. The top-down parsing of expressions
  112. Conventional top-down parsing
  113. right~recursive rule is used:
  114. En = En+1 [ On En]
  115. Square brackets indicate an optional element The rule for the highest precedence level uses the rule for
  116. expression primaries, which we have already seen, in place of the reference to a greater precedence level.
  117. eg. for the operators as used in the program above:
  118. E1: E2 { "+" E2}
  119. E2 = E3 {"*" E3}
  120. E3 = Pt "T" E3]
  121. P = ..(.. E1 ..).. I id
  122. Derivation ofthe parser
  123. This is shown using the example grammar already introduced, The reader is invited to confirm that at no
  124. step is a particular property of the grammar used in any way that would reduce the generality of the result.
  125. We proceed by substituting, in each of the rules for expressions, for the leading non-terminal symbol,
  126. giving:
  127. E1= PI "T" E3] {“*" E3 } { "+" E2 }
  128. -Ez =P[ "T"1~:3] { "*" E3 }
  129. E3=P["T"E3] r
  130. P = ,.(n E ..),. Hd
  131. In the programs below the construction of the parse tree is not shown and as the definitions of P, Priority
  132. and RighiASSOC are the same in each example they are not repeated.
  133. The analyser produced from these rules is even larger than that produced directly, but is amenable to
  134. optimisation, Consider for example the procedure for rule E1:
  135. proc E1() =
  136. PO;
  137. if Curre()="i" -> Geico; Es() fi;
  138. do CurrC()="*" _) GetC(); E3() od;
  139. do CurrC()="+" -> GeiC(); E2() od
  140. endproc ,
  141. This would be the procedure called, say, to parse the right hand side of an expressionY For example input
  142. such as "x:=0" the symbol following the zero would be tested three times before the procedure returned.
  143. However, it is possible to collapse the succession of tests into one.
  144. 13th June 1986 Page 5
  145. PAGE 6
  146. The top-down parsing of expressions
  147. Derivation of the parser
  148. Clearly the final loop can only be executed in an initial state such that the current input character is not
  149. "*". This is an invariant of the loop, since the procedure E2 finishes with another loop that also establishes
  150. this state. There is a general result for loop-transformations that is useful here: provided it is not possible
  151. for B and C to be true at the same time, and noi B is an invariant of instruction T, the following two
  152. programs are equivalent.
  153. I: do B-iS od; do C->T od
  154. and
  155. J:do B->S EI C->T Od
  156. Assuming termination, program I executes statement S a finite number of times, then executes statement T
  157. a finite number of times In general, program J produces an interleaving of executions of S and T` But the
  158. invariance of noi B means that no further executions of S are possible after the first execution of T. The
  159. requirement that B and C cannot be true at the same time ensures that program J is deterministic. Both
  160. programs produce the state not B solely by repeated execution of S, so the number of executions must be
  161. the same, Similarly given the same initial state the number of executions of T must be the same.
  162. Another useful, standard result3 is that:
  163. do B->S EJ C->T od
  164. is equivalent to
  165. doBorC -eifB->S D Caniod
  166. Using these results the two loops of procedure E1 can be transformed, eventually giving the following: .
  167. proc E1 () =
  168. pi);
  169. if currc()="i" -> GeiCt); Es() fi;
  170. do CurrC() e {"=r", "+") ->
  171. case GetC() of "r" => E3() El "+" => E2() endcase
  172. od
  173. endproc
  174. The if statement is replaced by a loop since the call E3() ensures that such a loop can never iterate more
  175. than once. Then a similar argument to the above allows us to merge the resulting loop with the final one,
  176. 13th June 1986 Page 6
  177. PAGE 7
  178. The top-down parsing of expressions
  179. Derivation of the 'parser
  180. giving:
  181. proc E1() =
  182. PO;
  183. do CurrC() e {"T", "ir", "+"} _)
  184. case GetC() of "T" => E3() [J "*" => E3() El "+" => E2() endcase
  185. od
  186. endproc
  187. The two other procedures E2 and E3 are readily transformed to:
  188. proc E2() =
  189. PO;
  190. do CurrC() e {"T", "*"} ->
  191. case GetC() 01"?" => E3() El "*" => ES() endcase
  192. od
  193. endproc
  194. proc E3() =
  195. PO;
  196. do CurrC() e ("T"} ->
  197. case GetC() of "T" => E3() endcase
  198. od
  199. endproc
  200. Obviously the most general version of the case statement could be used in all three procedures. The
  201. condition~parts of the loops are encoded using tests of numerical precedences. In E1, E2 and E3 the
  202. substitution is Priority(CurrC()) 2 1, Priority(CurrC()) 2 2 and Priority(CurrC()) 2 3, respectively.
  203. The three procedures are now identical, except in the use of the constants 1, 2 and 3 respectively in
  204. procedures E1 , E2 and E3. We therefore parameterize on this number, collapsing the procedures into one:
  205. proc E(n) =
  206. P0; '
  207. do Priority(CurrC()) 2 n -> -
  208. case GetC() of "T"=> E(3) El 'W' => E(3) D "+"=> E(2) endcase
  209. od
  210. endproc
  211. This version is further simplified and generalised on the basis of two observations. First, the call following
  212. each left associative operator is to the procedure for the next higher precedence Second, the call after a
  213. right associative operator is found is to the procedure for the same precedence. This is a direct consequence
  214. 13th J une 1986 Page 7
  215. PAGE 8
  216. The top-down parsing of expressions
  217. Derivation of the parser
  218. of the way the unambiguous grammar was constructed. Thus,
  219. proc E(n) =
  220. PO;
  221. do Priority(CurrC()) 2 n ->
  222. let oper = GetC();
  223. let oprec = Priority(oper);
  224. E(if RightAssoo(oper) then oprec else oprec+1)
  225. od
  226. endproc
  227. This procedure no longer depends directly on the operator symbols or the number of precedence levels.
  228. User-declared infix operators
  229. In many applications the precedence and associativity functions could, for efficiency, be encoded as
  230. tables indexed on characters or lexical items. Alternatively, the syntactic properties of each operator could
  231. be stored in the symbol table. This facilitates the provision of usendeiined infix operators. Precedence and
  232. infix status ‘deem-ations’ are properly regarded as compiler directives, and need to be recognised by the
  233. parser. If it is required to make such directives follow the usual scope rules, the parser will also have to
  234. undo the effects of the directives on block exit.
  235. Unary prefix operators
  236. These are conveniently recognised by the procedure P ~ this is simplest if such operators bind more
  237. tightly than any infix operator (as in Algol 68, but not in Pascal), The extra grammar productions required
  238. appear as new alternatives in the rule for primaries (ie. P), and hence each prefix operator is recognised by
  239. its own guard in the case-statement in procedure P. If it is required that Ӣ - 3" be interpreted as "-(-3)", the
  240. following production may be used:
  241. P :.._..P;id|,.(..E1..),.
  242. It is possible to arrange that "-3T4" be interpreted as "~(3T4)", but "-3+4" as "(-3)+4", using a procedure for
  243. primaries derived from:
  244. P :,._.,E2|id|n(.,E1.,>..
  245. Thus one may define a language in which "-maxint+maxint" evaluates to zero without overflow, and also
  246. 13th June 1986 Page 8
  247. PAGE 9
  248. The top~down parsing of expressions
  249. Unary prefix operators
  250. "-3T2" evaluates to minus nine. The Pascal productions6 that exclude "4*-3" from the language are
  251. difficult to enforce efficiently using the present method. It is not unknown for compilers to accept such
  252. ‘superlanguage’ features, perhaps because it is hard to imagine a useful error report.
  253. Operator precedence
  254. The method described here requires that all the operators for a particular precedence level have the same
  255. associativity. The operator precedence method does not have this restriction; see for example the rules for
  256. the shift operators in BCPLI. However, although production of the operator precedence matrix is usually
  257. considered trivial, there is no obvious relationship between the grammar and the final parser, Previously4 I
  258. there has often been some doubt as whether or not an operator precedence parser ‘accepts exactly the
  259. desired language’. Bomat7 gives a useful discussion of the use of recursion in operator precedence parsing,
  260. together with a readily accessible, simplified version of Richards’ algorithml.
  261. Conclusion
  262. The recursive descent technique makes possible the construction of a compact and easily comprehended
  263. parser directly from the usual informal description of the expression part of a programming language. No
  264. construction of precedence matrices is necessary. It is gratifying that the resulting parser is more efhcient
  265. than one directly constructed from an unambiguous grammar, although the gain is in practice small. On
  266. favourable input (expressions involving only two levels of precedence out of a possible eight), using Pascal
  267. for the implementation, an improvement of about 25% in the execution time of a simple calculator was
  268. obtained. Simple exprwsions, on which the algorithm does well, are of course more commonly
  269. encountered than complex ones.
  270. Since the operator symbols are recognised as such solely by their precedence, user~declaration of intix
  271. operators is trivial, and carries no executional overhead.
  272. In contrast with other top'down techniques, additional levels of precedence can be added without
  273. changing the cost of parsing simple or complex expressions. t
  274. 13th June 1986 Page 9
  275. PAGE 10
  276. The top-down parsing of expressions
  277. Conclusion
  278. Many language definitions use ambiguous expression grammars, disambiguated by ‘informar remarks
  279. about the precedences and associativities of the various operators. The algorithm shown can be used in
  280. these cases without modification, by encoding the informal remarks into two simple functions.
  281. References
  282. 1. M. Richards, and C. Whitby-Strevens, BCPL - the language and its compiler,
  283. Cambridge University Press, Cambridge, 1979.
  284. 2. DR, Hanson, ‘Compact Recursive-descent Parsing of Expressions’,
  285. Software Practice and Experience, vol 15(12), 1205-1212 (1985).
  286. 3, R, Bornat, Programming from First Principles, Prentice-Hall, London, 1986. (to appear)
  287. 4. A.V. Aho, R. Sethi and J.D.Ullman, Compilers: Principles, Techniques and Tools
  288. Addison-Wesley, 1985.
  289. 5. D. Gries, The Science afProgramming, Springer-Verlag, New York, 1981,
  290. 6. K. Jensen and N. Wirth, Pascal: User Manual and Report, Springer»Verlag, New York, 1978. I
  291. 7. R. Bornat, Understanding and Writing Compilers Macmillan, London, 1979.
  292. 13th J une 1986 Page 10