@Spongcer 2015-01-09T11:27:13.000000Z 字数 9616 阅读 3962

Fix Hive ParseException of Grouping Sets

ParseException GroupingSets Antlr Hive

Current Hive GROUPING SETS

Currently, when Hive parses GROUPING SETS clauses, and if there are some expressions that were composed of two or more common subexpressions, then the first element of those expressions can only be a simple Identifier without any qualifications, otherwise Hive will throw ParseException during its parser stage. Therefore, Hive will throw ParseException while parsing the following HQLs:

drop table test;
create table test(tc1 int, tc2 int, tc3 int);
explain select test.tc1, test.tc2 from test group by test.tc1, test.tc2 grouping sets(test.tc1, (test.tc1, test.tc2)); 
explain select tc1+tc2, tc2 from test group by tc1+tc2, tc2 grouping sets(tc2, (tc1 + tc2, tc2));
drop table test;

The following contents show some ParseExctption stacktrace:

2015-01-07 09:53:34,718 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,719 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,721 INFO  [main]: ql.Driver (Driver.java:checkConcurrency(158)) - Concurrency mode is disabled, not creating a lock manager
2015-01-07 09:53:34,721 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,724 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,724 INFO  [main]: parse.ParseDriver (ParseDriver.java:parse(185)) - Parsing command: explain select test.tc1, test.tc2 from test group by test.tc1, test.tc2 grouping sets(test.tc1, (test.tc1, test.tc2))
2015-01-07 09:53:34,734 ERROR [main]: ql.Driver (SessionState.java:printError(545)) - FAILED: ParseException line 1:105 missing ) at ',' near '<EOF>'
line 1:116 extraneous input ')' expecting EOF near '<EOF>'
org.apache.hadoop.hive.ql.parse.ParseException: line 1:105 missing ) at ',' near '<EOF>'
line 1:116 extraneous input ')' expecting EOF near '<EOF>'
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:210)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
2015-01-07 09:53:34,745 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(135)) - </PERFLOG method=compile start=1420595614721 end=1420595614745 duration=24 from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,745 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,746 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(135)) - </PERFLOG method=releaseLocks start=1420595614745 end=1420595614746 duration=1 from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,746 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2015-01-07 09:53:34,746 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(135)) - </PERFLOG method=releaseLocks start=1420595614746 end=1420595614746 duration=0 from=org.apache.hadoop.hive.ql.Driver>

Hive throws ParseException while handling the first HQL because the expression (test.tc1, test.tc2) in the GROUPING SETS is composed of two common subexpressions test.tc1 and test.tc2, but the first subexpression test.tc1 has a qualification named test.

Hive throws ParseException while handling the second HQL because the expression (tc1 + tc2, tc2) in the GROUPING SETS is composed of two common subexpressions tc1 + tc2 and tc2, the first subexpression tc1 + tc2 is not a simple subexpression without any qualifications but an arithmetic expression instead.

Hive will not throw ParseException while handling the follwing HQLs:

drop table test;
create table test(tc1 int, tc2 int, tc3 int);
explain select tc1, test.tc2 from test group by tc1, test.tc2 grouping sets(tc1, (tc1, test.tc2)); 
explain select tc1+tc2, tc1 from test group by tc1+tc2, tc1 grouping sets(tc1, (tc1, tc1 + tc2));
explain select test.tc1, test.tc1 + test.tc2 from test group by test.tc1, test.tc1 + test.tc2 grouping sets(test.tc1, (test.tc1), (test.tc1 + test.tc2));
drop table test;

Hive does not throw ParseException while handling the first two HQLs because the first subexpression of (tc1, test.tc2) and (tc1, tc1 + tc2) is just tc1 which is a simple subexpression with out any qualifications.

Hive also does not throw ParseException while handling the third HQL because the subexpressions (test.tc1) and (test.tc1 + test.tc2) are both respectively composed of only one expression test.tc1 which has a qualification named test or test.tc1 + test.tc2 which even is an arithmetic expression, not two or more instead.

Hive Parse Steps

The following contents are some relative grammer definitions of GROUPING SETS caluse in Hive：

groupingSetExpression
@init {gParent.pushMsg("grouping set expression", state); }
@after {gParent.popMsg(state); }
   :
   groupByExpression
   -> ^(TOK_GROUPING_SETS_EXPRESSION groupByExpression)
   |
   LPAREN 
   groupByExpression (COMMA groupByExpression)*
   RPAREN
   -> ^(TOK_GROUPING_SETS_EXPRESSION groupByExpression+)
   |
   LPAREN
   RPAREN
   -> ^(TOK_GROUPING_SETS_EXPRESSION)
   ;
atomExpression
    :
    KW_NULL -> TOK_NULL
    | dateLiteral
    | constant
    | castExpression
    | caseExpression
    | whenExpression
    | (functionName LPAREN) => function
    | tableOrColumn
    | LPAREN! expression RPAREN!
    ;
precedenceFieldExpression
    :
    atomExpression ((LSQUARE^ expression RSQUARE!) | (DOT^ identifier))*
    ;
precedencePlusExpression
    :
    precedenceStarExpression (precedencePlusOperator^ precedenceStarExpression)*
    ;

Those grammer definitions above are written in file ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g, the lexer file HiveLexer.g and three other relative grammer files SelectClauseParser.g, FromClauseParser.g and HiveParser.g are in the same directory.

ANTLR 3.4 will compile those lexer and grammer files into java sources files HiveLexer.java, HiveParser.java, HiveParser_FromClauseParser.java, HiveParser_IdentifiersParser.java, HiveParser_SelectClauseParser.java and put them into package org.apache.hadoop.hive.ql.parse.

So why Hive throws ParseException while handling GROUPINP SETS clauses as shown above? After deeply debugging the parser stages of ANTLR 3.4, we find the final reason as follows:

ANTLR 3.4 will use a function named predict defined in class org.antlr.runtime.DFA to make a prediction to choose a valid branch predefined in the grammer files above to complete it's parser stage while parsing HQLs, but the function predict was implemented by a non exclusive greedy algorithm, so ANTLR 3.4 will try to spend as low cost as possible to choose a branch, that is to say, predict will prefer to choose a small branch number to make a decision, and at the same time, it will use as few expressions as possible to determine whether this branch meets requirements, once matched, the branch number will be returned, and other branches will be ignored even though there are paths satisfying.

First, let's go back and take a look at the expression (test.tc1, test.tc2) in grouping sets(test.tc1, (test.tc1, test.tc2)) : ANTLR 3.4 will just start with the first branch of groupingSetExpression as an initial entrance, and then search recursively into the ninth branch of atomExpression, at last it will find a matched path LPAREN! atomExpression DOT^ identifier RPAREN! through the second branch of precedenceFieldExpression, this path can match (test.tc1 of expression (test.tc1, test.tc2) via non exclusive greedy algorithm.

Next, let's consider the expression (tc1 + tc2, tc2) in grouping sets(tc2, (tc1 + tc2, tc2)) : ANTLR 3.4 will also start with the first branch of groupingSetExpression as an initial entrance, and then search recursively into the ninth branch of atomExpression, and finally find a matched path LPAREN! precedenceStarExpression precedencePlusOperator^ precedenceStarExpression RPAREN! through precedencePlusExpression, this path can match (tc1 + tc2 of expression (tc1 + tc2, tc2) via non exclusive greedy algorithm.

Let's just fix our attentions on those final matched paths above, we find that the next token required by those paths should be a right parenthesis RPAREN! , but the actual character that will be input next after (test.tc1 and (tc1 + tc2 is just a comma ',' , then ANTLR 3.4 will throw ParseException: missing ) at ',' near '< EOF>'.

Hive ParseException Sln

So, how to solve this problem?

Through a deep analyzation above, it is not too hard to see that there is a absolute matched path in the second branch of groupingSetExpression, so we can let ANTLR 3.4 prefer to choose that branch to make prediction just by making a exchange between the first two branches of groupingSetExpression, that's all.

The new grammer definitions of groupingSetExpression after exchanging as follows:

groupingSetExpression
@init {gParent.pushMsg("grouping set expression", state); }
@after {gParent.popMsg(state); }
   :
   LPAREN 
   groupByExpression (COMMA groupByExpression)*
   RPAREN
   -> ^(TOK_GROUPING_SETS_EXPRESSION groupByExpression+)
   |
   groupByExpression
   -> ^(TOK_GROUPING_SETS_EXPRESSION groupByExpression)
   |
   LPAREN
   RPAREN
   -> ^(TOK_GROUPING_SETS_EXPRESSION)
   ;

But after that, it will broke the implicit rules of writing lexer and grammer files: simple before complex, atom before combination.

We have run the unit tests on Hive after modified and all cases passed. But we cannot guarantee that this is the best solution due to it is impossible for us to exhauste all other cases outside the unit tests. Any better solutions, welcome : zhaohm3@asiainfo.com.

Fix Hive ParseException of Grouping Sets

Current Hive GROUPING SETS

Hive Parse Steps

Hive ParseException Sln

内容目录