apache hive 无法收集stats问题

甩祸 · 发表于 2018-11-21 11:35:43

　　环境：
　　hive: apache-hive-1.1.0
　　hadoop:hadoop-2.5.0-cdh5.3.2
　　hive元数据以及stats使用mysql进行存储。
　　hive stats相关参数如下：
　　hive.stats.autogather：在insert overwrite命令时自动收集统计信息，默认开启true；设置为true
　　hive.stats.dbclass：存储hive临时统计信息的数据库，默认是jdbc:derby；设置为jdbc:mysql
　　hive.stats.jdbcdriver：数据库临时存储hive统计信息的jdbc驱动；设置为com.mysql.jdbc.driver
　　hive.stats.dbconnectionstring：临时统计信息数据库连接串，默认jdbc:derby:databaseName=TempStatsStore;create=true；设置为jdbc:mysql://[ip:port]/[dbname]?user=[username]&password=[password]
　　hive.stats.defaults.publisher：如果dbclass不是jdbc或者hbase，那么使用这个作为默认发布，必须实现StatsPublisher接口，默认是空；保留默认
　　hive.stats.defaults.aggregator：如果dbclass不是jdbc或者hbase，那么使用该类做聚集，要求实现StatsIAggregator接口，默认是空；保留默认
　　hive.stats.jdbc.timeout：jdbc连接超时配置，默认30秒；保留默认
　　hive.stats.retries.max：当统计发布合聚集在更新数据库时出现异常时最大的重试次数，默认是0，不重试；保留默认
　　hive.stats.retries.wait：重试次数之间的等待窗口，默认是3000毫秒；保留默认
　　hive.client.stats.publishers：做count的job的统计发布类列表，由逗号隔开，默认是空；必须实现org.apache.hadoop.hive.ql.stats.ClientStatsPublisher接口；保留默认
　　现象：

　　执行insert overwrite table 没有正确的返回numRows和rawDataSize;结果类似如下
　　[numFiles=1, numRows=0, totalSize=59, rawDataSize=0]
　　在hive stats mysql 数据库也没有任何相关的stats插入进来。
　　先定位问题是hive stats出现问题，由于console打印出来的信息过少，无法精确定位问题；因此设置
　　hive --hiveconf hive.root.logger=INFO,console ；将详细日志打印出来,发现以下信息：
[Error 30001]: StatsPublisher cannot be initialized. There was a error in the initialization
of StatsPublisher, and retrying might help. If you dont want the query to fail because accurate
statistics could not be collected, set hive.stats.reliable=false　　Specified key was too long; max key length is 767 bytes
　　这个问题比较简单，是由于hive1.1.0,ID column长度默认为4000；而且设置ID为主键，导致报错
　　org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsSetupConstants
  // MySQL - 65535, SQL Server - 8000, Oracle - 4000, Derby - 32762, Postgres - large.
  public static final int ID_COLUMN_VARCHAR_SIZE = 4000;　　
　　org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher：public boolean init(Configuration hconf)
            if (colSize < JDBCStatsSetupConstants.ID_COLUMN_VARCHAR_SIZE) {
            String alterTable = JDBCStatsUtils.getAlterIdColumn();
               stmt.executeUpdate(alterTable);
            }　　从这个代码知道，如果表的ID column size小于4000，会被自动改为4000；因此只有修改源码将4000->255（mysql采用utf8编码，一个utf8占用3个字节，因此255*3=765

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

Red Hat RHCE 8 (EX294) Cert Guide

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

亿图图示专家(EDraw Max) V7.9 中文破解版

[经验分享] apache hive 无法收集stats问题

浏览过的版块

扫码加入运维网微信交流群