|
最近基于大数据的实时分析比较热,也出来了很多解决方案,由于现有的很多方案是基于Hive的,所以大家都以Hive为参照,看从实现上有哪些不同,性能上有哪些提升。不过具体到应用场景,还是有很多不同的,比如Spark基于内存来做快速迭代,要求数据量不能太大,但是性能可以很好,Impala和Drill因为采用了和Dremel类似的存储结构,数据如果导入到其内部的列存储,就可以支持非常大的数据的硬盘快速扫描。Spark的RDD是基于行记录的,而Drill和Impala都支持嵌套数据。Spark以及目前的Impala基本上都需要先导入数据再做分析,而drill支持外部数据源。当然性能肯定会打折扣。
但是从成熟度上来看,Spark和Impala相对好一些,一些企业也开始用了,Drill还在不紧不慢的开发,距离自己的架构还远,只能安装后试一下。因为工作需要,我和实习生做了一下简单的实践,记在这里,希望有些参考作用。
Drill是apache参考Google的Dremel所创建的开源项目,项目的网址是http://incubator.apache.org/drill/。有两个可以公开订阅的邮件列表, User List和Developer List,虽然目前人不太多,
但是基本有问必答。
一、环境搭建
我们当时是安装这个guide编译安装的
https://cwiki.apache.org/confluence/display/DRILL/Compiling+Drill+from+source
安装环境:Xeon x5680 + Ubuntu 12.04
1)apt-get install maven
mvn-version如果显示java版本不是1.7,需要将javah、javadoc等的版本改为1.7的版本
protoc: errorwhile loading shared libraries: libprotoc.so.8: cannot open shared object file:No such file or directory
或-bash: /usr/bin/protoc: Nosuch file or directory
错误原因:/usr/local/lib,而Ubuntu体系默认的lib
so库路径:1) /etc/ld.so.conf.d/libprotobuf.conf 包含内容:2)
sudoldconfig
(这是因为so库时需要知道路径)
4)apt-get install git-all
5)git clone 如果有代理无法使用git下载,可以通过http下载zip包
cdincubator-drill
mvnclean install –DskipTests
如果有代理,需要设置mvn的代理:
vi /etc/maven/ settings.xml
如果有代理,需要设置mvn的代理:
optional DependencyResolutionExceptionorg.apache.drill.exec.exception.SetupException
通过邮件列表问了一下,应该是目前的drill不支持hdfs 2.0,所以应该使用hdfs1.1或者hdfs1.2
3)改成Hdfs1.2.1之后,启动sqline没有问题了。
4) 把文件region.parquet拷到HDFS中的test目录下
hadoop fs -copyFromLocalsample-data/nation.parquet test
5)执行查询,注意要使用select* from "/user/hadoop/test/region.parquet";
四、复杂查询
这里是一些查询例子,表明drill对sql的支持是比较好的,不再赘述
SELECT nations.name, regions.nameFROM (SELECT _MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM"/user/hadoop/test/nation.parquet") nations join (SELECT_MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM "/user/hadoop/test/region.parquet")regions on nations.regionKey = regions.regionKey order by nations.name;
(
SELECT nations.name, regions.name FROM (
SELECT_MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name
FROM"/user/hadoop/test/nation.parquet") nations
SELECT_MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name
FROM"/user/hadoop/test/region.parquet") regions
onnations.regionKey = regions.regionKey
命令写成下面这样容易看,但执行时命令中间不能有回车,所以需要写成上面那样
)
查询结果:
+-----------------+--------------+
| NAME | NAME |
+-----------------+--------------+
| ALGERIA | AFRICA |
| ARGENTINA | AMERICA |
| BRAZIL | AMERICA |
| CANADA | AMERICA |
| CHINA | ASIA |
| EGYPT | MIDDLE EAST |
| ETHIOPIA | AFRICA |
| FRANCE | EUROPE |
| GERMANY | EUROPE |
| INDIA | ASIA |
| INDONESIA | ASIA |
| IRAN | MIDDLE EAST |
| IRAQ | MIDDLE EAST |
| JAPAN | ASIA |
| JORDAN | MIDDLE EAST |
| KENYA | AFRICA |
| MOROCCO | AFRICA |
| MOZAMBIQUE | AFRICA |
| PERU | AMERICA |
| ROMANIA | EUROPE |
| RUSSIA | EUROPE |
| SAUDI ARABIA | MIDDLE EAST |
| UNITED KINGDOM | EUROPE |
| UNITED STATES | AMERICA |
| VIETNAM | ASIA |
+-----------------+--------------+
25 rows selected (0.991 seconds)
Notice the use of sub-queries to avoid thespread of the _MAP idiom.(子查询)
This query illustrates how a cast iscurrently necessary to make the parser happy(投影):
SELECT _MAP['N_REGIONKEY'] as regionKey,_MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet"WHERE cast(_MAP['N_NAME'] as varchar) IN ('MOROCCO', 'MOZAMBIQUE');
查询结果:
+------------+-------------+
| REGIONKEY | NAME |
+------------+-------------+
| 0 | MOROCCO |
| 0 | MOZAMBIQUE |
+------------+-------------+
2 rows selected (0.504 seconds)
// count distinct
SELECT count(distinct _MAP['N_REGIONKEY'])FROM "/user/hadoop/test/nation.parquet";
查询结果:
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PM org.apache.drill.optiq.DrillAggregateRuleonMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING: org.eigenbase.rel.InvalidRelException:DrillAggregateRel does not support DISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING: org.eigenbase.rel.InvalidRelException:DrillAggregateRel does not support DISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING: org.eigenbase.rel.InvalidRelException:DrillAggregateRel does not support DISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING: org.eigenbase.rel.InvalidRelException:DrillAggregateRel does not support DISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
Dec 09, 2013 10:15:45 PMorg.apache.drill.optiq.DrillAggregateRule onMatch
WARNING:org.eigenbase.rel.InvalidRelException: DrillAggregateRel does not supportDISTINCT a ggregates
+---------+
| EXPR$0 |
+---------+
| 5 |
+---------+
1 row selected (0.464 seconds)
// aliases(别名)
SELECT _MAP['N_REGIONKEY'] as regionKey,_MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet";
查询结果:
+------------+-----------------+
| REGIONKEY | NAME |
+------------+-----------------+
| 0 | ALGERIA |
| 1 | ARGENTINA |
| 1 | BRAZIL |
| 1 | CANADA |
| 4 | EGYPT |
| 0 | ETHIOPIA |
| 3 | FRANCE |
| 3 | GERMANY |
| 2 | INDIA |
| 2 | INDONESIA |
| 4 | IRAN |
| 4 | IRAQ |
| 2 | JAPAN |
| 4 | JORDAN |
| 0 | KENYA |
| 0 | MOROCCO |
| 0 | MOZAMBIQUE |
| 1 | PERU |
| 2 | CHINA |
| 3 | ROMANIA |
| 4 | SAUDI ARABIA |
| 2 | VIETNAM |
| 3 | RUSSIA |
| 3 | UNITED KINGDOM |
| 1 | UNITED STATES |
+------------+-----------------+
25 rows selected (0.354 seconds)
// order by(排序)
SELECT _MAP['N_REGIONKEY'] as regionKey,_MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet" ORDERBY _MAP['N_NAME'] DESC;
查询结果:
+------------+-----------------+
| REGIONKEY | NAME |
+------------+-----------------+
| 2 | VIETNAM |
| 1 | UNITED STATES |
| 3 | UNITED KINGDOM |
| 4 | SAUDI ARABIA |
| 3 | RUSSIA |
| 3 | ROMANIA |
| 1 | PERU |
| 0 | MOZAMBIQUE |
| 0 | MOROCCO |
| 0 | KENYA |
| 4 | JORDAN |
| 2 | JAPAN |
| 4 | IRAQ |
| 4 | IRAN |
| 2 | INDONESIA |
| 2 | INDIA |
| 3 | GERMANY |
| 3 | FRANCE |
| 0 | ETHIOPIA |
| 4 | EGYPT |
| 2 | CHINA |
| 1 | CANADA |
| 1 | BRAZIL |
| 1 | ARGENTINA |
| 0 | ALGERIA |
+------------+-----------------+
25 rows selected (0.444 seconds)
// subquery order by
select * from (SELECT _MAP['N_REGIONKEY']as regionKey, _MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet")as x ORDER BY name DESC;
查询结果:
+------------+-----------------+
| REGIONKEY | NAME |
+------------+-----------------+
| 2 | VIETNAM |
| 1 | UNITED STATES |
| 3 | UNITED KINGDOM |
| 4 | SAUDI ARABIA |
| 3 | RUSSIA |
| 3 | ROMANIA |
| 1 | PERU |
| 0 | MOZAMBIQUE |
| 0 | MOROCCO |
| 0 | KENYA |
| 4 | JORDAN |
| 2 | JAPAN |
| 4 | IRAQ |
| 4 | IRAN |
| 2 | INDONESIA |
| 2 | INDIA |
| 3 | GERMANY |
| 3 | FRANCE |
| 0 | ETHIOPIA |
| 4 | EGYPT |
| 2 | CHINA |
| 1 | CANADA |
| 1 | BRAZIL |
| 1 | ARGENTINA |
| 0 | ALGERIA |
+------------+-----------------+
25 rows selected (0.418 seconds)
// String where
SELECT _MAP['N_REGIONKEY'] as regionKey,_MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet" WHERE cast(_MAP['N_NAME']as varchar) > 'M';
查询结果:
+------------+-----------------+
| REGIONKEY | NAME |
+------------+-----------------+
| 0 | MOROCCO |
| 0 | MOZAMBIQUE |
| 1 | PERU |
| 3 | ROMANIA |
| 4 | SAUDI ARABIA |
| 2 | VIETNAM |
| 3 | RUSSIA |
| 3 | UNITED KINGDOM |
| 1 | UNITED STATES |
+------------+-----------------+
9 rows selected (0.363 seconds)
// INNER Join + Order (parquet)
SELECT n.name, r.name FROM (SELECT_MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet")njoin (SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM "/user/hadoop/test/region.parquet")rusing (regionKey);
查询结果:
+-----------------+--------------+
| NAME | NAME |
+-----------------+--------------+
| ETHIOPIA |AFRICA |
| MOROCCO | AFRICA |
| KENYA | AFRICA |
| ALGERIA | AFRICA |
| MOZAMBIQUE | AFRICA |
| UNITED STATES | AMERICA |
| ARGENTINA | AMERICA |
| BRAZIL | AMERICA |
| CANADA | AMERICA |
| PERU | AMERICA |
| INDONESIA | ASIA |
| INDIA | ASIA |
| JAPAN | ASIA |
| CHINA | ASIA |
| VIETNAM | ASIA |
| FRANCE | EUROPE |
| RUSSIA | EUROPE |
| UNITED KINGDOM | EUROPE |
| ROMANIA | EUROPE |
| GERMANY | EUROPE |
| EGYPT | MIDDLE EAST |
| SAUDI ARABIA | MIDDLE EAST |
| IRAQ | MIDDLE EAST |
| IRAN | MIDDLE EAST |
| JORDAN | MIDDLE EAST |
+-----------------+--------------+
25 rows selected (0.501 seconds)
// INNER Join + Order (parquet)
SELECT n.name, r.name FROM (SELECT_MAP['N_REGIONKEY'] as regionKey, _MAP['N_NAME'] as name FROM "/user/hadoop/test/nation.parquet")njoin (SELECT _MAP['R_REGIONKEY'] as regionKey, _MAP['R_NAME'] as name FROM"/user/hadoop/test/region.parquet")r on n.regionKey = r.regionKey orderby n.name;
查询结果:
+-----------------+--------------+
| NAME | NAME |
+-----------------+--------------+
| ALGERIA | AFRICA |
| ARGENTINA | AMERICA |
| BRAZIL | AMERICA |
| CANADA | AMERICA |
| CHINA | ASIA |
| EGYPT | MIDDLE EAST |
| ETHIOPIA | AFRICA |
| FRANCE | EUROPE |
| GERMANY | EUROPE |
| INDIA | ASIA |
| INDONESIA | ASIA |
| IRAN | MIDDLE EAST |
| IRAQ | MIDDLE EAST |
| JAPAN | ASIA |
| JORDAN | MIDDLE EAST |
| KENYA | AFRICA |
| MOROCCO | AFRICA |
| MOZAMBIQUE | AFRICA |
| PERU | AMERICA |
| ROMANIA | EUROPE |
| RUSSIA | EUROPE |
| SAUDI ARABIA | MIDDLE EAST |
| UNITED KINGDOM | EUROPE |
| UNITED STATES | AMERICA |
| VIETNAM | ASIA |
+-----------------+--------------+
25 rows selected (0.565 seconds)
select * from"/user/hadoop/test/nation.parquet";
查询结果:
+-------------------------------------------------------------------------------------------+
| |
+-------------------------------------------------------------------------------------------+
|{"N_NAME":"ALGERIA","N_REGIONKEY":0,"N_COMMENT":"haggle. carefully final deposits detect |
|{"N_NAME":"ARGENTINA","N_REGIONKEY":1,"N_COMMENT":"alfoxes promise slyly according to th |
|{"N_NAME":"BRAZIL","N_REGIONKEY":1,"N_COMMENT":"yalongside of the pending deposits. care |
|{"N_NAME":"CANADA","N_REGIONKEY":1,"N_COMMENT":"eashang ironic, silent packages. slyly r |
|{"N_NAME":"EGYPT","N_REGIONKEY":4,"N_COMMENT":"yabove the carefully unusual theodolites. |
|{"N_NAME":"ETHIOPIA","N_REGIONKEY":0,"N_COMMENT":"venpackages wake quickly. regu","N_NAT |
|{"N_NAME":"FRANCE","N_REGIONKEY":3,"N_COMMENT":"refullyfinal requests. regular, ironi"," |
| {"N_NAME":"GERMANY","N_REGIONKEY":3,"N_COMMENT":"lplatelets. regular accounts x-ray: unu |
|{"N_NAME":"INDIA","N_REGIONKEY":2,"N_COMMENT":"ssexcuses cajole slyly across the package |
|{"N_NAME":"INDONESIA","N_REGIONKEY":2,"N_COMMENT":"slyly express asymptotes. regular dep |
|{"N_NAME":"IRAN","N_REGIONKEY":4,"N_COMMENT":"efullyalongside of the slyly final depende |
|{"N_NAME":"IRAQ","N_REGIONKEY":4,"N_COMMENT":"nicdeposits boost atop the quickly final r |
|{"N_NAME":"JAPAN","N_REGIONKEY":2,"N_COMMENT":"ously.final, express gifts cajole a","N_N |
|{"N_NAME":"JORDAN","N_REGIONKEY":4,"N_COMMENT":"icdeposits are blithely about the carefu |
|{"N_NAME":"KENYA","N_REGIONKEY":0,"N_COMMENT":"pending excuses haggle furiously deposits |
|{"N_NAME":"MOROCCO","N_REGIONKEY":0,"N_COMMENT":"rns.blithely bold courts among the clos |
|{"N_NAME":"MOZAMBIQUE","N_REGIONKEY":0,"N_COMMENT":"s.ironic, unusual asymptotes wake bl |
|{"N_NAME":"PERU","N_REGIONKEY":1,"N_COMMENT":"platelets.blithely pending dependencies us |
| {"N_NAME":"CHINA","N_REGIONKEY":2,"N_COMMENT":"cdependencies. furiously express notornis |
|{"N_NAME":"ROMANIA","N_REGIONKEY":3,"N_COMMENT":"ularasymptotes are about the furious mu |
| {"N_NAME":"SAUDIARABIA","N_REGIONKEY":4,"N_COMMENT":"ts. silentrequests haggle. closely |
|{"N_NAME":"VIETNAM","N_REGIONKEY":2,"N_COMMENT":"helyenticingly express accounts. even, |
|{"N_NAME":"RUSSIA","N_REGIONKEY":3,"N_COMMENT":"requests against the platelets use never |
| {"N_NAME":"UNITEDKINGDOM","N_REGIONKEY":3,"N_COMMENT":"eans boostcarefully special requ |
| {"N_NAME":"UNITEDSTATES","N_REGIONKEY":1,"N_COMMENT":"y finalpackages. slow foxes cajol |
+-------------------------------------------------------------------------------------------+
25 rows selected (0.278 seconds) |
|