设为首页 收藏本站
查看: 632|回复: 0

[经验分享] [ZZ]Hadoop on OpenStack: Elastic Data Processing (EDP) with Savanna 0.3

[复制链接]

尚未签到

发表于 2015-4-12 10:59:34 | 显示全部楼层 |阅读模式
Hadoop on OpenStack: Elastic Data Processing (EDP) with Savanna 0.3

by Alexander Kuznetsov
August 26, 2013  Now that version 0.2 of Project Savanna is out, it’s time to start looking at what will be coming up in version 0.3. The goal for this next development phase is to provide elastic data processing (EDP) capabilities, creating a Savanna component that enables data analysis and transformation in an easy and resource-effective way, running a Hadoop cluster only when it’s needed.
  To provide this functionality, Savanna needs you to give it these three things:

  • The data for processing
  • A program that specifies how the data will be processed
  • The location where that program will be executed
  These three items define the architecture for the EDP part of Savanna. You can see the high-level architecture in this figure:

  As you can see from the image, EDP is made up of the following components:

  • Data discovery component—Enables pulling data from various data sources. Data can be pulled from Swift, GlusterFS, or NoSQL database such as Cassandra or HBase.
  • Job origin—Supplies a task for data processing and execution on the cluster and allows users to execute several types of job: jar file, Pig and Hive scripts, and Oozie job flows.
  • Dispatcher component—Responsible for scheduling the job on the new or existing cluster, provisioning a new cluster, resizing cluster and gathering information from clusters about current jobs and utilization.
  • UI component—Enables integration with the OpenStack Dashboard (Horizon). It’s future intent is to provide instruments for job creation, monitoring, and so on. Hue already provides part of this functionality: submitting jobs (jar file, Hive, Pig, Impala), viewing job status, and outputting.
  We’ll circle back to how EDP uses these three components, but first let’s consider in more detail how Savanna interacts with OpenStack as a whole:

  The Savanna product communicates with the following OpenStack components:

  • Horizon—Provides a GUI with the ability to use all of Savanna’s features.
  • Keystone—Authenticates users and provides a security token that is used to work with the rest of OpenStack, hence limiting a user’s abilities in Savanna to his or her OpenStack privileges.
  • Nova—Used to provision VMs for Hadoop Cluster.
  • Glance—Stores Hadoop VM images, each of which contains an installed OS and Hadoop.
  • Swift—Used as a storage for data that will be processed by Hadoop jobs.
  Let’s consider how Savanna performs elastic data processing using the three components we mentioned in the beginning of this post.
Data for processing
  The data for processing can be stored in various locations and have various representations. Let’s take a look at different uses cases for data locations Savanna is planning to support:

  • Raw files stored in distributed storage such as Swift, Gluster, Ceph, or an existing Hadoop cluster. The idea of Hadoop was initially born to address this use case, and it still remains a major power when we need to process huge amounts of data quickly, when the order of data processing is not critical, and when we expect to process all of the data, as opposed to just parts of it.
  • Data stored in a NoSQL database such as HBase, Cassandra, or MongoDB. Data is placed in this type of storage for random access. Placing data in NoSQL can increase the processing speed if only some significant part of the data should be processed. For example, if a user needs to process only 10 percent of the data according to some criteria, they can reduce the data processing time by putting the data in NoSQL and building the index according to the chosen filter.
  • Data stored in RDBMS, as is common in many legacy applications. One of the ways Savanna can provide a way to process this data is to enable integration with Apache Sqoop. Sqoop was designed for transferring bulk data between Hadoop and structured data stores, such as relational databases. So after starting cluster services and before submitting a job, you would populate the cluster using Sqoop.
Processing the data: job workflow
  In its simplest form, a job is defined by a single jar file. It contains all of the necessary code—in other words, implementation of the map and reduce functions. This means that writing a data processing procedure typically requires knowledge of Java. Another option is to use the Hadoop Streaming API. In that case, you can implement the map and reduce functions in any language supported by the operating system on which Hadoop is running, for example, Python or C++.
  For those who are not familiar with Java or other programming languages, Savanna provides an opportunity to use a high-level scripting language such as Pig or Hive. These languages can easily be learned by users who are not professional programmers.
  Looking ahead, an interesting possibility is support of Mahout as a service. It contains sophisticated algorithms, which are difficult to implement in a scripting language like Hive or Pig. Often, the end user doesn’t want to know the implementation details. In the future, we hope that Savanna will enable running Mahout jobs in an elastic way.
  Currently, Savanna supports users who are professional Hadoop programmers and have to run complex data processing algorithms, which involve several tasks. The abstraction that describes these general workflows is represented by direct acyclic graphs (DAGs). Vertices correspond to certain task and edges represent a dependencies between steps. In Hadoop, usually this process is managed via Oozie, a Hadoop workflow engine and job coordinator. The next version of Savanna, 0.3, will likely include a mechanism for scheduling Oozie workflows for elastic data processing.
  Now let’s talk about how Savanna can get the job code needed to begin data processing.
  The simplest way to provide a program for data processing is to store it in a distributed object store such as Swift. Savanna uses the Horizon dashboard to provide a user-friendly interface for file upload. This works great when you want a quick, one-time execution, for example, for an ad-hoc query. But on the other hand, Savanna should be able to process code from a version control system (VCS) repository—for example, git or mercurial. This is critical for users who have a continuous delivery process and need a way to propagate code from development and testing environments to production.
  We’ll now drill down and explain job execution in more detail.
Job Execution
  The first question when it comes to job execution is whether Savanna should start a new cluster or start the job on one that already exists. To answer that, Savanna takes into account various aspects, such as the current load the existing clusters, proximity to the data location, and the required speed. Savanna provides this information to the end user, so they can decide where to schedule the job.
  One very useful feature is cluster autoscaling during a job execution. Often, at different stages of data processing, you require different resource quantities. Autoscaling allows you to have only what you need at the time you need it. The Savanna team is still deciding how to handle autoscaling; it is likely that we will support it first within Savanna itself, then integrate with Heat.
Summary
  Plans for version 0.3 of Savanna include both analytics as a service and elastic data processing (EDP). Making this possible requires providing Savanna with the data, the implementation of the processing routine, and guidance on where to process the data itself. Once these capabilities are in place, Savanna’s elastic data processing (EDP) features allow you to choose where and how to process your data using resource autoscaling.
  To follow progress on the Savanna project, or even to contribute, please visit the Savanna wiki

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-56231-1-1.html 上篇帖子: OpenStack的Resize和冷迁移代码解析及改进 下篇帖子: Multicast on Openstack
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表