设为首页 收藏本站
查看: 478|回复: 0

[经验分享] Hadoop Dont's: What not to do to harvest Hadoop's full potential

[复制链接]

尚未签到

发表于 2016-12-10 07:56:37 | 显示全部楼层 |阅读模式
  
  We've all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that's the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.

So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it's not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.

So what went wrong?

The problem with Hadoop is that it is relatively easy to get started using it without an in-depth knowledge of what gives it its powers and without this, you are more likely than not to design your solution in a way which takes all of those powers away. So let's take a look at the few key features of Hadoop and what not to do to keep them.

Feature: Efficiency

There are a couple of things Hadoop does to ensure efficiency: 


  • It brings the computation to the data. So instead of sending large amounts of data over the network to the machines which execute the computations, it tries to run the computations directly on the nodes which contain the required inputs.
  • It processes files sequentially, reducing the number of costly disk seeks. Unless the Hadoop cluster is running on SSD hard drives, it will take it order of 10ms to seek to a new place on the disk. On the other hand during the same amount of time, it can read an order of 10 megabits of data from the disk sequentially. So you can't process large amounts of data if the process involves frequent seeks.
  • It uses compression, memory buffers and other optimizations to make the data flow in the system very efficient.
Do not: Physically separate your data cluster from your compute cluster.Whether your data is coming from HDFS, HBase, Cassandra or somewhere else, if it is not stored on the same machines as the MapReduce jobs are running on, it is impossible for Hadoop to bring the computation to the data. Cases exist when such a solution is acceptable, in particular when the jobs are more compute intensive than data-intensive (they do a lot of computing and not a whole lot of reading and writing data) and Hadoop can still be a good option for such jobs, but it's full potential is with jobs which process large datasets with relatively simple computations.

Do not: Create location inaware input formats. Often the input formats which come with the data storage you are using do not exactly fit your system and you have to write your own. Make sure that as you do that, you implement location-awareness by implementing a non-empty InputSplit.getLocation() or by inheriting this functionality from a superclass if you are extending an existing input format.

Do not: Read or write external data in the map or reduce tasks (apart from reading the input and writing output as orchestrated by the framework). For example, you may be tempted to make database read or write calls directly in the map or reduce code and there's nothing to prevent you from doing it. But especially if you are doing a lot of small random lookups, you are losing all of the optimization the framework provides for the efficient streaming of data through the system. Also, if you are accessing a limited availability resource, such as a SQL database, you may be introducing a bottleneck which prevents the solution from scaling (seeScalability).

Do not: Write code to copy files around manually. Once again, the Hadoop framework does a lot to optimize the flow of data during a compute job, but copying files around manually will always incur heavy network traffic (as data is replicated) and cannot be optimized since the system does not know what it is you are trying to do. Apart from the input/output mechanism, Hadoop also offers adistributed cache which can be used to bring data to your tasks. Those provided mechanisms are usually the most efficient way of bringing data and computation together.

Feature: Scalability

Hadoop is highly scalable in the sense that you can grow your cluster to many thousands of nodes and the computing throughput increases linearly with it. That's because Hadoop doesn't have bottlenecks. But that doesn't mean that a system implemented on top of Hadoop automatically doesn't have bottlenecks either. 

Do not: Synchronize tasks. Since tasks are executing in parallel, it may be tempting to add synchronization between them and with tools such as ZooKeeper it is fairly easy to add all sorts of distributed synchronization mechanisms such as locks or queues. But every synchronization is a bottleneck, so workflows using those almost never scale.

Do not: Use a constant number of mapper or reducer tasks. The number of map tasks a job is split into is proportional to the size of the input, so the scaling of the number of mappers comes naturally as the size of the input grows (as long as you remember about this if implementing a custom input format). But the number of reducers is defined by the programmer. Since the number of reducers also defines the partitioning of the output, it may be tempting to keep it constant or maybe even always keep just one reducer, so that the output is just a single file. But if you do this, you are preventing the system from scaling out as the amount of data and the cluster grows.

Do not: Talk to the job tracker. It may be tempting to have tasks talk directly to the job tracker which scheduled the job to find out extra information about the job or the other tasks. For example, you may be tempted to have the reducer ask the job tracker for the total number input records, in order to turn the classic word count example into an IDF calculation. But the job tracker is a single resource, so such practices could prevent your job from scaling and could also give misleading results, since the job tracker may not always be serving up the latest statistics as the job is still running.

Feature: Reliability

The Hadoop MapReduce framework handles failures gracefully. If a node in the cluster fails mid-task, it will just re-run the task on a different node. It also hedges against having to wait for slower nodes by starting tasks on multiple machines to begin with (what is referred to as "speculative execution"). But if you are not aware of these mechanisms, your system may end up not functioning as expected.

Do not: Write an output format without considering output committer functionality. If using FileSystemOutputFormat or its derivatives, only the output of the successful tasks makes it to the output directory of the job and the rest are discarded. This logic is handled in by the FileOutputCommitter. Some custom formats may not need committers, for example if they are writing to a key-value store which automatically keeps only one value per key, but in other cases, failed and subsequently restarted jobs, or tasks scheduled speculatively, could result in output data inconsistencies, such as duplicate records.

Do not: Write tasks with side effects. If a task has side effects, for example if it writes to an external data store directly, then retried tasks or tasks scheduled concurrently through speculative execution, may end up not functioning as desired.

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-312071-1-1.html 上篇帖子: (转)hadoop 从secondarynamenode 恢复namenode 下篇帖子: 检测hadoop程序的小技巧
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表