设为首页 收藏本站
查看: 692|回复: 0

[经验分享] Online Apache HBase Backups with CopyTable-Bill zhang-51CTO博客

[复制链接]

尚未签到

发表于 2018-11-21 13:32:54 | 显示全部楼层 |阅读模式
  源自:http://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/
  CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how to use it, and some common configuration caveats.
Use cases:
  CopyTable is at its core an Apache Hadoop MapReduce job that uses thestandard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path interface. It can be used for many purposes:

  •   Internal copy of a table (Poor man’s snapshot)
  •   Remote HBase instance backup
  •   Incremental HBase table copies
  •   Partial HBase table copies and HBase table schema changes
Assumptions and limitations:
  The CopyTable tool has some basic assumptions and limitations. First,if being used in the multi-cluster situation, both clusters must be online and the target instance needs to have the target table present with the same column families defined as the source table.
  Since the tool uses standards scans and puts, the target cluster doesn’t have to have the same number of nodes or regions.  In fact, it can have different numbers of tables, different numbers of region servers, and could have completely different region split boundaries. Since we are copying entire tables, you can use performance optimizationsettings like setting larger scanner caching values for more efficiency. Using the put interface also means that copies can be made between clusters of different minor versions. (0.90.4 -> 0.90.6, CDH3u3 -> CDH3u4) or versions that are wire compatible (0.92.1 -> 0.94.0).
  Finally, HBase only provides row-level ACID guarantees; this means while a CopyTable is going on, newly inserted or updated rows may occur and these concurrent edits will either be completely included or completely excluded. While rows will be consistent, there is no guarantees about the consistency, causality, or order of puts on the other rows.
Internal copy of a table (Poor man’s snapshot)
  Versions of HBase up to and including the most recent 0.94.x versionsdo not support table snapshotting. Despite HBase’s ACID limitations, CopyTable can be used as a naive snapshotting mechanism that makes a physical copy of a particular table.
  Let’s say that we have a table, tableOrig with column-families cf1 and cf2. We want to copy all its data to tableCopy. We need to first create tableCopy with the same column families:
  1
  srcCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell
  We can then create and copy the table with a new name on the same HBase instance:
  1
  srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy tableOrig
  This starts an MR job that will copy the data.
Remote HBase instance backup
  Let’s say we want to copy data to another cluster. This could be a one-off backup, a periodic job or could be for bootstrapping for cross-cluster replication. In this example, we’ll have two separate clusters: srcCluster and dstCluster.
  In this multi-cluster case, CopyTable is a push process — your sourcewill be the HBase instance your current hbase-site.xml refers to and the added arguments point to the destination cluster and table. This also assumes that all of the MR TaskTrackers can access all the HBase and ZK nodes in the destination cluster. This mechanism for configuration also means that you could run this as a job on a remote cluster by overriding the hbase/mr configs to use settings from any accessible remote cluster and specify the ZK nodes in the destination cluster. This could be useful if you wanted to copy data from an HBase cluster with lower SLAs and didn’t want to run MR jobs on them directly.
  You will use the the –peer.adr setting to specify the destination cluster’s ZK ensemble (e.g. the cluster you are copying to). For this weneed the ZK quorum’s IP and port as well as the HBase root ZK node for our HBase instance. Let’s say one of these machine is srcClusterZK (listed in hbase.zookeeper.quorum) and that we are using the default zk client port 2181 (hbase.zookeeper.property.clientPort) and the default ZK znode parent /hbase (zookeeper.znode.parent). (Note: If you had two HBase instances using the same ZK, you’d need a different zookeeper.znode.parent for each cluster.
  1
  2
  3
  4
  5
  # create new tableOrig on destination cluster
  dstCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell
  # on source cluster run copy table with destination ZK quorum specified using --peer.adr
  # WARNING: In older versions, you are not alerted about any typo in these arguments!
  srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase tableOrig
  Note that you can use the –new.name argument with the –peer.adr to copy to a differently named table on the dstCluster.
  1
  2
  3
  4
  # create new tableCopy on destination cluster
  dstCluster$ echo "create 'tableCopy', 'cf1', 'cf2'" | hbase shell
  # on source cluster run copy table with destination --peer.adr and --new.name arguments.
  srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase --new.name=tableCopy tableOrig
  This will copy data from tableOrig on the srcCluster to the dstCluster’s tableCopy table.
Incremental HBase table copies
  Once you have a copy of a table on a destination cluster, how do you do copy new data that is later written to the source cluster? Naively, you could run the CopyTable job again and copy over the entire table. However, CopyTable provides a more efficient incremental copy mechanism that just copies the updated rows from the srcCluster to the backup dstCluster specified in a window of time. Thus, after the initial copy, you could then have a periodic cron job that copies data from only the previous hour from srcCluster to the dstCuster.
  This is done by specifying the –starttime and –endtime arguments. Times are specified as decimal milliseconds since unix epoch time.
  1
  2
  3
  4
  5
  6
  7
  8
  # WARNING: In older versions, you are not alerted about any typo in these arguments!
  # copy from beginning of time until timeEnd
  # NOTE: Must include start time for end time to be respected. start time cannot be 0.
  srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=1 --endtime=timeEnd ...
  # Copy from starting from and including timeStart until the end of time.
  srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=timeStart ...
  # Copy entries rows with start time1 including time1 and ending at timeStart excluding timeEnd.
  srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=timestart --endtime=timeEnd
Partial HBase table copies and HBase table schema changes
  By default, CopyTable will copy all column families from matching rows. CopyTable provides options for only copying data from specific column-families. This could be useful for copying original source data and excluding derived data column families that are added by follow on processing.
  By adding these arguments we only copy data from the specified column families.

  •   –families=srcCf1
  •   –families=srcCf1,srcCf2
  Starting from 0.92.0 you can copy while changing the column family name:

  •   –families=srcCf1:dstCf1

    •   copy from srcCf1 to dstCf1

  •   –families=srcCf1:dstCf1,dstCf2,srcCf3:dstCf3

    •   copy from srcCf1 to destCf1, copy dstCf2 to dstCf2 (no rename), and srcCf3 to dstCf3

  Please note that dstCf* must be present in the dstCluster table!

  Starting from 0.94.0 new options are offered to copy delete markers and to include a limited number of overwritten versions. Previously, if arow is deleted in the source cluster, the delete would not be copied — instead that a stale version of that row would remain in the destinationcluster. This takes advantage of some of the 0.94.0>

  •   –versions=vers

    •   where vers is the number of cell versions to copy (default is 1 aka the latest only)

  •   –all.cells

    •   also copy delete markers and deleted cells

Common Pitfalls

  The HBase client in the 0.90.x, 0.92.x, and 0.94.x versions always use zoo.cfg if it is in the>Conclusion
  CopyTable provides simple but effective disaster recovery insurance for HBase 0.90.x (CDH3) deployments. In conjunction with the replicationfeature found and supported in CDH4’s HBase 0.92.x based HBase, CopyTable’s incremental features become less valuable but its core functionality is important for bootstrapping a replicated table. While more advanced features such as HBase snapshots (HBASE-50) may aid with disaster recovery when it gets implemented, CopyTable will still be a useful tool for the HBase administrator.


运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-637841-1-1.html 上篇帖子: 重新编译apache时make install步出现报错 下篇帖子: Apache Pig学习笔记(二)
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表