数据放置
目录
分布式存储里,数据放置解决数据在多个节点上合理分布的问题。需要满足:
- 适应不同的集群规模、I/O负载;
- 支持多种分布策略,物理位置优选本地、RACK打散、多AZ,介质类型限定SSD/HDD, 存储池限定,容量与IO能力;
- 不同场景与I/O类型,异构机型、新扩机器、故障缩容,前台业务I/O、数据复制、 数据均衡、重新分布。
1 技术点
1.1 概率分布函数
2 HDFS实现
2.1 翻译
2.1.1 介绍
HDFS默认使用BlockPlacementPolicyDefault。这种策略下,一个数据块在本地节点,另 两个数据块在同机架下的其他两台节点上。此外,HDFS支持几种不同的块分配策略插件。 用户基于基础设施和使用场景做出自己的选择。本文描述相关策略的细节,以及使用场景 和配置。
2.1.2 BlockPlacementPolicyRackFaultToleran
BlockPlacementPolicyRackFaultTolerant用于数据块需要放置到多个机架的场景。对于 三副本数据,BlockPlacementPolicyDefault策略是这样的:如果writer在datanode上, 放置一个副本到本机上,否则在writer所在机架上随机选择一个datanode,而第二个副本 放置到另一个机架的datanode上,最后一个副本选择第二个副本相同机架不同的datanode 上。因此使用2个机架,这种场景下,2个机架同时宕掉将导致数据不可用。 BlockPlacementPolicyRackFaultTolerant则会把三副本放置到三个不同的机架上。
配置文件 hdfs-site.xml:
<property> <name>dfs.block.replicator.classname</name> <value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant</value> </property>
2.1.3 BlockPlacementPolicyWithNodeGroup
新的三层拓扑中,引入了node组的概念。node组映射到基于虚拟化环境的基础设施。在 虚拟化环境下,多个VM运行在同一个物理机上。同一个物理机上的VM受相同硬件故障的 影响。映射到某个物理主机的node组,数据放置策略保证在这个node组上不会放置的 数据副本不会超过1个,因此node组失效时,最多只有一个副本受到影响。
Configurations:
- core-site.xml
<property> <name>net.topology.impl</name> <value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value> </property> <property> <name>net.topology.nodegroup.aware</name> <value>true</value> </property>
- hdfs-site.xml:
<property> <name>dfs.block.replicator.classname</name> <value> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup </value> </property>
- Topology script
拓扑脚本同上,唯一的差别是,这个拓扑脚本返回/{rack}/{nodegroup},而不仅仅是 /{rack}. 下面是一个拓扑映射表的例子:
192.168.0.1 /rack1/nodegroup1 192.168.0.2 /rack1/nodegroup1 192.168.0.3 /rack1/nodegroup2 192.168.0.4 /rack1/nodegroup2 192.168.0.5 /rack2/nodegroup3 192.168.0.6 /rack2/nodegroup3
更多细节参考HDFS-8468.
AvailableSpaceBlockPlacementPolicy 基于空间均衡的策略。类似于BlockPlacementPolicyDefault,但对于新的数据块,选中 较低空间使用率数据节点,被选中的概率会稍高一点。
参考HDFS-8131. For more details check HDFS-8131
AvailableSpaceRackFaultTolerantBlockPlacementPolicy 基于空间均衡的策略,类似于BlockPlacementPolicyRackFaultTolerant,数据分布到 尽可能多的机架上,同时较低空间使用率的数据节点,被选中的概率会稍高。
参考HDFS-15288.
2.2 原文
BlockPlacementPolicies
Introduction
By default HDFS supports BlockPlacementPolicyDefault. Where one block on local and copy on 2 different nodes of same remote rack. Additional to this HDFS supports several different pluggable block placement policies. Users can choose the policy based on their infrastructure and use case. This document describes the detailed information about the type of policies with its use cases and configuration.
BlockPlacementPolicyRackFaultTolerant
BlockPlacementPolicyRackFaultTolerant can be used to split the placement of blocks across multiple rack. By default with replication of 3 BlockPlacementPolicyDefault will put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. So totally 2 racks will be used, in scenario like 2 racks going down at the same time will cause data inavailability where using BlockPlacementPolicyRackFaultTolerant will help in placing 3 blocks on 3 different racks.
For more detail check HDFS-7891. .
Configurations: hdfs-site.xml <property> <name>dfs.block.replicator.classname</name> <value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant</value> </property>
BlockPlacementPolicyWithNodeGroup With new 3 layer hierarchical topology, a node group level got introduced, which maps well onto a infrastructure that is based on a virtualized environment. In Virtualized environment multiple vm's will be hosted on same physical machine. Vm's on the same physical host are affected by the same hardware failure. So mapping the physical host a node groups this block placement gurantees that it will never place more than one replica on the same node group (physical host), in case of node group failure, only one replica will be lost at the maximum.
Configurations:
- core-site.xml
<property> <name>net.topology.impl</name> <value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value> </property> <property> <name>net.topology.nodegroup.aware</name> <value>true</value> </property>
- hdfs-site.xml
<property> <name>dfs.block.replicator.classname</name> <value> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup </value> </property>
- Topology script
Topology script is the same as the examples above, the only difference is, instead of returning only /{rack}, the script should return /{rack}/{nodegroup}. Following is an example topology mapping table:
192.168.0.1 /rack1/nodegroup1 192.168.0.2 /rack1/nodegroup1 192.168.0.3 /rack1/nodegroup2 192.168.0.4 /rack1/nodegroup2 192.168.0.5 /rack2/nodegroup3 192.168.0.6 /rack2/nodegroup3
For more details check HDFS-8468.
BlockPlacementPolicyWithUpgradeDomain To address the limitation of block placement policy on rolling upgrade, the concept of upgrade domain has been added to HDFS via a new block placement policy. The idea is to group datanodes in a new dimension called upgrade domain, in addition to the existing rack-based grouping. For example, we can assign all datanodes in the first position of any rack to upgrade domain ud_01, nodes in the second position to upgrade domain ud_02 and so on. It will make sure replicas of any given block are distributed across machines from different upgrade domains. By default, 3 replicas of any given block are placed on 3 different upgrade domains. This means all datanodes belonging to a specific upgrade domain collectively won’t store more than one replica of any block.
For more details check HDFS-9006.
Detailed info about configuration Upgrade Domain Policy.
AvailableSpaceBlockPlacementPolicy The AvailableSpaceBlockPlacementPolicy is a space balanced block placement policy. It is similar to BlockPlacementPolicyDefault but will choose low used percent datanodes for new blocks with a little high possibility.
AvailableSpaceRackFaultTolerantBlockPlacementPolicy The AvailableSpaceRackFaultTolerantBlockPlacementPolicy is a space balanced block placement policy similar to AvailableSpaceBlockPlacementPolicy. It extends BlockPlacementPolicyRackFaultTolerant and distributes the blocks amongst maximum number of racks possible and at the same time will try to choose datanodes with low used percent with high probability.