当前位置: 首页 > >

Distributed Metadata Management Scheme in Cloud Computing

发布时间:

Distributed Metadata Management Scheme in Cloud Computing

Bing Li, Yutao He, Ke Xu

PCN&CAD CENTER, Beijing University ofPost and Telecommunication, China Vincent.lisiyi@gmail.com, heyutao 1987@gmail.com, xu ke@bupt.edu.cn
_

Abstract
Metadata management is critical to distributed file system. Many large-scale storage systems such as HDFS involve the decoupling of metadata management from file data access. In such architecture, a single master server manages all metadata, while a number of data servers store file data. This architecture can't meet the exponentially increased storage demand in cloud computing, as the single master server may become a performance bottleneck. This paper presents a metadata management scheme based on HDFS for cloud computing. Our scheme employs multiple Namenodes, and divides the metadata into "buckets" which can be dynamically migrated among Namenodes according to system workloads. To maintain reliability, metadata is replicated in different Namenodes with log replication technology, and Paxos algorithm is adopted to keep replication consistency. Besides, caching and prefetching technology is integrated to improve system performance.

Keywords: Cloud Storage, Management, Replication, Paxos. 1. Introduction

HDFS,

Metadata

With the rapid development of Internet, the amount of data is growing exponentially, and the large-scale data storage and processing has become a problem. Cloud computing is one of the most popular solutions to meet the demand. Cloud computing provides decreased cost of hardware resource and increased equipment utilization. Users can access all kinds of Internet services and virtual computing power through lightweight portable devices rather than traditional Pc. Cloud storage is a key issue for cloud computing, and metadata management plays a critical role in Cloud storage. Metadata is the data that describes the structure of a file system, such as hierarchical namespace, file and directory attributes and mapping of file to data chunks. Although the size of metadata is relatively small in the file system, metadata operations can take up over 50 percent of the whole file system

operations [1]. So the metadata management is critically important for file system performance. There have been many distributed storage systems, and most involve the decoupling of metadata management from data access, such as GFS[2] and HDFS[3]. The Hadoop Distributed File System (HDFS) is an open source distributed file system with master/slave architecture. A file is split into one or more blocks and these blocks are stored in a set of DataNodes. A single Namenode manages file system namespace, determines the mapping of file to blocks, and regulates access to files. In HDFS, all metadata is kept in the memory of the single Namenode, so it may become performance bottleneck as metadata number increases. So we researched HDFS, changed the single Namenode architecture to multiple Namenodes, and proposed a metadata management scheme that fulfills the following goals: (1) High performance: We distribute metadata among multiple Namenodes, and provide an efficient route for metadata requests to appropriate Namenodes. Caching and prefetching strategy is also adopted to improve system performance. (2) Well balanced load: We adopt a variant of consistent hashing to assign metadata among multiple Namenodes according to each Namenode's processing ability, and redistribute metadata when system workloads change. (3) Reliability: To guarantee system fault-tolerance, we propose a group-based log replication technology, and integrate Paxos algorithm to maintain replication consistency under the occurrence of failures. (4) Scalability: A Namenode can be added to or removed from the cluster without system restarting, and the metadata can be redistributed to keep load balance. The rest of the paper is organized as follows: We begin by discussing related work in Section 2. In Section 3, we present the architecture of our metadata management system, and discuss some detailed design and optimization issues. Performance test based on a prototype implementation are given in section 4. Finally, we conclude our paper in section 5.

978-1-4577-0208-2/11/$26.00 ?2011 IEEE
32

2. Related Work

Currently, Sub Tree Partitioning [4] [5] and Hashing [6] [7] [8] are two common techniques to distribute metadata among mUltiple servers. In sub tree partitioning, namespace is divided into many directory sub trees, each of which is managed by individual metadata servers. This strategy provides a good locality because metadata in the same sub tree is assigned to the same metadata server, but metadata may not be evenly distributed, and the computing and transferring of metadata may generate a high time and network overhead. Hashing technique uses a hash function on the path name to get metadata location. In this scheme, metadata can be distributed uniformly among cluster, but the directory locality feature is lost, and if the path is renamed, some metadata have to migrate. Hierarchical Bloom Filter Arrays (HBA) [9] is a special hash strategy that uses Bloom Filter to map metadata to metadata servers. We can check whether a metadata is in a server by hashing the path into positions in Bloom Filter and checking the values of those positions. HBA is fast and space-efficient, but each server has to keep as many Bloom filters as possible in order to maintain all metadata hash results, so a large memory is consumed. Consistent hashing [10] is another hash method used in Amazon Dynamo [11]. In basic consistent hashing, the output range of the hash function is treated as a ring. Not only the data is hashed, but also each node is hashed to a value in the ring. Each node is responsible for the data in the range between it and its predecessor node. In consistent hashing, the addition and removal of a node only affects its neighbor nodes. An optimization of consistent hashing is the introduction of "virtual node". Instead of mapping a physical node to a single point in the ring, each physical node is assigned to multiple positions, each of which is called a virtual node. With virtual node, data and workload is distributed over nodes more uniformly.
3. System Design 3.1. Architecture

GLOBAL_ID is the global unique identifier that is invariable once the path is created. USER_ID is the identifier of user that created the path. PARENT GLOBAL ID is the GLOBAL ID of the parent directory of the path. OTHER_META saves other information, such as permission, last access time and update time. BLOCK_PTR is the pointer to the file data blocks, which is used just for file. By this metadata format, the hierarchical namespace is changed into flat data structures, and can be stored in database and Namenodes. Figure 2 shows the architecture of our metadata management system.
-

Figure 1. Metadata format

r;::::==?>1

Namenode Manager
BLT

r-=....,. ,., .-,

Client

The metadata in this paper consists of directory metadata and file metadata. Directory metadata includes the hierarchical namespaces and directory attributes, and file metadata includes file attributes and the mapping from file to data blocks. Generally, a metadata is a tuple as below:

The system mainly consists of four components: Client, multiple Namenodes, Namenode Manager and Database. Client exposes interfaces to access metadata. To improve system performance, some recently used metadata will be cached in Client. Namenode is responsible for managing metadata and dealing with metadata request from Client. The updated metadata in Namenode is persisted into database periodically. Namenode Manager provides a route for Client to get the target Namenode. Besides, it manages the metadata distribution and load balancing among Namenodes by periodically receiving heartbeat message from Namenodes. In the following sections, we will illustrate the detailed design of metadata management, involving metadata partitioning, metadata accessing, system reliability and scalability.
3.2. Metadata partitioning

Figure 2. Metadata management system architecture

In our system, we adopt a variant of consistent hashing to partition metadata. We divide the consistent hashing ring into Q equally sized parts, each of which

33

is called a "bucket". Bucket is the basic unit for data distribution and transferring. When a bucket is migrated, all metadata in this bucket will be moved. The metadata is first partitioned into buckets, and then all buckets are evenly distributed across Namenodes. The mapping of a path's metadata to bucket is like consistent hashing. We first hash the USER_ID and PARENT_GLOBAL ID of the path to yield its position p in the consistent-hashing ring, and walk clockwise to find the first bucket with a position larger than p. The USER ID and PARENT_GLOBAL_ID is invariable, so ;0 metadata migration happens when the path is renamed. Moreover, the hashing on the PARENT GLOBAL ID allows all files and directories in a specified direCh>ry be mapped to the same bucket. So it provides a good directory locality which supports operations such as Is. In addition, it facilitates prefetching technology. When . ClIent requests a metadata, all the metadata in the same directory is returned. This way can greatly reduce network traffic and improve system performance. As to the mapping of buckets to Namenodes, Namenode Manager assigns each Namenode a certain nu?ber of buckets based on its actual processing abilIty, such as CPU and memory capacity. Namenode Manager uses a Bucket Look-Up Table (BLT) in memory to keep the mapping from buckets to Namenodes. Each entry of BLT consists of a BU?K?T_ID field and NAMENODE field. By modifyIng NAMENODE, the metadata distribution can be changed. To prevent the performance bottleneck caused by Client's frequent access to BLT in the single Namenode Manager, the BLT is cached into Client's memory when Client initializes.
3.3. Metadata access

Hash Table

Read Thre.ad Butkct Bucket Rc d Thre.ad Butket
Bucket Write

Thread

Butket

yne to

econdaries Database

Log Flush Thread

Log File 1-----. Flush
to

?or a file metadata access to path IAIBIC/jilename, CI?ent first looks up its cached metadata. If not found, ClIent gets the user_id and GLOBAL ID of IAIBIC as parent---.fSlobaCid and computes the c?nsistent hashing result. With the result, Client looks up the cached BLT to find out the bucket_id and corresponding Namenode N?menode i. Then Client sends Namenode i a request With the form of <bucket_hi, user_id, parent---.fSlobaCid, filename>. Namenode i first looks up the bucket array by bucket_id. If the bucket is not found, it means this bucket has migrated, so a BUCKET NOT FOUND ?essage is resp?nded. Otherwise, the bucket-looks up Its hash table With parent---.fSlobaCid, and traverses the found metadata list to find out the desired metadata whose NAME is filename. If successful, all metadata in this list is returned, and be cached in Client.
3.4. Metadata persistence

Figure 3. Namenode data structure

In HDFS, the namespace hierarchy is organized as a B-tree like ?ata structure that supports quick lookup . operatIOn With O(log n) complexity. In our system, we adopt a hash table instead of B-tree to imitate the namespace hierarchy as hash table has a complexity of 0(1) and better space utilization than B-tree. Figure 3 shows the data structure to manage metadata in each Namenode. To get the metadata in a specified bucket quickly, Namenode organizes its managed buckets as an array, and each bucket has a hash table from PARENT_GLOBAL_ID to a metadata list, in which each metadata has the same PARENT_GLOBAL_ID.

In HDFS, the entire namespace is persisted in "FsImage" which is stored as a file in local file system. In our system, we use a database instead as database provides better data retrieval efficiency and transactions support. When the system starts up, each Namenode gets its managed bucket ranges from Namenode Manager, and loads the metadata in these ranges from database to memory to begin metadata service. During metadata service, all metadata operations are taken in memory rather than written to database at once. Like HDFS, we adopt a log file in each Namenode to record all metadata operations. With each metadata write, a new log record is appended to the log file. A log record contains the GLOBAL ID of the written metadata, operation type, op?ration arguments and an unique log number to define the order of operations.

34

For each Namenode, only one log file is "active" for writing. If the log file meets a size threshold, it is closed, and Namenode switches to a new created active log file to continue log writing without stopping metadata service to Client. The log records in the old log file will be "flushed" into database by a log flush thread as Figure 3 shows. Each record is translated into corresponding SQL, and executed into database. After all log records have been flushed successfully, the old log file can be freely deleted.
3.5. Replication

The metadata is distributed in each Namenode's memory, so a availability mechanism is needed to maintain metadata safety. In this paper, we present a solution called "log replication". As the metadata is periodically persisted into database, we can get the newest metadata by applying the latest log records on the last-persisted metadata in database. So we just replicate the latest log records rather than in-memory metadata. As Figure 4 shows, each bucket is replicated in N Namenodes which are called a "group", N is 3 by default. The first Namenode is primary and the next N - 1 are called secondaries. The primary keeps both the metadata in memory and corresponding log records in disk, while the secondaries only keep the corresponding log records in disk. The metadata is only in the primary's memory, so the primary is in charge of all metadata read and write requests. Log records just consume secondary's disk space, so less memory is consumed compared to in-memory metadata replication.
ReadIWrite Requests

majority of acceptors, it proposes a value and sends it with an accept message. If a majority of accepts reply with an accepted message, the value is agreed on, and consensus is reached. Then the proposer sends a success message to all nodes to announce the consensus. In our system, Multi-Paxos is adopted. Multi-Paxos executes phase one only once, and repeatedly executes phase two in the following rounds, so Multi-Paxos reduces the number of message rounds. A data structure called commit queue is introduced in each Namenode to manage received log records. Once a Namenode receives a log record, it is not committed into the log file immediately. Instead, it is appended into the commit queue. Figure 5 shows the replication consensus protocol of our system in steady state (phase two). When a write request arrives to the primary, the primary transforms the metadata operation into a log record, appends it to commit queue, and sends the log record with an accept message to all secondaries in parallel. The secondaries receiving the accept message append the log record to commit queue, and reply with an accepted massage to primary. If a majority of secondaries reply with an accepted message, the primary updates the metadata in memory and writes the log record into log file. If the log record has been replicated to the commit queue of all secondaries, the primary can response to Client, and send an asynchronous success message to all secondaries. When each secondary receives success message, the log record in commit queue can be committed to disk. As most time is spent on the accept and accepted messages delays, and the commit queue guarantees the replication consistency at the level of memory, so the replication process would be quick.

[ Primary I
f-----.,? Aprcnd
Wrile
, , , ,

I Scc()n?[) 2[
, ,
, , ,

Q?euc
:

Primary

Secondary

ACC4:pj

Secondary

To avoid log records consensus problems between primary and secondaries, we adopt Paxos algorithm [12] [13]. The basic Paxos consists of two phases. In the first phase, the leader node selects a proposal number and sends a prepare message to other nodes called acceptors. Acceptors reply a promise message to the proposer if they have not received a prepare message with a higher proposal number. In the second phase, if the proposer receives promise message from a

Figure 4. Log replication

--writ?-L?gr------'---------, , Succ.ss

j:SWriICLoS

WrilCLog

Figure 5. Replication consensus protocol

35

3.6. Failure detection and handling

In our system, failure detection is implemented by heartbeat mechanism. The Namenode Manager considers a Namenode failed if the Namenode does not send heartbeat in a configured long time. The failure handling includes metadata recovery and bucket re? contribution. In the recovery phase, all metadata managed by the failed primary is recovered. First, the secondary with the largest log number is selected as the new leader, which is responsible for recovering metadata with its log records. As it is possible that this new leader has accepted some log records from the failed primary before, but has not received the success message, so these records are still in commit queue. For the log records in log file and having received success message in commit queue, they can be flushed to database directly. For those that have not received success message in commit queue, the leader sends them to all secondries with the accept message as the phase two of Paxos algorithm. If the majority accepts this record, the record would be applied to database. After the metadata in failed primary has been recovered, Namenode Manager will contribute the buckets managed by the failed primary to other Namenodes by modifying the BLT. Any client who has sent requests for metadata m to the failed primary without receiving replies will try to contact the new primary of m according to the BLT. The new primary loads m from database into memory and deals with this request. So in failure handling, most time is spent on electing new leader and flushing the latest log records into database. As the failed primary has flushed log records into database periodically before, so the flushed log records amount by the new leader would not be large, and failure handling would be quick.
3.7. AddinglRemoving Namenodes

command has persisted these buckets into database successfully, the BLT will be modified to rout these buckets to the new Namenode. When received a request, the new Namenode loads the metadata from database to memory and deals with the request. When a Namenode leaves the system, the buckets managed by this Namenode are first persisted into database. Then Namenode Manager modifies the BLT to distribute these buckets to remaining Namenodes, each of which will load these buckets from database to memory when handling the requests for these buckets.
4. Test and evolution

Another key design issues for our system is the ability to scale incrementally. When a Namenode is added to or removed from the cluster, the metadata should be dynamically redistributed to keep workloads balance. When a new Namenode joins, Namenode Manager will add a new group which uses the new Namenode as primary, and N - I random Namenodes as secondaries. Then according to the new Namenode's processing ability, Namenode Manager computes the number of buckets that should be transferred to the new Namenode from other Namenodes, and sends these bucket-ids with a BUCKET-TRANSFER command as heartbeat response. If each Namenode receiving the

We implemented the proposed scheme prototype by Java language, and tested its performance on a cluster with five nodes, which are connected by a 100 Mbps Ethernet switch in campus network. Each Namenode has a dual 2.0 GHZ processor, 1GB memory and 80GB disk. In our experiment, each client writes about 10000 directories and files to file system, and then reads 5000 metadata randomly. Client metadata cache number is set lOOO. By varying the number of clients, we test three different metadata managements: (1) DB: All metadata is stored in MySQL. Client accesses metadata directly from MySQL. (2) Single-Namenode: All metadata is stored in a single Namenode as HDFS does. (3) Multi-Namenodes: Proposed in this paper. All metadata is distributed among multiple Namenodes. Figure 6 shows the average latency of metadata write. Database has the worst performance as it deals with request by frequent 110 with disk. Single-Namenode and Multi-Namenodes performance are similar when client number is not much. That may also because log records have to been replicated to secondaries in the latter. But as client requests increase greatly, Multi? Namenodes have an obvious better performance than Single-Namenode because metadata are distributed to different nodes.
30r----??

? ?
?

!
t ?

2S 20 15 10 5

+------,?? +------,,?+-----??___,.;c;.? +--=_-"----.... "-. --?D8

__ SlnB?N.menode
?Multi-N .. mo"od.$

?

t

0+--=---.-----..--.10 Olern umber 20

;;a: ?? '? +---: : I

Figure 6. Average latency of write operations

36

Figure 7 shows the average latency of metadata read. As shown, Multi-Namenodes has the best read performance because requests are assigned to multiple Namenodes. Besides, the prefetched and cached metadata in Client greatly improves read performance.
35.-----

Commission of Education; Engineering Research Center of Information Networks, Ministry of Education.
References
[1] D. Roselli, 1. Lorch, and T. Anderson, "A Comparison of File System Workloads", Proceedings of the 2000 USENIX Annual Technical Conference, June 2000, page 41-54.

?. ?$+-------?---? .!O j 20+-------?: 15+-----???-..

!

30

+_-------#-

-.-08

?

..

1 ?

lO+_--???----??--

......Single·Namenode
"-Multl·Nameood.5

[2] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google file system", In 19th Symposium on Operating Systems Principles, Lake George, New York, 2003, pages 2943.

10

20

[3] http://hadoop.apache.orglcore/.
[4] M. Satyanarayanan, J. 1. Kistler, P. Kumar, et ai, "Coda: A highly available file system for a distributed workstation environment", IEEE 39(4):447-459. Transactions on Computers, 1990,

Figure 7. Average latency of read operations 5. Conclusion

[5] S. A. Weil, S. A. Brandt, E. L. Mille, et ai, "Ceph: A Scalable, High-Performance Distributed File System", In Proceedings of the 7th symposium on Operating systems design and implementation, 2006, pp. 307-320. [6] Rodeh, 0., Teperman, A., "zFS -a scalable distributed file system using object disks", In Proceedings of the 20th IEEEll lth NASA Goddad Conference on Mass Storage Systems and Technologies, 2003, pp. 207-218. [7] D. Feng, J. Wang, F. Wang, et ai, "DOIDFH: an Effective Distributed Metadata Management Scheme", T he 5th International Conference on Computational Science and Applications, 2007, pp. 245-250. [8] W. Li, W. Xue, J. Shu, et ai, "Dynamic Hashing: Adaptive Metadata Management for Petabyte-scale File Systems," In 23rd IEEEII4th NASA Goddard Conference on Mass Storage System and Technologies, 2006, pp. 93-98. [9] y.zhu, H.Jiang, J.Wang, and F.Xian, "HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems", IEEE Trans. Parallel and Distributed Systems,

In this paper, we present a metadata management system based on HDFS with multiple Namenodes for cloud storage. With consistent hashing and bucket, the metadata can be evenly distributed and dynamically rebalanced among Namenodes according to system workloads. BLT provides an efficient metadata retrieval mechanism for Client to find the target Namenode. To guarantee metadata availability under cluster failure, log replication technology with Paxos algorithm is adopted. In addition, system performance benefits from metadata caching and prefetching as our test shows. Test result shows that our system has a well performance for metadata reading and writing. However, for the single Namenode Manager, although the cached BLT in Client decreases its workload, it is still a bottleneck as it is responsible for monitoring all Namenode. System may get failed if Namenode Manager is down. So a replication server for Namenode Manager is needed, like the secondary Namenode for Namenode in HDFS. Besides, the decentralized architecture without Namenode Manager is also a solution.
Acknowledgements

June 2008, vol.19, no.6, pp.750- 763. [10] Karger, D., Lehman, E., Leighton, T., Panigahy, R., Levine, M. , and Lewin, D, "Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web", In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing, ACM Press, New York, 1997, pp. 654-663. [11] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et ai, "Dynamo: Amazon's highly available key-value store", In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, ACM, 2007, pages 205-220.

This work is supported by the National Key project of Scientific and Technical Supporting Programs of China (Grant Nos.2008BAH24B04, 2009BAH39B03); the National Natural Science Foundation of China (Grant No.61072060); the Program for New Century Excellent Talents in University (No.NECET-08-0738); the Co-construction Program with Beijing Municipal

37

[12]

L. Lamport, "The Part-Time Parliament", In ACM

[13] L. Lamport, "Paxos Made Simple", ACM SIGACT News, December 2001, 32(4):18-25.

Transactions on Computer Systems (J'OCS), 1998, pages 133-169,16(2).

38




友情链接: