Sparse indexing: large scale, inline deduplication using sampling and locality

Sparse indexing: large scale, inline deduplication using sampling and locality
of 13
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  USENIX Association 7th USENIX Conference on File and Storage Technologies 111 Sparse Indexing: Large Scale, Inline Deduplication UsingSampling and Locality Mark Lillibridge †  ,  Kave Eshghi †  ,  Deepavali Bhagwat ‡  ,  Vinay Deolalikar †  ,  Greg Trezise   , and Peter Camble  †  HP Labs  ‡ UC Santa Cruz    HP Storage Works Division Abstract We present  sparse indexing , a technique that uses sam-pling and exploits the inherent locality within backupstreams to solve for large-scale backup (e.g., hundreds of terabytes)thechunk-lookup diskbottleneckproblemthatinline, chunk-based deduplication schemes face. Theproblem is that these schemes traditionally require a fullchunk index, which indexes every chunk, in order to de-termine which chunks have already been stored; unfortu-nately, at scale it is impractical to keep such an index inRAM and a disk-based index with one seek per incomingchunk is far too slow.We perform stream deduplication by breaking up anincoming stream into relatively large segments and dedu-plicating each segment against only a few of the mostsimilar previous segments. To identify similar segments,we use sampling and a sparse index. We choose a smallportionofthechunksinthestreamassamples; oursparseindex maps these samples to the existing segments inwhich they occur. Thus, we avoid the need for a fullchunk index. Since only the sampled chunks’ hashes arekept in RAM and the sampling rate is low, we dramat-ically reduce the RAM to disk ratio for effective dedu-plication. At the same time, only a few seeks are re-quired per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorpo-rated into number of Hewlett-Packard backup products. 1 Introduction Traditionally, magnetic tape has been used for data back up. With the explosion in disk capacity, it is now af-fordable to use disk for data backup. Disk, unlike tape,is random access and can significantly speed up backupand restore operations. Accordingly, disk-to-disk backup(D2D) has become the preferred backup option for orga-nizations [3].Deduplication can increase the effective capability of a D2D device by one or two orders of magnitude [4].Deduplication can accomplish this because backup setshavemassiveredundancyduetothefactsthatalargepro-portion of data does not change between backup sessionsand that files are often shared between machines. Dedu-plication, which is practical only with random-access de-vices, removes this redundancy by storing duplicate dataonly once and has become an essential feature of disk-to-disk backup solutions.We believe chunk-based deduplication is the dedu-plication method best suited to D2D: it deduplicatesdata both across backups and within backups and doesnot require any knowledge of the backup data format.With this method, data to be deduplicated is brokeninto variable-length chunks using content-based chunk boundaries [20], and incoming chunks are comparedwith the chunks in the store by hash comparison; onlychunks that are not already there are stored. We are inter-ested in  inline  deduplication, where data is deduplicatedas it arrives rather than later in batch mode, because of its capacity, bandwidth, and simplicity advantages (seeSection 2.2).Unfortunately, inline, chunk-based deduplicationwhen used at large scale faces what is known as the chunk-lookupdiskbottleneckproblem : Traditionally, thismethod requires a full chunk index, which maps eachchunk’s hash to where that chunk is stored on disk, in or-der to determine which chunks have already been stored.However, at useful D2D scales (e.g., 10-100 TB), it isimpractical to keep such a large index in RAM and adisk-based index with one seek per incoming chunk isfar too slow (see Section 2.3).This problem has been addressed in the literature byZhu  etal.  [28], who tackle it by using an in-memoryBloom Filter and caching index fragments, where eachfragment indexes a set of chunks found together in theinput. In this paper, we show a different way of solvingthis problem in the context of data stream deduplication(the D2D case). Our solution has the advantage that it  112 7th USENIX Conference on File and Storage Technologies USENIX Association uses significantly less RAM than Zhu  etal. ’s approach.To solve the chunk-lookup disk bottleneck problem,we rely on  chunk locality : the tendency for chunks inbackup data streams to reoccur together. That is, if thelast time we encountered chunk   A , it was surrounded bychunks  B ,  C , and  D , then the next time we encounter  A (even in a different backup) it is likely that we will alsoencounter B , C , or D nearby. This differs from traditionalnotions of locality because occurrences of  A may be sep-arated by very long intervals (e.g., terabytes). A derivedproperty we take advantage of is that if two pieces of backup streams share any chunks, they are likely to sharemany chunks.We perform stream deduplication by breaking up eachinput stream into segments, each of which contains thou-sands of chunks. For each segment, we choose a fewof the most similar segments that have been stored previ-ously. We deduplicate each segment against only its cho-sen few segments, thus avoiding the need for a full chunk index. Because of the high chunk locality of backupstreams, this still provides highly effective deduplication.To identify similar segments, we use sampling and asparse index. We choose a small portion of the chunksas samples; our sparse index maps these samples’ hashesto the already-stored segments in which they occur. Byusing an appropriate low sampling rate, we can ensurethat the sparse index is small enough to fit easily intoRAM while still obtaining excellent deduplication. Atthe same time, only a few seeks are required per segmentto load its chosen segments’ information avoiding anydisk bottleneck and achieving good throughput.Of course, since we deduplicate each segment againstonly a limited number of other segments, we occasion-ally store duplicate chunks. However, due to our lowerRAM requirements, we can afford to use smaller chunks,which more than compensates for the loss of dedupli-cation the occasional duplicate chunk causes. The ap-proach described in this paper has recently been incorpo-rated into a number of Hewlett-Packard backup products.The rest of this paper is organized as follows: in thenext section, we provide more background. In Section 3,we describe our approach to doing chunk-based dedu-plication. In Section 4, we report on various simula-tion experiments with real data, including a comparisonwith Zhu  etal. , and on the ongoing productization of thiswork. Finally, we describe related work in Section 5 andour conclusions in Section 6. 2 Background2.1 D2D usage There are two modes in which D2D is performed today,usinganetwork-attached-storage(NAS)protocolandus-ing a Virtual Tape Library (VTL) protocol:In the NAS approach, the backup device is treated as anetwork-attached storage device, and files are copied toit using protocols such as NFS and CIFS. To achieve highthroughput, typically large directory trees are coalesced,using a utility such as tar, and the resulting tar file storedon the backup device. Note that tar can operate either inincremental or in full mode.The VTL approach is for backward compatibility withexisting backup agents. There is a large installed base of thousands of backup agents that send their data to tapelibraries using a standard tape library protocol. To makethe job of migrating to disk-based backup easier, ven-dors provide Virtual Tape Libraries: backup storage de-vices that emulate the tape library protocol for I/O, butuse disk-based storage internally.In both NAS and VTL-based D2D, the backup data ispresented to the backup storage device as a stream. Inthe case of VTL, the stream is the virtual tape image,and in the case of NAS-based backup, the stream is thelarge tar file that is generated by the client. In both cases,the stream can be quite large: a single tape image can be400 GB, for example. 2.2 Inline versus out-of-line deduplication Inline deduplication refers to deduplication processeswhere the data is deduplicated as it arrives and beforeit hits disk, as opposed to out-of-line (also called post-process) deduplication where the data is first accumu-lated in an on-disk holding area and then deduplicatedlater in batch mode. With out-of-line deduplication, thechunk-lookup disk bottleneck can be avoided by usingbatch processing algorithms, such as hash join [24], tofind chunks with identical hashes.However, out-of-line deduplication has several disad-vantages compared to inline deduplication: (a) the needfor an on-disk holding area large enough to hold an en-tire backup window’s worth of raw data can substantiallydiminish storage capacity, 1 (b) all the functionality that aD2D device provides (data restoration, data replication,compression, etc.) must be implemented and/or testedseparatelyfortherawholdingareaaswellasthe dedupli-cated store, and (c) it is not possible to conserve network or disk bandwidth because every chunk must be writtento the holding area on disk. 2.3 The chunk-lookup disk bottleneck The traditional way to implement inline, chunk-baseddeduplication is to use a  full chunk index : a key-valueindex of all the stored chunks, where the key is a chunk’shash, and the value holds metadata about that chunk, in-cluding where it is stored on disk  [22, 14]. When an  USENIX Association 7th USENIX Conference on File and Storage Technologies 113 incoming chunk is to be stored, its hash is looked up inthe full index, and the chunk is stored only if no entry isfound for its hash. We refer to this approach as the fullindex approach.Using a small chunk size is crucial for high-qualitychunk-based deduplication because most duplicate dataregions are not particularly large. For example, for ourdata set Workgroup (see Section 4.2), switching from 4to 8 KB average-size chunks reduces the deduplicationfactor (original size/deduplicated size) from 13 to 11;switching to 16 KB chunks further reduces it to 9.This need for a small chunk size means that the fullchunk index consumes a great deal of space for largestores. Consider, forexample, astorethatcontains10TBof unique data and uses 4 KB chunks. Then there are 2 ฀ 7 × 10 9 unique chunks. Assuming that every hash en-try in the index consumes 40 bytes, we need 100 GB of storage for the full index.It is not cost effective to keep all of this index inRAM. However, if we keep the index on disk, due tothe lack of short-term locality in the stream of incomingchunk hashes, we will need one disk seek per chunk hashlookup. If a seek on average takes 4 ms, this means wecan look up only 250 chunks per second for a process-ing rate of 1 MB/s, which is not acceptable. This is thechunk-lookup disk bottleneck that needs to be avoided. 3 Our Approach Under the sparse indexing approach,  segments  are theunit of storage and retrieval. A segment is a sequenceof chunks. Data streams are broken into segments in atwo step process: first, the data stream is broken intoa sequence of variable-length chunks using a chunkingalgorithm, and, second, the resulting chunk sequence isbroken into a sequence of segments using a segmentingalgorithm. Segments are usually on the order of a fewmegabytes. We say that two segments are similar if theyshare a number of chunks.Segments are represented inthe storeusing their  mani- fests : a manifest or segment recipe [25] is a data structurethat allows reconstructing a segment from its chunks,which are stored separately in one or more chunk con-tainers to allow for sharing of chunks between segments.A segment’s manifest records its sequence of chunks,giving for each chunk its hash, where it is stored on disk,and possibly its length. Every stored segment has a man-ifest that is stored on disk.Incoming segments are deduplicated against similar,existing segments in the store. Deduplication proceedsin two steps: first, we identify among all the segments inthe store some that are most similar to the incoming seg-ment, which we call  champions , and, second, we dedu-plicate against those segments by finding the chunks they Container store Ma nifest storeDisk storeChunker  Segmenter  by te streamsegmentschunks Deduplicator  champion ptrs Champion chooser Sparse index hooksmanifestptrschampions new manifests new chunksnew entries Figure 1:  Block diagram of the deduplication process share with the incoming segment, which do not need tobe stored again.To identify similar segments, we sample the chunk hashes of the incoming segment, and use an in-RAM in-dex to determine which already-stored segments containhow many of those hashes. A simple and fast way tosample is to choose as a sample every hash whose first  n bits are zero; this results in an average  sampling rate  of  1 / 2 ฀ ; that is, on average one in  2 ฀ hashes is chosen as asample. We call the chosen hashes  hooks .The in-memory index, called the  sparse index , mapshooks to the manifests in which they occur. The mani-fests themselves are kept on disk; the sparse index holdsonly pointers to them. Once we have chosen cham-pions, we can load their manifests into RAM and usethem to deduplicate the incoming segment. Note that al-though we choose champions because they share hookswith the incoming segment (and thus, the chunks withthose hashes), as a consequence of chunk locality theyare likely to share many other chunks with the incomingsegment as well.We will now describe the deduplication process inmore detail. A block diagram of the process can be foundin Figure 1. 3.1 Chunking and segmenting Content-based chunking has been studied at length in theliterature [1, 16, 20]. We use our Two-Threshold Two-Divisor (TTTD) chunking algorithm [13] to subdividethe incoming data stream into chunks. TTTD producesvariable-sized chunks with smaller size variation thanother chunking algorithms, leading to superior dedupli-cation.We consider two different segmentation algorithms inthis paper, each of which takes a target segment size asa parameter. The first algorithm,  fixed-size segmentation ,chops the stream of incoming chunks just before the first  114 7th USENIX Conference on File and Storage Technologies USENIX Association chunk whose inclusion would make the current segmentlonger than the goal segment length. “Fixed-sized” seg-ments thus actually have a small amount of size varia-tion because we round down to the nearest chunk bound-ary. We believe that it is important to make segmentboundaries coincide with chunk boundaries to avoid splitchunks, which have no chance of being deduplicated.Because we perform deduplication by finding seg-ments similar to an incoming segment and deduplicatingagainst them, it is important that the similarity betweenan incoming segment and the most similar existing seg-ments in the store be as high as possible. Fixed-size seg-mentation does not perform as well here as we wouldlikebecauseofthe boundary-shiftingproblem [13]: Con-sider, for example, two data streams that are identical ex-cept that the first stream has an extra half-a-segment sizeworth of data at the front. With fixed-size segmentation,segments in the second stream will only have 50% over-lap with the segments in the first stream, even though thetwo streams are identical except for some data at the startof the first stream.To avoid the segment boundary-shifting problem,our second segmentation algorithm,  variable-size seg-mentation , uses the same trick used at the chunkinglevel to avoid the boundary-shifting problem: we basethe boundaries on landmarks in the content, not dis-tance. Variable-size segmentation operates at the levelof chunks (really chunk hashes) rather than bytes andplaces segment boundaries only at existing chunk bound-aries. The start of a chunk is considered to represent alandmark if that chunk’s hash modulo a predetermineddivisor is equal to -1. The frequency of landmarks—andhence average segment size—can be controlled by vary-ing the size of the divisor.To reduce segment-size variation, variable-size seg-mentation uses TTTD applied to chunks instead of databytes. The algorithm is the same, except that we moveone chunk at a time instead of one byte at a time, andthat we use the above notion of what a landmark is. Notethat this ignores the lengths of the chunks, treating longand short chunks the same. We obtain the needed TTTDparameters (minimum size, maximum size, primary di-visor, and secondary divisor) in the usual way from thedesired average size. Thus, for example, with variable-size segmentation, mean size 10 MB segments using 4KB chunks have from 1,160 to 7,062 chunks with an av-erage of 2,560 chunks, each chunk of which, on average,contains 4 KB of data. 3.2 Choosing champions Looking up the hooks of an incoming segment  S   in thesparse index results in a possible set of manifests againstwhich that segment can be deduplicated. However, wedo not necessarily want to use all of those manifests todeduplicate against, since loading manifests from disk iscostly. In fact, as we show in Section 4.3, only a few wellchosen manifests suffice. So, from among all the mani-fests produced by querying the sparse index, we choose afew to deduplicate against. We call the chosen manifestschampions.The algorithm by which we choose champions is asfollows: we choose champions one at a time until themaximum allowable number of champions are found, orwe run out of candidate manifests. Each time we choose,we choose the manifest with the highest non-zero score,where a manifest gets one point for each hook it has incommon with  S   that is not already present in any previ-ously chosen champion. If there is a tie, we choose themanifest most recently stored. The choice of which man-ifests to choose as champions is done based solely on thehooks in the sparse index; that is, it does not involve anydisk accesses.We don’t give points for hooks belonging to alreadychosen manifests because those chunks (and the chunksaround them by chunk locality) are most likely alreadycovered by the previous champions. Consider the fol-lowing highly-simplified example showing the hooks of  S   and three candidate manifests ( m ฀ – m 3 ): S   b c d e m n m ฀  a  b c d e  f  m 2  z a  b c d  f  m 3  m n  o p q rThe manifests are shown in descending order of howmany hooks they have in common with  S   (commonhooks shown in bold). Our algorithm chooses  m ฀  then m 3 , which together cover all the hooks of   S  , unlike  m ฀ and  m 2 . 3.3 Deduplicating against the champions Once we have determined the champions for the incom-ing segment, we load their manifests from disk. A smallcache of recently loaded manifests can speed this processup somewhat because adjacent segments sometimes havechampions in common.The hashes of the chunks in the incoming segmentare then compared with the hashes in the champions’manifests in order to find duplicate chunks. We use theSHA1 hash algorithm [15] to make false positives hereextremely unlikely. Those chunks that are found not tobe present in any of the champions are stored on disk inchunk containers, and a new manifest is created for theincoming segment. The new manifest contains the loca-  USENIX Association 7th USENIX Conference on File and Storage Technologies 115 tion on disk where each incoming chunk is stored. Inthe case of chunks that are duplicates of a chunk in oneor more of the champions, the location is the location of the existing chunk, which is obtained from the relevantmanifest. In the case of new chunks, the on-disk locationis where that chunk has just been stored. Once the newmanifest is created, it is stored on disk in the manifeststore.Finally, we add entries for this manifest to the sparseindex with the manifest’s hooks as keys. Some of thehooks may already exist in the sparse index, in whichcase we add the manifest to the list of manifests that arepointed to by that hook. To conserve space, it may bedesirable to set a maximum limit for the number of man-ifests that can be pointed to by any one hook. If the max-imum is reached, the oldest manifest is removed from thelist before the newest one is added. 3.4 Avoiding the chunk-lookup disk bottle-neck Notice that there is no full chunk index in our approach,either in RAM or on disk. The only index we maintainin RAM, the sparse index, is much smaller than a fullchunk index: for example, if we only sample one out of every 128 hashes, then the sparse index can be 128 timessmaller than a full chunk index.We do have to make a handful of random disk accessesper segment in order to load in champion manifests, butthe cost of those seeks is amortized over the thousands of chunks in each segment, leading to acceptable through-put. Thus, we avoid the chunk-lookup disk bottleneck. 3.5 Storing chunks We do not have room in this paper, alas, to describe howbest to store chunks in chunk containers. The scheme de-scribed in Zhu  etal.  [28], however, is a pretty good start-ing point and can be used with our approach. They main-tain an open chunk container for each incoming stream,appending each new (unique) chunk to the open con-tainer corresponding to the stream it is part of. When achunk container fills up (they use a fixed size for efficientpacking), a new one is opened up.This process uses chunk locality to group togetherchunks likely to be retrieved together so that restorationperformance is reasonable. Supporting deletion of seg-ments requires additional machinery for merging mostlyempty containers, garbage collection (when is it safe tostop storing a shared chunk?), and possibly defragmen-tation. 3.6 Using less bandwidth We have described a system where all the raw backupdata is fed across the network to the backup system andonly then deduplicated, which may consume a lot of network bandwidth. It is possible to use substantiallyless bandwidth at the cost of some client-side process-ing if the legacy backup clients could be modified orreplaced. One way of doing this is to have the backupclient perform the chunking, hashing, and segmenta-tion locally. The client initially sends only a segment’schunks’ hashes to the back-end, which performs cham-pion choosing, loads the champion manifests, and thendetermines which of those chunks need to be stored. Theback-end notifies the client of this and the client sendsonly the chunks that need to be stored, possibly com-pressed. 4 Experimental Results In order to test our approach, we built a simulator thatallows us to experiment with a number of important pa-rameters, including some parameter values that are infea-sible in practice (e.g., using a full index). We apply oursimulator to two realistic data sets and report below onlocality, overall deduplication, RAM usage, and through-put. We also report briefly on some optimizations and anongoing productization that validates our approach. 4.1 Simulator Our simulator takes as input a series of (chunk hash,length) pairs, divides it into segments, determines thechampions for each segment, and then calculates theamount of deduplication obtained. Available knobs in-clude type of segmentation (fixed or variable size), meansegment size, sampling rate, maximum number of cham-pions loadable per segment, how many manifest IDs tokeep per hook in the sparse index, and whether or not touse a simple form of manifest caching (see Section 4.7).We (or others when privacy is a concern) run a smalltool we have written called  chunklite  in order to producechunk information for the simulator. Chunklite readsfrom either a mounted tape or a list of files, chunkingusing the TTTD chunking algorithm [13]. Except wherewe say otherwise, all experiments use chunklite’s default4 KB mean chunk size, 2 which we find a good trade-off between maximizing deduplication and minimizing per-chunk overhead.The simulator produces various statistics, includingthe sum of lengths of every input chunk (srcinal size)and the sum of lengths of every non-removed chunk (deduplicated size). The estimated deduplication factoris then srcinal size/deduplicated size.
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!