I was looking at the backup policy for our organization and thought let me check what is new and i found this, something called "Data Deduplication". Interesting......
Data deduplication, data reduction, commonality factoring, capacity optimized storage -- whatever you call it -- is a process designed to make network backups to disk faster and more economical.
The idea is to eliminate large amounts of redundant data that can chew up disk space. Proponents also say it enables you to make more data available online longer in the same amount of disk.
In deduplication, as data is backed up to a disk-based virtual tape library (VTL) appliance, a catalog of the data is built. This catalog or repository indexes individual bits of data in a file or block of information, assigns a metadata reference to it that is used to rebuild the file if it needs to be recovered and stores it on disk. The catalog also is used on subsequent backups to identify which data elements are unique. Nonunique data elements are not backed up; unique ones are committed to disk.
For instance, a 20-slide PowerPoint file is initially backed up. The user then changes a single slide in the files, saves the file and e-mails it to 10 counterparts. When a traditional backup occurs, the entire PowerPoint file and its 10 e-mailed copies are backed up. In deduplication, after the PowerPoint file is modified, only the unique elements of data -- the single changed slide -- is backed up, requiring significantly less disk capacity.
"The data-reduction numbers are great," says Randy Kerns, an independent storage analyst. "Most vendors are quoting a 20-to-1 capacity reduction by only storing uniquely changed data."
Data deduplication uses a couple of methods to identify unique information. Some vendors use a cryptographic algorithm called hashing to tell whether data is unique. The algorithm is applied to the data and compared with previously calculated hashes. Other vendors, such as Diligent, use a pattern-matching and differencing algorithm that identifies duplicate data. Diligent says this method is more efficient, because it is less CPU- and memory-intensive.
Dedupe Differentiation Data deduplication differs from compression in that compression looks only for repeating patterns of information and reduces them. Brad O'Neill, senior analyst with the Taneja Group, offers this example: The pattern of data '123412341234123412341234' would be compressed to '6 1234' or 6x1234 -- a fivefold compression of 24 digits. Data duplication would result in reducing the unique data initially to four digits -- 1234 -- and subsequent backups would recognize that no additional unique data was being transmitted, so it would not be backed up. Deduplication also differs from incremental backups in that only the byte-level changes are backed up. In incremental backups, entire files or blocks of information are backed up when they change. For instance, in a file, a user changes the single word 'Bob' to 'Steve' and saves the file. When the system backs up this data incrementally, rather than just backing up the unique data -- 'Steve' -- it backs up the entire file. Data-deduplication technology would recognize that 'Steve' is the only unique element of the file and thus back it up solely.
The size of the catalog and cache are also important in differentiating deduplication products.
"The efficiency of deduplication technology all comes down to how the index is architected and how large it is," O'Neill says. "For instance, Diligent spends a lot of time talking about the speed and size of its index -- that it's small and resides completely in RAM."
Data deduplication takes place by two methods -- either in-line or postprocessing. With in-line processing, data is deduplicated as it is backed up; in postprocessing, data is deduplicated after it is backed up.
Analysts say there is not much of a difference in the outcome between using either method.
"The in-line vendors make claims about performance and scalability; the postprocessing vendors are generally making the same claims," O'Neill says. "From everything I see, it comes down to the particular workload profile of the user. One of the disadvantages of postprocessing is it can potentially extend the time it takes to backup the data."
ADIC, Asigra, Avamar, Data Domain, Diligent, Falconstor and Microsoft all use in-line processing; Copan and Sepaton use postprocessing.