Data Deduplication in Simple Terms
May 7th, 2011 Mike Theriault
The prime cause of out-of-control data duplication is (ironically) the current standard backup protocol requiring numerous copies of every document just in case. The situation is further complicated by ever-expanding legal requirements.The best way out of this quagmire is data deduplication - a key technology for any organization wanting to optimize the performance, efficiency, and cost-effectiveness of its data storage environment.
The term Data deduplication is commonly shortened to data dedup. Essentially, it's a process of identifying and removing multiple occurrences of the same data. The first time a deduplication system identifies a file, block, or bit, it flags it. From there, the system marks each subsequent identical item with a placeholder before removing it from the system. The placeholder links to the original data so that users will always bring up the original data when they try to open the removed duplicate.
This deduplication process significantly reduces the amount of storage space needed in the system. For example, a system that has 200 copies of the same 5 MB document- one in each employee's personal folder- can reduce it to a single copy of the original file plus 199 placeholders that link back to the original document. This means 200 copies of a 5 MB file will take up 5 MB of space (plus the size of the index file) instead of 1,000 MB.
You're probably already compressing files to save storage space. Compression reduces the size of the file by eliminating redundant bits. Compression is better than doing nothing, but it doesn't eliminate redundant files. In fact, it just compresses multiple copies of the same files.
Data deduplication goes a step further by eliminating those redundant copies, storing only one. Storage-wise, this makes a big difference. Simply compressing files reduces storage space by about 50 percent, but data deduplication reduces storage space by a much greater percentage, as the 5 MB document example above illustrates.
A number of characteristics differentiate deduplication processes. Some approaches are inline, others are postprocess. Inline deduplication means that data is deduplicated before it's committed to a disk drive. The postprocess approach first writes data to disk, and once a job or a dataset has been completed, deduplication follows
Benefits
With data deduplication, users can streamline backup, facilitate emergency data restoration, and reduce costs. Primary benefits include:
Efficiency - Streamline Backup
As system back-ups become quicker and easier, users will be able to create and maintain more backup sets that stretch further back in time. This lets users keep a complete set of document versions without straining the system.
Performance - Facilitate Emergency Data Recovery
Until the advent of reliable data deduplication software, data compression was the only way to reduce the file size of the data stored offsite. With data deduplication, the backup set is both compressed and reduced---this means only the data that changed that day is backed up at the end of the day. This significantly reduces the transfer time in the event of a data restoration. Users no longer have to compromise on performance to take advantage of extended retention capability.
Cost-Reduction
By eliminating duplicate data and ensuring that data archives are as compact as possible, companies can keep more data online longer---at significantly lower costs. It's common to see a backup appliance with data deduplication technology holding 10 to 50 times more backup data than a conventional disk storage product. The advantage depends on the data being backed up, the backup methodology, and the length of time data is retained.
Drawbacks
Data loss can be an issue when backing up data using data deduplication. If the index file becomes corrupted, the data processed using data deduplication may be lost, since it will be nearly impossible to rebuild the data. However, this concern may be alleviated with leading-edge software and by making multiple copies of especially important data.
Focus on the total solution. Before deciding on a data deduplication system, be sure that you understand the entire process, from backup to restore, and know how to manage it.
Finding a Vendor
While the benefits of data deduplication can be astounding, pay attention to facts, not hype. The ideal vendor will have network of experts on staff and offer an optimal combination of experience, product choice, technical support, and competitive pricing. Among the other considerations is the vendor's ability to proactively address your data deduplication hardware and software needs.About the Author:
Mike Theriault is President & CEO of B2B Computer Products LLC. Award-winning B2B Computer was identified by Inc. magazine as one of the fastest growing businesses of its type in the U.S. in 2009 & 2010. It is a single-source provider of products and manufacturer-certified services that include data deduplication, virtualization, VoIP systems, disaster recovery, SAN storage, and server consolidation,. Visit their website at http://www.B2BComp.com.
Article Marketing3 people like this article


