Big data comes from everywhere or anywhere: existing data bases, mountains of unstructured and semi-structured documents, entirely new sources on the web or in the evolving “network of things” in the enterprise. It creates unprecedented volumes of data to winnow and analyze for meaning. The presence in the data streams of duplicate data—which may be necessary, as the same blocks of information may mean different things when they are found in different contexts—create a kind of double jeopardy for the enterprise.
First, of course, there’s just the question of where to keep it all. Storage capacity is cheap and getting cheaper to acquire, but it isn’t free, and even free capacity is not free to configure, manage, and safeguard.
Faced with the need to cope with new volumes of data, you have an opportunity to drive storage efficiency deeper into the data center. De-duplication is the key technology for making storage itself more efficient (where content management and the like may make our USE of storage more efficient). Making sure that you only store one copy of data, at the file level or better still a the block level, can vastly reduce the amount of space a given business’s data consumes. This reduces the load on backup systems as well, and by extension reduces WAN loads generated by storage array replication and by backup among sites.
Second, there’s the problem of how to make use of the big data. Analytics software turns the mountains of raw data into manageable foothills of useful information. However, its speed and efficiency will be to some degree determined by the volume of data it has to retrieve from storage, ingest and process. Any reduction in the volume retrieved from storage will improve performance.
Not all kinds of data can be deduplicated in this context, and there may be tradeoffs in performance that undercut the advantages.
Bottom line: A savvy CIO will try to avoid the double-jeopardy pinch of redundant data in an emerging big data environment, and examine closely the roles that deduplication can play.