A Data Protection Taxonomy

Data Protection Taxonomy

Building on the notion of a revised data protection and backup methodology described in my previous post, I think there is another important notion here: that of a data protection taxonomy.

As we think about moving backup away from a host-centric application, to a data-centric service, I think we need a way to consistently describe the data protection characteristics of a data set. This description needs to be completely independent of any storage array, application, data type, vendor, target, or network.

We need a simple, consistent means of sharing an entire data protection policy between any device or application responsible for providing data protection services.

I could then associate the data protection policy with a data set and any service provider for data protection could interpret it, and provide the mandated service level.

Basically, any data object could have such a policy associated with it. And I could bind that policy to it in all kinds of interesting places.

How about binding a policy to a VM and making it available through the APIs on ESX or VSphere? How about binding a policy to a database and making it available through the Oracle APIs? How about binding a policy to an OS? Better yet, how about binding it to a consistency group or LUN on an array?

At this point, with the appropriate credentials, any data protection service provider–be it archival services, Continuous Data Protection (CDP) services, backup services (hosted on an appliance, an array, in a traditional backup application)–can read or request the policy, and provide the required data protection service.

More practically: any service provider, from any vendor can act upon any data set, resident upon any storage.

How is that for no more vendor lock in?

If such a taxonomy could be widely agreed upon and adopted, the potential for increased architectural freedom for data protection services is enormous. The role of the traditional backup server would likely be reduced to that of a simple meta-data catalog. I am fine with that. But our ability to meet the architectural goals I described in the last post on broken backup would be significantly increased. I might go so far as to say that without this taxonomy, getting to that goal would be difficult to impossible.

The taxonomy I propose is described below

There are two forms: the simple, for those inclined to simplify data protection and retention, and the complex for those who want very exact control over when backups are taken and how they are retained. In either form, I think the taxonomy captures any characteristic we need to fully describe a comprehensive data protection scheme.

To clarify, the numbers represent a numeric value which must be summed. The sum will describe what data protection services are required. So, in the simple scheme, a 7 designates a data set that needs CDP, backup, and archive. A 2 designates a data set that gets traditional backup only.

Also Read : SQL Server Express 2018 JDBC Driver v1.2 Official Release

The letters would actually have a numeric value which provides the value for an actual, or the length of a retention.

So, using the simple case again, a data set described by 2,0,0,14,0,12,0 gets backup (only), and retains 2 weeks of daily backups, and 12 months of full backups.

Note that I have deliberately excluded the notion of full, incremental, progressive, deduplicated, and differential backups. In principal these simply offer different means of retaining the same data: from a logical retention perspective, they are identical. The implementation of these is determined by the protection methodology used by the data protection service provider–but is irrelevant to the actual policy. Put another way: if your service provider is Avamar, you might do a source deduplication, and always do incrementals (or fulls–they really are the same thing from an Avamar perspective). If it is TSM, you might do incrementals forever for some data sets, and fulls with standard incrementals for others. This is a choice best determined by the data protection service provider, and does not need to be described in our data protection taxonomy.

The two forms are:


Frequency of protection (CDP or backup) 1,2

  • Duration of CDP protection – # of hours

Eligible for archive (yes or no) 4

  • Age before archive (# of hours) a
  • Size before archive (bytes) b

Number of backups retained:

  • # of daily images c
  • # of weekly images d
  • # of monthly images e
  • # of annual images f

Therefore, backup policy = x,a,b,c,…,f; where x>0 and x<8


Frequency of protection (CDP or backup) 1,2

  • Duration of CDP protection – # of hours

Eligible for archive (yes or no) 4

  • Age before archive (# of hours) a
  • Size before archive (bytes) b

Number of backups retained:

  • # of daily images
  • Monday 8,c
  • Tuesday 16,d
  • Wednesday 32,e
  • Thursday 64,f
  • Friday 128,g
  • Saturday 256,h
  • # of weekly images
  • First 512,i
  • Second 1024,j
  • Third 2048,k
  • Fourth 4096,l
  • Fifth 8192,m
  • # of monthly images 32,n
  • # of annual images 64,o
  • Weeklies, monthlies, and annuals on first weekend of calendar 16384

Therefore, backup policy = x,a,b,c,…,o; where x>0 and x<32,768