Friday 27 July 2007

Driving data security forwards

I realise that so far what I've written this week may seem a little repetitive, or even off the point at times, but as always, there are methods to my madness, themes and threads to come back to and pick up, much like Harry Potter in many respects. Apart from the magic, wizards, owls, hats and bodycount. I digress (far too often).

I talked yesterday about data classification again, and how tagging was a difficult thing to do. In the outside world, "the internet", or any system which does not have one single point of control, tagging is meaningless. If I told you I was holding a fish, you might imagine a small, orange thing with an eye on either side, like a goldfish. I may be holding a shark. I may even be holding a goat, but just calling it a fish. You get the picture.

If I have a central point of control, I can specify what tags are available, whether they are then relevant, and how they might change. Assuming for now that this will never be achieved on the internet, the next best thing is a closed system which I have ultimate control of. Now I can start doing some interesting things.

This system is already effectively used by the military. Using Bell-LaPadula, Biba and Clark-Wilson models to control access based on confidentiality and integrity. It's enough to make any security-head start dribbling with anticipation.

If I can control a whole network of machines in this way, I can encrypt information that I deem to be confidential, apply integrity controls for that which I need to monitor closely, change groups and user access rights, etc.

Great! BUT, how do you persuade someone to classify all the data on their network? As I mentioned previously, the military do this already, but not many others. Can you imagine the time and effort it would take to trawl through every piece of data in your organisation and create meaningful meta-tags for every piece?

And what if those tags are erroneous? Computers aren't foolproof, and neither are fools, er, humans. On the internet we all tag our own blogs by picking out words we think describe them well. Think for a moment of how many different meanings words can have "set" for example means at least 12 different things. What about different languages? What about dyslexics? What about txt spk or h4x0r tags? However, a machine cannot understand the content of the files, and a list of figures may go unnoticed as a highly confidential document, or a useless piece of information marked as top secret.

In our closed system therefore, there needs to be an element of human interaction, but a controlled one. Even with a finite list, this job is a huge permutation, so now you can see again why the task of tagging the internet becomes so huge, so quickly. Explaining why this is a good idea to a security person is straightforward. Explaining it to the CFO is not. Why? Because it costs money, takes time and achieves very little at first glance.

The extremely clever answer so far has been de-duplication, or "de-duping". Instead of spending 100s of man hours going through each piece of data, a crawler is set to work on the filesystem, going through files and making hashes of the data, also picking out what it believes to be relevant meta-data for each file. If (and certainly when) two hashes of the same value are discovered, a note of the match is made and logged. When the crawler has been through all the data in the system, it shows the amount of file duplication, and the location of the duplicates, when an intelligent decision can be made on the necessity of the duplication, and the correct adjustments made (deletion of copies, shortcuts, etc.)

Estimates range up to a 30% saving in storage space for these solutions, which is a compelling argument for the CFO at last, and de-duping has proven a popular technology already. It also shows a good connection between man and machine, each filling in where the other is more prone to error. Of course there is still a necessity to make the tags relevant, but the data is already tagged with a unique identifier, which makes the task simpler in the first place.

In a closed system it is arguable that the relevant tags can be applied as a "work in progress", i.e. added by users via a desktop client as they are accessed. If access and integrity controls are added to all data, if any unauthorised accessing of files occurs before the data is properly classified, retrospective action can be taken, depending on the sensitivity of the information within. Thus a system evolves, any data not accessed within certain times can be flagged up for attention, data can be classified and approved for classification depending on attributes assigned to it, then policies written over the top.

It would, of course, have been easier to have started like this when filesystems were first invented, just as it would have been easier to assign a system of tagging for the entire internet in 1985, but that didn't happen, so we have to find a better way of protecting our data. Starting inside our networks doesn't seem like such a bad compromise to me.

No comments:

MadKasting