Sandline Discovery - Ideas, Education and eDiscovery
For file type filtering, do you use inclusion or exclusion lists?
We rarely recommend inclusion lists as the workflow may miss file types that could have been important, but we should revisit what the question implies.
Typically, the list includes file extensions, (Though it may include something like automatic classifications out of a file type sniffer) but the principle and caveats are the same. With file extensions, for example:
If using such a list for filtering, this implies you apply the filters after extracting files from archives, such as zips.
If whitelisting, that is, using an "inclusion" list, you would keep the extensions on the list and discard the rest; if blacklisting, that is an "exclusion" list, you would drop the extensions on the list and keep the rest.
Here's the Rub
It seems that an inclusion list would be better at culling your data down to the stuff you want, so why don't you recommend that?
It's a classic problem. Whitelists tend to be better at getting you only the things you want, but there are too many unknowns: you will exclude items that should have gotten through the filter.
Pop quiz: how many extensions can a Word document have? I can think of about a half dozen off the top of my head. Now go look at the available extensions in the "save as" dialog; there are more than a dozen, and those are just the popular ones.
Ok, but nobody really saves Word docs as .html.
You would think, but I've seen it in the wild.
Fine, exclusion list it is. But then you're forced to deal with lots of junk: thousands, maybe hundreds of thousands of file extensions. This seems unmanageable.
You are right. Besides, an exclusion list doesn't work either, at least not if it's just a file extension list. Take the .html extension, for example. Most of the time, it's an Internet cache file and can be excluded, but sometimes it's Word doc.
If neither works, what are we to do?
Despair not. We just need to use the exclusion list principle, but get a little more sophisticated.
At Sandline, we classify files based on a number of attributes, often, but not always, including extensions. These rules can be:
General – the set that we've built from experience and are portable between cases
Specific – rules that apply to characteristics unique to a particular case or customer.
We put files into several taxonomies. Files with a .csv extension, for example, usually fall into two hierarchical classifications: "Text/Delimited" and "Office/Excel". An .html file, likewise can fall into several types and from there, we can build exclusion rules like "exclude browser cache unless it is an office type".
In this way we can build exclusion rules that reduce the universe of file types to a scale understandable by humans.
No matter the data volumes the case requirements demand, there is a solution that will cull appropriately to drive down unneeded management and review.
Written by Joe Ulfers
We just sent you an email. Please click the link in the email to confirm your subscription!