Return to site

Search for "This" and "Not That."

· Quick Tips

Can you tell me what documents were only produced to party A and NOT produced to party B?

You get this question a lot in litigation support. It's a set difference question. What from set A is not in set B?


(I made some beautiful art to help!)

Here are a few ways to tackle this:


We've written about vlookup before. If your data is in Excel and it's reasonably small, use vlookup and filter for things that do not match. It's expedient and here it has the advantage of being the platform for analysis and the platform for presentation. All done.


If you have access to a SQL server, you can load those big DATS to two new tables A/B and perform the analysis with little worry about the computation constraints.

select a.Begdoc
from a left join b
on a.BegDoc = b.Begdoc
where b.Begdoc is null

This joins the two sets together and returns only records from A where they do not appear in B. And it will run on millions of records in a blink of an eye.


I used this in the past when I didn't have access to SQL, but I did have access to a Linux-like shell. The comm command takes two sorted lists and compares them line by line. When you use the switches (1, 2, 3) you can suppress information to get only the unique data from one or the other columns. Example:

comm -23 columnA.txt columnB.txt

-2 suppresses values unique to file 2

-3 suppresses the values that match

Leaving only the unique values in columnA.txt. Changing it to -13 will suppress the unique values in columnA.txt, leaving the unique values in columnB.txt.

When using these methods it's important to realize that you are using a practical application of set theory. Feel free to exaggerate what you do at dinner parties, "Well lately I've been applying set theory to the problems of legal data analysis." Smooth.

Written by Jon Canty

All Posts

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!