By Will Uppington, Clearwell Systems, Inc.
In my last post, I started a discussion on the myths surrounding concept search. The first myth I dispelled was the “concept search is concept search” myth. The myth is that there is an agreed upon definition of concept search. In actuality, when people in e-discovery use the term concept search, they don’t always mean the same thing. Frequently they are not actually talking about concept search technology at all and are actually talking about concept or content categorization technology, which is very different. The second myth that needs dispelling is that concept search is better than keyword search.
The thinking behind this myth goes something like this:
Keyword search has a lot of problems. It is prone to being over-inclusive, i.e., finding some non-relevant documents, and under-inclusive, i.e., not finding some relevant documents. Concept search technologies are new and interesting and using these technologies you can find documents that keyword search can’t find. Therefore, concept search must be better than keyword search.
Let’s examine this thinking. The first two statements are accurate. Keyword search is not perfect and can produce over- and under-inclusive results. And concept search and content categorization technologies can both help identify documents that keyword search technologies might not find. However, the conclusion that concept search is better than keyword search is not valid and doesn’t follow from these two statements. Why?
In order to answer this question, we first need to go back to the difference between concept search and content categorization. Because these are different technologies, we really need to separately compare concept search versus keyword search and content categorization versus keyword search. Let’s start with content categorization and keyword search.
The issue with this comparison is that keyword search and content categorization do different things. Keyword search can be used in many ways in e-discovery. The two most common are: (1) analysis or case assessment: finding the hot documents and understanding the matter by determining who knew what, when, how and why, etc., and (2) culling: removing non-responsive documents and/or identifying potentially privileged documents in order to reduce a large, starting set of documents to a smaller set before review.
Content categorization, on the other hand, has historically been used within the review phase of e-discovery. Categorization can help reviewers to better understand the documents they are reviewing and thus potentially increase the speed of review. Practitioners with whom I have worked also find that categorization can be useful during analysis by helping to understand a matter and identify potentially important keywords.
However, content categorization has not been used as part of culling. First, culling needs to be transparent. You need to be able to get agreement with or at least explain to the opposing side and the court exactly how you have culled the data set. If you cull based on categories of documents that have been generated by a proprietary, black-box algorithm, it’s going to be difficult to gain agreement on or explain your culling methodology. This is why the typical method of culling is still to use keyword search and either agree on the set of search terms with the opposing side or to use e-discovery search best practices to perform keyword searches on your own.
Second, content categorization has its own issues when it comes to being over- and under-inclusive. There is no guarantee that your group of documents that have been categorized as being related to, for example, a company’s hiring policies include all of the documents in your matter related to hiring policies or that they do not include some documents that may not really be related to hiring policies. Content categorization, like keyword search and virtually every information retrieval technology, is not perfect.
So what about concept search technology? Surely, concept search technology is better than old, boring keyword search. Well, actually it’s not that clear-cut. The problem with concept search technology is that while it might find more relevant documents than plain keyword search, it will also likely find more false positives. Imagine searching for documents containing “terminate” in an employment matter and your concept search technology automatically searching for “fire”, “dismiss”, etc. as well. You’ll find more documents related to the termination of employees, but you’ll also find a lot more non-relevant documents concerning house fires, the fire department, etc.
So concept search can help address the under-inclusive problem with keyword search, (though it won’t solve it) and can be helpful during analysis. But it can often increase the over-inclusive problem. In addition, today’s concept search technologies share the transparency problem with concept categorization. These technologies have largely been designed as “black boxes”, which as I have discussed in the past, makes sense for Enterprise search but not for e-discovery search, and, as a result, could also be potentially difficult to explain and defend. For these reasons, concept search technology isn’t used very much in e-discovery today. In order for its use to become widespread, it will need to become more transparent. But that’s a topic for another day.
The bottom line here is that despite all the hype, concept search and content categorization technologies do not solve all the challenges of e-discovery search. Both of these technologies can be very useful and the technology behind them is always improving. However, as most of the experienced practitioners I work with already know, these technologies are generally better thought of as supplements to keyword search, not replacements. The important question is not whether to use one technology over the other but which technology is best suited to your objectives and how best to use all the available technologies to achieve the desired goal.