Analytics for Human Information: The New Top 10 Myths of Big Data - Myth #9

In myth #8, I emphasized that the “what” questions are much less relevant and much less valuable than the “why” questions. We have spent decades asking “what” questions, and most organizations have become pretty good at answering them.  Over time our analytic approaches, tools, techniques, and expertise have aligned to answering “what” as effectively, efficiently, and accurately as possible. But, these approaches to uncovering the “what” may fly directly into the face of our new imperative—which is instead, to ask and answer “why.”  Indeed, our historical approaches to “what” may actually prevent us from answering “why,” which is the topic of this myth.

 

big-data-analytics.jpg

 

Big data myth #9: Big Data requires good data

 

The entire concept of data “goodness” stems from historical data processing approaches such as Extract, Transform, and Load (ETL) protocols. ETL has been around for quite some time and this is a standard approach to making data “good” for analysis.

 

In ETL, we take data from one or more sources, transform that data (which is effectively a cleansing step), and then load the various sources of data into some analytic tool to gain insight. The “T” of ETL implies, and exists out of the belief, that data is inherently “dirty.” It also implies that corporate data is full of errors, inconsistencies, mistakes, null values, etc., and that all of this causes the data to be less valuable and more prone to misinterpretation. While this perspective was fine in a world obsessed with “what,” in a new world realigning towards “why” the ETL approach is analytic suicide.

 

The noise is the signal

What if the truly valuable information for understanding “why” (the contextual stuff that is rich in new insight) was actually the so-called dirty data that ETL purposefully discards? What if the so-called noise in data that is used to answer “what” is actually the signal for answering “why?” This is exactly what I would propose for many, if not most, sets of corporate data. And yet most data warehousing or Big Data analytic tools purposefully eliminate this data to make it “clean.”

 

A data cleansing example: deleting redundant data

Sound dubious? Let me give you an example. In data cleansing, I look to delete redundant data, where perhaps the same person appears twice in a data set with only minor variations between the records. Or I look to fill in fields where the data is missing and shouldn’t be. In looking up the definition of “data cleansing” on Wikipedia (admittedly not the end-all authority on things), one of the examples given for when you would “cleanse” data for use is when you have a volume of customer data where one of the fields that you’re analyzing is their home phone number. 

 

Let’s assume that our ETL is set up so that if two customer records appear to be identical, except one has a home phone number and the other does not, that the system will merge these two records as part of the “T” of ETL (as part of the cleanse).  ETL systems make such data changes routinely and that is part of their function.

 

However, let’s say that I’m a local phone carrier, and the question that I’m trying to answer is how many customers used to have a home phone and no longer do. What happens when I merge those two records?  The record that included a home phone number wipes out the record that didn’t, making it look like the customer still had a home phone number, when they might not. Hence, I might have deleted the answer to the very question I was after. This happens all of the time with ETL “cleansing.” Context is destroyed in the effort to remove “noise” from data. 

 

Well, “why” questions are necessarily noisy, because they are context questions. They aren’t asking about the obvious transactional data, but rather about what else is going on around the transaction. When you do this, null fields, misspellings, and changes-without-meaning start to leap out as potentially rich sources of context—sources of answers to “why.”

 

Don’t throw the signal baby out with the noisy bathwater

What I am driving at here is the need to fundamentally rethink how your organization approaches data analytics. Stop potentially throwing the signal baby out with the noisy bathwater. Stop assuming that data that doesn’t fit certain pre-conceived notions is noise, and has no value. The truth is that this noise might be incredibly valuable to your organization, if you ask the right questions.

 

Also keep in mind that transactional data is usually pretty “clean” by its nature, while unstructured data is not. In fact, the lack of structure in unstructured data makes it extremely dirty, and exceedingly rich in context.  While it runs counter to nearly everything you may have been trained to do over your time as a data analyst, here is a situation where I urge you to stop hitting the ‘delete’ button and start refocusing your analysis towards the noise—towards the dirty data at your disposal. 

 

Finding the “why” in dirty data

In that dirty data—in that noise—could be the answers to “why” questions that can truly transform how your business operates. In that digital crud could be the insights that may completely change your understanding of your customers and your business. But, that can only happen if you have that noise, keep that noise, and understand that noise.

 

In our approach to data analytics, HP is emphasizing the merging of structured and unstructured data sources. In so doing, our customers are starting to ask, and answer, questions that were not ask-able before. This approach leads to new insights into how their businesses operate and how their customers think, which leads to a game-changing competitive advantage. If you’re not yet there, and if you’re still throwing out what may be the most valuable information in your organization, give us a call and let us show you the diamonds that might exist in your mountain of dirty data.

 

In my final installment of The New Top 10 Myths of Big Data, I’ll address the single most important factor in surviving and thriving in a Big Data world. 

 

Click below to continue reading about The New Top Ten Myths of Big Data

#HPAHIB

 

Edited by Robin Hardy

 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
Chris Surdak is a Subject Matter Expert on Information Governance and eDiscovery for HP Autonomy. He has over 20 years of consulting and te...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation