[This article was written by Bradley S. Fordham, Ph.D and was posted by Zach Piester]
When it comes to analytics, in particular for product ideation and optimization, listening to what the data does not say is often as important as listening to what it does. There can be various types of “silences” in data that we must get past to take the right actions. Here I will focus on the most common.
Frequently very large data sets will have a proportionately small number of items that will not “parse” (be converted from raw data into meaningful observations with semantics or meaning) in the standard way. A common response is to ignore them under the assumption there are too few to really matter. The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other. So, even though it may only be .1% of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems. Do not allow syntactically inconsistent data to be silent.
In real data sets, we often find semantic discrepancies (differences in meaning) from one item to the next where we expect similarity. A common example is “omission values”. Some items may have a zero, some may have the special value NULL, some may have blanks, some may have user-entered values such as “?” or “N/A”. Do these all mean the same thing or not for our analysis? Another place semantic gaps often form is in the relationships between data items/records. For example, if we expect to see the VIN of a new car linked to the final assembly plant in which it was produced, then what does it mean if the pointer to that plant is invalid (contains one of these “omission values” or a value that simply doesn’t resolve to an existing plant in our dataset)? Presumably all cars have to undergo final assembly in some plant, so the silence here is trying to tell us something that we should not ignore.
Assume we are given the data printed on car window stickers for vehicles actually sold in the 1st quarter of the year. We may find that 41% of them were blue while the next most prevalent color was only 18% of the vehicles sold. We might conclude that customers bought more blue cars because they preferred blue. In drawing that conclusion, however, we have allowed all the other data to be silent. With a bit more thought, we realize that we only have sticker prices, so direct evidence of discounting is not available. However, this data set does have model year, so we are able to see 70% of the blue cars sold were from the previous model year. It is likely they were discounted to clear them off the lots, thereby inflating the proportion of blue cars sold. So, maybe blue wasn’t so popular after all. Do not allow relevant data to be silent in drawing inferences.
Just because we are using more sophisticated types of analyses does not mean that we are doing a better job listening to all of our data appropriately. As an example, let’s consider a clustering analysis of the cars that were sold in the first quarter. We decide to use every attribute we have for these vehicles (price, color, type of interior…) and let our very sophisticated algorithms automatically decide how many significant clusters exist. Voila! Four clusters are produced and we are instantly presented with these “meaningful” groups of vehicles sold. When we truly listen to more of the detail data elements, however, we find that the clustering is really only highlighting that there were four major price points. The mid-size and the luxury cars were spread across the top two depending on the options installed and the compact and sub-compact cars spread across the bottom two depending on options. Just because the analysis was more sophisticated did not mean it listened to the data better. Unacceptable silences can still exist, and in fact are often harder to find.
At first glance, knowing everything on the window sticker of every car sold in the 1st quarter seems to provide a great set of data to understand what customers wanted and therefore were buying. At least it did until we got a sinking feeling in our stomachs because we realized that this data only considers what the auto manufacturer actually built. That field of view is too limited to answer the important customer desire and motivation questions being asked. We need to break the silence around all the things customers wanted that were not built.
In summary, we need to be careful to listen to all the relevant data, especially the data that is silent within our current analyses. Applying that discipline will help avoid many costly mistakes that companies make by taking the wrong actions from data even with the best of techniques and intentions.
The (ART+DATA) Institute