[NEW] Data Centrifuge: Cleaner Data for Reliable Insights

Last week at IIeX, aytm’s CEO, Lev Mazin, and our Director of Product Strategy, Dale Gilliam, unveiled a project that we’ve been secretly working on for months. 

In the age of click farms and survey bots, the consumer insights industry is plagued by growing online fraud cases, compromising respondent data quality and integrity. 

We get it. Data quality is a scary topic. But you no longer need to tackle it alone. To combat junk data, we’re developing a next-generation, response-level data quality engine, called Data Centrifuge.

The problem with current data cleansing methods 

For a human analyst, data cleaning is more of an art than a science. You may start with some prior assumptions and rules, but once you get into the data, you find that you must adapt to what you’re seeing. While you’re in it, the challenge becomes knowing how much cleaning is enough. 

Dale describes it as “like cleaning a cast-iron skillet.” He explains, “Obviously you want to get the food particles out of there, but trust me when I say from experience, you can over clean a cast-iron skillet and totally ruin it.” Likewise, over cleaning your data can create bias if you unintentionally clean out some good respondents. 

When manually performing data cleaning, you may typically start with around three flags or more if you had the foresight to set up some trap questions. And you remove anyone who fails more than one, which is simple enough. Then once you get in, you start looking at straight-liners and open-ends for some obvious bad behavior. 

But if you have five grid type questions, for example, how do you count those, and how do you create flags? Is it one more flag per straight-line and grid question depending on how the respondent answers that question? And the same goes for open ends. Depending on how they answer those questions, you could wind up with 12-15 flags, and now you don’t know where to draw the line. 

Some of those flags may even vary in importance. Determining how to read the flags and clean out the obvious bad actors and inattentive respondents can be likened to reading tea leaves, as opposed to following a scientific method. 

As an analyst, you’re trading off three things. First, you’re trying to make sure you’ve cleaned out the obvious bad actors. Second, you’re trying to make sure you haven’t over cleaned the data. And third, you’re trading off the labor cost and time of continuing to search for that next bad behavior. 

At some point, you have to call it. You’ve done your due diligence, but you never really know if you’ve done the job well enough. That’s the challenge we’re trying to solve with Data Centrifuge. 

Data Centrifuge Demo at IIeX 2020

Panel data quality vs. Response Data quality 

At aytm, we take panel data quality seriously. We built our panel PaidViewpoint.com from the ground up. And our 5th consecutive ranking as the #1 User-Rated USA Survey Platform by independent review site, SurveyPolice.com, is a testament to the care we take to ensure our panelists feel appreciated and have a great survey experience. This, along with our extensive quality verifications and well-designed user interface, leads to high-quality panel data. In fact, less than 1% of respondents are rejected by our clients and internal admins. 

But, while panel data quality is critical, that’s not what we’re talking about here. The problem is that none of us can solve the fielding problem by ourselves. We have panel partners, and you have your suppliers, but we don’t have a direct relationship with the respondents on the other end of that transaction. So how do we tell if the quality of their responses is good or bad? 

We want to draw a line between panel quality and response quality. Data Centrifuge focuses on the raw data we have purely in survey responses and what we can do to ensure that it’s of the highest quality. 

What is Data Centrifuge? 

At its core, Data Centrifuge utilizes a spectrum of vectors that analyze respondent behavior from various independent angles, automatically identifying factors that affect the integrity of survey responses while recognizing the good completes. 

Text-based data goes through leading Neuro-Linguistic Processing (NLP); image responses must pass Computer vision methods, and categorical data is modeled with Bayesian Network learning algorithms. 

Extending beyond mindless AI, we’ve augmented Data Centrifuge with context-aware, top-notch statistical models that use context as well as external expert knowledge in an unsupervised fashion to produce unmatched accuracy and reliability. 

Data → Cleansed 

Aytm’s automated data cleaning engine features sophisticated algorithms that analyze patterns, spotting anomalies in the data, some of which would be invisible to human analysts. Numerical statistics then indicate the strength or weakness of identified inconsistencies for each respondent. 

Data Centrifuge is driven by a growing number of vectors — independent, automatic approaches, techniques, and statistical models — working together to separate convincing respondents from questionable. It identifies and removes: 

●     Speeders

●     Straight-liners

●     Duplicate responses

●     Random responses

●     Exotic responses

●     Bad and meaningless open-ended responses

●     Inadequate image responses

●     Interrupted sessions

●     and much more 

Data Centrifuge highlights issues that a researcher wouldn’t be able to discover in any reasonable time frame. For example, when analyzing straight-lining, it looks for other patterns (like zig-zagging) that could only be spotted after de-rotating/unrandomizing the data — which would be a pain for a human analyst to perform manually. 

Time → Saved 

Spend less time manually mulling through panel responses weeding out bad apples and more time uncovering actionable insights that can drive the business forward. 

Confidence → Restored 

Fraudulent panel data leads to misguided recommendations. Rest assured, knowing that your most important business decisions are being informed by the highest quality panel data available. 

An Algorithm that Learns 

While it is a highly advanced and formidable defense, Data Centrifuge is not a silver bullet. It’s designed to be an ever-learning mechanism that evolves and improves over time as human ingenuity on both sides — respondents’ and researchers’ — expose it to new ways of looking at the problem. 

Let’s tackle this together 

So, while it can be scary, it’s important that we all have the courage to take a flashlight and look at the monster hiding under the bed. Take a critical look at your raw data to make sure it meets the highest quality standards. 

Response-level data quality is an issue of such magnitude that we can’t solve it alone. We need your support. 

So please join the conversation inside the Insighter Community. What are your common pain points around manual data cleaning? We’d love to hear from you.