Careful selection of an online survey sample provider is critical, as it can impact data quality, data integrity, and, ultimately, study results. If data quality checks fail (and all responses are included), the results can be adversely affected and far less accurate than a survey sample selection process that has conducted rigorous checks.
Ideally, data integrity should be addressed even before the data is collected, and transparent and objective evaluation of the panel and all panel partners is the first step in building industry-wide criteria aimed at ensuring that survey data collected can be trusted.
Aytm’s Principal Statistician, Ivan Konanykhin and I recently conducted an exploratory study comparing the data quality of US online panels, measuring both the number of respondents who failed data quality checks and the extent to which their inclusion in survey data affects the study results. In this article, I’ll reveal the results of the data quality evaluation across fourteen online sample sources.
Hypotheses
We had two hypotheses as a part of our study into panel data quality: First, that panel providers differ in data quality, second is that if no quality control measures are undertaken, results are affected significantly. We were also interested in the different factors that correlate with poor data quality. Our social media study (performed between March 2020-February 2021) compared the data quality of 14 non-probability sample-based US online panels, measuring both the number of respondents who failed quality checks and the extent to which their inclusion in the survey affected data results. Each panel had between 527-681 survey completes, with a total of 7,391 total respondents.
Experiment
We performed quality checks according to several parameters:
- Whether respondents sped through the survey (answered in under 3 minutes and 38 seconds)
- Straight-lining (respondents who gave identical or nearly-identical answers for every question or using the same scale response to grid format type questions)
- Consistency of survey responses in questions related to:
- Age
- Number of children
- Active social media use
- Open-ended verbatim checks to ensure respondents paid attention to the question's verbiage
Methodology
Hypothesis 1 - Regressed the number of failed quality control checks against the panel indicator controlling for demographics. The 5 most correlated with the data variables were chosen as covariates to eliminate the potential effect of respondents’ background on the outcome variables. Pairwise interactions were also considered.This allowed us to isolate the effect of the panel on the quality of the data.
Hypothesis 2 - We conducted multiple regressions (1 per survey variable), to determine if the number changes depending on the outcome of the number of failed quality control checks. Conservative Bonferroni Multiple Comparisons correction was applied to ensure that we detected truly unexpected numbers of significant results. Coefficients and p-values from the regression were extracted and analyzed in an aggregative summary.
For research question 1, we conducted stepwise regression to determine which demographic factors increased or decreased the data quality. We considered 11 standard demographic questions and their pairwise interactions and regressed quality control against the variables selected with the stepwise algorithm using BIC criterion.
Results
One-third of the respondents passed all of the data quality checks. On average, panelists failed two data quality checks. Half of the respondents failed one check at the most. Most discrepancies were observed on quality checks related to social media and questions about children. Age data discrepancies and straight-lining were in the minority.
It’s apparent from the data that panels differ in their data quality. Two panels were significantly better than the average, and two were significantly worse. A sample from a better provider could be given a half or a whole QC failure point less than the alternatives, on average.
For the second part of the study, we checked 189 variables and found that 76% of them show strong, significant correlation with the QC variable. To demonstrate the effect, we can take a look at this variable in particular (Question 1 sub question 1, answer 1). Those who failed our quality checks are most likely to agree that they are knowledgeable on how third parties and companies are selling &/or using their personal data. We can imagine not a lot of people would be knowledgeable about these sorts of matters, so the finding matches our expectation.
For each increase in the number of failed QC checks, the likelihood of agreeing to the question increased by .31, which roughly translates to 7% each point.
We find that younger respondents, students working in real estate, and young hospitality workers can be trusted less, while females, older respondents, Asian American IT specialists, and retired IT specialists can be trusted more. All education levels that have a professional degree seem to decrease the likelihood of bad data.
Our meta research (research that is performed on research itself) was devoted to the frequently debated issue of online sample data quality in market and survey research. We found evidence that data quality varies strongly between different sources of online samples, and we assume that much of these variations are the result of differences between the online panels and recruitment sources and practices, as well as respondents’ treatments and incentive structures.
Our review of the data also highlights the fact that survey data results will be affected if no quality control measures are undertaken (and the lower the sample quality, the more severe such an effect would be). Lastly, we identified demographic variables that also correlate with lower sample quality and ultimately data quality. We hope our research can help researchers make informed decisions as it comes to the choice of online sample provider and data cleaning for online surveys.
Panel Data Cleaning Solutions
This research on research study on panel integrity brought another need to light: the need for automated data cleaning. Aytm has come up with a solution to this problem: Data Centrifuge. Data Centrifuge uses a number of automated vectors to separate convincing survey respondents from questionable ones by analyzing data anomalies and patterns. We hope it will set a gold standard for survey quality across the industry, allowing data quality to be quantifiable and more repeatable.