Last week, we reviewed results from a study analyzing data quality across different Mechanical Turk Populations. We measured quality using the Roundtable Alias API on an open-ended question in the survey and detected a large amount of fraud. Can we validate this data?

To do so, we added a set of questions that acted as a ‘ground-truth bot detector.’ For these questions, participants were asked to estimate the average high temperature in Boston for every month (our population was American). These questions are powerful differentiators because (1) responses are continuous and (2) humans’ knowledge about climate is imperfect but highly structured - Americans know that July is hotter than January, but are less sure whether the average July high temperature in Boston is 82°F, 84°F, or 86°F (the true answer is 82°). A perfect set of answers is thus highly likely to have been generated by a GPT agent, whereas responses that don’t follow basic seasonal temperature trends suggests low quality bot or ‘junk’ responses.

As an aside, part of the intellectual joy of ‘survey fraud and bot detection’ (which, understandably, is not often the most joyous of phrases) is that it is essentially a modern-day Turing Test in the real world and at scale. Places like Kantar have seen a huge bot problem, with artificial agents able to programmatically answer surveys and collect payments. It is increasingly important to identify which data is generated by humans, not only for market research but also for data labeling and training in AI.

The plot below shows the response profiles of participants to the temperature questions. Each curve shows the predictions made by one participant, with the dark curve showing the true high temperatures in Boston. The left plot shows human participants, whose predictions follow seasonal patterns with a systematic degree. On the right are bots, both low-quality or GPT. GPT-based agents get near-perfect accuracy, whereas low-quality bots’ predictions don’t follow basic seasonal patterns and are highly inaccurate.

How well does the Alias API do at identifying participants flagged with this knowledge check? Below, we plot group the data into a 2x2 grid based on whether it was flagged by Alias (which only analyzes open-ends) and whether it was flagged with our temperature check (the knowledge flag).

Using just a single open-ended question, the Alias API detected over 85% of bots with low levels of false positives. Researchers have the option of including up to six open-ended questions, which increases the precision of bot detection and reduces both false negatives and false positives. However, even in this simple case the Alias effectively groups participants.

Bots are systematic in their behavior, but also ever-involving. It doesn’t make sense for researchers to update their internal bot detection algorithms every few months to adapt. Rather, they should be able to trust a service so they can spend time generating high-quality insights. At Roundtable, we’re continuously updating our API so researchers can cut through the noise and find what’s real.