How To Register On Amazon Mturk With Fake Information

A flake more than than two years agone, I wrote a mail service laying out some of the evidence I had seen showing that people were submitting fraudulent responses on MTurk. Since that time, at least ii research teams have conducted far more systematic investigations than what I originally did. This paper has been published at Political Science Inquiry and Methods. I suspect that this i is not far behind.

These papers arrive at basically the aforementioned conclusion. In brief: There is fraudulent responding on MTurk. In large function, it comes from non-U.s.a. respondents who use VPNs to circumvent location requirements. These respondents are mostly clicking through surveys as fast as possible, probably paying hardly any attending to annihilation. As a outcome, they degrade data quality considerably, attenuating treatment effects by perhaps 10-30%. These papers also give some good news: you lot can utilise gratis spider web services to place and filter out these responses on the basis of suspicious IP addresses—either at the recruitment phase, or afterward you did your study.

The current situation

Those are valuable contributions and the authors have done the enquiry community a service. At the same time, I'one thousand writing this post because I saw some things that make me recall that they paint an incomplete moving-picture show of what is going on, and that we accept more work to do. I'll get to the details in a second, only let me put the essentials up-front:

Is there still fraud on MTurk?: Yes.
Can you use attention-check questions to screen these responses out?: I wouldn't count on it. Near attending checks are closed-ended, and I see suspicious responders that appear to be quite skilful at passing closed-concluded attention checks.
Does screening out suspicious IP addresses solve problem?: It probably helps, but much of the problem remains.
Is there anything else that can be done?: I think looking at responses to open-concluded questions is a promising artery for detecting fraud.

A fresh study

Ok so let's go into it. Yanna Krupnikov and I ran a study concluding calendar week. This is actually the aforementioned projection that led to my post ii years ago, wherein we got majorly burned by fraudulent responding. So we were quite careful this time. We included a bunch of airtight-ended attending check questions. We also included 1 open-concluded attention check (of sorts), and an open up-ended debriefing question. And, we fielded the study via CloudResearch (a.k.a. TurkPrime), which is a platform that facilitates MTurk studies. CloudResearch is quite aware of MTurk data integrity problems, and has been at the forefront of combating them (e.one thousand. hither). We used the following CloudResearch settings:

We blocked duplicate IP addresses.
We blocked IPs with suspicious geocode locations.
We used the option to verify worker country location.
We blocked participants that CloudResearch has flagged as being low-quality. This is the default setting on CloudResearch, though note that CloudResearch also offers a more than restrictive option, which is to only recruits participants who have been actively vetted and approved by them. (Run across here for more than details.) I cannot say how things might exist different if we had called that option. [Edit viii/thirty/21: My subsequent experiences with pre-approved respondents has been very positive. They announced to perform very well in attention checks, and do not show the signs of fraud described herein. See here for a working paper by CR researchers providing a more than detailed look at their pre-canonical list.]
We did not set a minimum Hitting blessing rate. When you do so, CloudResearch limits recruitment to MTurkers who have completed at least 100 HITs in the past. And so, we were concerned that setting such a minimum would limit recruitment to "professional respondents," which could conduct on our results. I do non think this makes as well big a difference as concerns fraud—Turkers tend to have high approval ratings and I accept previously seen fraudulent responding even when the blessing rating set above 95%—but I cannot be certain. Nosotros take another study coming up and will hopefully be able to say more than then.

An aside on CloudResearch: I was really impressed with the functionality of their platform and the piece of work they have done to ensure information integrity. This mail service should not exist construed as a slight confronting their efforts. As I say, it might be the instance that the more than restrictive option I mention above solves all the problems I describe here, albeit at the cost of a smaller participant pool. (Someone should attempt it and let united states know!) Merely plenty of people do work on MTurk not through CloudResearch, so these issues merit word in any event.

Let's get to the open up-ended attention check I mention above. Our motivation in including this was that fraudulent responses appeared to take bug with these in the by (run across my old postal service). We asked them, "For you, what is the nearly of import meal of the solar day, and why? Please write 1 judgement." We didn't really care what these respondents found to be the nearly of import meal of the day. We but wanted a not-too-onerous way to confirm that they could read an English sentence and write something coherent in response.

More often than not, they could. The vast majority of our respondents wrote sensible things, like "I remember breakfast is the most important repast of the day as information technology gives a fresh start to a new 24-hour interval" or "Dinner is the almost of import meal of the day considering I demand to be total to be able to slumber afterward."

Very nice and expert

But several did not. Skimming through the responses, I spotted 28 that looked suspicious to me. Our survey had 300 responses, so this is nearly 10%. The suspicious responses announced below. The second column you lot tin run into here is the open up-ended debriefing question. I'll refer to these 28 rows in the dataset every bit the "suspicious responders."

Some remarks. First, these look very much like the sort of thing we were contending with two years ago. In particular, various permutations of the words "skilful" and "nice" in the debriefing field are very mutual. We as well see some people who misunderstood the question. ("Dominicus, considering it is a twenty-four hour period of residual.")

The responses on rows xiv and 21 merit particular attention. They are exactly the same (and plagiarized from this website). Moreover, these two responses were entered contemporaneously, as if the same person had our survey instrument open up in 2 split up windows and was filling out the survey twice, simultaneously.

Traditional attention checks

How did the people I flagged as having suspicious open up-endeds do on our closed-ended attention check questions? They did quite well!:

We presented participants with a moving picture of an eggplant, and asked them to place information technology, from the options "eggplant," "aubergine," "squash," "brinjal." (This question has been used in past work to identify people who are probably non in the U.S. Eggplants are chosen aubergine in the Britain and elsewhere, and brinjal is the Indian word for eggplant.) All 28 of the suspicious responders chose the correct option (eggplant).
We asked them, "For our enquiry, careful attention to survey questions is critical! To bear witness that you are paying attention delight select 'I accept a question.'" The response options were "I sympathise," "I do not understand," and "I have a question." 27 suspicious responders chose the correct respond, and one skipped the question.
We asked respondents, "People are very busy these days and many do not accept time to follow what goes on in the government. We are testing whether people read questions. To prove that you've read this much, answer both 'extremely interested' and 'very interested.' " There were five response options, ranging from "not interested at all" to "extremely interested." 26 suspicious responders passed this check, choosing the ii right options.
We showed respondents a calendar with the date of October 11, 2019 circled, and asked them to write that date. (Technically I suppose this is an open-concluded question, though a more restrictive ane than the "favorite meal" question.) This question comes from past work confirming that non-United states of america participants tend to bandy the month and appointment, relative to norms in the U.Due south. And then they'd write "11/10/19" rather than "10/11/19". 17 suspicious responders passed this question, which is 60.vii%. In comparison, 91% of non-suspicious responders passed. Of the 11 suspicious responders who did non laissez passer, 6 wrote "11/10/nineteen" and 5 wrote various other seemingly-arbitrary dates.

Nosotros can expect at one more thing. CloudResearch asks MTurk workers demographic questions after research studies, and it keeps track of whether respondents reply a question near gender in a consistent fashion. (Repeatedly-changing gender responses would exist adduced every bit bear witness that a person is answering randomly.)

For the respondents in our report, 167 (55%) had perfect consistency. 131 (44%) had missing values on this mensurate. (I assume people new to CloudResearch?) Only 3 (1%) had other scores: ninety% consistency, 80% consistency, or 63.64% consistency. Nosotros can't link these consistency scores to individual responses, so I don't know whether the suspicious responders are the people with missing consistency scores.

Using IP Addresses to Observe Fraud

How about suspicious IPs? As I note higher up, the recent work on this issue recommends checking respondent IP addresses against public databases of suspicious IPs. We used the CloudResearch option to block as many of these as possible before entering the survey, but possibly we could notice others ex post.

I used four separate IP checkers to assess whether these responses are valid. Here is what they showed:

IPHub, which is recommended by the Kennedy et al. PSRM piece every bit being the all-time service, flagged i of the 28 responses as suspicious. (Information technology flagged 8 respondents with seemingly-fine open up-ended responses as suspicious, but that's another story.)
IPIntel flagged five out of 28 as being suspicious.
Proxycheck.io flagged 1 out of 28 as being suspicious (the same one as IPHub).
IPVoid flagged 5 out of 28 equally being questionable, and an additional two equally being suspicious.

And then, the IP checkers identified some questionable responses, but they would let most of them through. Consider this: If we went to extreme lengths and ran all four IP checkers, excluding a response if it is flagged past any of the 4, we would only flag 11 responses full—less than half of those with suspicious open up-endeds. (And if we did that, we would also need to evaluate the fake-positive difficulty. These checkers might as well flag responses that are actually fine. I am not looking at that here.)

Of course, as I write above, we invoked several CloudResearch tools that might have filtered out a agglomeration of problematic IP addresses before they fifty-fifty entered our survey. Maybe the IP-checking tools struggle hither due to survivorship effects: simply the especially wily fraudsters remain. Merely even if this is the case, these still represent about x% of our data—easily enough to matter.

Is this really fraud?

Maybe I'm making something out of nothing here. If these folks pass iv attention-cheque questions, consistently written report the same gender, and take not-suspicious IP addresses, possibly they're actually ok and we should just consider them to exist valid responses.

Maybe, but I'm skeptical. Non being able to write a simple sentence about what repast is near important to you seems, to me, similar per se evidence that a respondent is non mentally invested to a degree that nigh studies require. Furthermore, the patterns in the open-endeds I brandish in a higher place are facially suspicious. The repeated invocation of "dainty," "practiced," "very practiced," "very nice," etc. seem to reflect that these responses share a common origin, and are not merely getting a picayune lazy at the finish of a survey instrument. And of grade, I mention one articulate-cut case of maliciousness above (plagiarizing from an external website).

And there's a little more: we come across some tentative evidence that these 28 suspicious responders behaved differently in the main part of our study. (28 is a small comparison group, so I will merely describe general patterns.) Our study was a simple instrumentation check: a within-subjects blueprint that involved viewing both a loftier-quality and a depression-quality campaign advertisement (in a random order), and rating them. Among most respondents, the effects were huge (meaning our instrumentation did what we expected). But the suspicious responders rated both ads nearly the same—seemingly oblivious to obvious differences. They as well appeared to favor the loftier ends of our various rating scales, compared to the non-suspicious responders.

If yous asked me for my mental picture of what is going on, I suppose at this point I only call up at that place is a person or people—I don't know whether they are in the U.s.a. or not—who have come up with an apparatus to complete MTurk surveys in high volume. They might pay a modicum of attention—probably just enough to pass attention checks that multiple researchers use, which might by at present be familiar. Only probably not plenty attention to count as a valid response in survey research. I doubtable free-response, open-ended questions are fairly constructive at identifying these respondents considering of the effort required to answer them.

I encourage other researchers doing studies on MTurk to experiment with including other free response questions that might be revealing here. There is no need to re-use my "favorite repast" question. The fraudsters might adapt to it. Endeavor asking respondents to write one sentence about a hobby they savour, about a musical performance they think, or what are they main ways they swallow caffeine (if they do). And please include a uncomplicated "Please use this box to report whatsoever comments on the survey" question, as suspicious responders announced especially prone to writing "good" and "nice" in response to such questions. It would also be helpful to have information points from people who recently asked open-ended questions and prepare MTurk approval ratings to be quite high.

Be nice to MTurk

A closing remark. Please practise not apply this information to besmirch MTurk as a resource for conducting survey research. Call back that about xc% of the responses we received look fine. Indeed, they appear to be higher in attention and quality than what I am accustomed to seeing on non-convenience samples. This is not a categorical trouble with MTurk equally a data source (and information technology drives me bananas when reviewers dismiss MTurk out of hand). Information technology is a specific issue with an otherwise cracking resource, and our studies will be meliorate if we solve it.