(1) Overview


Collection Date(s)

Fall 2018


Vigilance is defined as “the ability of organisms to maintain their focus of attention and to remain alert to stimuli over prolonged periods of time” [1, p. 433]. Vigilance tasks, then, are tasks that require sustained attention to identify rare, but important stimuli over a length of time [1, 2, 3]. Research suggests that humans are poorly suited to vigilance performance; a marked decrease in vigilance performance occurs after approximately 15-30 minutes [3, 4, 5, 6, 7]. Concern over this “vigilance decrement” and associated performance errors has inspired researchers to seek methods of alleviating the vigilance decrement [e.g., 8, 9, 10, 11, 12, 13, 14]. Other researchers have focused specifically on the vigilance task of monitoring automated systems for errors, and they have used vigilance tasks to examine automation use, complacency, and associated issues [e.g., 15, 16, 17, 18].

The X-ray screening task was popularized by Merritt and Ilgen [17], who used it to examine participants’ trust in a fictional decision aid. They proposed that the task can be considered a “microworld” in that it contains essential elements of a real-life situation but also provides the opportunity for increased experimental control [19, p. 65]. Participants view a series of images of X-ray luggage containing various items, some of which may be dangerous (guns or knives). Participants’ task is to indicate whether or not each image contains a weapon.

(2) Methods


991 U.S. adults (18+) were recruited via Amazon Mechanical Turk (MTurk) and compensated $4.00. Average age was 36 years (sd = 11.17). The sample was 56.6% male, 43.1% female, and 0.3% other genders. Racially, they were 73.7% white/Caucasian, 12.1% African-American, 4.6% Latin, 5.3% East or South Asian, 2.6% multi-racial, and 1.6% other ethnicities.


Scans of individual items (e.g., a baby bottle, a pair of shoes, a cell phone) were provided by the Transportation Security Administration for research in approximately 2003. The researchers then created luggage images by combining selected individual items into top-down views of scanned luggage. This was done by layering, rotating, and positioning the individual items as needed using Adobe Photoshop. This process resulted in 150 luggage images, 20% of which contained weapons. The images varied in the number of items contained and the degree the items overlapped. The images were consistent in terms of their overall size and luminance.


To curtail respondent fatigue, each respondent viewed half of the 150 X-ray stimuli (each X-ray image was rated by ~500 respondents). For each image, if the participant believed the image contained a weapon, they selected “search.” If not, they selected “clear.”

The study took participants 23.38 minutes on average. Each slide was viewable for as long as the participant liked, and there was no programmed delay between slides. In order to keep participants as naïve as possible, no performance feedback was provided. In addition to completing the X-ray task, participants completed 5 demographic items (before the X-ray task) and 81 self-report items related to another study (after the X-ray task).

Due to limitations of randomization in our survey platform, we created four sets of slide combinations. The combinations were equal in the number of slides containing weapons, and they ensured that each item appeared in two of the four sets and thus, should be evaluated by approximately N = 500 participants. Participants were randomly assigned to a slide set, and the order of the slides within each set was randomly assigned for each participant.

Quality Control

Respondents viewed a video-based training instructing them on what items to search for (i.e., guns and knives). The training video was just under 3 minutes long and is available in the repository. Following the training video, they were required to pass a multiple-choice attention check question about the instructions in the video; specifically they needed to correctly indicate that they should click “search” if the image contains a gun or knife, but not any other items. All respondents in our sample passed the attention check, with 92.8% passing on the first attempt and the remaining 7.2% passing on the second attempt.

Ethical issues

This research complied with the American Psychological Association Code of Ethics and was approved by the Institutional Review Board at the University of Missouri-St. Louis. Informed consent was obtained from each participant. Data were identified only by a random ID number.

(3) Dataset description

Object name




Data type

Primary Data

Processed Data

Format names and versions

The images are available as .jpg files. The slide norms are presented in Excel format, and the training video is mp4.

Data Collectors

The online data collection through MTurk was supervised by Dr. Stephanie M. Merritt in Fall 2018.







Repository location

https://irl.umsl.edu/psychology-faculty/61/. The files may also be found in the Harvard Dataverse Repository https://doi.org/10.7910/DVN/Z6R79K.

Publication date


(4) Reuse potential

These materials can be used to produce new original research on vigilance, trust in automation, and more. Other relevant topics suggested by reviewers include sustained attention and fatigue, visual detection, decision certainty, threat biases, training, perceptual learning, and (if a per-image deadline is added) decision-making under stress. Researchers may also consider measuring per-slide response times as a component of these efforts.

The norms provided can be used to equate expected difficulty across different subsets of items. The average accuracy rate on the full set of slides was 70%. Researchers could use this information to set automation reliability at/above/below the level of the average participant. We suspect that our norms best represent the accuracy of inexperienced screeners.

We recommend that researchers create a scoring scheme for the task that matches the relative costliness of mistakes and the relative value of correct versus incorrect screening decisions. In this data collection, no formal scoring scheme was presented, and the average sensitivity was d’ = .24 (average hit rate = 77%; average false alarm rate = 31%).

When reusing these images, researchers should consider whether to provide feedback on each individual decision, periodically, or not at all. Giving feedback more often is likely to produce greater improvements in performance throughout the task, compared to receiving feedback less often or not at all. Further, if working with an automated system, getting feedback affects the development of trust in the automated system.