Normed Images for X-ray Screening Vigilance Tasks

A great deal of interest concerns the study of vigilance performance, and trust in automation, due to their implications for public safety. This work provides an experimental resource for scholars in need of a vigilance style task. The dataset includes 150 X-ray images of luggage, and participants indicate whether or not they believe each image contains a dangerous item (simulating airport security screening). Using a sample of 991 adults recruited via MTurk, we normed these items in terms of difficulty. These stimuli can be used to study vigilance performance, trust in an automated decision aid, and other areas.


Fall 2018
Background Vigilance is defined as "the ability of organisms to maintain their focus of attention and to remain alert to stimuli over prolonged periods of time" [1, p. 433]. Vigilance tasks, then, are tasks that require sustained attention to identify rare, but important stimuli over a length of time [1,2,3]. Research suggests that humans are poorly suited to vigilance performance; a marked decrease in vigilance performance occurs after approximately 15-30 minutes [3,4,5,6,7]. Concern over this "vigilance decrement" and associated performance errors has inspired researchers to seek methods of alleviating the vigilance decrement [e.g., 8,9,10,11,12,13,14]. Other researchers have focused specifically on the vigilance task of monitoring automated systems for errors, and they have used vigilance tasks to examine automation use, complacency, and associated issues [e.g., 15,16,17,18].
The X-ray screening task was popularized by Merritt and Ilgen [17], who used it to examine participants' trust in a fictional decision aid. They proposed that the task can be considered a "microworld" in that it contains essential elements of a real-life situation but also provides the opportunity for increased experimental control [19, p. 65]. Participants view a series of images of X-ray luggage containing various items, some of which may be dangerous (guns or knives). Participants' task is to indicate whether or not each image contains a weapon.

Materials
Scans of individual items (e.g., a baby bottle, a pair of shoes, a cell phone) were provided by the Transportation Security Administration for research in approximately 2003. The researchers then created luggage images by combining selected individual items into top-down views of scanned luggage. This was done by layering, rotating, and positioning the individual items as needed using Adobe Photoshop. This process resulted in 150 luggage images, 20% of which contained weapons. The images varied in the number of items contained and the degree the items overlapped. The images were consistent in terms of their overall size and luminance.

Procedures
To curtail respondent fatigue, each respondent viewed half of the 150 X-ray stimuli (each X-ray image was rated by ~5 00 respondents). For each image, if the participant believed the image contained a weapon, they selected "search." If not, they selected "clear." The study took participants 23.38 minutes on average. Each slide was viewable for as long as the participant liked, and there was no programmed delay between slides. In order to keep participants as naïve as possible, no performance feedback was provided. In addition to completing the X-ray task, participants completed 5 demographic items (before the X-ray task) and 81 self-report items related to another study (after the X-ray task).
Due to limitations of randomization in our survey platform, we created four sets of slide combinations. The combinations were equal in the number of slides containing weapons, and they ensured that each item appeared in two of the four sets and thus, should be evaluated by approximately N = 500 participants. Participants were randomly assigned to a slide set, and the order of the slides within each set was randomly assigned for each participant.

Quality Control
Respondents viewed a video-based training instructing them on what items to search for (i.e., guns and knives). The training video was just under 3 minutes long and is available in the repository. Following the training video, they were required to pass a multiple-choice attention check question about the instructions in the video; specifically they needed to correctly indicate that they should click "search" if the image contains a gun or knife, but not any other items. All respondents in our sample passed the attention check, with 92.8% passing on the first attempt and the remaining 7.2% passing on the second attempt.

Ethical issues
This research complied with the American Psychological Association Code of Ethics and was approved by the Institutional Review Board at the University of Missouri-St. Louis. Informed consent was obtained from each participant. Data were identified only by a random ID number. The norms provided can be used to equate expected difficulty across different subsets of items. The average accuracy rate on the full set of slides was 70%. Researchers could use this information to set automation reliability at/above/below the level of the average participant. We suspect that our norms best represent the accuracy of inexperienced screeners.
We recommend that researchers create a scoring scheme for the task that matches the relative costliness of mistakes and the relative value of correct versus incorrect screening decisions. In this data collection, no formal scoring scheme was presented, and the average sensitivity was d' = .24 (average hit rate = 77%; average false alarm rate = 31%).
When reusing these images, researchers should consider whether to provide feedback on each individual decision, periodically, or not at all. Giving feedback more often is likely to produce greater improvements in performance throughout the task, compared to receiving feedback less often or not at all. Further, if working with an automated system, getting feedback affects the development of trust in the automated system.