A Large Dataset of Generalization Patterns in the Number Game

We present a dataset with 272,700 two-alternative forced choice responses in a simple numerical task modeled after Tenenbaum’s “number game” experiment [6]. Subjects were shown a set (e.g. {16, 12}) and asked what other numbers were likely to belong to that set (e.g. 1, 5, 2, 98). Their generalization patterns reflect both rule-like (e.g. ‘even numbers,’ ‘powers of two’) and distance-based (e.g. ‘numbers near 50’) generalization. This dataset is available for further analysis of these simple and intuitive inferences, developing of hands-on modeling instruction, and attempts to understand how probability and rules interact in human cognition.


Background
Numbers and related mathematical ideas form a complex set of interrelated concepts that can be used to study the origin and use of structured mental representations. To examine learning and generalization in this domain, we present an extension of the "number game" task originally developed by Tenenbaum [6]. In the number game, a subject is given a list of numbers sampled from an unknown rule. Subjects are asked to generalize from the samples and predict what other numbers ("targets") are likely to obey the rule. For example, a subject could be told that an unknown program generated the numbers {4, 16, 8}, and then is asked to rate whether 12 might be generated as well. In this example, subjects might rate 12 as relatively unlikely since the observed data suggests a rule like 'powers of two'. However, if the shown set were instead {4, 16, 8, 10}, the concept of ' even numbers' now seems like a better explanation than 'powers of two,' and subjects should generalize accordingly. The simplicity of this setup provides an simple toy domain for studying rule-like generalization: the set of hypotheses is likely to be very simple and concrete (e.g. basic arithmetic concepts), the input sets provided to subjects are small, and the findings are intuitive.
Tenenbaum [6,7,8] showed that subject generalizations in this task followed statistically sensible inferences combining both rule-like generalizations (e.g. ' even numbers,' 'powers of two,' 'multiples of ten') and magnitude/ similarity-based generalizations (e.g. 'numbers near 50'). For instance, subjects' patterns of generalization depend strongly on the amount of data provided, consistent with models that quantify the likelihood of sampling the observed set given a possible rule. For example, if the participant sees {80, 10, 30}, the likelihoods for the hypotheses 'multiples of 10' and 'multiples of 5' should be significantly higher than for 'multiples of 2' since the latter includes more numbers and thus assigns {80, 10, 30} a smaller likelihood of being sampled. With short input lists like {8, 32}, or even single-number lists like {8}, responses are much more revealing of a priori inductive biases. For instance, given {8}, we can measure whether a participant prefers to generalize to 10 (suggestive of ' even numbers'), or 16 (suggestive of 'powers of two'). A comparison of these generalization therefore may tell us what types of numerical concepts subjects possess before our experiment -that is, the mathematical concepts that are most likely (highest prior) before data is observed. The number game task has also been used to study information gathering behavior [4], similar to that of Wason's 2-4-6 task [9].
In Tenenbaum's original task [6], 8 subjects were presented with the same 8 sets of input numbers, each with a list of 30 hand-selected targets. Subjects rated each number on a scale of 1-7, according to how likely the number was to be accepted by the program; they were instructed to take as much time as needed. Our experiment aims to replicate this general framework, but with a much larger numbers of subjects, concepts, and targets. We constructed a large space of concepts by applying several simple and intuitive transformations (e.g. x → 2 x ) across a variety of basic number patterns (e.g. ' even numbers,' 'prime numbers,' etc.). We then sampled 255 sets from these concepts, and tested approximately 10 subjects on a two alternative forced choice (in the set or not) for the targets 1, 2, . . ., 100 for each input set. This resulted in a total of 606 subject participants, providing 253,564 total responses after trimming. We used a two-alternative forced choice instead of Tenenbaum's Likert scale rating in order to simplify later data analysis and model comparisons. Our binary response data is easily incorporated into either mixed effect logistic regressions, or as a binomial likelihood in more general probabilistic models. In forthcoming work, we present analysis aimed at capturing people's priors in the context of a Bayesian data analysis and model comparison [1].
(2) Methods Sample 606 participants were recruited via Amazon Mechanical Turk (MTurk). We configured PsiTurk [3] to require participants to have an approval rating of at least 95% and be from the U.S. (note that the latter constraint is not absolute, as a small number of workers from outside the U.S. are able to circumvent this qualification). The first run included 510 subjects, of which 25 were rejected on qualitative criteria of not adequately attempting the task (see Quality Control). Rejected MTurk Human Intelligence Task data (HITs) were replaced in follow-up runs. After the first 485 subjects, an additional 126 were run and five were rejected. The experiment was run in March and April of 2015.
The modal education level participants reported was a "Bachelor's degree," accounting for 37.6% of subjects, followed by "some college" with 24.9%. Mean age of participants was 34.2, the median was 31.0, and the standard deviation 11.0. Reported gender was nearly even, with 51.1% of participants being male. The vast majority of subjects (97.4%) reported English as their first language. Age and education level for our subjects is typical of the MTurk population [2], though we find a higher than average proportion of male subjects.

Materials
Our version of the number game task was implemented in HTML and JavaScript, and distributed to Mechanical Turk workers via the PsiTurk interface [3]. shows the sequence of instruction pages shown before the beginning of the experiment. Once the experiment began, subjects rated a series of targets for each of 15 sets, shown in Figure 1 (right). For each concept, the subject first saw a screen with the concept shown on it, and when they were ready they proceeded by pressing the spacebar. The subject then saw a target, and responded 'yes' or 'no' to the question, "Is it likely that the program generates this number next?" by pressing either the 'y' or 'n' key. Once they responded, another target was shown. This was repeated for 30 targets, until the first screen for the next set.

Stimuli Sets
Numerical sets used as stimuli were constructed with the goal of spanning an interesting space of generalization stimuli. To generate sets, first we generated a collection of concepts. Beginning with six "primordial sets": all numbers, evens, odds, squares, cubes, and primes, the following functions were then mapped across each primordial set: f(n) = n, f(n) = n + 1, f(n) = n − 1, f(n) = n + 2, f(n) = n − 2, f(n) = 2 * n, f(n) = 3 * n, f(n) = 2 * n + 1, f(n) = 3 * n + 1, f(n) = 3 * n − 1, f(n) = 2 n , f(n) = 2 n+1 , f(n) = 2 n + 1, f(n) = 2 n − 1. Numbers in our task were restricted to the domain of integers 1 through 100. We selected these primordial sets and functions in order to span concepts studied in previous work [6,7,8], and also to include some degree of pseudo-random number sets -for example, subjects are likely to perceive 2 primes − 1 (i.e. {3, 7, 31}) as being a set of random numbers, perhaps limited to some interval. Duplicates and extremely short (length < 3) full concepts were removed. We then added 21 additional full concepts: 4 * n, 5 * n, 6 * n, 7 * n, 8 * n, 9 * n, 10 * n, 5 * n + 1,2,3,4, and 10 * n + 1, . . . 9. Given these 79 full concepts, we generated sets by the following procedure: if the length of the full concept was greater than 4, we chose 4 random numbers from the list without replacement; if the length was greater than 3, we chose 3 numbers; if the length was greater than 2, we chose 2 numbers. For each full concept, between 1 and 3 sets were created, for a total of 200 sets. Finally, 55 singleitem sets were added to these 200, for a total of 255 sets. 16 of the single-item sets were hand selected, all integers 1 through 15 and 100, and the rest were chosen randomly from the range of 16 to 99. See Table 1 for a full list of sets.
While we generated sets like {16, 8} from underlying concepts (e.g. 'powers of 2'), analysis of our data should primarily be interested in what generalizations the set {16, 8} leads subjects to, not whether subjects can recover the generating concept itself. There will often be too little data to infer the generating concept, particularly given the profusion of close alternative concepts (e.g. 'numbers between ten and twenty'). Examination of subjects' generalization from very little data like these small sets will be informative about their underlying inductive biases.

Targets
These 255 sets were divided into 17 groups of 15 sets each, where each participant assigned one of these groups. For each set presented to a participant, 30 targets (in 1. . .100) were shown, randomly selected without replacement, so that each participant made 450 decisions. All together, at least nine two-alternative forced-choice ratings were collected for each number from 1 to 100. Due to a small randomization error, targets for each set were slightly non-independent relative to one another, with no obvious effect on the experiment.

Procedures
After seeing the instructions, subjects provided 30 ratings for each of 15 different sets. At the conclusion of these forced-choice trials, the subject filled out a brief questionnaire. Basic demographics were collected: age, gender, first language, ZIP (postal) code, and highest level of education. Subjects were also asked to describe in English 5 sets randomly selected from the 15 they were shown during the experiment.

Quality Control
HITs were rejected based on qualitative criteria, under the determination that the task was not properly attempted. Many rejected HITs were exceedingly fast, including a number of HITs completed in less than 5 minutes. Other rejected HITs included significant repetitive answering patterns, such as many 'yes' responses followed by many 'no's, or alternating 'yes'/'no'. A small fraction of reaction times were corrupted during data recording; these were replaced with NA in the dataset.

The Dataset
The dataset contains a single row for each response collected, with columns for rating (1 for 'yes,' 0 for 'no'), set ("set"), target, subject id ("id"), trial number for subject ("trial"), reaction time ("rt"), subject demographics, number of HITs for this set and target pairing ("hits"), as well as probability of responding yes ("p"), entropy of responses ("H"), and a typicality measure. Probability was calculated as the number of 'yes' responses for a given set and target pairing, divided by the total number of responses for this pairing. Entropy was calculated as: − (p*log(p) + (1 − p)*log(1 − p)). Though not included, we have also used a response typicality metric to identify subjects whose responses strayed far from the average (e.g. due to low task effort) -calculated as log(p) for 'yes' ratings, and log(1−p) for 'no' ratings.
To illustrate our collected data, Figure 2 shows ratings for targets under three related sets: {3, 63}, {33, 3}, and {93, 43, 83, 53}. Plots in the left and right columns of the figure are colored according to two distinct rules, 'numbers ending in 3' (left) and 'multiples of 3' (right). The data for {93, 43, 83, 53} strongly supports the first, but not the second, pattern. The data for {3, 63} is more uniform, and seems to correspond more closely to 'multiples of 3'. The distribution for {33,3} may be interpreted as a mixture of the two patterns -'numbers ending in 3' are generally rated highest, with some probability mass assigned to 'multiples of 3'. The human ratings illustrate how the set may push generalizations one way or the other and reveal both categorical (all-or-nothing) and gradient generalizations across subjects.  While we intended our experiment to present subjects with sets (unordered collections), it is possible that some subjects interpreted the goal as generalizing from sequences (ordered data). Further analysis may distinguish these possibilities in detail, although the sets were presented in random order and many subjects did not use sequential terminology in their descriptions of the concepts. Out of 3030 concept descriptions, only 21 included sequential terms such as "increasing order," "decreasing," or "ascending".

Ethical issues
All data presented collected in this experiment was anonymized prior to public release. This work was approved by the Research Subjects Review Board at the University of Rochester as part of a protocol for experimental data collection on Mechanical Turk.

Object name
Our dataset consists of one primary file, numbergame_ data.csv (described above), and 3 supplementary files including additional subject data: • instructions_rt.csv : time spent looking at each instruction page • set_descriptions.csv : qualitative set descriptions provided in questionnaire • show_set_rt.csv : time spent looking at set presentation page, before responding to targets (see Figure 1, right) To prevent ambiguity in file loading, commas have been replaced with underscores in the 'set' column of number-game_data.csv, set_descriptions.csv, and show_set_rt.csv, and in the ' descr' column of set_descriptions.csv .

Data type
Raw data file.

Format names and versions
All data is in comma-delimited CSV format; scripts provided in R and python.

Data Collectors
Eric Bigelow.