We present a dataset with 272,700 two-alternative forced choice responses in a simple numerical task modeled after Tenenbaum’s “number game” experiment [

March–April 2015.

Numbers and related mathematical ideas form a complex set of interrelated concepts that can be used to study the origin and use of structured mental representations. To examine learning and generalization in this domain, we present an extension of the “number game” task originally developed by Tenenbaum [

Tenenbaum [

In Tenenbaum’s original task [^{x}) across a variety of basic number patterns (e.g. ‘even numbers,’ ‘prime numbers,’ etc.). We then sampled 255 sets from these concepts, and tested approximately 10 subjects on a two alternative forced choice (in the set or not) for the targets 1, 2, …, 100 for each input set. This resulted in a total of 606 subject participants, providing 253,564 total responses after trimming. We used a two-alternative forced choice instead of Tenenbaum’s Likert scale rating in order to simplify later data analysis and model comparisons. Our binary response data is easily incorporated into either mixed effect logistic regressions, or as a binomial likelihood in more general probabilistic models. In forthcoming work, we present analysis aimed at capturing people’s priors in the context of a Bayesian data analysis and model comparison [

606 participants were recruited via Amazon Mechanical Turk (MTurk). We configured PsiTurk [

The modal education level participants reported was a “Bachelor’s degree,” accounting for 37.6% of subjects, followed by “some college” with 24.9%. Mean age of participants was 34.2, the median was 31.0, and the standard deviation 11.0. Reported gender was nearly even, with 51.1% of participants being male. The vast majority of subjects (97.4%) reported English as their first language. Age and education level for our subjects is typical of the MTurk population [

Our version of the number game task was implemented in HTML and JavaScript, and distributed to Mechanical Turk workers via the PsiTurk interface [

Sequence of instructions shown before the experiment begins (left) and the display shown for main data collection (right).

Numerical sets used as stimuli were constructed with the goal of spanning an interesting space of generalization stimuli. To generate sets, first we generated a collection of concepts. Beginning with six “primordial sets”: all numbers, evens, odds, squares, cubes, and primes, the following functions were then mapped across each primordial set: f(n) = n, f(n) = n + 1, f(n) = n – 1, f(n) = n + 2, f(n) = n – 2, f(n) = 2 * n, f(n) = 3 * n, f(n) = 2 * n + 1, f(n) = 3 * n + 1, f(n) = 3 * n – 1, f(n) = 2^{n}, f(n) = 2^{n+1}, f(n) = 2^{n} + 1, f(n) = 2^{n} – 1. Numbers in our task were restricted to the domain of integers 1 through 100. We selected these primordial sets and functions in order to span concepts studied in previous work [6, 7, 8], and also to include some degree of pseudo-random number sets – for example, subjects are likely to perceive 2^{primes} – 1 (i.e. {3, 7, 31}) as being a set of random numbers, perhaps limited to some interval. Duplicates and extremely short (length < 3) full concepts were removed. We then added 21 additional full concepts: 4 * n, 5 * n, 6 * n, 7 * n, 8 * n, 9 * n, 10 * n, 5 * n + 1,2,3,4, and 10 * n + 1, … 9.

Given these 79 full concepts, we generated sets by the following procedure: if the length of the full concept was greater than 4, we chose 4 random numbers from the list without replacement; if the length was greater than 3, we chose 3 numbers; if the length was greater than 2, we chose 2 numbers. For each full concept, between 1 and 3 sets were created, for a total of 200 sets. Finally, 55 single-item sets were added to these 200, for a total of 255 sets. 16 of the single-item sets were hand selected, all integers 1 through 15 and 100, and the rest were chosen randomly from the range of 16 to 99. See Table

A complete list of sets used in experiment, sorted by length. Each cell shows a set of numbers that was presented to subjects, who then decide what other numbers in 1 … 100 are likely to be generated.

1 | 2 | 3 | 4 | 5 |

6 | 7 | 8 | 9 | 10 |

11 | 12 | 13 | 14 | 15 |

16 | 17 | 18 | 22 | 23 |

25 | 26 | 30 | 31 | 33 |

35 | 36 | 39 | 43 | 44 |

47 | 48 | 49 | 50 | 51 |

53 | 55 | 61 | 62 | 64 |

66 | 67 | 69 | 72 | 73 |

78 | 81 | 84 | 85 | 86 |

89 | 91 | 93 | 95 | 100 |

75, 4 | 81, 9 | 5, 51 | 13, 91 | 98, 83 |

16, 8 | 33, 9 | 3, 31 | 52, 24 | 25, 17 |

85, 7 | 71, 11 | 64, 4 | 5, 65 | 3, 63 |

55, 29 | 6, 74 | 3, 87 | 63, 67 | 94, 70 |

8, 92 | 2, 8 | 33, 3 | 7, 31 | 81, 25 |

26, 2 | 24, 35 | 83, 11 | 14, 47 | 50, 2 |

75, 27 | 19, 73 | 28, 13 | 2, 47 | 8, 64 |

28, 2 | 7, 63 | 10, 3 | 6, 25 | 16, 54 |

3, 81 | 55, 3 | 25, 82 | 23, 80 | 53, 2 |

60, 54 | 22, 96 | 59, 3 | 34, 26 | 15, 93 |

15, 11 | 94, 7 | 92, 56 | 8, 32 | 8, 16 |

5, 9 | 3, 7 | 83, 77 | 70, 15 | 6, 66 |

7, 67 | 73, 33 | 59, 14 | 10, 80 | 21, 71 |

42, 62 | 63, 43 | 64, 44 | 75, 95 | 76, 26 |

37, 57 | 68, 58 | 79, 59 | 84, 56 | 66, 78 |

28, 98 | 96, 48 | 90, 45 | 87, 8, 52 | 66, 93, 51 |

7, 51, 23 | 10, 55, 58 | 29, 62, 98 | 32, 4, 16 | 33, 3, 65 |

31, 3, 1 | 50, 76, 28 | 61, 9, 45 | 85, 19, 91 | 5, 77, 89 |

33, 5, 19 | 10, 74, 22 | 63, 27, 81 | 67, 99, 15 | 16, 82, 28 |

20, 32, 92 | 100, 1, 16 | 65, 26, 2 | 48, 99, 3 | 38, 18, 3 |

23, 62, 98 | 98, 18, 50 | 75, 48, 3 | 73, 33, 3 | 49, 76, 13 |

47, 2, 74 | 1, 8, 27 | 9, 65, 28 | 10, 66, 29 | 53, 47, 41 |

98, 54, 18 | 4, 16, 12 | 69, 19, 85 | 81, 87, 27 | 6, 34, 82 |

51, 39, 87 | 15, 39, 35 | 52, 22, 94 | 92, 68, 20 | 77, 17, 8 |

90, 100, 5 | 16, 26, 96 | 82, 67, 72 | 48, 63, 53 | 64, 94, 84 |

10, 50, 100 | 71, 31, 21 | 62, 32, 22 | 63, 23, 43 | 84, 94, 34 |

15, 85, 45 | 46, 76, 66 | 27, 37, 67 | 78, 48, 88 | 59, 89, 79 |

12, 84, 8 | 48, 30, 78 | 14, 98, 91 | 8, 80, 48 | 18, 45, 72 |

3, 33, 99 | 65, 47, 55, 13 | 39, 60, 12, 45 | 73, 35, 53, 59 | 31, 19, 49, 100 |

11, 41, 38, 92 | 16, 32, 2, 8 | 33, 17, 5, 9 | 31, 3, 1, 15 | 20, 8, 84, 100 |

29, 77, 37, 17 | 91, 25, 61, 31 | 17, 11, 65, 53 | 29, 9, 3, 31 | 34, 86, 30, 94 |

63, 27, 99, 33 | 99, 59, 31, 67 | 70, 10, 88, 40 | 68, 14, 8, 26 | 4, 64, 25, 81 |

17, 50, 5, 82 | 8, 24, 48, 63 | 6, 51, 66, 27 | 79, 47, 62, 98 | 18, 8, 72, 98 |

75, 48, 12, 3 | 9, 33, 73, 99 | 49, 28, 76, 13 | 11, 26, 74, 2 | 67, 31, 17, 61 |

98, 24, 30, 3 | 4, 66, 78, 96 | 43, 4, 91, 9 | 15, 17, 21, 5 | 22, 62, 26, 82 |

69, 9, 39, 21 | 23, 35, 95, 83 | 70, 88, 7, 22 | 92, 14, 20, 5 | 68, 20, 35, 92 |

55, 60, 25, 100 | 71, 61, 26, 81 | 92, 82, 27, 42 | 83, 38, 58, 68 | 59, 34, 14, 89 |

80, 70, 90, 100 | 81, 71, 21, 31 | 72, 92, 82, 42 | 93, 43, 83, 53 | 54, 14, 94, 34 |

55, 15, 75, 45 | 96, 86, 36, 66 | 97, 77, 87, 47 | 48, 78, 38, 98 | 59, 49, 99, 69 |

76, 44, 8, 48 | 96, 90, 6, 42 | 42, 35, 21, 84 | 64, 96, 24, 56 | 36, 45, 54, 81 |

While we generated sets like {16, 8} from underlying concepts (e.g. ‘powers of 2’), analysis of our data should primarily be interested in what generalizations the set {16, 8} leads subjects to,

These 255 sets were divided into 17 groups of 15 sets each, where each participant assigned one of these groups. For each set presented to a participant, 30 targets (in 1…100) were shown, randomly selected without replacement, so that each participant made 450 decisions. All together, at least nine two-alternative forced-choice ratings were collected for each number from 1 to 100. Due to a small randomization error, targets for each set were slightly non-independent relative to one another, with no obvious effect on the experiment.

After seeing the instructions, subjects provided 30 ratings for each of 15 different sets. At the conclusion of these forced-choice trials, the subject filled out a brief questionnaire. Basic demographics were collected: age, gender, first language, ZIP (postal) code, and highest level of education. Subjects were also asked to describe in English 5 sets randomly selected from the 15 they were shown during the experiment.

HITs were rejected based on qualitative criteria, under the determination that the task was not properly attempted. Many rejected HITs were exceedingly fast, including a number of HITs completed in less than 5 minutes. Other rejected HITs included significant repetitive answering patterns, such as many ‘yes’ responses followed by many ‘no’s, or alternating ‘yes’/‘no’. A small fraction of reaction times were corrupted during data recording; these were replaced with NA in the dataset.

The dataset contains a single row for each response collected, with columns for rating (1 for ‘yes,’ 0 for ‘no’), set (“set”), target, subject id (“id”), trial number for subject (“trial”), reaction time (“rt”), subject demographics, number of HITs for this set and target pairing (“hits”), as well as probability of responding yes (“p”), entropy of responses (“H”), and a typicality measure. Probability was calculated as the number of ‘yes’ responses for a given set and target pairing, divided by the total number of responses for this pairing. Entropy was calculated as: – (p*log(p) + (1 – p)*log(1 – p)). Though not included, we have also used a response typicality metric to identify subjects whose responses strayed far from the average (e.g. due to low task effort) – calculated as log(p) for ‘yes’ ratings, and log(1– p) for ‘no’ ratings.

To illustrate our collected data, Figure

Human predictive distributions for 3 concepts (rows), selectively highlighting target numbers corresponding to ‘multiples of 3’ (left column) and ‘ends in 3’ (right column).

While we intended our experiment to present subjects with

All data presented collected in this experiment was anonymized prior to public release. This work was approved by the Research Subjects Review Board at the University of Rochester as part of a protocol for experimental data collection on Mechanical Turk.

Our dataset consists of one primary file,

To prevent ambiguity in file loading, commas have been replaced with underscores in the ‘set’ column of

Raw data file.

All data is in comma-delimited CSV format; scripts provided in R and python.

Eric Bigelow.

English.

Create Commons Attribution (CC-By).

NA.

This data will be useful to basic research on human conceptual representation and generalization. Because of the simplicity of the task, we expect that it will provide a compelling teaching example, showing ways in which structured concepts influence generalization in a domain that is both simple and intuitive. Work using this data to infer the priors on human concepts is ongoing as part of LOTlib, a Python library for Language of Thought models [

The authors declare that they have no competing interests.