(1) Background

During infancy and toddlerhood, when children develop the ability to represent and interact with objects, they show a variety of action errors (Jiang & Rosengren, 2018). Scale errors, in which young children attempt to act on miniature-sized artifacts in an impossible way (e.g., trying to get into a palm-sized car; DeLoache, Uttal, & Rosengren, 2004), are one of such intriguing phenomena. Scale errors are observed in children aged from 12 to 36 months (Ware, Uttal, & DeLoache, 2010), with a peak of approximately 18 to 24 months of age (DeLoache et al., 2004; Grzyb, Cangelosi, Cattani, & Floccia, 2019). The robustness of these phenomena has been verified in laboratory settings, classrooms (Rosengren, Carmichael, Schein, Anderson, & Gutiérrez, 2009; Rosengren, Schein, & Gutiérrez, 2010), and at home (Rosengren, Gutiérrez, Anderson, & Schein, 2009; Ware et al., 2010).

Several developing cognitive abilities have been considered to play a role in producing scale errors, including immaturity of inhibitory controls (DeLoache, LoBue, Vanderborght, & Chiong, 2013; Ishibashi & Moriguchi, 2021), object or own body size understanding (Brownell, Zerwas, & Ramani, 2007; Grzyb, Cangelosi, Cattani, & Floccia, 2017; Ishibashi & Moriguchi, 2017; Ware, Uttal, Wetter, & DeLoache, 2006), or integration of multiple object features (Ishibashi, Twomey, Westermann, & Uehara, 2021). However, some developmental scientists argued that a strong bias toward specific object features, such as object shape (Grzyb, Cangelosi, et al., 2019) or object function (Casler, Eshleman, Greene, & Terziyan, 2011), leads to the execution of scale errors because such a bias weakens children’s attention to other object features such as its size.

Recently, the relationships between scale errors and children’s lexical development have attracted considerable attention (Grzyb, Cangelosi, et al., 2019; Grzyb, Cattani, Cangelosi, & Floccia, 2014; Grzyb, Nagai, et al., 2019; Hunley & Hahn, 2016; Oláh, Elekes, Pető, Peres, & Király, 2016). Lexical development organizes the semantic conceptual system, and this system is considered to influence the way children perceive and interact with objects. Furthermore, lexical development is closely related to increased attention to object shape (Gershkoff-Stowe & Smith, 2004; Jones, 2003; Landau, Smith, & Jones, 1988; Smith et al., 2002) or function (Kemler Nelson, Russell, Duke, & Jones, 2000; Kobayashi, 1997; Zuniga-Montanez et al., 2021). For instance, training attentional bias to object shape increased toddlers’ productive vocabulary size of object words (Smith et al., 2002). In the age range of 18 to 30 months, toddlers whose noun vocabulary size was relatively larger were more likely to produce scale errors (Grzyb, Cangelosi, et al., 2019; Grzyb et al., 2014).

Although the target age range was older than that in usual scale error tasks (18 to 30 months) and the task differed from the ordinary body-based scale error task, one study directly investigated the effect of semantic information, induced by lexical development, on scale errors (Hunley & Hahn, 2016). This study examined whether the activation of children’s semantic representations by providing specific object labels increased the frequency of tool-based scale errors. The results suggested that object labeling strengthened children’s semantic conceptual system comprising shape, category, and function. However, owing to the different age ranges and tasks in Hunley and Hahn (2016), the link between scale errors and children’s rapid change in semantic representations remained unclear. If semantic information truly influences scale error production, the effect of object labeling on scale errors should also be observed in 18- to 30-month-old toddlers since such labels are thought to activate semantic representations such as object shape and function.

Therefore, based on similar studies using body-based scale error tasks (DeLoache et al., 2004; Grzyb, Cangelosi et al., 2019), we performed a study to address the labeling effect on toddlers’ scale errors (Hagihara, Ishibashi, Moriguchi, & Shinya, 2022). To explore how labeling effects were related to toddlers’ development, we used their age in months and their productive vocabulary size of nouns, verbs, and adjectives as candidate developmental indices, as seen in previous studies investigating the relationships between scale errors and the noun or adjective vocabulary size (Grzyb, Cangelosi, et al., 2019; Grzyb et al., 2014). The primary purpose of this study was to investigate whether object labeling increased scale errors and, if so, how it was related to toddlers’ chronological or lexical development. To the best of our knowledge, this is the first experimental in-lab study that collected scale error data repeatedly (i.e., twice from the same participant). Thus, we believe that the data will be especially helpful to researchers who aim to run a scale error study in a within-participant design as well as those who perform a meta-analysis of scale errors.

(2) Methods

2.1 Study design

A within-participant design with two different conditions was adopted: in the noun condition, children heard specific object words during the task (e.g., “chair”), whereas in the pronoun condition, children heard only general pronouns (e.g., “this”). The participants were assigned to one of these conditions for the first time when they visited the laboratory. They then engaged in the same task in the other condition approximately 2 weeks later. The order of the conditions was counterbalanced.

The number of scale errors produced is treated as the dependent variable. Two developmental indices were used as independent variables: participants’ age in months in the first session and productive vocabulary size of nouns, verbs, and adjectives calculated from the words and grammar form of the Japanese MacArthur Communicative Development Inventory (JMCDI; Ogura & Watamaki, 2004) taken in the first session. The other primary independent variables were the conditions (noun or pronoun conditions as a within-participant variable) and interactions between developmental indices and conditions. As covariates, we included the study order (i.e., first or second time) and the effortful control score in the Japanese version of the Early Childhood Behavior Questionnaire (ECBQ; Putnam, Gartstein, & Rothbart, 2006; Sukigara, Nakagawa, & Mizuno, 2015), which was related to participant’s executive function (Shinya et al., 2022), taken in the second session. Effortful control in the ECBQ is a temperament factor related to participants’ executive function (Shinya et al., 2022) and is operationally defined by primary loadings for inhibitory control, attention shifting, low-intensity pleasure, cuddliness, and attention focusing (Putnam et al., 2006).

2.2 Time of data collection

The data collection was performed from July 2019 to March 2020.

2.3 Location of data collection

All participants took part in the study in a laboratory room at the Center for Early Childhood Development, Education, and Policy Research (CEDEP), Graduate School of Education, The University of Tokyo, Japan. The room size was 7.0 m × 3.4 m × 3.4 m. The room was divided by the tall curtains and standing partitions into three spaces: a study space for the scale error task (3.5 m × 2.4 m × 3.4 m), a waiting space for the start and interval of the task (3.5 m × 3.4 m × 3.4 m), a storage space for the study instruments (3.5 m × 1.0 m × 3.4 m).

2.4 Sampling, sample, and data collection

We included 72 typically developing Japanese toddlers (43 girls; M = 23.1 months, SD = 3.9; range = 18–30 months) in the final analyses. We determined the sample size based on previous in-lab experimental studies (n = 81 in Grzyb et al., 2014; n = 76 in Ishibashi & Uehara, 2020). The sample size was also equivalent to a more recent study (n = 72 in Rivière et al., 2020). An additional 16 participants joined but were excluded from the final analyses due to fussiness (n = 2), declined to participate in the second session (n = 9), and experimental error (n = 5). On average, the interval between the first and second sessions was 13.1 days (SD = 6.0).

All participants included in the final analyses lived in the suburbs of Tokyo (Tokyo-to: n = 61 [84.7%]; Saitama-ken: n = 4 [5.6%]; Chiba-ken: n = 4 [5.6%]; Kanagawa-ken: n = 2 [2.8%]; Gunma-ken: n = 1 [1.4%]). Most of them were considered middle class or above based on census data from their area of residence (e-Stat, 2021). In the case of Tokyo-to, the Gini coefficient of equivalized yearly disposable income was relatively small (0.30), and the average annual household income of the participants was around 6.2 million yen, which was higher than the average for the whole of Japan (5.5 million yen).

2.5 Materials/Survey instruments

We used a slide, car, desk set (chair and table), cap, and shoes for the study. These objects comprised a child-sized and miniature-sized version, except for the shoes following the procedure in previous studies (Ishibashi & Moriguchi, 2017; Ishibashi & Uehara, 2020). The sizes of the objects are shown in Table 1.

Table 1

The dimensions of the objects used in the study.


Slide 46 × 110 × 72 5 × 21 × 14

Car 75 × 42 × 85 6 × 3 × 7

Desk 61 × 41 × 48 10 × 18 × 15

Chair 36 × 29 × 32 8 × 10 × 10

Cap 25 × 18 × 15 5 × 2 × 2

Shoes N/A 4 × 7 × 2

Although parents were present in the experimental session, they were instructed not to talk about the objects, their specific use, and their size. In each session, a toddler played with the child-sized objects freely for 5 minutes. The child and parent were asked to leave the room temporarily, while the experimenter replaced the child-sized objects with miniature-sized objects. The child and parent re-entered the room and the child then played with the miniature-sized objects for 5 minutes. Throughout the task, the experimenter interacted with toddlers to maintain their attention and interest in the objects. The experimenter provided different verbal cues depending on the conditions: utterances with object labels in the noun condition, and pronouns without object labels in the pronoun condition.

We assigned the participants to either the noun and pronoun conditions for the first session, and the other condition for the second session approximately 2 weeks later. The condition order was counterbalanced. Before the experiment in the noun condition, the parents were asked how they labeled, or their children expressed each of the objects used in the task in their everyday lives. The experimenter used either of these object labels reported by the parents or general pronouns in the noun or pronoun condition, respectively. For instance, the experimenter said “Mite, isu dayo! Hora, isu!” [Look at the chair! It’s the chair!] in the noun condition, whereas uttered “Kore o mite! Hora, kore!” [Look at this! This is it!] in the pronoun condition. If a participant did not seem to be interested in any of the objects, the experimenter encouraged the child to interact with them with “Isu de asobo!” [Let’s play with the chair!] in the noun condition or “Kore wa do?” [How about this?] in the pronoun condition. Throughout the scale error tasks for both conditions, the experimenter avoided using specific verbs referring to object use (e.g., sit down, put on, or get in) and used only general verbs (e.g., play, do, or see) in order to examine the labeling effect of object words, not action words, on scale error production. The overall number and timing of labeling varied across participants because the experimenter interacted with toddlers freely and naturally as if engaged in spontaneous toy play. However, the experimenter kept the number of labels referring to the objects (i.e., object labels in the noun condition and pronouns in the pronoun condition) equivalent across sessions.

The parents were asked to complete the Words and Grammar form of the J-MCDI (Ogura & Watamaki, 2004) during the first session to assess the participant’s productive vocabulary size of nouns, verbs, and adjectives. The questionnaire included a maximum of 281 items for object words, 103 items for action words (i.e., verbs), and 63 items for descriptive words (i.e., adjectives). We classified items according those types of words based on previous studies (Caselli et al., 1999; Ogura et al., 2016; see Table 2).

Table 2

Category classification of the vocabulary items in this study.


Nouns (281) Animals (43), Vehicles (14), Toys (18), Food and Drink (68), Clothing (28), Body parts (27), Furniture and Rooms (33), Small household objects (50)

Verbs (103) Action words (103)

Adjectives (63) Descriptive words (63)

Notes: Values in parentheses indicate a maximum number of items in the Words and Grammar from the J-MCDI.

The parents also completed the Japanese version of the ECBQ (Putnam et al., 2006; Sukigara et al., 2015) during the second session to include the effortful control subscore as a covariate. Given that inhibitory control has been considered to have relationships with scale errors (DeLoache et al., 2013; Ishibashi & Moriguchi, 2021), we collected this variable as a covariate. Note that the participants were equipped with physiological sensors when they performed the scale error task during the second session, but we do not report their physiological responses here since we collected those data for a different objective not included in this study.

2.6 Quality Control

During the experiments, the second experimenter counted the number of labels referring to the objects uttered by the primary experimenter (i.e., the number of nouns for the noun condition, and that of pronouns for the pronoun condition). No significant difference in the number of labels was found between the two conditions (t(71) = 1.73, p = 0.088; the noun condition, M = 50.2, SD = 10.7; the pronoun condition, M = 46.3, SD = 16.0).

The play sessions were recorded by three video cameras that were set at different positions in the room, and children’s behavior regarding scale error productions was coded later. We coded participants’ interactions with the miniature-sized objects according to the coding scheme by DeLoache et al. (2004). There were three criteria: (a) whether the participant tried to interact with the miniature-sized objects as they interacted with the child-sized objects; (b) whether the participant’s hands or mouth (for example) touched the objects’ proper part(s); (c) to what extent the participant’s interaction with the miniature-sized objects was serious (1: definitely serious, 2: probably serious, 3: not clear, 4: probably pretending, 5: definitely pretending). For criterion (c), the action was classified as a scale error when it was scored as 1 or 2. We regarded participant’s behavior as scale errors only when these three criteria were satisfied. If repetitive attempts with a single object were observed (e.g., trying to get in the miniature-sized car a few times in a row), we counted them as a single scale error. The first coder annotated all the data, whereas the secondary coder annotated 25% of them. The inter-rater reliability was sufficiently high (κ = .92) according to the criteria of Landis and Koch (1977). The number of scale errors per session was used as a representative variable for scale error production.

As developmental indices, we used the participants’ age in months during the first session and the productive vocabulary sizes of nouns, verbs, and adjectives, which was calculated by summing the checked items from the J-MCDI. We also included the effortful control score from the ECBQ as a covariate to take individual differences in inhibitory control into account as in Shinya et al. (2022). This score was calculated by averaging the five sub-factors: inhibitory control, attention shifting, low-intensity pleasure, cuddliness, and attention focusing (Putnam et al., 2006). Only one participant did not have enough time to complete the ECBQ, resulting in the missing value. This was handled using the hot deck imputation method (Myers, 2011). That is, the missing value was replaced with one of a similar participant within the same dataset while matching the participants’ age in months, gender, and total vocabulary size.

2.7 Data anonymization and ethical issues

All parents provided verbal and written consent for their children to take part in the study. This study was conducted with the approval of the Ethical Committee of the Life Science Research Ethics and Safety of the University of Tokyo. The participants’ names were replaced with anonymized IDs for each data file. The video recordings of the scale error study were stored and viewed only at the Center for Early Childhood Development, Education, and Policy Research (CEDEP), Graduate School of Education, The University of Tokyo.

2.8 Existing use of data

The study using this dataset has been published in the Journal of Experimental Child Psychology (Hagihara et al., 2022).

(3) Dataset description and access

3.1 Repository location

The dataset and analysis codes for Hagihara et al. (2022) are available at https://doi.org/10.17605/OSF.IO/KR93J.

3.2 Object/file name

In the OSF repository, there are 5 files and 1 folder, except for the Rproj file. A brief description of each file/folder is shown in Table 3.

Table 3

Brief description of what is in the repository.


sedata.csv Dataset used in Hagihara et al. (2022).

sedata_detailed.csv Additional detailed dataset.

0_DescriptiveStats.R R scripts to analyze the sample characteristics and experimental manipulations.

1_WithinDesign_Discrete.R R scripts to perform the primary analysis.

2_BetweenDesign_Discrete.R R scripts for the Supplementary analysis 1.

3_WithinDesign_Continuous.R R scripts for the Supplementary analysis 2.

Models (folder) Stan codes to perform Bayesian statistics.

3.3 Data type

The dataset file is named “sedata.csv,” which included processed data for each participant per session. The information contained in this file is shown in Table 4.

Table 4

Description of the dataset “sedata.csv”.


id Qualitative Participant ID.

time Qualitative Time when data were collected (i.e., first or second session).

age1st Quantitative (integer) Participants’ age in months in the first session.

ageexp Quantitative (integer) Participants’ age in months when they performed the task.

ageD1st Quantitative (integer) Participants’ age in days in the first session.

ageDexp Quantitative (integer) Participants’ age in days when they performed the task.

gender Qualitative Participants’ gender (“f” = female; “m” = male).

cond Qualitative Condition of the task (“noun” = noun condition; “wonoun” = pronoun condition).

vcball Quantitative (integer) Participants’ total vocabulary size.

noun Quantitative (integer) Vocabulary size of common nouns.

verb Quantitative (integer) Vocabulary size of verbs.

adj Quantitative (integer) Vocabulary size of adjectives.

se_occ Quantitative (binary) Whether scale errors were observed or not (“1” = a participant produced at least 1 scale error).

se_sum Quantitative (integer) The number of scale errors observed.

labeling Quantitative (integer) The number of labels referring to the objects uttered by the experimenter.

ecrow Quantitative (continuous) Raw score of the effortful control (with missing values).

ec Quantitative (continuous) Effortful control score with hot deck imputation.

The other dataset file named “sedata_detailed.csv” included additional detailed data on the sub-categories of the J-MCDI, sub-factors of the ECBQ, and additional participant information (Table 5). We provide these additional data to improve the usefulness of our dataset, allowing other researchers to conduct secondary analyses.

Table 5

Description of the dataset “sedata_detailed.csv”.


BabyTalk1, Animals, Vehicles, Toys, FoodAndDrink, Clothing, BodyParts, FurnitureAndRooms, SmallHouseholdObjects, OutsideThings, PlacesToGo, People, GamesAndRoutines, ActionWords, TimeWords, DescriptiveWords, Pronouns, QuestionWords, PositionsAndLocations, Quantifiers, ConnectingWords, BabyTalk2, ConversationalWords, Others Quantitative (integer) Vocabulary size of each semantic sub-category of J-MCDI.

InhibitoryControl, LowIntensityPleasure, AttentionalFocusing, AttentionalShifting, Cuddliness, ActivityLevel, HighIntensityPleasure, Impulsivity, PositiveAnticipation, Sociability, Sadness, Shyness, Discomfort, Fear, Frustration, MotorActivation, PerceptualSensitivity, Soothability Quantitative (continuous) Raw score of sub-factors of the ECBQ.

Sibling Qualitative Whether a participant had any younger siblings (=1), older siblings (=2), or not (=0).

daycare Quantitative (binary) Whether a participant was enrolled in a daycare (=1) or not (=0).

residence Qualitative Which of the following prefectures did the participant live in: Tokyo, Saitama, Chiba, Kanagawa, or Gunma.

Note: The other columns were identical those in the data file “sedata.csv.”

3.4 Format names and versions

The dataset is a CSV file. If a reader wants to perform the analyses reported in Hagihara et al. (2022), the following version of the R platform and packages would be required: RStudio: 1.4.1106, R: 4.04, cmdstanr: 0.3.0, ggmcmc:, here: 1.0.1, loo: 2.4.1, rstan: 2.21.2, tidyverse: 1.3.0. For those who are less familiar with cmdstanr, the installation guide can be found at Gabry and ČČešnovar (n.d.). To use cmdstanr, cmdstan must be installed. Troubleshooting information can be found at Stan Development Team (n.d.; section 1.4).

3.5 Language

American English.

3.6 License

CC-By Attribution 4.0 International.

3.7 Limits to sharing

The dataset does not contain any identifiable personal information. Thus, researchers can use the data for their own purpose such as performing secondary analysis.

3.8 Publication date

The dataset has been publicly available since 18/07/2022.

3.9 Fair data/Codebook

We used the OSF platform to publish the dataset with the DOI link. This ensures that the dataset is searchable and accessible. We described how to see the data above so that other researchers can easily reuse the data.

(4) Reuse potential

We collected participants’ scale error data with their vocabulary and inhibitory control measures in order to investigate the labeling effect of object words (e.g., chair) on scale error production. To the best of our knowledge, this is the first in-lab experiment on scale errors conducted under a within-participant design, which would be methodologically useful for other researchers if a similar study design is adopted.

Although there is a previous study with a larger sample size (Grzyb, Cangelosi et al., 2019; n = 125, 18–30 months), our dataset has the equivalent number of participants uniformly distributed during the toddlerhood period to the many other studies on scale errors (e.g., Grzyb et al., 2014; Ishibashi & Uehara, 2020; Rivière et al., 2020). This dataset can be used to perform a power analysis to estimate the desired sample size for future in-lab scale error studies. Furthermore, a researcher will be able to estimate the effect size of scale error productions in terms of the differences between the two conditions or developmental change, regardless of when a within-participant or between-participant study is designed by using the dataset entirely or partially (e.g., if a researcher uses data in the first session only, it can be regarded as a between-participant experiment).

Another potential use of this dataset would be to merge it with other datasets including participant’s vocabulary measures and perform a secondary analysis to investigate what kind of developmental indices has greater predictability on scale error production. Some previous studies (e.g., Grzyb et al., 2014; Grzyb, Cangelosi et al., 2019) collected children’s vocabulary sizes. Combining those datasets will elucidate the relationships between scale errors and lexical development more clearly and reliably. Relatedly, the dataset provided here is expected to be used for future meta-analysis, as to the best of our knowledge, no meta-analysis has been performed since the phenomenon had been originally reported (DeLoache et al., 2004). Note that, in general, the experimenter’s verbal instructions are not restricted during the scale error task. This suggests that the noun condition in our study more closely resembles other scale error studies (e.g., Grzyb, Cangelosi et al., 2019) than the pronoun condition does. Toddlers were exposed to different linguistic instructions depending on the conditions. Therefore, our dataset cannot be treated as the product of a test-retest experiment despite its within-participant design.

In addition, this dataset allows researchers to investigate the more fine-grained developmental trends of scale errors because it contains continuous values of participant’s age in months and vocabulary size of nouns, verbs, and adjectives. The previous experimental studies (DeLoache et al., 2004; DeLoache et al., 2013; Grzyb et al., 2014; Grzyb, Cangelosi et al., 2019; Ishibashi & Moriguchi, 2017; Were et al., 2006) as well as the recently published study using this dataset (Hagihara et al., 2022) generally classified participants into several categorical developmental groups. However, such discretization of inherently continuous variables may lead to statistical concerns (Naggara et al., 2011; Royston et al., 2006; Rucker et al., 2015). Given that some studies reported the nonlinear developmental trends on scale error production (e.g., DeLoache et al., 2004) and that its developmental trends are still under debate (Grzyb, Cangelosi et al., 2019), addressing such concerns would be important for scale error research. With a larger sample size containing our dataset, polynomial or spline functions could be used (Royston et al., 2006; Rucker et al., 2015) to express nonlinear trends errors while maintaining the continuous nature of the variables.

In conclusion, we believe that our dataset will be useful for future scale error research, and we also encourage other researchers to make their dataset shared to foster more rigorous and reproducible developmental studies.