A foundational postulate in personality research – the “Lexical Hypothesis” – is that all relevant psychological differences between people are marked by linguistic descriptors. A major benefit of this lexical approach is that it helps to constrain the scope of differences between people, as differences that cannot be succinctly described are presumed to be less salient. This logic has led to the development of several widely used assessment models in personality, each based on data collected from self-ratings and ratings of others using subsets of these descriptors [2, 9, 14, 18, 21, 22].
Though the universe of descriptors is finite, there are more trait descriptive adjectives (TDAs) than can be used to collect ratings from any single rater; the most exhaustive lists have counted nearly 18,000 terms . The data described herein were collected to extend research seeking to deal with this multiplicity of terms [1, 2, 4, 7, 15]. Specifically, the aim of work in this tradition is to identify a subset of terms that are among the most familiar and unambiguous for a representative population of English speakers.
The primary challenge of identifying useful subsets stems from the fact that many descriptors are used infrequently in reference to personality. Allport and Odbert (1936), for example, suggested that only about one quarter of the long list of terms they cataloged from the unabridged version of Webster’s New International Dictionary was suitable for use as personality trait names. The remaining 13,500 terms were deemed beyond their operationalization of personality (see the original source for more detail).
The infrequent use of many terms is not only a matter of scope, however, as a substantial proportion of terms are highly obscure. Some have fallen out of everyday use; more were rarely used outside of specific contexts. One consequence of obscurity is that some of the terms that are reasonably related to personality can be set aside due to a lack of familiarity among the general population. A second – and less intuitive – consequence is that new terms occasionally enter the lexicon despite having a highly similar meaning to one or more terms that already exist. This implies that the subset of widely-used personality terms evolves over time and contains a non-trivial degree of synonymity.
Prior efforts to reduce the universe of terms into a tractable subset uniformly credit the list of Allport and Odbert (1936) as a starting point. Cattell  used human judgments of synonymity to reduce his list, focusing mainly on the first category (about 4,500 terms) plus a “few hundred additional terms” [4, p. 437]. His judgments produced a list of roughly 170 terms (and, for many, a corresponding antonym).
Norman  took a more exhaustive approach. He supplemented Allport and Odbert’s list with new terms, then culled terms he deemed obscure, broad, non-psychological in nature, evaluative, or classified as quantifiers of degree (rather than directly descriptive). This produced a list of nearly 2,800 terms. Norman sought to be over-inclusive in his trimming, and this was confirmed by subsequent itemetric analyses based on social desirability ratings as well as self- and other-ratings.
Two important comments about Norman’s analyses are relevant to the current work. First, Norman reported that only 34.5% of the terms were known to all the “bright, literate, university undergraduates” who rated them [15, p.17]. As only 8.4% of the U.S. population had completed a college degree in 1967 , the level of literacy among this group was relatively less common than it is currently; 34.8% of the U.S. population held a college degree in 2020 . Aside from gender (roughly half female), no additional demographic information about the raters was provided. As the data were collected from undergraduates at the University of Michigan in the mid-1960s, raters in the sample are likely to have been young White individuals from the midwestern U.S. (especially Michigan) of above average socioeconomic status. In other words, though roughly 2,000 terms were known to 90% of Norman’s raters, the generalizability of this information is uncertain. Second, Norman reported having asked raters to provide definitions for the 200 terms evaluated by each, but he does not report on the accuracy or the degree of ambiguity of these definitions. Thus, Norman’s data regarding self-reported familiarity can only partially inform the question of the suitability of each term for analyses of personality structure.
Goldberg  subsequently winnowed Norman’s terms using five procedures. Most dropped terms were (1) obscure (roughly one-third), and/or (2) nouns (232 terms), though a small number were removed due to having (3) extremely high or low ratings of social desirability, or (4) a high dispersion of social desirability ratings (a proxy for ambiguity). An unreported number of additional terms were (5) dropped using “intuitive judgments of suitability” [7, p. 209]. These procedures left 1,657 adjectives from Norman’s list. An additional 13 terms were retained in alternate forms (4 nouns were turned to adjectives), and 40 terms were added, including 38 non-overlapping terms from the Adjective Check List . Goldberg’s final list of 1,710 terms has been highly influential, both in his own subsequent work  and personality structure research conducted by others [3, 20].
The current project aimed to update Goldberg’s list in several ways. First, by adding terms that are missing from the 1,710, whether due to oversight or the evolution of descriptors over the last 40 years. Second, we sought to evaluate knowledge of the new and existing terms with a metric that can provide some indication of consensus about the meaning of each term beyond self-reports of familiarity. Third, we sought to collect a modern and more representative sample of participants in terms of age, race/ethnicity, gender, and education. Fourth, we sought to provide an open and accessible database of terms to encourage further use among psychology researchers.
2.1 Study design
The study design involved several distinct steps, including (1) aggregating a trait descriptive adjective set with 2,818 terms; (2) sourcing definitions for all of the terms; (3) creating multiple choice vocabulary questions based on each term-definition pair; (4) designing and completing survey-based data collection on the pool of questions; and (5) developing tools to disseminate results from the analyses of these data and other characteristics of the 2,818 TDA set. Details of each step are given in the sub-sections below.
Step 1: Aggregation of the 2,818 Trait Descriptive Adjectives
Given the centrality of the 1,710 item TDA set derived by Goldberg and Norman, the TDA set reported on in this work is a super-set of those terms. This does not imply however that all these terms are necessarily well-suited for subsequent administration in personality structure research. In fact, 483 of these terms are not among the 100,000 most frequently occurring terms in the Corpus of Contemporary American English COCA; , and an additional 242 were outside of the top 50,000. In many instances, this obscurity was driven by the inclusion of prefixes indicating negation (e.g., unstudious) or extremity (e.g., overtrusting, ultrademocratic). Goldberg  provides considerable discussion of the over-inclusive nature of these 1,710 and the rationale for retaining terms with various prefixes.
Note that 15 of the 1,710 terms were edited slightly, as shown in Table 1. In 14 of these cases, the spelling was changed based on recommendations from more than one online dictionary. In most cases, these changes reflected the typical progression of spelling revisions over time during lexicalization , from hyphenation to compound words. The remaining case (“satiric”) was a change of form (“satirical”).
|PRIOR FORM||CHANGED TO|
To evaluate the need to extend beyond the 1,710, we reviewed all the terms dropped by Goldberg from Norman’s list of 2,797 terms. Given our aims, it seemed likely that some of the more obscure terms may have become more widely used over the last few decades. This prompted the re-introduction of 204 terms. Based on the decision to remain consistent with the over-inclusiveness of prior efforts, terms were added if they were deemed potentially personality-relevant and were among the 100,000 most frequent terms in the COCA (137 of these 204 were among the top 50,000 most frequent).
Repeating this same procedure with the remaining Allport and Odbert list prompted the re-introduction of 847 further terms that were among the top 100,000 most frequent terms in the COCA (754 of these were in the top 50,000). It should be acknowledged that many of these terms were likely dropped by Allport and Odbert or Norman despite their familiarity because they were deemed overly evaluative, broad, or too strongly indicative of affective states. Still, retaining such terms at this stage seemed preferable as they could always be dropped later (i.e., prior to structural analyses, possibly on the basis of more rigorous criteria). In addition, we felt that many of these terms belong in even the most exclusive lists – “private”, “modern”, and “academic” (to name a few) are familiar, reasonably specific, and non-evaluative descriptors. See Section 4 on “Re-use Potential” for more discussion of this issue.
Finally, an additional 57 terms were added from a variety of sources. Of these, 7 were shared with the first author in personal correspondence with Gerard Saucier (“appreciated”, “controversial”, “exciting”, “supportive”, “well-adjusted”, “well-known”, and “well-liked”). The remainder were added by the first author following review of lists of new terms added to the Merriam-Webster dictionary from 1968 to 2019, and by reviewing adjectives among the 100,000 most frequently used words in the COCA list that were not already included.
The final list contains 2,818 terms. To summarize, this includes all 1,710 of Goldberg’s terms (with minor revisions listed in Table 1) and overlaps with 1,914 of Norman’s 2,797 terms. Of the 904 additional terms, 847 were also present in the lists of Allport and Odbert. The 57 new terms (i.e., previously unconsidered by Allport and Odbert, Norman, and Goldberg) are shown in Table 2.
Step 2: Sourcing definitions for the 2,818 Trait Descriptive Adjectives
Definitions for each word in the list were obtained from the Oxford Dictionaries website, available under license through Google Search. The Oxford Dictionaries site is maintained by Oxford University Press, which also publishes the Oxford English Dictionary. From the OED website:
“The dictionary content in Oxford Dictionaries focuses on current English and includes modern meanings and uses of words. Where words have more than one meaning, the most important and common meanings in modern English are given first, and less common and more specialist or technical uses are listed below. The OED, on the other hand, is a historical dictionary and it forms a record of all the core words and meanings in English over more than 1,000 years, from Old English to the present day, and including many obsolete and historical terms .”
Definitions for each term were sourced individually (i.e., manually rather than via API) to ensure that the definition used was relevant to personality. In rare cases where more than one definition for a term may have been relevant to personality, the first definition (i.e., the most important and common, per the statement above from OED) was used.
Edits to these definitions were made infrequently in cases where the given definition was lengthy, as we sought to keep all definitions shorter than 100 characters in length (including punctuation and spaces). Similarly, editing and/or the use of definitions from other dictionaries was required for a small number of terms that did not have a definition in Oxford Dictionaries. Without exception, this issue was caused by the inclusion of prefixes of negation or extremity (e.g., “unwilful”, “insuppressible”, “oversuspicious”).
Step 3: Creating multiple choice term-definition vocabulary items
The next step was to create two multiple choice vocabulary questions from each term-definition pair – 5,636 questions in total. Questions were designed such that the definitions were used as stimuli, and respondents were expected to identify the matching word from several options. For each of the 2,818 pairs, 5 of the other 2,817 terms were drawn at random as distractors. All six terms – the 5 distractors and the term that correctly matches the definition – were then used as possible response options, along with two other possibilities: “I don’t know” and “None of these” (note that “None of these” was never the correct response). During item development, the order of presentation for the correct response and the 5 distractors was randomized; the last two options were the same for all items. For example, the following item was developed using the term-definition pair for “spontaneous”:
Free, natural, and unconstrained in behavior.
- I don’t know
- None of these
Though the questions were developed algorithmically, all items were reviewed by each member of the authorship team to identify questions that included one or more close synonyms as a distractor, as this would have reduced the validity of the question for evaluating respondents’ knowledge of the target term. For cases where this issue occurred, the questions were replaced with new randomly generated substitutes and reviewed again. The decision to use two questions for each term-definition pair was made to further reduce the effect of this and similar concerns, as large differences in the proportion of correct responses across the two versions were expected to signal idiosyncratic effects caused by one or more distractors.
Note that the text of these questions (i.e., the definitions and the 5 random distractor choices associated with each term-definition pair) have not been made openly available online or as part of the dataset described in this project. This was done to maintain their validity for subsequent research. More specifically, if these questions were made publicly available, participants may have the opportunity to study them and even post answers online, thus invalidating the items. Contact the first author to inquire about access to the questions.
Step 4: Survey-based data collection
Data were collected on these 5,636 questions using a cross-sectional, planned missingness design. The aim of this aspect of the project was to evaluate the extent to which the meaning of each term was known among a relatively representative sample of respondents. By relatively representative, we mean in relation to prior efforts to evaluate the familiarity of terms [17, 13]. Terms with higher proportions of correct responses can be considered to be more familiar than terms with lower proportions of correct responses.
To create the survey, two forms (A and B) were used to split the 5,636 questions, with the two versions of each term-definition pair assigned to a different form. Respondents to the survey were administered 75 questions drawn at random from each form and 9 demographic questions (see Section 2.2 below for more information). As such, there was no chance of presenting the same term-definition pair within an administration.
Participants were recruited through two online crowdsourcing portals (again, see Section 2.2). The study was posted with the title “Trait-Descriptive Adjective Vocabulary,” and the description stated that respondents would be helping to evaluate the familiarity of adjectives used to describe people. Participants who consented to the survey were instructed as follows: “For each question, choose the option that matches the definition given. If you think none of the options match the definition, select the option labeled ‘None of these’. If you don’t know the answer, select the option labeled ‘I don’t know’. Please do not look up the definition!” (emphasis included in the original). Only one question was presented on each page. The survey was set to auto-advance after a response was selected, but participants were allowed to go back to change their answers to earlier questions. No feedback was given at the end of the survey.
We sought to collect a sample size large enough that each item would have approximately 30–40 responses. This number was chosen as 30 because, at this value, the standard error of the estimated proportion is below .10 for all true values of the proportion. Given the goal of identifying TDAs that are widely familiar, we believed this level of precision to be sufficient.
Step 5: Analyses and database development
Analyses of the survey data collected as described in Step 4 include descriptive statistics about the sample and each of the 5,636 questions. This also included aggregation across both forms of each term-definition pair. The analytic code and output are available online at https://pie-lab.github.io/tda/. This resource also provides a database of the 2,818 TDAs that can be filtered and searched according to several criteria. These include the sample size and mean proportion of correct responses to the vocabulary questions, the frequency of each term’s presence in books indexed by Google, and the inclusion/exclusion of the term in other influential subsets of TDAs. The other subsets of TDAs include Goldberg’s (1982) 1,710 terms, the 100 terms in Goldberg’s Big Five Factor Markers , and the subset of 435 terms used in validation work on the Big Five by Saucier and Goldberg . Note that the database does not reflect inclusion/exclusion in the lists by Norman or Allport and Odbert, as this list is only partially overlapping with those lists. Similarly, it does not show the frequency of each term indexed in the COCA  as those data are proprietary.
2.2 Sampling, sample, and data collection
Participants (N = 1,572; 57% female) were recruited from two different crowdsourcing platforms: Prolific (90.7%) and Amazon’s Mechanical Turk (MTurk; 9.3%). Data collection was conducted across numerous small waves to meet stratified quotas across numerous categories simultaneously, including the form of the survey, sex, age, and race/ethnicity. To increase the generalizability with respect to literacy, the survey was only made visible to respondents who had previously identified themselves to Prolific or MTurk as not having attained a college/university degree. Similarly, the survey was only made visible to respondents who reported being current residents of the U.S. (this necessarily implies that the data are not generalizable to English speakers outside the U.S.). See Section 2.5 on Quality Control for more information about exclusion criteria.
Participants were compensated US$ 2.50 for completing the survey, as this was approximately equivalent to the U.S. federal minimum wage at the time of data collection (US$ 7.25 per hour for roughly 20 minutes of work, on average). Participants were allowed to take the survey multiple times (including both forms A and B).1 Across all 1,572 participants, we obtained 3,290 full responses to the survey. Approximately 44% (N = 691) of participants took the survey one time, 35% (N = 554) took the survey twice, and the rest took the survey between 3 and 10 times. Given the relatively small proportion of items administered to each respondent, there are few instances in which a participant saw the same item multiple times. More specifically, across the 241,506 item answers, there are only 2,419 times (1%) a participant saw the same question more than once.
The resulting sample contained participants from a wide range of ages, household incomes, and different geographies (by state) within the U.S. Please see the supplemental website for figures summarizing the demographic characteristics of this sample (https://pie-lab.github.io/tda/sample.html). The sample included a higher proportion of participants identifying as White (73%) relative to the US population (64% of US adults according to the 2020 Census) and a lower proportion of respondents identifying as Black or Hispanic (9% vs 12% and 5% vs 16%, respectively).
Most participants had either some college-level education (42%) or a high school degree/GED equivalent (40%). Approximately 12% of respondents reported having attained a college/university degree. The cause of this inconsistency with the recruitment strategy is unclear, but it is likely that the Prolific/MTurk workers experienced a change in degree status since first joining the platform. Both the age and geographic distributions reflected considerable diversity. All the U.S. states and the District of Columbia were represented in the sample except for Vermont and Alaska. Approximately 66% of the sample had a household income of $60,000 or less.
2.3 Time of data collection
The survey-based data collection described in Step 4 of Section 2.1 occurred between May and July of 2020.
2.4 Quality Control
To facilitate generalizability of the data to native speakers of American English as spoken in the U.S. at the time the data were collected, participants were ineligible to complete the survey if they did not self-report speaking English “fluently” or “very well”, or if they currently lived or grew up outside the United States. Responses were excluded if participants took less than 3 minutes to complete the survey.
2.5 Data anonymisation and ethical issues
A consent form outlining the study rationale, including potential benefits and risks, was presented to participants prior to taking the survey. Participants were given the option to decline or consent to participation in the study as outlined by this document; participants who declined did not go on to complete the survey. Anonymity was maintained as no individually identifying data were collected from participants. This procedure was reviewed and approved by the Institutional Review Board at the University of Oregon (Protocol #02012020.001).
2.6 Existing use of data
Coughlin, J., Condon, D. M., Weston, S. J. (2021, February). Identifying Unbiased Trait-Descriptive Adjectives for Personality Psychology. Poster session presented at the annual meeting of the Society for Personality and Social Psychology (virtual).
(3) Dataset description
3.1 Repository location
Condon, D. M., Coughlin, J., & Weston, S. J. (2021). Trait Descriptive Adjectives, Harvard Dataverse. https://doi.org/10.7910/DVN/5T80PF.
3.2 Object name
The data repository contains 5 data files. The raw data files are labeled ‘TDA_data_scored’, ‘TDA_frequencies’, and ‘masterkey’.
The repository also includes two output files that match the content in the database on the Github site. These files are labeled ‘item_difficulties’ and ‘TDA_properties.csv’.
3.3 Data type
3.4 Format names and versions
The data are published in CSV format. The accompanying website was built using R version 4.1.1  and RStudio version 2021.9.0.
The data have been published in the public domain with a CC0 license.
3.7 Limits to share or embargo
3.8 Publication date
The final version of the data was published 2021–10–26. The data were first deposited on 2021–04–27.
3.9 FAIR data/Codebook
These data conform with FAIR guidelines in that they have been posted in the public domain using a secure and accessible data repository with appropriate meta-data (including a persistent identifier). Interoperability is demonstrated with openly accessible analytic code and a searchable database on the Github website.
(4) Reuse potential
The primary opportunities for re-use of these data relate to subsequent work on the lexical structure of personality descriptors in American English. For example, the information provided about these 2,818 TDAs could be used to inform the collection of self- and other-ratings for all or, more likely, some subset of the terms. While similar research has been done extensively before now, several of the most influential efforts have relied on relatively small and/or homogenous samples of raters, and they have mainly used human judgment to winnow the number of TDAs down to a tractable size. The data provided here could be used to replicate prior work and/or (re-)evaluate the effects using different sets of terms on structural analyses.
Further, some researchers have recently noted that claims about the universality of the so-called Big Few models are logically problematic. For example, the evidence of similar statistical covariation among the self- and other-ratings of terms across groups may not adequately account for differences in means or the potential exclusion of other meaningful content [5, 19]. As the list reported here contains only American English terms, they are independently sufficient for evaluations with cultures primarily using other languages, but they may be useful in conjunction with lists from other languages.
Similarly, this list of terms and the methods described here can contribute to studies of the generalizability of lexical models within populations speaking American English. For example, our method for assessing the extent of knowledge about each term may be useful for subsequent attempts to evaluate terms that are specific to one or more of the many variations of American English spoken throughout the U.S. These include cultural (e.g., African American English, Cajun Vernacular English, Mexican American English) and regional (e.g., New England English, Upper Midwestern English) dialects as well as American English-based hybrid languages such as Hawai’i Creole English (known locally as Hawaiian Pidgin) and Gullah English. Differences in the knowledge and scope of TDAs across these variations may be useful for sensitivity analyses of personality structure.
These terms are also useful for lexical research that does not rely only on survey-based methods. Recent advances in natural language processing (NLP) techniques, for example, offer considerable potential for novel applications of language analysis, especially with respect to the breadth and diversity of study populations over time . These applications will benefit from the availability of an updated and more comprehensive collection of personality descriptors.
More specifically, these data can be used to identify commonly known (or uncommonly known) trait descriptive adjectives (TDAs) for use in personality scale development and/or personality-relevant vocabulary tests. While the commonly known TDAs may be preferred when developing generalizable personality assessments, the use of uncommon TDAs may have merit in vocabulary-based ability measures, as they allow a test creator to generate items at various levels of difficulty. Ability measures such as these could be used for trait-recognition tasks (as they were used here), or for more general reading comprehension (i.e., English-language literacy) measures.
As over half of participants completed the survey more than once – with about 20% completing the survey 3 or more times – these data also offer an opportunity to study consistency in performance over repeated attempts (improvement vs fatigue). Though the number of participants who saw the same item multiple times was low, the data may also be useful as a metric of consistency in responses by Prolific/MTurk participants.
Finally, these data offer numerous possibilities for use in instructional contexts. They could be used to provide materials for subsequent data collection or to teach statistical techniques such as binary logistic (multilevel) regression, chi-square tests, and point-biserial correlations.