Ten (mostly) simple rules to future‐proof trait data in ecological and evolutionary sciences

Traits have become a crucial part of ecological and evolutionary sciences, helping researchers understand the function of an organism's morphology, physiology, growth and life history, with effects on fitness, behaviour, interactions with the environment and ecosystem processes. However, measuring, compiling and analysing trait data comes with data‐scientific challenges. We offer 10 (mostly) simple rules, with some detailed extensions, as a guide in making critical decisions that consider the entire life cycle of trait data. This article is particularly motivated by its last rule, that is, to propagate good practice. It has the intention of bringing awareness of how data on the traits of organisms can be collected and managed for reuse by the research community. Trait observations are relevant to a broad interdisciplinary community of field biologists, synthesis ecologists, evolutionary biologists, computer scientists and database managers. We hope these basic guidelines can be useful as a starter for active communication in disseminating such integrative knowledge and in how to make trait data future‐proof. We invite the scientific community to participate in this effort at http://opentraits.org/best‐practices.html.


| INTRODUC TI ON
As early as 300 BC, Greek philosophers such as Theophrastus forged the first formal systems defining and classifying organisms by their combination of morphological, physiological, behavioural and phenological characteristics, that is, their traits (Weiher et al., 1999).
Knowing an organism's traits often allows deeper understanding of its life history, behaviour, fitness, biotic interactions and potential responses to and effects on ecosystem processes (Violle et al., 2007).
Traits are commonly defined as a measure of an entity (Garnier et al., 2017), where the entity can be the whole individual, or a specific organ or tissue (e.g. a fish, its tail fin or wood) and the quantity is an observable characteristic of that entity (e.g. the length or colour of a fish, the length of its tail fin or the density of wood tissue).
Together, all traits of an individual organism represent its phenotype, which results from the historical evolution of the genotype and potential current interactions with the environment. Therefore, a trait record should inform not only about the entity that was observed (e.g. taxonomic classification or age) and the quantity/characteristic that was measured, but also about the environment in which the individual has developed that trait (de Bello, Carmona, et al., 2021;Kattge et al., 2011), for example, where a fish was caught, where a tree lived or the soil depth where an invertebrate was observed.
There are many ways to describe and measure the traits of organisms (Kearney et al., 2021;Walker et al., 2022). For example, a plant leaf can be described by several hundred measurable characteristics, or 'traits'. These include surface area, sodium concentration, phenology and maximum photosynthetic rate (see e.g. Kattge et al., 2020). On the one hand, different traits of an individual are often correlated, for example, for a tree to grow tall it usually needs a thick stem. Recognising these correlations in how the data are collected (e.g. on the same tree) and stored is essential. In this case, for a trait record to be meaningful, it needs to be connected to a combination of multiple trait measurements. In contrast, a trait record can also be rather simple, if the given trait is well defined, if it depends 'only' on the genotype, or if it is not affected by current interaction with the environment.
In essence, trait data are a special kind of data: they are diverse (e.g. categorical or numeric, with a multitude of units), relatively simple (e.g. length) or potentially complex (e.g. behavioural traits), largely independent of one another (e.g. fish colour) or correlated with other traits (e.g. brain and body mass), and range between cheap and costly to measure (e.g. simple colour vs. metabolome data). However, they are very informative as they represent the evolutionary adaptation or developmental acclimation of the individual organisms to their environment and allow for quantitative and predictive ecology and biodiversity research. Therefore-if collected, stored and published in a meaningful way-organismal trait data have an extraordinary value for reuse, as indicated by, for example, the >20,000 data requests to the TRY Plant Trait Database since 2015 .
To enable the reuse of trait data beyond their original research campaign, to make them meaningful in other contexts and to avoid data degradation, observation records must be clearly defined, where possible the environmental context given, as well as provenance and sampling and measuring protocols for collection documented (Michener, 2006). Recent efforts to expand trait knowledge across the Tree of Life (Gallagher et al., 2020) Wilkinson et al., 2016), fundamental principles at the heart of the emergent Open Science movement (Nosek et al., 2015). Global and local datasets of organismal traits have rapidly grown since the 1990s (e.g. Herberstein et al., 2022;Kattge et al., 2020;Madin et al., 2016Madin et al., , 2020. However, these datasets bear various new challenges linked to harmonisation, biases, expertise and communication (Salguero-Gómez et al., 2021). These challenges result in a significant trade-off between investing in collecting new trait data or reusing open trait data (Westoby et al., 2021). Indeed, many studies in trait-based research reuse available trait data or collect additional trait data and/ or assemble new data (e.g. examples in Kattge et al., 2020). Thus, these studies also often involve linking different types of data, which requires interoperability between datasets Gallagher et al., 2020). 4. Trait observations are relevant to a broad interdisciplinary community of field biologists, synthesis ecologists, evolutionary biologists, computer scientists and database managers. We hope these basic guidelines can be useful as a starter for active communication in disseminating such integrative knowledge and in how to make trait data future-proof. We invite the scientific community to participate in this effort at http://opent raits.org/best-pract ices.html.

K E Y W O R D S
data life cycle, data science, FAIR principles, good practices, metadata, open science, phenotype, trait data These key aspects are just a few of many dimensions illustrating how and why researchers have to make biological decisions, and a wide range of data-science choices when collecting and working with trait data. Multiple complexities of trait data structure and manipulation are not obvious at first glance (Michener, 2006). For instance, there is sometimes confusion, and lack of awareness of trait standards, measurement units and trait data are particularly prone to errors in recording, language translation and understanding (Dawson et al., 2021;Kunz et al., 2022). By offering a larger perspective, a 'trait data life cycle' (i.e. a data life cycle specific for trait data, Rüegg et al., 2014) can help clarify these confusions and inform about good practices when working with trait data (Figure 1). In this article, we highlight some common pitfalls in the usage of trait data and offer 10 rules for making critical decisions that consider the entire life cycle of trait data. We start each rule with a general and simple statement and develop the complexity of each rule within more detailed subsections.

| RULE 1: S ELEC T THE RI G HT TR AIT
Let your study question or hypothesis determine both the trait(s) to be used and how those traits are collected and analysed. Clear, upfront definitions of traits will avoid errors through, for example, confusion of scales and definitions, data gaps or inclusion of inadequate traits (Dawson et al., 2021;González-Suárez et al., 2012;Hulme et al., 2013;Messier et al., 2017).

| Follow your hypothesis
Increasingly, trait data describing organisms of interest are publicly available for reuse. However, primary trait collection is necessary for a large number of research questions, for instance those involving rare species, understudied regions or small spatial scales. Vast public availability extends the potential scope of what is possible with limited resources (e.g. Falster et al., 2021;Kattge et al., 2020). However, when reusing trait data, we relinquish control of what variables are collected, which species are sampled, and the methods used for collection (Koricheva et al., 2013). Undirected fishing expeditions for traits can yield large datasets. Still, these may not be appropriate to answer a given research question, for various reasons (e.g. coverage, geographical origin, distribution, meaningfulness, and resolution, Violle et al., 2015). Furthermore, the wealth of available trait data may distract from initial hypotheses, risking random exploration of the available traits and fishing for significant relationships without a clear focus. Thus, trait selection and collection should in most cases be primarily tethered to a concrete hypothesis, not defined by the availability of existing data. This rule does, however, not completely exclude extensive data exploration and data-driven discovery within F I G U R E 1 Ten (mostly) simply rules and where they apply in the overall trait data life cycle. Each rule is primarily applied to a specific element of the cycle (in bold) but can also be necessary to other elements (secondary application). Rules 9 and 10 apply to the whole cycle.

| Consider the scale
Research questions define the appropriate hierarchical level for sampling: a continental-scale study of thousands of species may treat the intraspecific variation as statistical noise. In contrast, this variation may be the study focus on locally scaled projects. There is no 'correct' scale, either in terms of spatial grain (e.g. km 2 , m 2 ), temporal duration (e.g. seconds, years) or taxonomic coverage (e.g. clade, species, population or individual), but not every scale will be appropriate for every question. So, when defining the traits of interest, it is important to determine the scale at which these need to be collected or aggregated to match the research question (Messier et al., 2017).

| Be aware of existing trait definitions and homologies
Much effort has already gone into creating definitions and protocols for trait collection (Pérez-Harguindeguy et al., 2013). Yet, trait naming and corresponding descriptions may differ between studies and trait databases (Ankenbrand et al., 2018;Dawson et al., 2021;Kunz et al., 2022). For example, the activity cycle of animals is sometimes reported as a discrete value (e.g. Jones et al., 2009), or sometimes split into multiple binary traits such as 'nocturnal', 'crepuscular', 'diurnal' (e.g. Wilman et al., 2014). Similarly, values may differ between resources (e.g. 'therophyte' and 'annual' are synonyms). Furthermore, when comparing traits and trait states across organisms, it is important to be aware of the 'homology' of the character. Homologous traits share similarity of structure, physiology or development (often by common evolutionary ancestry), whereas non-homologous (or analogous) characters may perform a similar function, but differ in structure, physiology or development.

| Be pragmatic and transparent
In a perfect trait research world, we could measure or retrieve the exact traits for the precise scale and organisms needed to answer our specific question. This vision is rarely applicable in practice.
Instead, we often need to work with proxies for traits that are difficult to measure (e.g. hairiness of pollinators as a proxy for pollination effectiveness, Stavert et al., 2016), for inference of fitness (e.g. reproductive output as a performance trait, McGraw & Caswell, 1996, Violle et al., 2007 or for traits that are incomplete in a database (e.g. diet or behavioural traits are less complete than morphological traits, Oliveira et al., 2017). There is a common understanding of these technical or financial limitations in the scientific community; ultimately, we must be pragmatic to advance research questions. However, it is crucial to explain and justify the choice of traits, especially when these are used as proxies or 'best available data' to allow fellow researchers to understand and evaluate whether such choices were valid for the specific research question at hand.

| RULE 2: CON SULT E XIS TING DATA
Build on existing trait resources to reduce the likelihood of redundancy and ensure compatibility with current data. The decision when to collect new trait data is generally based on the research question, the scope of the analysis (e.g. local, global), and the availability of the existing data. Financial and geographic constraints may also influence the decision to use current trait data instead of embarking on a measurement campaign. However, the existing trait data must be 'fit for purpose' to avoid compromising the capacity to answer the research question and in many cases, new trait measurements will still be needed.

| Check public data sources
Most data probably exist decentralised as individual trait datasets in the form of raw data attachments to publications, data papers or data uploads to unspecific public databases (e.g. Zenodo https://zenodo.org, Dryad https://datad ryad.org). However, these datasets can be challenging to find if not registered at central hubs  . Common to these efforts is the fact that they contain already harmonised, error-checked and standardised values. These resources usually provide user-friendly interfaces for searches and dynamic, up-to-date aggregations of data. Particularly for studies of larger scale (e.g. many taxa, many bioregions), it often makes sense to consult these existing big databases and data registries.

| Identify and cite data origins
Trait data are not always raw or first-hand: they can be created and perhaps aggregated from original observations and measurements (e.g. Kattge et al., 2020) but also mobilised from literature or undigitised legacy trait data (e.g. Parr et al., 2015), synthesised as imputed trait data (e.g. Penone et al., 2014), reused from data publications (e.g. Kattge et al., 2020) or mined from texts with automated algorithms or other contexts (Thessen et al., 2018).
Thus, when reusing trait data, it is essential to check and report information about the source to downstream analyses and subsequent publications (i.e. data provenance). Importantly, providing this information also gives credit to the original trait data collectors.

| Fill the gaps
Existing databases are taxonomically and biogeographically biased, 'gappy', and traits assigned to the same species are rarely collected in the exact locations or conditions (Etard et al., 2020).  (Leitão et al., 2016), or for threatened species which will benefit from functional approaches to their conservation .
These handbooks provide precise, domain-specific definitions and recommended methods for trait measurement, measurement precision and replication. They also provide considerations and warnings of misconception and error, and point to the key literature debating the methodology. Taking formalisation of trait concepts even one step further are thesauri of trait concepts (Garnier et al., , 2017 (Calder, 1982).

| RULE 4: CONTE X T IS CRUCIAL
Always pair your data points with metadata. Sampling protocols ideally also define metadata that can be considered as covariates of the measurement procedure or inform the user about the provenance of the trait data. Together with the trait measurements, metadata defines an observation and its context (Madin et al., 2008).
While such metadata may already be necessary for the proximate research question, it further helps future users to understand better and reproduce the methods and correctly interpret the trait values.
The reuse value of existing datasets increases with the quantity and quality of metadata, so, datasets with sufficient context information are more likely to be reused in future synthesis analyses or included in more extensive databases.

| Cover the domain-specific standard, if possible
Deciding which further metadata to collect often involves a trade-off between which data are commonly collected in a specific domain (e.g. plants) and the time and expense involved in collecting or processing such data. Metadata preferably includes detailed documentation and code of how traits were measured (e.g. manufacturer and version of devices used) and processed (e.g. standardizations or species means). We recommend checking existing well-used datasets and databases of the specific domains before collecting new trait data to determine which common metadata should be covered.

| Link to other data by metadata
A good practice is to link the data with publications directly (e.g. by DOI) for the scientific context and further information in the materials and methods sections, as well as identification of trait data providers (e.g. by ORCID) to provide opportunities for feedback and requests for additional information. Traits are often measured also to collect other data, such as ecosystem function (e.g. Bongers et al., 2021) or species composition or interactions (e.g. Breitschwerdt et al., 2018). In these cases, functions measured‚ and species composition recorded, would be part of the metadata or links to those data in other repositories.

| RULE 5: S TRUC TURE TR AIT DATA
Do not underestimate the importance of the structure of your dataset. It might sound trivial at first glance to think about how to structure the data, but poorly structured data may become a nightmare to work with in downstream analyses, or to reformat for publication, deposit in a public database, or synthesise in metaanalyses. It thus makes sense to consider structural aspects even in the early stages of a project using traits.

| Minimum trait data standards
The minimal, essential information for a trait record includes taxon name, trait name, observation ID, trait value, unit (if applicable) and source. Several standards are available to help structure this minimal information set (Fegraus et al., 2005;Kattge et al., 2011;Madin et al., 2007;Parr et al., 2015;Schneider et al., 2019;Wieczorek et al., 2012). A good start for data structuring is to adopt one of these well-established schemes.

| Apply version control
The process from gathering to analysing trait data is long, or trait data may change as measurement technologies improve (e.g.

| Harmonise trait data
If trait data originate from multiple sources, each source may identify the same entities or concepts differently (Kunz et al., 2022).
Harmonisation is crucial to reconcile equivalent entities and explicitly connect related entities by 'similar' or subclass relationships.

| Derive traits from raw data
Research questions may concern composite or derived traits, such as the 'hand-wing index' (a wing's aspect ratio in birds). It is advisable to calculate derived traits directly from the raw data where possible to avoid bias and allow for new calculations. This procedure may not always be possible because of data gaps; in this case the calculation can be done at a higher level (e.g. at the taxonomic level of interest).

| Transform and standardise where applicable
Likewise for other types of data, transformations such as the natural logarithm or square root may be essential to conform to the requirements of analytical models. Beyond these, data challenges include how to combine binary, categorical and continuous traits

| Work with relative errors
Units are essential when we deal with approximations, uncertainties and errors (Langtangen & Pedersen, 2016).

| RULE 7: K NOW THE LIMITATI ON S
Follow the latest developments for best practices in trait data analyses. As the downstream part of data analysis is directly linked to the research question, generalisation of analytical methods is rarely possible. Given the diversity of research questions, the analytical steps can thus broadly diverge. However, the following notions can help identifying some common mistakes made with trait data due to their nature.
Beyond this, we recommend referring to closely domain-specific and topic-related literature that can provide appropriate solutions.

| Mind the level
Traits encompass different levels: organ, individual, population, species and community (Violle et al., 2007), and this structure determines the tools used for data analyses. For instance, traitenvironment relationships investigated at the species or community level require different analysis types (e.g. comparative models vs. simple linear models, see below). It is important to choose the appropriate level early in the research program to fit the target scientific question and to be able to analyse the data correctly.

| Do not confuse richness and abundance signals in trait metrics
Metrics aggregating traits at the community level (e.g. functional diversity or community-weighted means-CWMs), are influenced by the richness, the abundance of species and the overall species composition of the community. Choosing metrics unrelated to abundance (e.g. unweighted means) or null models (Hawkins et al., 2017) is necessary to separate species abundance, composition or richness signals from trait information.

| Handle correlations with care
Traits are often correlated, causing issues with statistical analyses (e.g. collinearity in linear models when traits are explanatory variables). Often, these correlations are due to biological constraints (e.g. allometries), or 'strategies' (Díaz et al., 2016).
In some cases, it is possible to use multivariate analyses (e.g.

| Consider correction for phylogenetic relatedness
When analysing data from multiple species in trait-trait correlations, or when using traits as responses, and depending on whether the focus of the question is ecological or evolutionary, it may become necessary to account for the fact that species are not independent units (Pillar et al., 2021). The whole field of comparative analyses tackles this issue. It proposes tools to account for phylogenetic relatedness in trait analyses (e.g. see Garamszegi, 2014), although care should be taken to justify the use of such analytical corrections relative to the aims of the research question (Freckleton, 2000;Westoby et al., 1995).

| Account for variability and uncertainty
Very often, intraspecific data are aggregated at the species level to obtain one trait value per species. All information on variability and measurement uncertainty is then lost. When information on variability is available and reasonable in the scope of the study, it is possible to include it, for example, by weighting specieslevel measures in functional diversity metrics (de Bello, Carmona, et al., 2021) or by explicitly including it when inferring trait evolution across lineages (Kostikova et al., 2016;Purschke et al., 2017). This can be an issue, especially if variability is phylogenetically structured (Garamszegi, 2014;Paterno et al., 2018).

| RULE 8: PUB LIS H TR AIT DATA TO G E TH E R WITH M E TA DATA
Openly publish trait data to facilitate answering yet unknown questions beyond their original study, lay the groundwork for understanding ecological processes beyond clearcut niches (Elton, 1927;Schneider et al., 2019), and democratise access to valuable trait datasets (Soranno et al., 2015). Each data point of trait measurements has a considerable value for the scientific community and future generations working on trait-related research questions.

| Consider the stakeholders
As our scholarly processes evolve to better find, access, integrate and reuse scientific data, we face the communal task of treating trait datasets as first-class research citizens. However, doing so is not easy as it involves different stakeholders: publishers have to make their publications open and FAIR (Wilkinson et al., 2016), scientists have to improve their skills to publish, reuse and correctly cite datasets, and funding agencies have to find ways to reward exemplary projects.
A welcome development is that many publishers now consider trait data papers (e.g. Falster et al., 2021;Guerrero-Ramírez et al., 2021;Tobias et al., 2022;Vandvik et al., 2020), which allow for a detailed methodological and context description, open access, and at the same time, accreditation of trait data collectors by citations.

| Accept the additional responsibility
Erroneous data might bias a current project, but also the future works of others. Currently, no common established practices exist on how peer review is also extended to trait data. A way to ensure that a dataset conforms to community standards is to submit it to an established curated database (e.g. TRY  for plant traits; Coral Traits (Madin et al., 2016) for anthozoans). Furthermore, consider publicly depositing raw and processed data and clearly differentiating between the two types. This allows tracing errors generated during processing and grants future users access to the original values.

| Aim for redundancy
Public trait data suffer from the same generic issues as other data, for example, hardware failures, linkrot (URLs not entirely reliable) or content-drift (content changes, but URLs do not, Koehler, 1999).
To mitigate such issues and reliably preserve data in the long term, data can be submitted to multiple repositories, for example, beside trait databases, also in general storage platforms such as  (Elliott et al., 2020) that rely on machine-readable data.

| Register trait data
Independent of the choice of actual data deposition, it is important that datasets are registered in a trait data registry (e.g. https://opent raits.org) to allow fellow scientists to find the data quickly.

| RULE 9: RE VIE W DATA AND CODE LIK E THE RE S E ARCH ITS ELF
Best practices in peer review have already been discussed in detail (Roberts, 2004;Spigt & Arts, 2010), but can perhaps be summarised with this statement: 'Be polite, fair, specific, and constructive'. A reviewer should provide information for the editorial team to decide; this process also applies to the data. Specifically for trait-based papers, it includes considering the entire life cycle of the trait data: 1. First, are the traits themselves appropriate for the question being asked? It should be considered how these traits have been used in the past and how they fit into biological theory.
Are they being contextualised appropriately, and are they fit for the purpose for which they are being used?
2. How were the data collected? Does the protocol conform to current standards, bearing in mind that the purpose of many papers is to improve standards and so they may not? Is the collection of new data well justified? Are units and metadata properly provided?
3. How were the data processed? Consider not just quality assurance and quality control but also how the traits were generally processed into a format that can be analysed.

| Train students
Courses specific to trait-based research are often lacking at undergraduate and graduate levels. Where courses or modules are taught, the focus may be limited to a subset of the trait data life cycle (e.g. Collection and Analysis; Figure 1), leaving students lacking critical skills . Open Educational Resources, including those built using incubators (Ryder et al., 2020), are one promising method for implementing such courses and modules more easily. In particular, authentic teaching experiences provide several benefits over traditional lectures or 'cook-book' experiments (Brownell et al., 2012). They seem well suited to trait-based ecology given that many traits can be collected quickly and inexpensively and that many tools are available (see, e.g. de Bello, Carmona, et al., 2021).
One example of such authentic teaching experiences, the TraitTrain plant functional trait courses (https://plant funct ional trait scour ses.w.uib.no/), has provided training across the entire trait data life cycle to hundreds of participants and has created scientific (Henn et al., 2018), data , methodological  and pedagogical (Geange et al., 2021) publications.

| Train colleagues
Making colleagues aware of important developments in trait-based research via either formal (e.g. publishing protocols, giving talks) or informal means (e.g. conversations, social media, email) is a critical way of helping to advance the field. Furthermore, trait-based research is an integrative field. It provides many opportunities for collaboration and idea-sharing across branches of life science, so discussing traits with a wide variety of colleagues is useful.

| Train the world
There is an urgent need for more comprehensive trait data across the globe and the tree of life , thus, increasing global access to training. Open access publications, tools, data and educational resources help lower the barriers to participation (Evans & Reimer, 2009). Furthermore, due to the relative ease, low cost and When working with trait data, we gain particularly as an interdisciplinary community of field biologists, synthesis ecologists, evolutionary biologists, computer scientists and database managers from a broad taxonomic range. This allows for the development of tools, methods and infrastructures that connect the entirety of trait science in an interoperable fashion. We hope that these basic guidelines can be useful as a starter for active communication in disseminating such integrative knowledge and how to make trait data future-proof. We encourage the scientific community to contribute to these rules when new tools and practices emerge in the 'living document' version of this article at http://opent raits.org/ best-pract ices.html.  -193921238). Open Access funding enabled and organized by Projekt DEAL.

CO N FLI C T O F I NTE R E S T
All authors declare to have no conflict of interest.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.14033.

DATA AVA I L A B I L I T Y S TAT E M E N T
This paper does not use data.