Glossary | Sources
Review this guide for definitions and descriptions of terms related to data management and sharing.
Term |
Definition |
Source |
---|---|---|
Access controls | Procedures and controls that limit or detect access to critical information resources, such as through software, biometrics devices, or physical access to a controlled space. | NIST Computer Security Resource Center |
Accessible | The “A” in FAIR. The ability for a user to know how data can be accessed, possibly including authentication and authorization. | The FAIR Guiding Principles for Scientific Data Management and Stewardship |
Anonymization | Process that removes the association between the identifying dataset and the data subject. | NIST Computer Security Resource Center |
Archiving | Ensuring that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity. | Committee on Data Research Data Management Terminology |
Attribution | Providing acknowledgement to the source of the information (text, image, dataset, etc.) you are using. | UNC Health Sciences Library |
Certificate of Confidentiality (CoCS) | Certificates that protect the privacy of research subjects by prohibiting disclosure of identifiable, sensitive research information to anyone not connected to the research. | NIH Policy and Compliance |
Codebook |
A human readable document that provides information on each data element. Though often used interchangeably with data dictionary, the data dictionary can contain more information about the structure of a database than a codebook. (More about codebooks here.)
|
NNLM Data Glossary |
Common Data Elements (CDEs) | A standardized, precisely defined question that is paired with a set of specific allowable responses, that is then used systematically across different sites, studies, or clinical trials to ensure consistent data collection. CDEs are developed so that data can be collected in the same way across multiple research studies and are generally structured as a precisely defined question and answer. | NLM CDE Tutorial |
Content standards | Agreed-upon shared guidance on how items in datasets are collected and represented. Content standards may include definitions of conditions, units of measurement, terminology allowable within each variable or data element, and any other guidance about the content of data points when collected, stored, transformed, or loaded to another system. | Working Group on NIH DMSP Guidance |
Controlled access | Data sharing that requires a request for access to the dataset to be approved; it is usually limited to researchers with a specific, relevant research question. Data sharing restrictions are determined by the owner of the data prior to collection of the data and included on the informed consent signed by those entering the data. | NCATS Toolkit for Patient-Focused Therapy Development |
Copyright | A type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression. | US Copyright Office |
Data | The term “data” does not have one clear definition and can be interpreted differently depending on the context. For example, the NIH Data Management and Sharing Policy defines data as “The recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings, regardless of whether the data are used to support scholarly publications.” | NNLM Data Glossary |
Data availability statement | A section, typically towards the end of a research article, that describes how the reader can access the data and whether they are publicly available, available upon request, or otherwise restricted. (More about data availability statements here.) | NNLM Data Glossary |
Data dictionary | A document that outlines the structure, content, and meaning of a given variable. This includes what type of data is being collected (e.g. free text, numerical, categorical or group data), the full wording of a question, what values are allowable (e.g. numeric ranges, multiple choice codes), and what those values mean (e.g. 0 = no high blood pressure diagnosis, 1 = borderline high blood pressure, 2 = high blood pressure). (More on data dictionaries here.) | NNLM Data Glossary |
Data element | A basic unit of information that has a unique meaning and subcategories (data items) of distinct value. Examples include gender, race, and geographic location. | NIST Computer Security Resource Center |
Data science | An interdisciplinary field which uses statistics, computer science, programming, and domain knowledge to collect, process, and analyze data for the purpose of acquiring knowledge or solving a problem. It also includes sharing acquired knowledge through storytelling, visualization, and other means of communication, and often employs methods such as machine learning, AI, natural language processing, algorithms, and other analytic tools to process and understand data. | NNLM Data Glossary |
Data stewardship | A process that involves ensuring effective control and use of data assets and can include creating and managing metadata, applying standards, managing data quality and integrity, and additional data governance activities related to data curation. It also may include creating educational materials, policies, and guidelines around data at an institution. | NNLM Data Glossary |
Data use agreement | An executed agreement between a data provider and a data recipient that specifies the terms under which the data can be used. (See UI Guidance on data use agreements.) | NIST Computer Security Resource Center |
De-identification | General term for any process of removing the association between a set of identifying data and the data subject. | NIST Computer Security Resource Center |
Documentation | Information needed for the data to be understood, interpreted, and used. It can describe the research project as well as the resulting data. Dataset documentation for tabular (e.g. spreadsheet data) can include variable names and descriptions, explanation of codes and classification schemes used, etc. (See Ten simple rules… (Rule 2)) | Ten Simple Rules…(Rule 2) |
Encoding standards | Agreed-upon shared guidance on the technological process by which data and metadata files are made into a computer-readable format. Common encoding standards include html, pdf, csv, docx, and more. Less common encoding standards require specific software. | Working Group on NIH DMSP Guidance |
FAIR | FAIR data is Findable, Accessible, Interoperable, Reusable. The principles emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data. | The FAIR Guiding Principles for Scientific Data Management and Stewardship |
Findable | The “F” in FAIR. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services. | The FAIR Guiding Principles for Scientific Data Management and Stewardship |
Informed consent | The process by which a volunteer confirms their willingness to participate in a research study, such as a clinical trial, after having been informed of all aspects of the study that are relevant to the volunteer’s decision to participate. It is documented by means of a written, signed, and dated informed consent form. | Working Group on NIH DMSP Guidance |
Intellectual property | Any product of the human intellect that the law protects from unauthorized use by others. IP is traditionally comprised of four categories: patent, copyright, trademark, and trade secrets. | Cornell Legal Information Institute |
Interoperability (of data) | The ways in which data is formatted to allow diverse datasets to be merged or aggregated in meaningful ways. It relies on metadata and data documentation so that researchers know which datasets and variables are comparable, and is frequently accomplished through the use of data standards. | |
Interoperable | The “I” in FAIR. The capability to communicate, execute programs, or transfer data among various functional units in a useful and meaningful manner that requires the user to have little or no knowledge of the unique characteristics of those units. | Committee on Data Research Data Management Terminology |
License | A legal instrument that enables the data owner/creator to provide permissions to other users, under specific terms, to use the dataset. | |
Metadata | Information that describes, explains, locates, classifies, contextualizes, or documents an information resource. It is what enables you to search for books in your local library catalog, videos on YouTube, or find journal articles through PubMed. It can also help manage data, by tracking attributes like data provenance and versioning. (More about metadata here.) | NNLM Data Glossary |
Non-commercial license | A license that allows others to adapt, reuse, and remix a work, so long as it is not intended for commercial purposes (i.e., as long as there is no monetary compensation.) | Creative Commons |
Non-derivative license | A license that allows people to copy and distribute a work, but disallows them from creating adaptations of it (e.g., no remixes, no transformations, etc.) | Creative Commons |
Open Data Commons | A license for data and databases which provides a set of legal tools and licenses to help people publish, provide and use open data. | Open Data Commons |
Open-source license | A license that allows software to be freely used, modified, and shared. | Open Source Initiative |
Persistent Identifiers (PIDs, PDIs, or GUIDs) | A string of letters and numbers used to distinguish between and locate different objects, people, or concepts. An example of a PID is a Digital Object Identifier (DOI) which is used to locate specific digital objects, like journal articles, or ORCiD, a PID for researchers. Also known as Persistent identifiers (PIDs), Persistent Digital Identifiers (PDIs) or Globally Unique Identifiers (GUIDs). | NNLM Data Glossary |
Preservation (of data) | A series of managed activities necessary to ensure continued stability and access to data for as long as necessary. For data to be preserved, at minimum, it must be stored in a secure location, stored across multiple locations (e.g. ‘Rule of Three’), and saved in open file formats that will likely have the greatest utility in the future. It can also include depositing data in a repository, which allows for publication and preservation. | NNLM Data Glossary |
Public domain | A work of authorship that is no longer under copyright protection, or that failed to meet the requirements for copyright protection. Works in the public domain may be used freely without the permission of the former copyright owner. | US Copyright Office |
Qualitative data | Data representing information and concepts that are not represented by numbers; often used more frequently in the humanities and social sciences. Includes data gathered from interviews and focus groups, personal diaries and lab notebooks, maps, photographs, and other printed materials or observations. | NNLM Data Glossary |
Quantitative data | Data represented numerically, including anything that can be counted, measured, or given a numerical value. It can be classified in different ways, including categorical data that contain categories or groups (like countries), discrete data that can be counted in whole numbers (like the number of students in a class), and continuous data that is a value in a range (like height or temperature). | NNLM Data Glossary |
README | Provides information about a data file and is intended to help data be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. (More on README files here.) | NNLM Data Glossary |
Repository | A tool to share, preserve, and discover research outputs, including data or datasets. Generally speaking, researchers submit and describe their own data which is then ingested into the repository for storage. Other researchers can then download, or request to download, the data directly from the repository. (More on data repositories here.) | NNLM Data Glossary |
Reusable | The “R” in FAIR. Reusability is the ultimate goal of FAIR data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. | The FAIR Guiding Principles for Scientific Data Management and Stewardship |
Reuse | The analysis of existing data collected by other individuals or institutions for a new research purpose. It can refer to statistical, quantitative data or descriptive, qualitative data. Also known as secondary data analysis. | NNLM Data Glossary |
Standards (data) | An agreed-upon approach to allow for consistent measurement, qualification or exchange of an object, process, or unit of information, such as the metric system of measurement. Includes methods of organizing, documenting, and formatting data in order to aid in data aggregation, sharing and reuse. Data standards can be generated by a research community, a governmental organization, or other large organizations. Metadata standards are also data standards as they standardize how metadata is formatted in order to ease the sharing of metadata across platforms. | NNLM Data Glossary |
Sources
These sources may define additional terms and provide more detail.
Committee on Data Research Data Management Terminology
The FAIR Guiding Principles for Scientific Data Management and Stewardship
NCATS Toolkit for Patient-Focused Therapy Development
NIST Computer Security Resource Center
Ten Simple Rules for Maximizing the Recommendations of the NIH Data Management and Sharing Plan
Working Group on NIH DMSP Guidance