IRO Metadata Best Practices - Research Data Services

If you’d like the information in this website as a PDF, click here.

A. Pre-deposit considerations

When possible, convert data to open, non-proprietary formats, which facilitate long-term access. For instance, Excel files should be converted to .csv or .tsv format. More information about open formats can be found here.

For tabular data, review the data structure and revise as needed to facilitate reuse. These steps make it easier to convert from Excel to .csv, and enables others to more easily open, view and understand your data.

Don’t use colored text or backgrounds; use a single column header row; use columns for variables, rows for observations; exercise caution if you have date fields in an Excel file, and check them after conversion to .csv; do not embed formulas in cells (they will not convert to .csv), etc. See these short guides for more information about best practices for tabular data:

If there are multiple files, name and organize them to help the reader; list them in the order you want them displayed in the repository, and include that list in the readme.txt (see 1 below)

Create a data dictionary for each tabular dataset, or codebook for surveys (see 2 below)

If article/peer reviewers need to have temporary access to the data before it is publicly accessible, we can provide a temporary link to the dataset from IRO that you can provide to reviewers.

B. Obtain a DOI for the dataset from the repository, so you can cite the data in your article manuscript

Many repositories can reserve a DOI for a dataset prior to publishing the data, so that you can include a reference/citation for the data set(s) in your manuscript, even before the data is made openly accessible*.

See C and D below for adding a reference to your data in your article and creating a data availability statement to include in your article.

*The DOI will not work until it is activated. Once it is activated, the data is published, and this is not reversible.

Required information: To reserve a DOI for your data, a repository will usually need the following:

Title of the Data Set: Not the same as the title of the article.
Example: PCB Emissions data from Paint Colorants.
List all Creators (aka authors) of the dataset, in the order you want them to be listed in the data record and citation for the data. Please provide first name (and middle initial if so desired), and last name. For UI researchers, please make sure you have an ORCID and that it is connected with the UI:
- If you don’t have an ORCID yet, start here.
- If you have an ORCID, connect it to the University of Iowa.
- If there are Creators/Authors from other institutions, please provide their ORCIDs if you can
Publication Year: the repository will usually use the current year, unless you tell them that the dataset has previously been made available elsewhere.
Funder(s) and funder-supplied grant number(s)Grant number (see the guidelines for NIH grant numbers)
The license you would like to use for the data.
- If you have multiple types of materials (i.e., data, code/software, instructional materials), you may need to select licenses appropriate for each type of material

C. Cite the data in your article manuscript/thesis, and include it in the references section

This will ensure that the citation to the data is displayed in both the pdf and online versions of your article, helping readers of online or print versions of the article to find the dataset. It also enables systems to generate citation statistics for your data.

Once you have the DOI from the repository, the citation would include the following elements, most of which are described above in B.

The exact order and punctuation will depend on the citation style of the journal in which you are publishing

Creator (PublicationYear). Title. Publisher. (resourcetype). Identifier

Publisher is the institution with the repository, or the repository itself. In this example, it is the University of Iowa, since it is being published in the UI’s repository.
resourcetype is usually ‘dataset’ but might also be ‘collection,’ ‘model,’ ‘software,’ etc., depending on the nature of the material being deposited
Identifier is https://doi.org/ + DOI

Example:

Jahnke, Jacob C. and Hornbuckle, Keri C. (2019): Dataset for PCB Emissions from Paint Colorants. University of Iowa. (dataset). https://doi.org/10.25820/vtd8-n771

D. Include a Data Availability Statement in your article manuscript

A data availability statement is a succinct note to inform the reader about the location of the data record and the data.

Note: This is in addition to citing the dataset as described in C above.

In the text of the document, include a statement describing how the data underlying the findings of your article can be accessed and reused. This should include a footnote or endnote to the citation for the dataset, in the article’s references section.**

Your journal or publisher might include guidelines for the data availability statement. If they do not, see these examples of data availability statements from Taylor & Francis.

E. Provide these other details about the dataset:

The following information should be provided before the DOI is activated (we usually activate the DOI (making the data accessible) at the time when the article is published. Once the DOI is activated it can’t be de-activated.

This information will help others find, understand, and reuse the data.

1. A Readme.txt file

The readme file provides context about the data, explains the methods, and is indexed by search engines for use in web searches. So the more detail, the better.

Cornell University’s Research Data Management Service Group offers a great outline . Find more information here.

Example from Iowa Research Online:

See the readme file in this record: https://doi.org/10.25820/data.006135

2. For spreadsheet/tabular data, create a Data Dictionary

A data dictionary is critical to making your research more reproducible because it allows others to understand your data. The purpose of a data dictionary is to explain what all the variable names and values in your spreadsheet really mean.[1]

Data dictionaries are usually in .csv or .txt file format. If there is a group of data files that are similar, one dictionary might describe all of them. If each data file in a collection of files has a unique set of variables, it may be better to provide a data dictionary for each file.

These are some of the most common elements described in data dictionaries:

Variable names
Readable variable name (may include a definition/description of the variable)
Measurement units
Allowed values , or range of values, if applicable
Are null values allowed for the variable?
Other codes for the variable (e.g., for missing data, data below limit of detection/quantitation)
What data type is the variable (text, string, number, ISO 8601 date, etc.)
Synonyms for the variable name (optional)
Other resources

See examples from: Open Science Framework , the Smithsonian (.pdf), and this USDA blank template

3. For surveys, create a Codebook

Codebooks are a type of data dictionary that are more appropriate for survey and interview data.
See here.

4. Create an abstract and a methods statement (if you did not already include these in the readme.txt file, above):

The abstract should be a brief description of the data and the context in which the data was collected or created.

Focus on the data, rather than reusing the abstract from the related article. Abstracts help make your data more discoverable, and they provide context and information about the dataset for the researcher who finds your data.

An abstract might also be included in the readme.txt file; in fact, we recommend having these texts in both places. If it is in your readme file, in IRO we will use that for the Abstract when we create the data record.

A methods section can also provide important information to researchers and others who may find and view your data. Here too, try to focus on the methods that are relevant to the data. This should describe the methodology employed for the study or research.
5. Subjects (keywords, descriptors)

Most repositories will allow you to select or enter a list of descriptors or keywords about the data. Your data record will be improved if you include terms from thesauri or other sources, such as gene names, species, chemical substance identifiers.

6. Are any related works already published?

If other materials (software, data, articles) are about to be published or have already been published that are related to the data, provide the full citation (including DOI, ISBN, etc.), so we can add that to the record and link the two together.

For instance, if:

the dataset is a subset of a published dataset, or incorporates data from other sources,
code or software is published elsewhere and associated with the dataset
another article has been published on this dataset

provide the citation(s), including DOIs, for those sources:

F. Contact the repository when the article is published

Send the DOI for the article (or other related materials) to the repository, and they will update the data record so that it has a link to the article, and activate the dataset DOI, making it publicly accessible.

This will enable links in both directions between article and dataset.

[1] https://help.osf.io/hc/en-us/articles/360019739054-How-to-Make-a-Data-Dictionary https://www.lib.uiowa.edu/data/manage/documenting/readme/#codebooks

Questions? Contact us: lib-data@uiowa.edu or brian-westra@uiowa.edu