Readme files | Data Dictionaries | Codebooks

Readme Files

A readme file is an important addition to your raw data collection.

Creating and updating them during active research can help you track your work, and makes it that much easier when you are ready to share or publish your data.

Readme files allow others to understand and reuse your data after you have submitted it to repositories by explaining the nuances of your unique data collection.

Here’s an example of a data record that includes a readme file:


These guidelines may help you organize and create a readme file for your dataset.

  1. Start during the research project
    You can start preparing for your readme file when you start collecting your data. Make notes that will help you and others interpret and understand your data later. “If you didn’t document it, it didn’t happen. In two years you’ll wish you had written more down.” (Mike Harms Lab wiki)
  2. Use an outline
    Starting with a well-organized outline can help ease the process. Here are two:

    • Cornell University’s Research Data Management Service Group offers a great outline
  3. Update the readme as the project progresses
  4. Deposit your readme file with your data
    Once you have finished preparing your dataset and readme files, you can submit them to your chosen repository. Including a readme file with your data ensures that others will be able to understand and reuse your data (with respect to licenses/permissions) for years to come.

Data Dictionaries

Data dictionaries provide critical information about data, through describing the names, definitions, and attributes of the data elements in the file. In some cases, data dictionaries and codebooks may provide overlapping information. Dictionaries are deposited with the data when data is shared or published. In some cases, the repository might assist with creating the dictionary and codebook (below).

A data dictionary is a file that describes each element of your dataset. If your dataset includes tabular data, R code, and images, the data dictionary would include a list of the fields in the table and what they mean, including units and precision; a brief overview of the purpose of the code (if not already contained in comments); and information about the images and how they relate to the dataset (more detailed metadata for the images should be embedded). From Smithsonian Data Management Best Practices. Describing Your Data: Data Dictionaries (pdf).

Data dictionaries can serve several purposes, including:

  1. Improving efficiency and reduce risk of mistakes and data loss by keeping things consistent across a project. The dictionary can define data names, labels, units, constraints such as acceptable range of values, and other characteristics.
  2. Enabling software to process a data file, by providing details to the software about the file. This information might include the type of data in each column (integer, character, date, etc); the name of the column; the physical units, if relevant; whether nulls are included; etc.
  3. Increasing interoperability and reuse of the data that you want to share and publish.
  4. Providing “human-readable” details to support discovery, interpretation and analysis.

For more details on what might be in a data dictionary, how to make one, and examples, see:


Codebooks are used by survey researchers to provide information about the data from a survey instrument. The codebook documents the layout and structure of the data file, the response codes that are used to record survey responses, and other information.

Some tools (e.g., REDCap) may generate a codebook for you. In other cases you may need to augment what the tool generated, or create one from scratch.

A codebook enables the user to quickly ascertain some of the details about a dataset before downloading the file. Like data dictionaries, codebooks can provide the information that facilitates the integration of datasets from different sources.