Readme files | Data Dictionaries | Codebooks

Readme Files

A readme file is an important addition to your raw data collection.

Creating and updating them during active research can help you track your work, and makes it that much easier when you are ready to share or publish your data.

Readme files allow others to understand and reuse your data after you have submitted it to repositories by explaining the nuances of your unique data collection.

See the PCBsSchoolAirREADME v3″ file in this record for an example of a readme file.

Guidelines

  1. Start during the research project
    Documentation will help you and others interpret and understand your data later.

  2. Use an outline
  3. Update the readme as the project progresses
  4. Deposit your readme file with your data
    Once you have finished preparing your dataset and readme files, you can submit them to your chosen repository. Including a readme file with your data ensures that others will be able to understand and reuse your data (with respect to licenses/permissions) for years to come.

Data Dictionaries

A data dictionary is a file that describes each element of your dataset. If your dataset includes tabular (spreadsheet) data, the data dictionary would include a list of the fields in the table and what they mean, including units and precision.

If your data included R or Python code or scripts, the dictionary would provide a brief overview of the purpose of the code (if not already contained in comments); and information about the code relates to the dataset. [From Smithsonian Data Management Best Practices. Describing Your Data: Data Dictionaries (pdf)].

Data dictionaries have several benefits:

  1. Keeping things consistent across a project. The dictionary can define data names, labels, units, constraints such as acceptable range of values, and other characteristics.
  2. Enabling software to process a data file, by providing details to the software about the file. This information might include the type of data in each column (integer, character, date, etc); the name of the column; the physical units, if relevant; whether nulls are included; etc.
  3. Increasing interoperability and reuse of the data that you want to share and publish.
  4. Providing “human-readable” details to support discovery, interpretation and analysis.

For more details on what might be in a data dictionary, how to make one, and examples, see:

Codebooks

Codebooks are used by survey researchers to provide information about the data from a survey instrument. The codebook documents the layout and structure of the data file, the response codes that are used to record survey responses, and other information.

Some tools (e.g., REDCap) may generate a codebook for you. In other cases you may need to augment what the tool generated, or create one from scratch.

A codebook enables the user to quickly ascertain some of the details about a dataset before downloading the file. Like data dictionaries, codebooks can provide the information that facilitates the integration of datasets from different sources.

Examples: