Spreadsheets are ubiquitous in research. They provide an easy way to quickly record and organize tabular data.
However, researchers often have to spend large amounts of time cleaning and organizing this spreadsheet data before they can do analysis.
By investing some time at the start of a project to plan and optimize the structure of the data table you can:
- reduce the amount of time spent later on reorganizing and restructuring the data
- make it easier to:
- sort and review the data
- analyze the data
- visualize the data
- transform or import it into a database
- integrate it with other datasets
Best Practices for Organizing Tabular Data
Tidy Data
Creating “tidy data” is a subset of data cleaning, and generally refers to data that has been “structured for data analysis,” or “making it easier for computers to work with the data.”
Hadley Wickham provides a set of guidelines for Tidy Data.
Data Carpentry
Data Carpentry has a nice, step by step activity for creating tidy data, based on a set of 10 general rules.
Deeper Dive
Ready for a deeper dive into the nuances of data structure? See these:
- Oak Ridge National Laboratories’ (ORNL) Best Practices for Preparing Environmental Data Sets to Share and Archive has very practical in-depth examples of data structure issues and best practices.
- Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. American Statistician, 72(1), 2–10.
Tools
OpenRefine
OpenRefine is an open source tool that is free for you to download and install on Windows, Mac, and Linux computers. Use OpenRefine to review and clean up (tidy) tabular data, split or combine data columns, and insert linked data from other sources, among other things.
Download and installation instructions
The Libraries provide individual, team, and class-based training on OpenRefine (request via lib-data@uiowa.edu)
REDCap
REDCap can be a useful tool for creating and defining a data structure by basing it on a codebook or survey tool. Use REDCap for data entry for tabular data and surveys. It can be set up for HIPAA data, and is frequently used for biomedical research.
REDCap is supported by the Institute for Clinical and Translational Sciences (ICTS). Access and use of the system requires an account, which is easy to request by emailing ICTS.
More info: introductory and training videos, and overview page. Vanderbilt University created REDCap and provides more information.