The goal: make your data machine-readable so you have the flexibility to import the data into a variety of analysis tools and data repositories in the future.
Provide column labels or a header line.
Label each column with a short but informative name.
Follow the same conventions recommended for file names -- use only letters, numbers, or underscores. Avoid spaces and special characters.
Document the definition of codes, abbreviations, and variables names.
Abbreviations and variable names don't mean the same thing to everyone. Creating a list that defines each variable or code ensures that all project staff are collecting the same data -- this list will also help future users understand your data.
The simplest version of this documentation would be a "ReadMe" text file that resides in the same folder as your data. The social sciences often refer to this information as a codebook, while other disciplines use the term "data dictionary."
Columns should contain one single type of data.
Are your data text? Numeric? Categorical? etc.
*Format dates and times according to the ISO 8601 standard.
*If you're using text data, be sure to use a standard naming convention (see examples below).
Record component variables, not compound variables.
For example: If you're measuring each subject's BMI (Body Mass Index), don't just record the BMI itself. Also record the data you used to calculate the BMI (height and weight). This gives you more options later (you could re-calculate the BMI using a different formula if desired).
Agree on a standard representation for missing data.
Does your field or your preferred analysis software have a standard notation to represent missing data?
Avoid visual cues and ambiguous/dependent information.
Programs like Excel allow you to make very visual spreadsheets -- but remember:
highlighting and font colors will be lost if you need to export data to different software,
merging cells could also hinder data export,
notes such as "see above cell" could become meaningless if someone re-sorts the spreadsheet.
A fictional example of a poorly-structured spreadsheet:
The same spreadsheet, with suggested corrections:
Recommendations are based on: Andrea Horne Denton and Sherry Lake's "Workshop on the Best Practices in Data Collection and Management," presented at the National Network of Libraries of Medicine, Middle Atlantic Region's Symposium: "Doing It Your Way: Approaches to Research Data Management for Libraries" (April 2014)
Data documentation (metadata)
What are metadata? (and what are metadata standards or schemas?)
In a nutshell, metadata are documentation about data (or data about data).
The longer definition is: "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource." (from the National Information Standards Organization's "Understanding Metadata", pg 1).
Metadata standards or schemas provide "common terms, definitions and structure that allow for consistent communication"-- the structure also provides a "reliable and predictable format for computer interpretation". (from DataONE Education Module: Lesson 07: Metadata -- http://www.dataone.org/education-modules)
Why add metadata to your data?
Good metadata helps other researchers discover, understand, and use your data.
Any known intellectual property rights, licenses, or restrictions on use of the data.
Where and how your data can be accessed by other researchers.
The following elements may not be required to deposit your dataset in a repository, but they will help future users interpret your data correctly:
How the data were generated -- specific observations or measurements recorded; equipment and software used (model and version numbers); formulae, algorithms; experimental protocols; and other things one might include in a lab notebook.
How the data have been altered or processed (e.g. normalized).
Why the data were created or collected.
All applicable physical locations.
Variable and code list
Definitions of all variables in the data files, where applicable. Explanations of codes or abbreviations used in either the file names or the variables in the data files (e.g. 999 indicates a missing value in the data).
All files forming part of the dataset, including hierarchy structure and extensions (e.g. ‘photo1023.jpeg’, ‘participant12.pdf’).
Formats of the data file(s), e.g., SPSS, HTML, PDF, JPEG, etc.
Unique date/time stamp and identifier for each version.
Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data.