Skip to main content

Data Management @ NAU

Structuring data

The goal: make your data machine-readable so you have the flexibility to import the data into a variety of analysis tools and data repositories in the future.

Provide column labels or a header line. Label each column with a short but informative name.

Follow the same conventions recommended for file names -- use only letters, numbers, or underscores. Avoid spaces and special characters.
Document the definition of codes, abbreviations, and variables names. Abbreviations and variable names don't mean the same thing to everyone. Creating a list that defines each variable or code ensures that all project staff are collecting the same data -- this list will also help future users understand your data.

The simplest version of this documentation would be a "ReadMe" text file that resides in the same folder as your data. The social sciences often refer to this information as a codebook, while other disciplines use the term "data dictionary."
Columns should contain one single type of data. Are your data text? Numeric? Categorical? etc.

*Format dates and times according to the ISO 8601 standard.
*If you're using text data, be sure to use a standard naming convention (see examples below).
Record component variables, not compound variables. For example: If you're measuring each subject's BMI (Body Mass Index), don't just record the BMI itself. Also record the data you used to calculate the BMI (height and weight). This gives you more options later (you could re-calculate the BMI using a different formula if desired).
Agree on a standard representation for missing data. Does your field or your preferred analysis software have a standard notation to represent missing data?
Avoid visual cues and ambiguous/dependent information. Programs like Excel allow you to make very visual spreadsheets -- but remember:
  • highlighting and font colors will be lost if you need to export data to different software,
  • merging cells could also hinder data export,
  • notes such as "see above cell" could become meaningless if someone re-sorts the spreadsheet.

 

A fictional example of a poorly-structured spreadsheet:

example of poorly constructed spreadsheet

The same spreadsheet, with suggested corrections:

example of good spreadsheet

Recommendations are based on:
Andrea Horne Denton and Sherry Lake's "Workshop on the Best Practices in Data Collection and Management," presented at the National Network of Libraries of Medicine, Middle Atlantic Region's Symposium: "Doing It Your Way: Approaches to Research Data Management for Libraries" (April 2014)

Data documentation (metadata)

What are metadata? (and what are metadata standards or schemas?) In a nutshell, metadata are documentation about data (or data about data).
The longer definition is: "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource." (from the National Information Standards Organization's "Understanding Metadata", pg 1).

Metadata standards or schemas provide "common terms, definitions and structure that allow for consistent communication"-- the structure also provides a "reliable and predictable format for computer interpretation". (from DataONE Education Module: Lesson 07: Metadata -- http://www.dataone.org/education-modules)
Why add metadata to your data? Good metadata helps other researchers discover, understand, and use your data.

Learn more:
*DataONE Education Modules' "Lesson 07: Metadata"
*Australian National Data Service's "Metadata Guide (Working Level)"
What metadata standards are used in your field or discipline? The Digital Curation Centre maintains this list of disciplinary metadata standards -- if your discipline isn't listed, check out the "General Research Data" category.

*For the social sciences, also look at IPCSR's Best Practices in Creating Metadata.
*For the biological, natural, or biomedical sciences, also look at FAIRsharing.org.
What's the easiest way to add metadata to your data? Document this information as your research progresses -- this saves you from having to scramble to collect this information at the end of your project!

If your discipline has a preferred metadata standard, you can collect the information in that structured format.

Common metadata elements

Metadata elements commonly required by repositories:

(tables below adapted from Curtin Library's Research data management guide)

General overview Title Name of the dataset.
Creator Name(s) and contact information for the organization(s) or people who created the data.
Identifier A unique identifying number assigned to the data set, such as a DOI.
Date Key dates associated with the data, including collection date(s) as well as project start and end date. Preferred format is the ISO 8601 standard.
Content description Subjects Keywords or phrases describing the subject or content of the data. If your field of research* does not use a standardized vocabulary or ontology**, consider using the Library of Congress Subject Authority Headings.

*FAIRsharing.org maintains a list of ontologies/terminologies for the biological, natural, and biomedical sciences. The Royal Society of Chemistry has created several ontologies for the chemical sciences. For geospatial and location data, see the Open Geospatial Consortium (OGC). For other research fields, contact us.
**What's an ontology? The most famous definition of ontology is from Tom Gruber of Stanford's Knowledge Systems Laboratory.
Access Rights Any known intellectual property rights, licenses, or restrictions on use of the data.
Access information Where and how your data can be accessed by other researchers.

 

The following elements may not be required to deposit your dataset in a repository, but they will help future users interpret your data correctly:

General overview Methods How the data were generated -- specific observations or measurements recorded; equipment and software used (model and version numbers); formulae, algorithms; experimental protocols; and other things one might include in a lab notebook.
Processing How the data have been altered or processed (e.g. normalized).
Abstract Why the data were created or collected.
Content description Place All applicable physical locations.
 
Variable and code list Definitions of all variables in the data files, where applicable. Explanations of codes or abbreviations used in either the file names or the variables in the data files (e.g. 999 indicates a missing value in the data).
Technical description File inventory All files forming part of the dataset, including hierarchy structure and extensions (e.g. ‘photo1023.jpeg’, ‘participant12.pdf’).
File formats Formats of the data file(s), e.g., SPSS, HTML, PDF, JPEG, etc.
Version Unique date/time stamp and identifier for each version.
Necessary software Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data.