Data

Data are the foundation of research and science. Once an appropriate research topic is determined, proper data collection, retention, and sharing are vital to the research enterprise.

 

Data embraces any collection of facts, measurements, or observations. Different disciplines have different notions of what constitutes data, ranging from material created in a wet laboratory, such as an electrophoresis gel or a DNA sequence, to that obtained in social-science research, such as a filled-out questionnaire, video or audio recordings, or photographs. Data can be astronomical measurements, microscope slides, climate patterns, cell lines, field notes, soil samples, or results of statistical analyses.

There are a number of methodological issues of which researchers should be aware when selecting data. These include choices about:

  • Data types (e.g., nominal, ordinal or interval measures).
  • Samples ("frames") and sample size, instruments.
  • Methodologies.

Different disciplines have preferences for different approaches, and for what constitutes acceptable "rigor" for reliability and validity of results. This is one reason why a careful prior review of the existing literature on a topic is imperative when designing a research protocol. For example, a key component of most protocol designs will be the sample size (or "n"). From a purely methodological perspective, that decision hinges on how large an error one is willing to tolerate in estimating population parameters; or put differently, what effect size will be required for the result to be considered significant. These must be determined in advance of commencing data collection. But statistical explanatory power must be balanced against time, cost and other practical considerations, just like every other element of the protocol.

Data collection methods vary by discipline, and according to the data types of interest; but the emphasis on ensuring accurate and honest collection remains the same. Consequences from improperly collected data include:

  • Inability to answer research questions accurately.
  • Inability to repeat and validate the study.
  • Distorted findings resulting in wasted resources.
  • Misleading other researchers to pursue fruitless avenues of investigation.
  • Compromising decisions for public policy or private decision-making.
  • Causing harm to human participants and animal subjects.

As with data selection, it is critical that researchers have sufficient methodological skills to assure the quality of data collection efforts. Everyone who participates in the investigative effort should be trained in the methods. Where possible, researchers should try to build checks-and-balances into the collection process.

In information security, it is conventional to speak of three core goals for information protection:

  • Confidentiality - limiting information access and disclosure to authorized users;
  • Integrity - ensuring that data is not changed inappropriately after recording, whether by accidental or deliberate activity. Also, the notion that the person or entity in question entered the right information - that is, that the information reflected the actual circumstances ("validity") and under the same circumstances would generate identical data (what statisticians call "reliability").
  • Availability - refers to the availability of information resources to authorized users. Everyday risks like fire, water or other environmental damage, or simple technical failures like hard disk crashes, must be considered. It's an essential practice to make frequent, periodic backup copies of a data collection, and store these copies in a secure secondary location that is protected both from intruders and environmental threats.

UTA Guidance regarding information security and data can be found here: https://www.uta.edu/security/encryption/fulldiskencryption/index.php

Read more about The Practice of Keeping Research Notebooks: Paper vs. Electronic.

Data handling procedures should describe when, how, and who may handle data for storage, retrieval, sharing, archiving and disposal purposes. These procedures may depend on the nature of the project, the cost of maintaining that data, research sponsors' requirements, etc.

Retaining data on paper files and electronic media long past the end of a project can increase the chances of unauthorized access. Disposal of sensitive data requires care and technical expertise to ensure that the information could not be reconstructed from the storage media. Review UT Arlington's Records Information Management policies here: http://www.uta.edu/ouc/rim/

Like data selection criteria, the choice of statistical analysis methods should always precede data collection. Waiting until later in the research process increases the risk that analytic decisions will be driven by consideration of which produces the most favorable results. Any bias occurring in the collection of the data, or selection of method of analysis, will increase the likelihood of drawing a biased inference. Every field of study has developed its accepted practices for data analysis; if an unconventional approach is used, it is crucial to clearly state this is being done and show how this new and possibly unaccepted method of analysis is being used, as well as how it differs from other more traditional methods. Whether statistical or non-statistical methods are used, researchers should be clear - to themselves and to the persons to whom the analyses are presented - of the limitations and possible biases of their methods.

The practice of ensuring research integrity extends to the stage of documenting and preparing results for publication. Publishing in peer-reviewed journals or presenting in scholarly meetings is the primary mechanism for investigators to disseminate their findings to the research community. This community relies on authors to report the events of a study honestly and accurately. All researchers should be aware of the issues that compromise the integrity of data reporting and publishing:

  • Misrepresentation of data quality, or of the data itself.
  • Analysis of data by several methods to find a significant result.
  • Fabrication or falsification of data.
  • Inadequate evaluation of prior research.
  • Misleading discussion of observations.
  • Reporting conclusions that are not supported.
  • Failure to disclose conflicts of interest.
  • Plagiarism.
  • Unjust attribution of authorship.
Data "ownership" generally refers to both the possession of and responsibility for information. As a legal concept, it embraces the range of rights and obligations with respect to a data collection, including rights and obligations to share. All investigators and research staff should review the institution's policies with respect to data ownership, to make sure their understanding matches the institution's. If a specific third-party sponsor is involved, the sponsor / granting agency may set out the terms of copyright.
Review UT Arlington's policies here: http://www.uta.edu/research/administration/departments/tm/index.php
Data and data books collected by undergraduates, graduates, and postdoctoral fellows on a research project generally belong to the grantee institution, or the PI under conditions described above. In any case, students should generally not assume that it will be permissible to take "their data" when they leave. Appropriate arrangements need to be made in advance. If the faculty PI does not raise the issue, the student or fellow must. Usually arrangements may be made to take copies of the data when they leave.