Data Cleaning

CARL Open Education Working Group

11 Data Cleaning

Learning Outcomes

By the end of this chapter, learners will:

Understand and explain the importance of data cleaning in the data analysis process
Analyze determine the impact of missing data on their analyses
Develop strategies for data cleaning

Introduction

Data cleaning is the process by which you take the data you have collected and modify or correct the data. This process can involve:

Determining the impact of missing data
Filling in incomplete data
Standardizing data where there are inconsistencies
Splitting data
Restructuring data

It is important to undergo this step to improve the quality of the data analysis. Inconsistencies in the formatting and spelling of locations or institutions can cause challenges when conducting analyses based on location, and any data duplication can negatively impact the accuracy of results.

The Data Cleaning Process

Costanzo (2023) explains the steps in the data cleaning process:

Exploring Data – a preliminary analysis to gain a general sense of the data that’s been collected.
Structuring Data – organizing the data in a way that supports the analysis you hope to conduct, and the research questions you seek to answer.
Cleaning Data
Enriching Data – adding any additional data that helps to contextualize what you already have (e.g., if you have data about the institutional affiliation of OER authors, you could add data on the province that the OER was published in).
Validating Data – checking over the data to ensure it is error free.
Publishing Data

Missing Data

Part of data cleaning (Costanzo’s Step 1 and 3) can include looking at datasets to determine if there is any missing data. This refers to any values in the dataset that should have been recorded but have not been.

Example

Students are surveyed about their experiences buying course materials. Some students might not respond to all questions, such as their academic discipline, which results in missing data.

The impact of missing data is dependent on the size of the dataset and any patterns that may exist in the missing data. A general rule of thumb is that if less than 5% of the data values are missing, there will be negligible impact (Montelpare et al., 2020). Missing data has a smaller impact on larger datasets. However, missing data in a smaller dataset or patterns in missing data will impact analysis.

Example, continued

If students responding to your survey were often skipping the question related to their academic discipline, there is a pattern. It would be best to avoid any analysis that looks at a particular trend based on academic discipline (e.g., looking at the prevalence of costly ancillary materials by discipline).

Any patterns or significant amounts of missing data should be taken into consideration when conducting an analysis. You can also report the missing data when writing up your results.

Tips for Cleaning Data

There are several techniques that can be used to clean data, whether qualitative or quantitative:

Check for misspelled words or inconsistent spelling (e.g., colour v color)
Standardize names (e.g., Memorial University of Newfoundland, Memorial, MUN -> MUN)
Remove duplicated data
Use find and replace to standardize spelling and letter case
Remove any unnecessary spaces (e.g., a single space at the beginning of a response)
Standardize numbers and signs (e.g., one vs 1)
Turn qualitative data into numeric data (e.g., replace Yes and No responses with 1 and 2)
Ensure dates and times are formatted consistently (e.g., MM-DD-YYYY)
Merge and split columns (e.g., location data that is City, Province can be split into two columns, one each for City and Province)
Subset data (i.e., take a smaller set of the data that meets a criteria, such as responses that only come from undergraduate students)

Tools for Cleaning Data

The previous chapter on Tools and Technologies contains a list of tools that can support the data cleaning process.

Conclusion

The data cleaning process involves several steps. It is important to undergo this process to ensure quality analysis. Missing data might impact the analyses you choose to conduct. Subsetting and structuring data can help facilitate the analysis process by organizing the data in a way that makes the analysis easier. Cleaning data can involve several techniques, and it is important to review your results to ensure accuracy.

Resources

Data Cleaning During the Research Data Management Process

References

Costanzo, L. (2023). Data cleaning during the research data management process. In K. Thompson, E. Hill, E. Carlisle-Johnston, D. Dennie, & E. Fortin (Eds.) Research data management in the Canadian context. Pressbooks. https://ecampusontario.pressbooks.pub/canadardm/

Montelpare, W. J., Read, E., McComber, T. , Mahar, A., & Ritchie, K. (2020). Applied statistics in health care research. Pressbooks. https://pressbooks.library.upei.ca/montelpare/chapter/working-with-missing-data/

University of Queensland. (2023). Work with data and files. Pressbooks. https://uq.pressbooks.pub/digital-essentials-data-and-files/chapter/clean-data/

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License