Data Preparation: How to Cleanse Your Data
We know that in the real world, data is more likely to come to us ‘dirty’, but there is some disagreement among practitioners as to how and when data should be ‘cleaned’.
Some people clean before they transform, and others wait until they have loaded it into the database. My preference is to clean the data at each stage of the Extract, Transform, Load (ETL) process – it might seem an inefficient use of your time but I have found this to be the best way to protect yourself against obstacles further on.
Unfortunately, data preparation is always going to be time-consuming, but the more due diligence you take in this stage, the more you will speed up the Data Science Process as a whole.
1. Extract your data
We need to extract data in the first instance for two reasons: to ensure that we are not altering the original source in any way, and because the data that we want to analyse is often stored across a number of different locations.
Some examples of possible locations are:
- a database
- an Excel spreadsheet
- a website
- a .csv file
- paper reports
If we are using data from multiple sources, then we will have to extract it into a single database or warehouse in order for our analyses to work. But it is not always easy to extract from locations that use formatting particular to that system.
Why it’s important to extract data even if it is only in one location
Technically, you could analyse the data directly within its storage facility (the original database, an Excel spreadsheet and so on). While it is not recommended, this method is acceptable for making quick calculations, such as computing the sum of a column of values in Excel.
However, for serious data science projects, carrying out data tasks within its original storage facility is a huge red flag. In doing so, you might accidentally modify the raw data, thereby jeopardizing your work.
And this is the best-case scenario, as it only affects you and your individual project. Working within the storage facility rather than extracting the original data to a test database leaves it vulnerable to user damage, and your work may even end up crashing the internal systems of your institution.
That should give any data scientist pause when they start working with an organization’s data. They are entrusting us with important if not essential company information, so we must ensure that we leave the data just as it was when we started on the project.
Software for extracting data
There are a couple of exceptional free-to-use programs for extracting and reading data that are sure to wean you off any bad habits that might have been formed as an Excel user. These programs work well with data that is in a raw .csv file format.5
Although it can take time, data can in most cases be stripped down to a raw .csv file. And if you’re working for a large organization where you have to request data extracts, then good news: the data will most likely be given to you in a .csv format anyway.
2. Transform your data
You cannot simply dump your data from its original source directly into a data warehouse. Not unless you want to work with a messy dataset. By transforming your data, you can reformat the information you plan to use into a language that will suit your objectives.
In a broad sense, the transformation step includes alterations such as joining, splitting and aggregating data. These are functions that allow us to create derived tables to better suit the problem at hand.
But the most important function of transformation is data cleaning – and that’s what we will focus on.
In this step, we must identify and manage any errors in our original database, which in the real world will often run the gamut from formatting inconsistencies, through outliers, to significant gaps in information.
But to do so, we first have to understand what we are looking for. So, how can we identify dirty data?
Dirty data
Dirty data is information that is either incorrect, corrupt or missing. These three qualifiers are due to the following factors:
Incorrect data
In these instances, information has been (partially or completely) incorrectly added to the database (e.g. inputting a currency value into a date cell).
Sometimes, we will know that data is incorrect. It may be evident when there is a mismatch between columns. For example, if we had a single row, where the country cell was ‘France’ and the city cell was ‘Rome’, we would know that one was incorrect.
We may also be able to identify incorrect data by simply using our common sense – we would know that an entry in a date of birth column that has been given as ‘12/41/2001’ simply cannot be correct.
Corrupt data
Corrupt data refers to information that may originally have been correct in the dataset but is now mangled.
Information can become corrupted in different ways. Contributing factors can include if the database to which it belongs has been physically damaged, if it has been altered by another software or if it has been previously extracted in unadvisable ways.
Sometimes, data can simply become corrupted due to transfer to a database that does not support the format it had in the previous storage.
Missing data
Missing data either occurs when no information is available for a given cell, or when the person responsible for inserting the data has neglected to add it into the cell.
Missing data is a common topic in data science, and it is most likely to occur because of human error.
Fixing corrupt data
To fix corrupt data so that it can be read by a machine, we can first try the following:
- re-extract it from its original file to see if something has corrupted the file during the first extraction
- talk to the person in charge of the data to see if they can cast light on what the actual data should be
- exclude the rows that contain corrupt data from your analysis.6
Fixing missing data
If we cannot resolve our problem by using any one of these methods, then we must consider our data as missing. There are various methods for resolving the problem of missing fields in spreadsheets:
- Predict the missing data with 100% accuracy. We can do this for information that we can derive from other data. For example, say we have a spreadsheet with customer location data that contains column values for both ‘State’ and ‘City’; the entry for State is missing but the City entry is ‘Salt Lake City’. Then we can be certain that the state is ‘Utah’. It is also possible to derive a missing value based on more than one value, for example, to derive a profit value from both revenue and expenses values. Bear in mind that when we are inputting information in both examples, we are doing so on the assumption that there were no errors in the collection of the data.
- Leave the record as it is. In this scenario, you would simply leave the cell with no data empty. This is most useful when specific fields have no bearing on our analysis and therefore can be left out of our testing, but it can also be used if we are planning to use a method that isn’t significantly affected by missing data (ie methods that can use averaged values) or if we use a software package that can deal appropriately with this lack of information. In cases where you leave the record as it is, I would recommend keeping notes of where your data contains gaps, so that any later anomalies can be accounted for.
- Remove the record entirely. Sometimes, the data that is missing would have been critical to our analysis. In these instances, our only option is to remove the entire row of data from our analysis, as the missing information makes them unable to contribute. Obviously, the major drawback in this case is that our results will become less significant as the sample has decreased. So this approach is likely to work best with large datasets, where the omission of a single row will not greatly affect the dataset’s statistical significance.
- Replace the missing data with the mean/median value. This is a popular approach for columns that contain numerical information, as it allows us to arbitrarily fill any gaps without tampering too significantly with our dataset. To calculate the mean, we add all of the values together and divide that total by the number of values. To calculate the median, we find the sequential middle value in our data range (if there are an uneven number of values, just add the two middle numbers and divide that total by two). Calculating the median rather than the mean is usually preferable, because the former is less affected by outliers, which means that extreme values either side of the median range will not skew our results.
- Fill in by exploring correlations and similarities. This approach is again dependent on your missing data value being numerical, and it requires the use of models to predict what the missing values might have been. For instance, we could use a predictive algorithm to forecast the missing data based on existing similarities among records in your dataset.
- Introduce a dummy variable for missing data. This requires adding a column to our dataset: wherever we find missing values in the dataset, we allocate a ‘yes’ value to it – and when it is not missing we give it a ‘no’ value. We can then explore how the variable correlates with other values in our analysis, and so retrospectively consider the implications of why this data might be missing.
Dealing with outliers
Let’s say that we are working for a company selling phone accessories and we want to find the average number of units that we have sold of one of our phone cases to each of our distributors.
We have been in business for years, and so our datasets are large. The person responsible for inputting these values into our database was having a bad day, and instead of inputting the number of product units into the product column, they put the distributor’s telephone number.
That error would abnormally raise our average in this column (and would mean that a single distributor has purchased at least 100 million units!).
If we were to analyze that record on its own, we would probably notice the error. But if we simply calculated the average without looking at the data, our report would be skewed by that outlier – and that would make the report unusable.
Nevertheless, it is important to distinguish between outliers that can be attributed to erroneous information and outliers that are correct but that fall outside the normal range of values. The value for a distributor that did purchase 100 million units of your product will still be an outlier, as the value is higher than the normative number of units purchased.
Many datasets will have outliers – our job is to understand where they are and to ensure that they do not unfairly skew our reports. This will largely depend on the type of analysis that we want to carry out.
For example, if we wanted to work out for a publishing house the average number of units sold to book stores around the world, and we know that the outlier was an exceptional purchase order, we might choose to remove the record even though it’s valid.
It is possible to find outliers in your dataset without searching for them manually, by generating a distribution curve (also known as a bell curve for normal distributions) from your column values.
Distribution curves graphically depict the most probable value or event from your data by way of their apex, and it is simple enough to create them directly, even in Excel. Once you have created your distribution curve, you can identify the values that fall outside the normal range.
3. Load your data
Once we have transformed our data into the format we need, we can load our data into our end target: the warehouse.
Once this process is complete, we should manually look through our data one last time before we run it through a machine algorithm, to be absolutely certain that we are not working with underprepared data.
Quality assurance after the load
Loading the data into a warehouse can sometimes cause problems. You may have missed cleaning up some of the dirty data in the previous stage, or some of the data may have simply been loaded incorrectly.
For that reason, you must learn to double-check your data within the warehouse.
The following are the quality assurance (QA) checks that you should always make at this stage:
- Count the number of rows that you have in your final dataset and compare it to the initial dataset. If it is different, return to the initial dataset to find out what happened. Unfortunately, sometimes the quickest way to check is just by looking at it, and this will mean scrolling through the data line by line. The quickest way to do this is to go from the bottom up rather than the top down, because any errors in data are likely to carry downwards.
- Check the columns for skewness. To completely safeguard yourself against problems in the analysis stage, check both the top 100 and the bottom 100 rows.
- Check the columns that are susceptible to corruption. This usually refers to dates and balances, as we have earlier established that they are the most prone to error.
- Check text values. If we have free-form text values from surveys where respondents have typed up answers to an open-ended question, then uploading this kind of text to a database can be tricky. Usually, databases will limit the maximum number of letters in a column. That might result in cutting off an answer, which leaves our data as missing or can even sometimes affect the rest of the dataset. Free-form text can also contain symbols that databases either cannot recognize or misuse because they are qualifier symbols such as quotation marks.
Think (again) like a consultant
Quality assurance (QA) is the most important part of data preparation and, as it comes right at the end of the process, be careful not to lose steam at this stage.
I was lucky to enter into the field of data science through the world of consulting, which pays great diligence to QA. With quality assurance, work is peer-reviewed. The numbers need to add up, and results have to make sense. Don’t be afraid of this stage – it’s not designed to trip you up, it is there to help protect you from making errors later on in the process.
Those companies that have worked with data for a while have set up rigorous, predetermined procedures that data scientists must follow to the letter before any analysis can be carried out.
Some companies will even have consultants to check your process, and expect them to take a lot of time with this step. Delivering an incorrect result will at the very least cost money and, in the worst case, may severely affect business operations. That is why it is so important to ensure that QA is carried out before you move on to the next step.
Now that you have a beautiful warehouse of squeaky-clean data, and you know the question or series of questions that you want to pose to it, you can finally move on to my favourite part: the analysis.
SuperDataScience (2016) SDS 008: data science in computer games, learning to learn and a 40m euro case study, by Ulf Morys [Podcast] 28 October www.superdatascience.com/8 (archived at https://perma.cc/P72H-JP3H)