Basic Data Pre-processing for Machine Learning

5 min readAug 26, 2019

Traveling is fun. You get to see new places, meet new people, learn new culture, their heritage and what not. If you are a photographer, you get amazing scenery to shoot great pictures too! But travel comes with a boring part- preparing for the journey. I have hardly seen anyone who is excited about packing their bag. But you need to do some careful preparation to ensure that your bag is not weighted too much, which becomes a burden to carry or you need to make sure you are not short in anything that you need in your travel.

Just like preparing for the travel, you need to do some data preprocessing before you actually implement your machine learning algorithm. Your data does not always come in fully prepared form. You need to tweak your data so that the computer can perform the mathematical calculations on it and the non-relevant information don’t crowd your calculation.

So let’s see some basic data preprocessing steps for machine learning. I will be using python for the coding. If you don’t have python installed, you can download and install it from Anaconda Distribution. Also, I will upload the coding in my github repository in a jupyter notebook. You can access them here.

Getting Started

import pandas as pd

If you are not already familiar with Pandas, Pandas is a python library which is used to handle data in tabular form. We imported the library and gave it a short name to be referred by. Later in our code, we will be able to refer pandas as pd instead of writing its full long name.

The next thing we will do is to import our data in a pandas DataFrame object. There are numerous ways for getting your data. For example, here we are getting the data from a .csv file stored in our hard drive.

dataset = pd.read_csv('data_preprocessing_tutorial.csv')# The file is saved in the same location as our working directory so we wrote only the file name. If the file is saved in other location, we have to write to the full location. for example "pd.read_csv('c:/user/desktop/data.csv')" or change the working directory to the location of this file

I have uploaded the file we are working with here as example in my github repository. Feel free to download it. It is from the “Titanic: Machine Learning from Disaster” competition of Kaggle. I shortened it a little bit for the ease of our learning. If you want the full version, you can get it directly from the Kaggle site here. The data contains some information about the passengers of Titanic and if they survived or not. The challenge is to predict if a passenger survived or not based on his/her information.

You will want to divide the data into two part, one will carry all the independent variables, the other will carry the dependent variable(s).

X = dataset.iloc[:, [2,4,5,6]].values
y = dataset.iloc[:, 1].values

If you check the data, column 1 does not affect our depended variable in any way. So we omitted it and started from column 3. Indexing in python starts at 0 hence the column 3 is indexed 2. And our depended variable (if the passenger survived) is in column 2, which is indexed 1. The first semicolon (:) means “Take all the rows”, we are including information of everyone of our passengers.

Dealing with missing values

You will notice some of the observations have some data missing. Check our example data. The missing values are represented as NaN. Calculations will not work if it faces a missing value. You have two options. Either you remove that observation, or you enter some value in that missing place, like the mean value. You want to do this before separating the data into X and y as we did in the previous section.

# Dropping Missing value
dataset = dataset.dropna(subset = ['Age'])# Or filling missing value with mean
dataset = dataset.fillna(dataset.mean()['Age'])

Encoding Categorical Data

Machine learning algorithm works with numeric values. So your categorical data needs to be represented in numeric form. For example, you might want to represent Yes or No as 1 or 0. Because ‘Yes’ or ‘No’ doesn’t make any sense to the mathematical calculations but 1 and 0 does. If you have three categories, they will be represented as 0, 1 , 2. More categories, more numbers, you get the idea.

We will be using LabelEncoder class from the scikit-learn.preprocessing module for this purpose.

from sklearn.preprocessing import LabelEncoder

We imported the LabelEncoder class. Now we will create an object of this class. We will encode the second column. Let’s name the object labelencoder_X_1 which will encode the second column indexed as 1.

labelencoder_X_1 = LabelEncoder()

Now we need to fit our labelencoder object on the column so that it knows how many categories are there and what numbers are to be assigned to each category and then transform the column accordingly. Both this task can be done with one single function which is fit_transform.

X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])

For each categorical data, we need to create separate lebelencoder object or fit the same encoder separately for each column. So in our case, we will create another encoder object for the Embarked column.

labelencoder_X_3 = LabelEncoder()
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])

Dummy Variable

In the last section we talked about categorical data. As you can see, the Embarked column had three categories, S, C and Q which are encoded as 2, 0 and 1 respectively (the encoder coded them alphabetically). May be the Embarked column has co-relation with our dependent variable. But we don’t know how much they contribute. Our machine learning algorithm will find that for us. But as 2 is greater than 1 and 1 is greater than 0 and as ML algorithm works based on numerical mathematics, it might think that S (which is encoded as 2) has more impact than C (which is encoded as 0). But we know the numbers don’t actually mean it. There is no relational order between them. We cannot compare them this way. So what we need to do is use dummy variables.

Dummy variable will convert the column in question into three number of columns in our case, as we have three categories. Each column will represent each category and have either 1 or 0 as value. So the column for S will have 1 if that passenger embarked at S and other columns will have 0. Similarly, if the the passenger embarked at C then C column will have 1 for that passenger and S and Q will have 0.

We will use OneHotEncoder class from scikit-learn.preprocessing for creating our dummy variable.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

One more thing. We need to drop one column from these three new columns of dummy variable to avoid dummy variable trap.

# Excluding the first column to avoid dummy variable trap
X = X[:, 1:]

Conclusion

That ends our today’s discussion. But this does not include all the preprocessing you need to do. There is more, based on the data and what you want to do with it. And that’s where you need to implement your creativity and skill.