Data Preprocessing.

6 min readMay 25, 2021

In this post, I will introduce the process of Data preprocessing step by step, which is the first and important step in machine learning process. So let’s first define what actually is Data Preprocessing.

What is Data Preprocessing?

In real world, data can be in many forms like text, image, video, tables, audio etc. Also, all this will be in large dataset containing rows and columns. Now, machine don’t understand all this forms of data, it can only understand binary 1's and 0’s.

So when we have large dataset in any of the above mentioned form then we have to transformed or have to encode it into the machine language i.e.., Binary 1's and 0’s, this type of data can be easily parsed by the machine.

Now that we have seen what’s data preprocessing, let’s talk about the steps involve in the data preprocessing. Note: steps are not always applicable to your dataset, it varies from the dataset to dataset. Below are the general steps involve in the data preprocessing:

1. Importing the libraries:

In this section I will discuss about how to import the python rich libraries for data preprocessing. The three core python libraries used for data preprocessing in machine learning are: Numpy, Pandas, and Matplotlib.

Numpy: The fundamental package for scientific computing with Python. It provides powerful N-Dimensional arrays and numerical computing tools for different operations.

Pandas: It is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. It is generally used in importing, and managing the datasets. Pandas is an open source, providing high-performance, easy-to-use data structures and data analysis tools.

Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, web application servers, and various graphical user interface toolkits.

2 Importing the Dataset:

First step to build the machine learning model is to import the dataset that you have gathered for machine learning project. For this you have to set the current directory of your dataset to the working directory

dataset = pd.read_csv(‘Data.csv’)

or else you can copy and paste the path of your dataset in your python code.

dataset = pd.read_csv(‘C:\Users\my files\Downloads\data\Data.csv’)

After importing dataset our data looks like this

Now to extract the independent and dependent variables we will use iloc[:,:] function of the pandas. Here first colon(:) is used to extract number of rows . Second colon is used to extract the number of columns.

Below code will extract all the rows of columns(except last one) country, age and salary , and will store it in variable X.

X = dataset.iloc[:,:-1].values

Below code will extract all the rows of last columns and will store it in variable y.

y = dataset.iloc[:,-1]

3. Identifying and Handling missing values:

Generally, we can’t afford to have a missing value in our dataset to train our machine learning model. In our dataset we can see that there are two rows which doesn’t have values in it. So there are two methods to handle this missing data.

Deleting that particular row: This methods works when we have large dataset and nearly 1% of the missing data because after deleting 1% of the data will not affect the learning quality of the model.
Calculate the mean: This method works when we have more then 1% of the missing data and therefore we must handle that in a right way. Here, we will replace the missing value by the mean of all the values in which the data is missing. This is the classic way to handle the data and for this we will use sklearn library

imputer = SimpleImputer(missing_values=np.nan, strategy=’mean’)
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

By performing the above code we can see that all the missing values is replaced by the average of that particular column

4. Encoding the categorical values:

From the dataset we can see that one columns contains the words/strings. Machine can’t compute the correlation between the words and a number so we have to encode this string value in numerical values. As seen in our dataset example, the country column will cause problems, so we must convert it into numerical values. The code will be as follows –

Encoding the independent values

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [0] )], remainder=’passthrough’)
X= np.array(ct.fit_transform(X))

Encoding the dependent values

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

5. Splitting dataset into train set and test set

Every dataset must be divided into the some proportion of test and train set. Generally we split it in 80–20 ratio, 80% training and 20% testing set. Train set is used to train the machine learning model by it’s input and known output or we can say that our model actually learns from this model. Test data is used to evaluate the performance of our machine learning model. Sometimes we divide our dataset into 3 group, test, train and validation. Validation set is used to fit the model to improve the hyperparameter. From this model doesn’t learn anything.

Here arise a question that when should we apply feature scaling. Feature scaling is applied after we split the data into train and test set. Let’s understand what is feature scaling and why we should apply it after the splitting dataset.

Feature scaling: Feature scaling is used to scale all the values on the same scale and we do this because we don’t want to make any feature to dominate the other which therefore would be neglected by the model. Feature scaling is the technique which calculate the Mean and Standard Deviation of the feature in order to perform the scaling.

Reason to use feature scaling before splitting data is that, when we apply feature scaling on whole dataset then we will get the Mean and Standard Deviation of all the values including the data which will be in test set. And we call that as an information leakage. That means we are not suppose to work with test data. And due to this we will not able to give unseen and brand new test data to our model.

6. Feature scaling

In above section we have discussed what is feature scaling and why it should be applied after splitting data. In this section we will introduce how to apply feature scaling on the dataset. There’re two technique to scale the feature of the dataset.

1. Standardization: It consist of subtracting each value of feature by it’s mean and dividing it by the Standard deviation, which is the square root of the variant. Standardization will transform all the values of feature between and around -3 and +3. This technique works well all the time

2. Normalization: It consist of subtracting each value of feature by the minimum value of the features and then dividing it by the difference between maximum value of feature and the minimum value of the feature. Normalization will transform all the values of feature between 0 and . This technique is recommended when you have normal distribution in most of the feature.

I hope you like this post. If you want to see whole code of data preprocessing here it is GitHub.

Data Preprocessing.

Written by Dhruv Patel