visit
At the heart of Machine Learning is to process data. Your machine learning tools are as good as the quality of your data. This blog deals with the various steps of cleaning data. Your data needs to go through a few steps before it is could be used for making predictions.
Steps involved in data preprocessing :
Step 1: Importing the required Libraries
To follow along you will need to download this dataset : Every time we make a new model, we will require to import Numpy and Pandas. Numpy is a Library which contains Mathematical functions and is used for scientific computing while Pandas is used to import and manage the data sets.import pandas as pd
import numpy as np
Step 2: Importing the Dataset
Data sets are available in .csv format. A CSV file stores tabular data in plain text. Each line of the file is a data record. We use the read_csv method of the pandas library to read a local CSV file as a dataframe.
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
Step 3: Handling the Missing Data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
Our object name is imputer. The Imputer class can take parameters like :
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Step 4: Encoding categorical data
But why encoding ?
We cannot use values like "Male" and "Female" in mathematical equations of the model so we need to encode these variables into numbers. To do this we import "LabelEncoder" class from "sklearn.preprocessing" library and create an object labelencoder_X of the LabelEncoder class. After that we use the fit_transform method on the categorical features.After Encoding it is necessary to distinguish between between the variables in the same column, for this we will use OneHotEncoder class from sklearn.preprocessing library.One-Hot Encoding
One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
Step 5: Splitting the Data set into Training set and Test Set
Now we divide our data into two sets, one for training our model called the training set and the other for testing the performance of our model called the test set. The split is generally 80/20. To do this we import the "train_test_split" method of "sklearn.model_selection" library.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
Step 6: Feature Scaling
Most of the machine learning algorithms use the Euclidean distance between two data points in their computations . Because of this, high magnitudes features will weigh more in the distance calculations than features with low magnitudes. To avoid this Feature standardization or Z-score normalization is used. This is done by using "StandardScaler" class of "sklearn.preprocessing".
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)