Predict CO2 Emission using Polynomial Regression
Introduction:
In this blog, we learn how to use scikit-learn to implement polynomial regression on fuel consumption and Carbon dioxide emission of cars data. Then use model to predict unknown value.
It contains the following parts:
- Setup your environment
- Data Preparation
- Exploratory data analysis
- Polynomial Regression Model
- Model Evaluation
Setup your environment
To run the program on your local computer, install the following required libraries, These libraries are
- python 3.8.0
- numpy
- pandas
- matplotlib
- scikit-learn
Data Preparation
Understand the data
We have downloaded a fuel consumption dataset, FuelConsumption.csv
, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. You can find more information about the data, go to Fuel consumption ratings.
- MODELYEAR e.g. 2014
- MAKE e.g. Acura
- MODEL e.g. ILX
- VEHICLE CLASS e.g. SUV
- ENGINE SIZE e.g. 4.7
- CYLINDERS e.g. 6
- TRANSMISSION e.g. A6
- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
- CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0
Import the Packages
Create a python file (for example model.py). After installed the required packages and import packages in your python file.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Read the Data
Read the data using Pandas
df = pd.read_csv('FuelConsumptionCo2.csv')
Information of the Data
Information about given data,
df.info()
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 13 columns):
MODELYEAR 1067 non-null int64
MAKE 1067 non-null object
MODEL 1067 non-null object
VEHICLECLASS 1067 non-null object
ENGINESIZE 1067 non-null float64
CYLINDERS 1067 non-null int64
TRANSMISSION 1067 non-null object
FUELTYPE 1067 non-null object
FUELCONSUMPTION_CITY 1067 non-null float64
FUELCONSUMPTION_HWY 1067 non-null float64
FUELCONSUMPTION_COMB 1067 non-null float64
FUELCONSUMPTION_COMB_MPG 1067 non-null int64
CO2EMISSIONS 1067 non-null int64
dtypes: float64(4), int64(4), object(5)
memory usage: 108.5+ KB
Missing Values
Find the missing values of given data,
missing_values = df.isnull().sum()
missing_values[0:13]
ENGINESIZE 0
CYLINDERS 0
FUELCONSUMPTION_CITY 0
FUELCONSUMPTION_HWY 0
FUELCONSUMPTION_COMB 0
FUELCONSUMPTION_COMB_MPG 0
CO2EMISSIONS 0
dtype: int64
Exploratory data analysis
Lets start exploratory data analysis on our data. First plot each of these features vs the Emission, to see how linear is their relation.
ENGINESIZE vs CO2EMISSIONS
CYLINDERS vs CO2EMISSIONS
Polynomial Regression Model
Polynomial regression, where the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x.
First split the data, We want to create a model, must have split it into training and testing. The model trained by training dataset and then apply the evaluation of model used by testing dataset.
train_x = np.asanyarray(df[['ENGINESIZE']])
train_y = np.asanyarray(df[['CO2EMISSIONS']])
test_x = np.asanyarray(df[['ENGINESIZE']])
test_y = np.asanyarray(df[['CO2EMISSIONS']])
Define the Model
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
poly = PolynomialFeatures(degree=2)
train_x_poly = poly.fit_transform(train_x)
lr = linear_model.LinearRegression()
lr.fit(train_x_poly, train_y)
Coefficient and Intercept in the simple linear regression, are the parameters of the fit line.
print('Coefficient:',lr.coef_)
print('Intercept:',lr.intercept_)
Out[]:
Coefficient: [[ 0. 50.21277583 -1.47931126]]
Intercept: [107.72954763]
we can plot the fit line over the data,
plt.scatter(df.ENGINESIZE,
df.CO2EMISSIONS,
color='blue')
XX = np.arange(0.0, 10.0, 0.1)
YY = lr.intercept_[0]+ lr.coef_[0][1]*XX+
lr.coef_[0][2]*np.power(XX, 2)
plt.plot(XX,YY)
plt.xlabel('Engine Size')
plt.ylabel('CO2 Emission')
test_x_poly = poly.fit_transform(test_x)
pred = lr.predict(test_x_poly)
from sklearn.metrics import r2_score
print('Mean Absolut Square Error:%.2f'
% np.mean(np.absolute(pred - test_y)))
print("Residual sum of squares (MSE): %.2f"
% np.mean((pred - test_y) ** 2))
print("R2-score: %.2f" % r2_score(pred, test_y) )
Out[]:
Mean Absolut Square Error:23.31
Residual sum of squares (MSE): 936.33
R2-score: 0.70
- Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error.
- Root Mean Squared Error (RMSE): This is the square root of the Mean Square Error.
- R-squared is not error, but is a popular metric for accuracy of your model. It represents how close the data are to the fitted regression line. The higher the R-squared, the better the model fits your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
No comments:
Post a Comment