• vinayak sable

Linear Regression With Python

Hello friends, your relationship plays an important role in your life likewise, the study of the relation between two or more than two variables is an interesting part of every data enthusiast. In this post we try to explain linear regression and solving real-life problems using python.

Linear Models play a central part in Data Science and Machine Learning Methods. On the one hand, these models are able to approximate a large amount of metric data structures in their entire range of definitions or at least piecewise.

Linear Models And Regression Analysis

Suppose the output of any process is denoted by a random variable y, called as dependent (or study of interest) variable, depends on k independent variables denoted by x1,x2,...,xk. Suppose the nature of y can be explained by a relationship is given by,

Where f is some well-defined function and β1, β2,..., βk are the parameters which characterize the contribution of 𝑋1,𝑋2,...,𝑋𝑘. The parameter 𝜖 represents the stochastic nature of the relationship between y and 𝑋1, 𝑋2,...,𝑋𝑘. When 𝜖=0, Then the relationship is called the mathematical model otherwise the statistical model. A relationship is called linear if it is linear in parameters and nonlinear, if it is not linear in parameters. Regression Analysis is investigating and modeling the relationship between dependent and independent variables. The linear regression model for one independent variable is a simple regression model, The model representation given below,

𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜖


  y is response or dependent variable

  x is Regressor or Predictor or independent variable

β0 is an intercept of a regression line

𝛽1 is a slop of regression line

𝜖 is an error term for representing noise in the regression model

In general the dependent variable y is linearly depends on k independent variable x1,x2,...,xk is given below

𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +...+ 𝛽𝑘𝑥k +𝜖

This is called Multiple Linear Regression Model because here more then one independent variable is in model

Assumption of Linear Regression

Linear Regression is a parametric model. Hence its need to satisfy some assumptions,

  1. The relationship between the independent and dependent variables to be linear and Dependent variable is independently distributed

  2. The errors are assumed to behave mean zero & variance 𝜎2

  3. Errors are uncorrelated i.e value of one error does not depend on the value of any other error.


Least-Squares Estimation Least squares can be used to estimate the regression coefficients. Suppose that no. of observation is greater than no. of regressors are available, and let yi denote the i th observed response and xij denote the i th observation or level of regressor xj. then Least Squares function is given by,

sum of square

We try to minimized function SS with respect to coefficient of x i.e β0,β1x,β2x,...,βk. the least-squares estimators of the coefficient of x must satisfy,


Model express in Matrix notation is given below, y = X β + 𝜖 where y is an n × 1 vector of the observations, X is an n × p matrix of the regressor variables, β is a p × 1 vector of the regression coefficients, and ε is an n × 1 vector of random errors or noise.

After solving equations for estimating coefficients of x we get normal equations of least-squares, is given below, 𝑋′𝑋𝛽=𝑋′𝑦 from above equation estimate of 𝛽 is given below, estimated 𝛽=(𝑋′𝑋)−1𝑋′𝑦

Apply linear regression on FIFA 19 complete player dataset

  1. First Download FIFA 19 complete player dataset from Kaggle

  2. Unzip in the working directory and starts coding

# Importing Librarie
import numpy as np #for matrix and algebric manupulation
import pandas as pd # For dataframe Manupulation
import seaborn as sns # graphical Representation
import matplotlib.pyplot as plt 

# Read data.csv file for modeling

#Data cleaning First drop unnecessary values
drop_cols = df.columns[25:54]
df = df.drop(drop_cols, axis = 1)
df = df.drop(['Unnamed: 0','ID','Nationality','Position','Work Rate','Name','Photo','Club','Club Logo','Flag','Jersey Number','Joined','Special','Loaned From','Body Type','Wage','Value','Release Clause'  ], axis = 1)

Find out Null values Counts in every column,


# Output
Age                          0
Overall                      0
Potential                    0
Preferred Foot              48
International Reputation    48
Weak Foot                   48
Skill Moves                 48
Real Face                   48
Crossing                    48
Finishing                   48
HeadingAccuracy             48
ShortPassing                48
Volleys                     48
Dribbling                   48
Curve                       48
FKAccuracy                  48
LongPassing                 48
BallControl                 48
Acceleration                48
SprintSpeed                 48
Agility                     48
Reactions                   48
Balance                     48
ShotPower                   48
Jumping                     48
Stamina                     48
Strength                    48
LongShots                   48
Aggression                  48
Interceptions               48
Positioning                 48
Vision                      48
Penalties                   48
Composure                   48
Marking                     48
StandingTackle              48
SlidingTackle               48
GKDiving                    48
GKHandling                  48
GKKicking                   48
GKPositioning               48
GKReflexes                  48
dtype: int64

# Drop Null Values
df = df.dropna()

Conver Nominal Variables in Numeric of Real Face and Preferred Foot column

#Handling categarical variable
df['Real Face']=df['Real Face'].apply(lambda x : 1 if x=='Yes' else 0)
df['Preferred Foot']=df['Preferred Foot'].apply(lambda x : 1 if x=='Right' else 0)

Get target variable is overall rating and split data into train test

#overall rating as target variable
target = df.Overall
df = df.drop(['Overall'], axis = 1)

#Splitting into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2)

#Convert categorical variable into dummy/indicator variables.
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

# Output
(3642, 41) (14565, 41)
(3642,) (14565,)

Fit Linear Regression Model using sklearn

#Modelling Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression(), y_train)
predictions = model.predict(X_test)

#Find out the R-Square score and root mean squared error
from sklearn.metrics import r2_score, mean_squared_error
print('R2 : '+str(r2_score(y_test, predictions)))
print('RMSE : '+str(np.sqrt(mean_squared_error(y_test, predictions))))

# Output
R2 : 0.9272154465330666
RMSE : 1.8753691700529491

Coefficient Of Determination (R-Square)

It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It's given below


SSR is sum of square due to regressors SST is the total sum of square

SSRes is sum of square due to residuals

From the above output, our model explains 92.72 % variation in Target Variable.

Root Mean Square Error (RMSE) Root Mean Square Error(RMSE) is the standard deviation of the residuals. The formula is given below,

Visualizing the results

plt.xlabel('Predicted Rating')
plt.ylabel('Overall Rating')
plt.title("Linear Prediction of Player Rating")


©2020 by AI Katta.