- vinayak sable

# Linear Regression With Python

Hello friends, your relationship plays an important role in your life likewise, the study of the relation between two or more than two variables is an interesting part of every data enthusiast. In this post we try to explain linear regression and solving real-life problems using python.

Linear Models play a central part in Data Science and Machine Learning Methods. On the one hand, these models are able to approximate a large amount of metric data structures in their entire range of definitions or at least piecewise.

**Linear Models And Regression Analysis**

Suppose the output of any process is denoted by a random variable y, called as dependent (or study of interest) variable, depends on k independent variables denoted by x1,x2,...,xk. Suppose the nature of y can be explained by a relationship is given by,

Where f is some well-defined function and β1, β2,..., βk are the parameters which characterize the contribution of 𝑋1,𝑋2,...,𝑋𝑘. The parameter 𝜖 represents the stochastic nature of the relationship between y and 𝑋1, 𝑋2,...,𝑋𝑘. When 𝜖=0, Then the relationship is called the mathematical model otherwise the statistical model. A relationship is called **linear** if it is linear in parameters and **nonlinear**, if it is not linear in parameters.
Regression Analysis is investigating and modeling the relationship between dependent and independent variables. The linear regression model for one independent variable is a **simple regression model**, The model representation given below,

**𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜖**

Where,

y is response or dependent variable

x is Regressor or Predictor or independent variable

β0 is an intercept of a regression line

𝛽1 is a slop of regression line

𝜖 is an error term for representing noise in the regression model

In general the dependent variable y is linearly depends on k independent variable x1,x2,...,xk is given below

𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +...+ 𝛽𝑘𝑥k +𝜖

This is called **Multiple Linear Regression Model **because here more then one independent variable is in model

**Assumption of Linear Regression**

Linear Regression is a parametric model. Hence its need to satisfy some assumptions,

The relationship between the independent and dependent variables to be linear and Dependent variable is independently distributed

The errors are assumed to behave mean zero & variance 𝜎2

Errors are uncorrelated i.e value of one error does not depend on the value of any other error.

**THE MODEL PARAMETERS ESTIMATION**

** Least-Squares Estimation**
Least squares can be used to estimate the regression coefficients. Suppose that no. of observation is greater than no. of regressors are available, and let yi denote the i th observed response and xij denote the i th observation or level of regressor xj. then Least Squares function is given by,

We try to minimized function SS with respect to coefficient of x i.e β0,β1x,β2x,...,βk. the least-squares estimators of the coefficient of x must satisfy,

and

Model express in Matrix notation is given below,
**y = X β + 𝜖
**where y is an n × 1 vector of the observations, X is an n × p matrix of the regressor variables, β is a p × 1 vector of the regression coefficients, and ε is an n × 1 vector of random errors or noise.

After solving equations for estimating coefficients of x we get normal equations of least-squares, is given below,
**𝑋′𝑋𝛽=𝑋′𝑦****
**from above equation estimate of **𝛽** is given below,
estimated **𝛽=(𝑋′𝑋)−1𝑋′𝑦**

**Apply linear regression on FIFA 19 complete player dataset**

First Download

__FIFA 19 complete player dataset__from KaggleUnzip in the working directory and starts coding

```
# Importing Librarie
import numpy as np #for matrix and algebric manupulation
import pandas as pd # For dataframe Manupulation
import seaborn as sns # graphical Representation
import matplotlib.pyplot as plt
sns.set_style('dark')
# Read data.csv file for modeling
df=pd.read_csv('data.csv')
#Data cleaning First drop unnecessary values
drop_cols = df.columns[25:54]
df = df.drop(drop_cols, axis = 1)
df = df.drop(['Unnamed: 0','ID','Nationality','Position','Work Rate','Name','Photo','Club','Club Logo','Flag','Jersey Number','Joined','Special','Loaned From','Body Type','Wage','Value','Release Clause' ], axis = 1)
df.head()
```

Find out Null values Counts in every column,

```
df.isna().sum()
# Output
Age 0
Overall 0
Potential 0
Preferred Foot 48
International Reputation 48
Weak Foot 48
Skill Moves 48
Real Face 48
Crossing 48
Finishing 48
HeadingAccuracy 48
ShortPassing 48
Volleys 48
Dribbling 48
Curve 48
FKAccuracy 48
LongPassing 48
BallControl 48
Acceleration 48
SprintSpeed 48
Agility 48
Reactions 48
Balance 48
ShotPower 48
Jumping 48
Stamina 48
Strength 48
LongShots 48
Aggression 48
Interceptions 48
Positioning 48
Vision 48
Penalties 48
Composure 48
Marking 48
StandingTackle 48
SlidingTackle 48
GKDiving 48
GKHandling 48
GKKicking 48
GKPositioning 48
GKReflexes 48
dtype: int64
# Drop Null Values
df = df.dropna()
```

Conver Nominal Variables in Numeric of ** Real Face** and **Preferred Foot** column

```
#Handling categarical variable
df['Real Face']=df['Real Face'].apply(lambda x : 1 if x=='Yes' else 0)
df['Preferred Foot']=df['Preferred Foot'].apply(lambda x : 1 if x=='Right' else 0)
```

Get target variable is overall rating and split data into train test

```
#overall rating as target variable
target = df.Overall
df = df.drop(['Overall'], axis = 1)
#Splitting into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2)
#Convert categorical variable into dummy/indicator variables.
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
print(X_test.shape,X_train.shape)
print(y_test.shape,y_train.shape)
# Output
(3642, 41) (14565, 41)
(3642,) (14565,)
```

Fit Linear Regression Model using sklearn

```
#Modelling Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
#Find out the R-Square score and root mean squared error
from sklearn.metrics import r2_score, mean_squared_error
print('R2 : '+str(r2_score(y_test, predictions)))
print('RMSE : '+str(np.sqrt(mean_squared_error(y_test, predictions))))
# Output
R2 : 0.9272154465330666
RMSE : 1.8753691700529491
```

**Coefficient Of Determination (R-Square)**

It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It's given below

Where,

SSR is sum of square due to regressors SST is the total sum of square

SSRes is sum of square due to residuals

From the above output, our model explains **92.72 %** variation in Target Variable.

**Root Mean Square Error (RMSE)
**Root Mean Square Error(RMSE) is the standard deviation of the residuals. The formula is given below,

**Visualizing the results**

```
plt.figure(figsize=(20,12))
sns.regplot(predictions,y_test,scatter_kws={'alpha':0.5,'color':'red'},line_kws={'color':'green','alpha':0.7})
plt.xlabel('Predicted Rating')
plt.ylabel('Overall Rating')
plt.title("Linear Prediction of Player Rating")
plt.savefig("plot.png",format='png')
plt.show()
```