- vinayak sable

# Decisions with Decision Tree

Updated: Apr 17, 2020

Hello everyone in this post you will learn about Decision Tree in machine learning. This algorithm is mostly based on the tree data structure of computer science and this algorithm not required more mathematics knowledge. This is a greedy algorithm and a top-down recursive tree induction algorithm in machine learning.
First, we learn **what is the tree in computer science?** From Wikipedia, a tree is an abstract data type that simulates a hierarchical tree structure with a root value and subtrees of the children with a parent node represented as the set of link nodes.

**Decision Tree
**The decision tree is a non-parametric supervised learning technique used for generating the tree and the set rules from the given data. It is a classification and regression method in machine learning. The goal of the decision tree is the create a model that predicts the value of the study or target variable by learning simple decision rules inferred from the data feature.
J.R. Quilan a researcher in machine learning developed a decision tree algorithm known as the **Interactive Dichotomizer 3** (ID3). ID3 is based on the entropy(H) or Information Gain metrics after that Quilan present **C4.5** algorithm for decision tree, It is a successor of the ID3. In 1984, C. Stone, R. Olshen, J. Friedman and L. Breiman published the book ** Classification and Regression Trees (CART)** this book explains the generation of the binary decision trees. The structure of the decision tree given below,

Terms in Decision Tree

Root Node -: It represents the entire population or sample. This further divided into two or more homogeneous set.

Decision Node -: It is a sub-node that represent rule for splitting data into further classification.

Terminal / Leaf Node -: It does not split also show predicted target variable is called terminal or leaf node.

**Type of Decision Tree**

There are two types of decision trees. It depends on the target variable,

Binary Variable Decision Tree : A decision tree which has binary target variables, then it is called a binary tree. It is also called

**Classification Tree**.Continues Variable Decision Tree : A decision tree which has continuous target variables, then it is called a continuous tree. It is also called

**Regression Tree**.

**Algorithm**

Input:

**Data partition(S)**: A Set of training tuples and their associated class labels;**List of attribute**: A set of candidate attributes**Attribute selection method**: a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. It contains a splitting attribute and either a split-point or splitting subset.

Procedure:

Create a single node as root node.

If all of the tuple in the S are all of the same class C then return that notes as a leaf node named with the class C

List of attribute is empty then return that node as a leaf node named with the majority class in S and stop

Apply attribute selection method on remaining data and attributes for the finding next best-splitting attribute and label that node as a splitting criterion.

There are presently three partitioning scenarios with splitting attribute a. If a splitting attribute has discrete value and it's allowed to multiway splitting then a branch is created for each known value in that splitting attribute b. If the splitting attribute has continuous values then find out the splitting point. Split point is often taken as a midpoint of the two known adjacent values of splitting attribute. And create a binary branch. (< splitting_point and ≥ splitting_point) c. It splitting attribute has discrete values and binary tree must be produced. Here we find of a subset of the features in an attribute using attribute selection method then create two branches one for belonging into that subset and another for not belonging into a subset

If the number of remaining observations is zero then create a leap node with a majority class in observation present in S and stop.

Otherwise, start from step 1 with a remaining attribute in an attribute list

**Attribute selection metrics**.

There are various attribute selection metrics but in this post we have explained only three metrics that are commonly used for building a decision tree. The attribute having the best score for the metrics is select as the splitting attribute for the given Decision Node.

*Notations: *

S - Training set of class labeled tuples

Ci – ith class label in the target variable

Ci, S – Set of tuples of class Ci in S

|S| - No. of tuples in S

|Ci, S| - No. of tuples in Ci,S

**1.** **Information Gain**

Information gain used as an attribute selection metric in ID3. Its amount of information that's gained by knowing the value of the attribute. The attribute which has the highest information gain is selected as the splitting attribute for the present node.

From information theory, information of S is given below,

Where a log function to the base 2 is used. Because the information is encoded in bits. It's also called **entropy** of S.

Suppose attribute A having k distinct values hence attribute can be used to split S into k subsets

Where sj contains those tuples in S that have the outcomes aj of A.

Information for attribute A is,

Where |Sj|/|S| is the weight of the jth partition.

Info(S) w.r.t A is expected information required to classify the tuple from S based on the partitioning by A. Hence gain of the attribute A is given below,

** Gain(A) =Info(S) – Infoa(S)**

** Gain(A) **tells us how much would be gained by branching on A. The attribute A with the highest information gain. Gain(A) is selected as the splitting attribute for the particular node.

**2.** **Gain Ratio**

The information gain measure is biased. When attributes have the many unique values and the corresponding information of s with respect to that attribute is zero. In the C4.5 algorithm uses an extension to information gain is known as the **gain ratio**. This overcomes the bias in information gain. It's simply normalization of information gain. Using a splitting information value define the analogously with info(s) as

Gain ratio is defined as,

** GainRatio(A) = Gain(A)/splitInfoA(S)**

the attribute with maximum gain ratio is selected as the splitting attribute.

**3.** **Gini index**

The Gini index is used in **CART**. It is a measure the impurity of S, set of the training tuples as,

Where Pi is a probability of tuple in S belong to class Ci.

It's used for binary split. For the discrete-valued attribute, the subset which gives the minimum Gini index for that attribute is selected as a splitting subset. For continuous-valued attributes, find out the possible split point. Hence Gini index for attribute A is given below,

The attribute that maximizes the reduction in impurity. In other words it has the minimum Gini index is selected as a splitting attribute.

**Tree Pruning**

A fully grown tree many times may suffer from overfitting the data. The bigger tree is less useful from a practical point of view and the tree pruning method addresses this problem of overfitting the data. Tree pruning is a method of reducing the size of the tree. The pruned tree is smaller and is less complex.

There are mainly two approaches to tree pruning

**1.** **Prepruning**

In a prepruning approach a tree is pruned by the halting its construction only. i.e we stop splitting of a node at a certain stage. We may use criteria like the number of tuples in the node available for splitting should be above some specified number. Other statistical measures can also be used to make a decision of terminology the tree growth

**2.** **Postpruning**

In post pruning, we allow trees to fully grow and then remove the subtree from it. A subtree at a given node is removed and the node is labeled by the class which is of most frequent class. Looking at the percentage of the miss-classified tuples decision is taken to remove subtree.

**Problem solving with python.**

Let's start to create a decision tree with python.
1. First Download __Titanic: Machine Learning from Disaster__ Dataset from Kaggle

2. Unzip in the working directory and starts coding

```
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.metrics import accuracy_score
from IPython.display import Image as PImage
from sklearn.model_selection import GridSearchCV
%matplotlib inline
train = pd.read_csv('train.csv') #data load in computer
train.head()
```

Data Preprocessing In which, find out missing values and covert categorical variable in numerical variable ex. Sex convert into 0 or 1.

```
# count of null values
train.isna().sum()
# Output
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
# Cabin convert into present absent
train["Cabin"] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1) #for null value 0 and not null value 1
# create new variable Family size and Is_alone
train['family_Size'] = train['SibSp'] + train['Parch'] + 1
train["Is_alone"]=train['family_Size'].apply(lambda x : 1 if x==1 else 0)
# Drop variable contain unrelative information
drop_elements = ['PassengerId', 'Name', 'Ticket', 'SibSp','Parch']
train = train.drop(drop_elements, axis = 1)
```

```
# Missing Value and labelling
train['Age'] = train['Age'].fillna(train['Age'].median())
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
# label encoder for Sex and Embarked variable
from sklearn.preprocessing import LabelEncoder
train['Sex'] = LabelEncoder().fit_transform(train['Sex'] )
train['Embarked'] = LabelEncoder().fit_transform(train['Embarked'] )
# Correlation in Data
plt.figure(figsize=(14,14))
sns.heatmap(train.astype(float).corr(),linewidths=0.1, cmap=plt.cm.gist_earth, linecolor='white', annot=True,vmax=1.0, square=True,)
plt.savefig("correlation.png",format='png')
plt.show()
```

**Finding the best tree depth with of Cross-Validation**

```
y_train = train['Survived']
x_train = train.drop(['Survived'], axis=1).values
parameters = {'max_depth':range(1,x_train.shape[1]+1)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters) # also use KFold
clf.fit(X=x_train, y=y_train)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_)
# Output
0.8092031425364759 {'max_depth': 3}
```

We Create a decision tree with a maximum depth 3

```
# fit model
decision_tree = tree.DecisionTreeClassifier(max_depth=3)
decision_tree.fit(x_train, y_train)
# output
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
```

where Sklearn "Gini Index" criteria by default.

**Plot Decision Tree **

```
from sklearn import tree
import pydotplus
from io import StringIO
dotfile = StringIO()
tree.export_graphviz(decision_tree, out_file=dotfile,
impurity = True,
feature_names = list(train.drop(['Survived'], axis=1)),
class_names = ['Died', 'Survived'],
rounded = True,
filled= True)
graph=pydotplus.graph_from_dot_data(dotfile.getvalue())
graph.write_png("dtree.png")
PImage("dtree.png")
```

Accuracy of a given decision tree is

```
acc_decision_tree = round(decision_tree.score(x_train, y_train) * 100, 2)
acc_decision_tree
# Output
82.38
```

**Advantages:**

1. Decision trees require less effort for data preparation during pre-processing.

2. A decision tree does not require normalization and scaling of data.

3. Missing values in the data does not affect building a decision tree.

4. Decision trees do not make any assumption about the spatial distribution and classifier structure.

**Disadvantage:**

1. Due to continuous variables, Decision Tree not fit good.

2. Decision tree often overfit.

3. Decision tree training is relatively expensive as complexity and time taken is more.

4. A small change in the data can cause a large change in the structure of the decision tree causing instability.

5. Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

**Summary:**

The decision tree is a top-down recursive algorithm and supervised learning algorithm. Each non-leaf node is based on the attribute selection measure.

There are two types of decision tree which are based on target variable type (Discrete and Continues)

ID3, C4.5, and CART are examples of such algorithms using different attribute selection measures.

Tree pruning algorithms used to improve accuracy by removing tree branches reflecting noise in the data. Several scalable algorithms, such as Random Forest, have been proposed for scalable tree induction. In future posts we will be writing about Random Forest. And remember, any suggestions, comments or critics are welcome!

Thanks for reading.

References :

Jiawei Han, Micheline Kamber, and Jian Pei: Data Mining Concepts and Techniques

L. A. Breslow and D.W. Aha: Simplifying decision trees: A survey

D. E. Brown, V. Corruble, and C. L. Pittard: A comparison of decision tree classifiers with backpropagation neural networks for multimodal classification problems.