In this article we will discuss about:- 1. Meaning of Decision Tree 2. Expressiveness of Decision Tree 3. Examples of Inducing 4. Strength and Weakness 5. Practical Uses.

Meaning of Decision Tree:

A decision tree represents a particular way of breaking up a data set into classes or categories. The root of the tree implicitly contains all the data to be classified, while the leaf nodes represent the final classes after categorization. Inter mediate nodes also called decision nodes represent choice points or tests upon attributes of data, which serve to further subdivide the data at that node.

A decision tree induction is one of the simplest and yet most successful form of machine learning. It serves as a good introduction to the area of inductive learning and is easy to implement. We first describe decision tree as the performance element of the learning agent and how it learns while accepting perceptions and then selecting the external actions. The important connected terms would also be explained.

It is a tree where each node is either:

ADVERTISEMENTS:

1. A leaf node—indicates the value of the target attribute (class) of examples or

2. A decision node—specifies some test to be carried out on a single attribute value, with one branch and sub-tree for each possible outcome of the test.

A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node which classifies the instance.

The decision tree takes as input, an object or a situation, described by a set of attributes and returns a ‘decision’ the predicted output value for the input. The input attributes can be discrete or continuous. Here, we assume discrete inputs, the outputs value can be discrete or continuous.

ADVERTISEMENTS:

Generating a discrete-valued learning function is called classification whereas generating a continuous learning function is called regression. We would concentrate on Boolean classification, wherein each example is classified as true (positive) or false (negative), in classification learning.

A decision tree reaches its decision by performing a sequence of tests on decision nodes. Each decision node in the tree specifies some test to be carried out on a single attribute value with one branch and sub tree for each possible outcome. Each leaf node in the tree specifies the values to be returned if that leaf is reached.

The decision tree representation seems to be very natural for humans. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Decision tree induction is a typical inductive approach to learn knowledge on classification (Fig. 9.2a).

To explain inductive learning through decision trees let us take a simple example of dining in a restaurant. We define a goal predicate (hypothesis), Will wait (whether to wait for the table in a restaurant).

ADVERTISEMENTS:

As a learning example, we need to know about the attributes available and let us assume that the following list of attributes is available:

1. Alternate:

Whether there is a suitable alternate restaurant nearby.

ADVERTISEMENTS:

2. Bar:

Whether the restaurant has a comfortable bar area to wait in.

3. Sun/ Hol:

Open on Sundays and holidays.

ADVERTISEMENTS:

4. Hungry:

Whether we are hungry.

5. Patrons:

How many people are in the restaurant (values are None, Some, Full).

6. Price:

The restaurant’s entry range (Rs., Rs. Rs., Rs. Rs. Rs.).

7. Raining:

Whether it is raining outside.

8. Reservation:

Whether we had made a reservation.

9. Type:

The kind of restaurant (Indian, Chinese, Thai, Fast-Food).

10. Wait Estimate:

The wait estimated by the restaurant authorities (10, 10-30, 30-60, -60 minutes). On possible decision tree could be as shown in Fig. 9.3.

Here, eight out of ten attributer have been used; price and type being irrelevant in the present situation not used. The tree is traced starting at its root, following along the appropriate branch till we reach the leaf. The example, here selected is “yes wait for table”, with Patrons = full and Wait Estimate = 10-30, this example is labeled positive (that is, yes, we will wait for a table), when the estimated wait time is 10 – 30 min.

Expressiveness of Decision Tree:

The decision tree is really describing a logical relationship between the hypothesis will wait and some logical combination of attribute values, expressed as:

ᗄs Will Wait(s) ↔ (P1(S) v P2(s) v Pn (s)) … (9.1)

where, each condition P.(s) is a conjunction of tests corresponding to a path from the root of the tree to a leaf with a positive outcome. The above expression (9.1) looks like a first-order logical expression, but actually it is a propositional logical sentence because it contains just one variable and all the predicates are unary.

The decision tree is not applicable for tests which refer to two or more different objects for example, (is there a cheaper restaurant nearby), expressed as:

ⴺr2Nearby (r2, r) ˄ Price(r,p) ˄ Price(r2, P2) ˄ Cheaper (P2, p)

Decision trees are fully expressive within the class of propositional languages, that is any Boolean function (True or False) can be written as a decision tree. This can be done easily by having each row in the truth table for the function corresponding to path in the tree. This would yield an exponentially large decision tree as the truth table has exponentially many rows. Hence, decision trees can represent many functions with much smaller trees. But for some kinds of functions, however, this does pose a problem.

For example, if the function is a parity function, which returns 1 if and only if an even number of inputs are 1, then an exponentially large decision tree will be needed. It is also difficult to use a decision tree to represent a majority function which returns 1 if more than half of its inputs are 1. Hence, decision trees are good for some kinds of functions and bad for others. This is because there are 2n rows in the truth table for n attributes. So, irrespective of representation there are some functions which will require 2n bits to represent.

Examples of Inducing Decision Trees:

An example for a Boolean decision tree consists of a vector of input attributes X, and a single Boolean output value Y. A set of examples (X1, Y1), (X2, Y2)…, (X12, Y12) is shown in the table 9.1.

The positive examples are those in which the goal will wait is true, the negative examples are those in which this is false. The complete set of examples is called the training set and assigning of label positive and negative to each example, as classification. Finding of decision tree which agrees with the training set might seem difficult, but this is not so.

A decision tree can be constructed which has one path to a leaf for each example, where the path tests each attribute inturn and follows the value for the example and the leaf has the classification of the example.

With given the earlier example the decision tree will come up with the right classification to the examples though for other examples it may not hold good. The decision tree just memorises the observations, it does not extract any pattern from examples. Applying Ockham’s razor we should find the smallest, decision tree which is consistent with the examples. Unfortunately, finding this smallest tree is intractable but with suitable heuristics a tenable tree can be drawn.

The basic idea behind the Decision-TREE-LEARNING Algorithm is to test the most important attribute first. “Most important” means that attribute which makes the most difference to the classification of an example. This way we hope to get to the correct classification with a small number of tests, meaning thereby that all paths in the tree will be short and the tree as a whole will be small.

Fig. 9.4, shows how the algorithm gets started. Here we have 12 training examples, which are classified into positive and negative sets. We then decide which attribute to use as the first test in the tree. Fig. 9.4 (b) shows that attribute ‘type’ is a poor one because it gives us two positive and two negative examples, out of the four attributes.

‘Patrons’, shown in Fig. 9.4 (a) is a fairly important attribute because if the value is none or some then we are left with example sets for which we can answer definitely (No and Yes respectively). If the value is full then we have mixed set of examples. In this way, we consider all possible attributes and choose the most important one as the root test. Let us assume that the most important attribute is PATRONS.

In general after the first attribute test splits up the examples, each outcome is a new decision tree learning problem in itself, with fewer examples and one fewer attribute.

There are four cases to consider for these recursive problems:

1. If there are some positive and some negative examples, then chose the best attribute to split them Fig. 9.4(c) shows hungry being used to split the remaining examples of the attribute Patrons.

2. If all the remaining examples are positive (or all negative) we can answer yes or no. Fig. 9.4 (a) shows examples of this in the attributes none and some in the case of Patrons.

3. If there are no examples left, it means that no such example has been observed and we return a default value calculated from the majority classification at the node’s parent.

4. If there are no attributes left, but positive and negative examples, we have a problem. It means that these examples have exactly the same description, but different classifications. This happens when some of the data are incorrect that is there is ‘noise’ in the data. It also happens when the attributes do not give enough information to fully describe the situation, or when the domain is truly non-deterministic. One simple way out of the problem is to use a majority vote.

The Decision-Tree Learning algorithm shown below is continued to be applied until we get the tree shown in Fig. 9.5.

Algorithm:

Function Decision:

Tree Learning (examples, attributes, default returns a decision tree.

inputs: examples, set of examples

attributes, set of attributes

default, default value for the go al predicate

If examples is empty then return default.

else if all examples have the same classification then return the classification.

else if all attributes is empty then return MAJORITY-VALUE (examples).

else

best ← CHOOSE. ATTRIBUTE (attributes, examples)

tree ← a new decision tree with root test best

m ←MAJORITY-VALUE (examples)

for each value vi of best do

examples ← (elements of examples with best = v)

sub-tree ← DECISION-TREE-LEARNING (examples, attributes- best,) add a branch to tree with labels vi and sub-tree return tree.

The two division trees in Fig. 9.3 and Fig. 9.5, are distinctively different. This does not mean that learning algorithm does no good job. It looks at the examples, not at correct function, and in fact, its hypothesis (Fig. 9.5) not only agrees with all the examples, but is simpler than the original tree.

The learning algorithm need not include tests for raining and reservation attributes because it can classify all the examples without them. Fig. 9.5 has also detected an interesting point in the data, we should wait for Fast Food on Sunday/Holidays, which was not present in the tree drawn in Fig. 9.3.

If we were to gather more examples, we might induce a tree more similar to the original. The tree in Fig. 9.5, is bound to make a mistake; for example, it has never seen a case where the wait is 0-10 min but restaurant is full. For a case where the attribute Hungry is false the tree says not to wait. Certainty a hungry person shall wait for such a small estimated waiting time as 0-10 min.

This raises an obvious question- if the algorithm induces a consistent but incorrect tree from the examples, how incorrect will the tree be? 

Performance of the Decision Tree Learning Algorithm:

A learning algorithm is good if it provides hypothesis which do a good job of predicting the classifications of unseen examples.

The qualitative method for assessing its performance is the following:

1. Collect a large set of examples.

2. Divide it into two disjoint sets: the training set and test set.

3. Use the learning algorithm with the learning set as examples to generate a hypothesis H.

4. Measure the percentage of examples in the test set which are correctly assessed by H.

5. Repeat steps 1 through 4 for different sizes of training sets and different randomly selected training sets of each size.

The result of this is a set of data which can be processed to give the average prediction quality as a function of the size of the training set. The key idea of this methodology is to keep the training set and test data separate, for the same reason that the result of a test would not be a good measure of intelligence if the students had seen the question paper before hand.

Which attribute is the best classifier?

The scheme used in decision tree learning for selecting attributes is designed to minimise the depth of the final tree. The idea is to pick the’ attribute which goes as far as possible towards providing an exact classification of the examples. A perfect attribute divides the examples into sets which are all positive or all negative.

The attribute Patrons is not perfect, though fairly good, whereas the attribute type is useless one as it leaves the example sets with roughly same proportion of positive and negative examples as the original set.

We need as formal measure of fairly good and really useless in the spirit of the function-CHOOSE-ATTRIBUTE, of the Decision Tree algorithm.

The measure should have its maximum value, when the attribute is perfect and its minimum value when the attribute is of no use at all. The detailed analysis, using the theory of statistics is involved and is not given here. It has been observed that when there are two or more examples with the same descriptions, the decision tree learning algorithm must fail to find a decision tree constructed with all the examples.

Then, the solution is to have each node report either the majority classification for its set of examples or report the estimated probabilities of each classification, using the relative frequencies. The former is appropriate for an agent which requires the decision tree to represent a strict logical function, whereas the latter can be used by a decision-theoretic agent.

It may happen that the decision-tree algorithm will find a decision tree, even when vital information is missing, which is consistent with all examples. This is because the algorithm can use the irrelevant attributes, which are quite often present, to make the spurious classifications, among the examples.

As long as there are no two examples with identical descriptions Decision-tree-learning will find an exact hypothesis. Whenever, there is a large set of possible hypothesis we have to be careful in not using the freedom to find the meaningless regularity in the data. This problem is called over fitting.

Over fitting is a significant practical difficulty for the decision tree learning and many other learning methods. There are several approaches to avoid over filling in decision tree learning.

They are:

1. Which stop growing the tree earlier, before it reaches the point where it perfectly classifies its training data.

2. Which allow the tree to over-fit the data and then post prune the tree.

Both the above approaches are being in practice and in both the approaches the key question is what criterion is to be used to determine the correct final tree size.

Approaches include:

1. Use a separate set of examples, distinct from the training examples, to evaluate the utility of post pruning nodes from the tree.

2. Use all the available data for training, but apply a statistical test to examine whether expanding/Pruning a Particular node is likely to produce an improvement beyond its train and set.

3. Use an explicit measure of the complexity for encoding the training examples and the decision tree halting growth of the tree when this encoding size is minimized. This approach is based on a heuristic called the Minimum Description length principle.

In principle decision-tree algorithm can grow each branch of the tree just deeply enough to perfectly classify the training examples wile this is sometimes a reasonable strategy, in fact it can lead to difficulties when there is noise in the data or when the number of training example is too small to produce trees which over-fit the training examples.

Strength and Weakness of Decision Tree Methods:

Strengths of Decision Tree Methods:

1. They are able to generate understandable rules.

2. They perform classification without requiring much computation

3. They are able to handle both continuous and discrete variables.

4. They provide a clear indication of which fields are most important for prediction or classification.

Weaknesses of Decision Tree Methods:

1. They are less appropriate for estimation tasks where the goal is to predict the value of a continuous variable.

2. They are prone to errors in classification problems with many class and relatively small number of training examples.

3. They can be computationally expensive to train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found.

4. They do not treat well non-rectangular regions. Most decision-tree algorithms only examine a single field at a time. This leads to rectangular boxes which may not correspond well with the actual distribution of records in the decision space.

Practical Uses of Decision Tree Learning:

Decision trees provide a simple representation for propositional logic knowledge which can be used for decision making and classification of objects. ID3 program written by Quinlan (1986) was the decision tree created out of positive and negative examples of a concept (an economy car from a specific country (Japan).

ID3 is one of such systems which employs decision trees to learn object classification (car from Japan) from labelled training instances. For obtaining a decision tree of smaller size (with lesser no of nodes) ID3 measured information gain of attributes and expanded the nodes of the tree based on the order of the attributes, following the descending sequence of their information gain.

Although decision tree learning cannot generate interesting scientific theories because of its representational restrictions, it has been used in a wide variety of applications, a couple of them are:

1. Designing Oil Platform Equipment:

Michie (1986) applied decision learning to an expert system, GASOIL, for designing gas-oil separation systems for off shore oil platform. Gas-oil separation is done at the well-head for a very large, complex and expensive separation system, whose design depends on a number of attributes including the relative proportions of gas, oil and water, and the flow rate, pressure, density, viscosity, temperature and susceptibility to waxing.

At that time GASOIL was the largest commercial expert system in the world, containing approximately 2500 rules. Building such an expert system by hand would have taken roughly 10 person-years. Using decision-tree learning this was developed in 100-person-days.

2. Learning to Fly:

For very complex system, such as aircraft and electorates, the development of a model may be infeasible. Sammutetal (1992) deployed the art of decision tree learning for the task of learning to fly a Cessna on a flight simulator. The data was generated by watching three skilled human pilots performing an assigned flight plan 30 times each. Each time the pilot took an action by setting one of the control variables such as thrust or flaps a training example was created.

In all, 90,000 examples were obtained, each described by 20 state variable and labelled by the action taken. From these examples a decision tree was extracted, which was converted into C- code and inserted into the flight simulator’s control loop so that it could fly the plane itself.

The result were surprising: not only does the program learn to fly but learns to fly better than its creators:

This is because the generalisation process cleans the occasional mistakes made by humans. Such results suggest that machine learning techniques may yield controllers which are more robust than conventional, manually programmed autopilots. For difficult tasks such as flying helicopters carrying heavy loads in high winds, no autopilots are available and very few humans are competent. Such tasks are potentially suitable for systems based on automated learning.