The following points highlight the two main types of machine learning. The types are: 1. Inductive Learning 2. Ensemble Learning.

Type # 1. Inductive Learning:

This type of learning can be understood with the help of an example pair (x, f(x)) where x is the input and f(x) is the output of the function applied to x. Both x and f(x) are real numbers.

The task of pure inductive inference (or induction) is the following:

Given a collection of examples of f, inference returns a function h which approximates f.

ADVERTISEMENTS:

The function h is called a hypothesis. The reason that this type of the learning is difficult, from a conceptual point of view, is that it is not easy to tell at the very outset, whether any particular h is good approximation of f. A good hypothesis will generalise well, that is, will predict unseen examples correctly. In formal terms, how do we know that the hypothesis h is close to the target function f. This is a fundamental problem of induction learning.

Issues in Inductive Learning:

1. How do we choose one hypothesis from among multiple consistent hypothesis (which agree with all the data).

One answer is Okcham’s Razor (named often the 14th century ‘English Philosopher) stated below:

ADVERTISEMENTS:

Prefer the simplest hypothesis consistent with the data. Intuitively this makes sense, because hypothesis which are no simpler than the data themselves are failing to extract any pattern from the data.

2. The possibility of finding a simple consistent hypothesis depends upon the hypothesis space chosen (set of hypothesis agreeing with the data). We say that a learning problem is realisable if the hypothesis space contains the true function, otherwise it is unrealisable. Unfortunately, we cannot tell whether a given leading problem is realisable, because the true function is not known.

One way to get around this bottleneck is to use prior knowledge to derive a hypothesis space in which we know the true function must lie.

3. Yet another approach is to use the largest possible hypothesis space. For example, why not let H be the class of all Turing machines? After all, every computable function can be best represented by some Turing machine.

ADVERTISEMENTS:

The problem with this idea is that it does not take into account the computational complexity of learning. There is a trade off between the expressiveness of a hypothesis space and the complexity of finding simple, consistent hypothesis within that space.

For these reasons, most work on learning has been focused on relatively simple representations. Representation can be through propositional logic or first order logic. The expressiveness-complexity trade off is not as simple as it looks on paper.

It has been found quite often that an expressive language makes it possible for a simple theory to fit the data, whereas, restricting the expressiveness of the language means that the corresponding consistent theory must be very complex.

For example, the rules of chess may be written in a page or two of first order logic but require thousands when written in propositional logic. In such cases, it should be possible to learn much faster by using the more expressive language.

ADVERTISEMENTS:

The performance element is responsible for accepting perceptions and then selecting the external actions. What if the decision tree acts as the performance element?

Type # 2. Ensemble Learning:

In the method of inductive learning done so far we have the case when a single hypothesis, chosen from a hypothesis space is used to make prediction. When the whole collection or ensemble of hypotheses workable from the hypotheses space is used for making prediction (Ensemble Learning) how is the prediction of the goal (hypothesis) achieved?

Supposing we generate a hundred different decision trees from the same training set and have them vote on the best classification for a new example then?

Such a choice should obviously be preferred for two reasons:

ADVERTISEMENTS:

a. Consider that we have an ensemble of five hypothesis in the hypothesis space, then in order to misclassify a given example at least three of them must misclassify it, which is certainly better than classifying (or misclassifying) the example by a single hypothesis.

b. The other reason is through looking at the generic idea of enlarging the hypothesis space, that is think of the ensemble of hypothesis as a hypothesis and the new hypothesis space as the set of all possible ensembles constructible from the hypothesis in the original space.

The most widely used ensemble method is called boosting. It works on the concept of a weighted training set. In such a training set, each example has an associated weight wi ≥ 0. The higher the weight of an example the higher is the importance attached to it during the learning of a hypothesis. Boosting starts with wi = 1 for all examples (training set). From this the first hypothesis hf is generated.

This hypothesis will classify some of training examples correctly and some incorrectly. In order for the next hypothesis to fair better on the misclassified examples the weights of the misclassified ones are increased. From this new weighted training set the hypothesis h2 is generated. The process is continued until M hypothesis have been generated (M is the input of the boosting algorithm).

The final ensemble hypothesis is a weighted-majority combination of M hypothesis, each weighted according to how will it perform on the training set. The process is explained conceptually in the Fig. 9.6. Each rectangle corresponds to an example: the height of the rectangle corresponds to the weight. The ticks and crosses indicate whether the example is classified correctly by the current hypothesis in the final ensemble.

It has been observed that the predictions improve as the ensemble hypothesis gets more complex.

Now we review the premise with which we started, the example pair (x, f(x)), that is how do we know that the hypothesis h is close to the target function f(x) if we do not know f. The answer to this puzzle is based on Computational Learning Theory, a field at the intersection of AI, statistics and theoretical computer science.

The principle is: any hypothesis which is seriously wrong will almost certainly be found out with high probability after a small number of examples. Thus, any hypothesis which is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong, that is it must be Probably Approximately Correct (PAC). Any learning algorithm which returns hypothesis which are probably approximately correct is called PAC-learning algorithm.

This leads to one important criterion on the connection between the training and the test examples, the hypothesis is to be approximately correct on the test set, not on just training set. The key assumption is that the training and test sets are drawn randomly and independently from the same population of examples with the same probability distribution. This is called stationarity assumption. This assumption is important because without this assumption the ensemble learning cannot predict at all.

How many examples are needed?

Let:

a. X be the set of all possible examples.

b. D be the distribution from which examples are drawn.

c. H be the set of possible hypothesis.

d. N be the number of examples in the training set. From the complex analysis (not reported) it turns out that

that is, if a learning algorithm returns a hypothesis which is consistent with this number of examples (N), then with probability at least 1- δ it has error at the most ∈ (which is very small, in other words the hypothesis is correct. The number of required examples as a function of ∈ and δ is called the sample complexity of the hypothesis space.

So the size of the hypothesis space is of key importance. If H is the set of Boolean functions on n attributes, then │H│ = 22n that is the sample complexity of the space grows as 2n. Because the number of possible examples is also 2n this says that any learning algorithm for the space of all Boolean functions will do no better than a look up table if it merely returns a hypothesis which is consistent with all known examples.

Another way of looking at is to observe that for any unseen example, the hypothesis space will contain as many consistent hypothesis which predict a positive outcome as it does for hypothesis which predict as negative outcome.

So the space of the functions need to be restricted; in that case the true function may be eliminated. A way out of this dilemma is to focus on learnable subsets of the expressive power of Boolean functions. We use the more restricted languages. And this is what is done in the case of Decision List.