In this article we will discuss about the reasoning strategy with uncertain knowledge which are:- 1. Introduction to Probabilistic Reasoning 2. Syntax of Probability Used for Probabilistic Reasoning 3. Conditional Probability 4. Axioms of Probability 5. Limitations.

Introduction to Probabilistic Reasoning:

Now we consider knowledge required in the diagnosis of a dental disease or in an automobile repair. This involve a task wherein the knowledge is never certain. First Order Logic (FOL) fails in such cases.

This can be shown by writing rules for FOL:

Consider the rule:

ADVERTISEMENTS:

ᗄp Symptoms (p, Toothache) → Disease (p, Cavity)

That is if a patient has ache in the tooth he will have a cavity in a tooth.

This rule is wrong because not all patients with toothache have cavities, there may be other causes of toothache – gum disease or abscess or some other problem:

ᗄp Symptoms (p, Toothache) → Disease (p, Cavity) □ Disease (p, Gum Disease) □ Disease (p, Abscess)

ADVERTISEMENTS:

To make the rule true we have to add an almost unlimited list of possible causes.

The rule can be written as casual rule:

ᗄp Disease (p, Cavity) → Symptoms (p, C Toothache)

This rule is also not true; not all cavities cause pain.

ADVERTISEMENTS:

To fix the rule, the rule need to be made exhaustive logically that is add all the causes on the left hand side which cause toothache. Even then, for the purpose of diagnosis, we must consider the fact that there could be no relationship between toothache and a cavity, the patient might have cavity yet no pain.

This shows that the use of first order logic in a domain like medical diagnosis fails for three main reasons;

1. Laziness:

Too many antecedents or consequence need to be added in a rule in order to make it an exception less rule and thus it becomes too hard to use such rules.

ADVERTISEMENTS:

2. Theoretical Ignorance:

Medical science has no complete theory for the domain.

3. Practical Ignorance:

Even if we knows a doctor all the rules, he might be uncertain about a particular patient because not all the necessary tests have been or can be performed on the patient.

ADVERTISEMENTS:

The enumerated connection between toothache and cavity may or may not be true in real life. This illustration in the domain of medicine can, however, provide example of uncertain knowledge. In other domains such as Law, business, Indian marriage or even politics, the knowledge required for the problem solving can at best be provided by a degree of belief in the relevant domain. And the degree of belief is tackled by the Probability Theory, which assigns to each qualification a numerical degree of belief between 0 and 1.

Probability provides a way of describing the uncertainty which comes from Laziness and Ignorance. This can be illustrated by the help of a statement. The patient has 80% chances or a probability of 0.8 that the patient has a cavity if he has a toothache.

This belief could be derived from the statistical data that 80% of the patients’ seen by the doctor having toothache are reported to have cavities; or from some general rules or from some evidences, the missing 20% knowledge includes all other possible causes of toothache about which we are too lazy or too ignorant to confirm or deny.

Assigning probability 0 to a given statement corresponds to an unequivocal belief that the statement is false, while assigning a probability of 1 corresponds to an unequivocal belief that the statement is true. Probabilities between 0 to 1 correspond to an intermediate degree of truth in the truth of the sentence.

The sentence in itself may, infact, be true or false as degree of belief is different than the degree of truth. A probability of 0.8 does not mean 80% true; but 80% degree of belief. Thus, the probability and logic make similar commitments; either do or do not hold in the world.

In Logic, a sentence such as “The patient has a cavity” is true or false depending on the interpretation in the world; it is true when the fact it refers to is existing; whereas in the probability theory a sentence such as “The probability that the patient has a cavity is 0.8 is about the problem solver’s belief, not directly about the world of domain”.

These experiences constitute the evidence on which probability assertions are based. For example, suppose we have drawn a card from a pack. Before looking at the card, we may assign a probability of 1/52 to its probability being the ace of spades. After looking at the card, an appropriate value for the probability for the same proposition would be either 0 or 1.

Thus, an assignment of probability to proposition is analogous to saying whether a given logical sentence (or its negation) is entailed by the knowledge base rather than saying whether it is true or not. More evidences added to the knowledge base can change the probability.

Syntax of Probability Used for Probabilistic Reasoning:

Any notation for describing degrees of belief must be able to deal with two main issues- the nature of the statements to which degrees of belief are assigned and the dependence of the degree of belief on the agent’s (system’s) experience.

This version of probability theory uses an extension of propositional logic for its sentences. The dependence on experience is reflected in the syntactic distinction between prior probability statements, which apply before any evidence is obtained and conditional probability statements, including evidence, explicitly.

We would recapitulate that propositions are assertions which may be true or false. Further degrees of belief are applied to propositions and we confine ourselves to elementary propositions and from them generate complex propositions, using standard logical connectives.

For example,

Elementary propositions such as

cavity = true or cavity = false (abbreviated by ˥ cavity) and toothache = false

can be combined to form complex propositions using all standard logical connectives.

For example,

cavity = true ˄ toothache = false

or cavity ˄ = ˥ toothache

is a proposition to which we may ascribe a degree of (dis) belief.

Atomic Events:

The concept of atomic events is useful for probability theory. An atomic event is a complete specification of the state of the world. It can be thought of as an assignment of particular values to all the variables of which the world is composed.

For example, if the world consists of only the Boolean variables, Cavity and Toothache then there are just four distinct atomic events:

− Cavity = true

˥ Cavity = false

toothache = true

˥ toothache = false,

and the proposition cavity = false ˄ toothache = true is such an event.

Atomic events have some important properties:

1. They are mutually exclusive at the most one can be true. For example, in Cavity ˄ toothache and cavity ˄ ¬ toothache, both cannot be true.

2. The set of all possible atomic events is exhaustive, at least one must be true. That is disjunction of all atomic events is logically equivalent to true.

3. Any particular atomic event entails the truth or falsehood of every proposition, whether simple or complex. For example, the atomic event Cavity ˄ ¬ toothache entails the truth of cavity and the falsehood of cavity → toothache.

4. Any proposition is logically equivalent to conjunction of all atomic events which entails the truth of the proposition. For example, the proposition cavity is equivalent to disjunction of the atomic events cavity ˄ toothache and cavity ˄ ¬ toothache.

All probability statements must indicate the evidence with respect to which the possibility is being assessed. As we receive new percepts, its probability assessments are updated to reflect the new evidence. Before evidence is obtained, the probability is called prior or unconditional p rob ability, after the evidence is obtained the probability is called posterior or conditional probability.

The prior probability or unconditional probability associated with a proposition is the degree of belief accorded to it in the absence of the other information; written as P(a).

For example, if the prior probability that a patient has a cavity is 0.1 it is written as:

P(Cavity = true) = 0.1 or P(Cavity) = 0.1

P(a) can be used only when no other information is available.

As soon as some new information becomes known the probability becomes conditional probability.

When we want to talk about all possible values of a random variable weather, it is represented as P(Weather).

This denotes a vector of values for the probabilities of each individual state of the weather.

Thus the four values are shown below:

P(weather = Sunny) = 0.7

P(weather = rain) = 0.2

P(weather = cloudy) = 0.08

P(weather = snow) = 0.02

and may be written by one equation:

P(weather) = < 0.7, 0.2, 0.8, 0.2 >

This statement defines a prior probability distribution for the random variable, Weather.

The probabilities of all combinations of the values of a set of random variables is represented by P(weather, cavity) and can be expressed by a 4 x 2 table of probabilities. This is called Joint Probability distribution of weather and cavity.

If the complete contended world consists of three variables, cavity, toothache and weather then joint distribution is P(cavity, toothache, weather)

Probability distributions for continuous variables are called Probability Density Functions.

Conditional Probability in Probabilistic Reasoning:

Once a variable becomes known in a domain and we want to determine the probability of the other variable then this is a case of conditional probability. For any two propositions a and b P (a/b) is read as the probability of ‘a’ given that all we know is ‘b’. For example,

P(cavity/toothache) = 0.8

indicates that if a patient is observed to have a toothache and no other information is yet available, then the probability of the patients having cavity is 0.8. The prior probability P(cavity) can be thought of as a special case of the conditional probability P(cavity) where the probability is conditioned on no evidence. In practice, conditional probability is mostly used.

Conditional probabilities can be defined in terms of Unconditional probability, by the equation.

Axioms of Probability Used for Probabilistic Reasoning:

The following three axioms are important in probability:

1. All probabilities are between 0 and 1, for any proposition a 0 < P(a) < 1.

2. Valid (necessarily true) probabilities have value 1 and necessarily false (unsatisfiable) have probability 0

P(true) = 1   P(false) = 0

3. Probability of a disjunction is

P (a b) = P (a) + P (b) – P (a b)

These three axioms are often called Kolmogrov’s axioms in honour of Russian mathematician Ander Kolmogorov who showed how to build up the rest of probability theory from these laws. These laws deal with prior probability, but with the help of eq. 7.1 can be extended to conditional probability.

Use of Axioms:

That is any probability distribution on a single variable must sum to 1.

From axiom 3 the probability of a proposition is equal to the sum of probabilities of the atomic events in which it holds;

This is because:

Any proposition a is equivalent to the disjunction of all the atomic events in which a holds, say set e (a). Since the atomic events are mutually exclusive so the probability of any conjunction of atomic events is zero (axiom 2, above). So the above probability condition holds, after axiom 3.

This equation provides a simple method for computing the probability of any proposition, given a full joint distribution which specifies the probabilities of all atomic events.

We now turn to the use of these axioms in deriving an inference. Inference based on probability

We would use the full joint distribution as the “knowledge base” from the world of cavity, catch and toothache to derive answers to all questions.

Let us continue with the example of toothache due to the presence of a cavity and a catch (say dentist’s broken needle has got stuck into the patient’s tooth). The full joint distribution is 2 × 2 × 2 as shown in Table 7.2.

As already noted, the axioms of probability require that the probabilities in the joint distribution sum to 1. Equation 7.2 also gives us a direct way to calculate the probability of any proposition, howsoever, simple or complex. This is done just by identifying those events (atomic) in which the proposition is true and adding their probabilities.

For example, there are six atomic events in which cavity v toothache holds good;

P (cavity v toothache) = 0.018 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

Particular derivation can be the distribution over some subset of variables or a single variable.

For example, adding entries in the first row gives the unconditional or marginal probability of cavity.

P(cavity) = 0.108 + 0.12 + 0.072 + 0.008 = 0.2

This process is called marginalisation or summing out because the variables other than cavity have been kept out of the sum.

In general, Marginalisation rule for any sets of variables Y and Z can be written as:

 

That is, a distribution over Y can be obtained by summing out all other variables from any joint distribution containing Y.

Yet another variation of equation can be:

and is called conditioning. Marginalization and conditioning rules are useful for derivation of probability expressions.

In most cases, we will be interested in computing conditional probabilities of some variables, given evidence about others. Conditional probabilities can be found by first using equation (7.1) to obtain an expression in terms of unconditional probabilities and then evaluating the expression for full joint distribution. This is illustrated with the help of computing the probability of a cavity given the evidence of toothache.

Using the Product Rule:

We may note that the denominator remains constant whether we calculate toothache with cavity or with no cavity. It can be viewed as normalisation constant for the distribution P (cavity | toothache), ensuring that it adds up to 1.

Let α be such a constant, then the above two equations can be rewritten into a single equation:

P(cavity | toothache) = α P (cavity, toothache)

= α [P(cavity, toothache, catch) + P(cavity, toothache, ¬ catch)]

= α [<0.018, 0.016> + <0.012 + 0.064>]

= α <0.12, 0.18>

= <0.6, 0.4>

Thus, normalisation is very helpful in general inference procedure. General Inference Procedure- (for a single variable in a query). Let X be the query variable (cavity in the example) and E be the set of evidence variables (toothache in the example) and E be the observed values for them and Y be the remaining unobserved variables(catch).

Then the query is:

where summation is over all possible Y’s (all possible combinations of values of the unobserved variables Y).

Limitations of Probabilistic Reasoning:

We may note that the variables X, E and Y constitute the complete set of variables for the domain, so P(X, E, Y) is a subset of probabilities from the full joint distribution.

The probabilistic inference algorithm does not augur well. For a domain described by n Boolean variables it requires an input table of size 0(2n) and takes 0(2n) time to process the table. In the real time problem there might be hundreds or thousands of random variables, so it becomes impracticable to define the vast numbers of probabilities required. That is why the full joint distribution is not used practically. So more subtle methods are required.

Bayes’ Rule:

We have defined the conditional probabilities in terms of unconditional probabilities in equation 7.1.

This equation is called Bayes’s rule (Bayes’s law or Bayes’s theorem).

The more general case of multivalued variables can be written as:

This equation represents a set of equation, each dealing with specific values of the variables.

When the background evidence e is also available this equation becomes:

But Bayes’ rule is useful in practice because there are many cases where we do have probability estimates for these three numbers and need to compute the fourth. In the domain of medical diagnosis, we often have conditional probabilities on casual relationships and want to derive a diagnosis.

In medical profession an expert doctor knows that the disease meningitis causes the patients to have a stiff neck, say 50% of the time. The doctor also knows some unconditional facts; the prior probability that the patient has meningitis m is 1/50,000 and the prior probability that any patient as a stiff neck s is 1/20.

These propositions are represented by;

That is, we expect only 1 in 5,000 patients with a stiff neck in those who have meningitis. Even though a stiff neck is quite strongly indicated by meningitis (with probability 0.5) the probability of meningitis in the patients having stiff neck remains small (0.0002). This is because the prior probability on stiff neck is much higher than the prior probability on meningitis.

Apparently Bayes’ rule (7.4) does not seem to be that useful since to compute just one conditional probability it requires three terms − one conditional probability and two unconditional probability.

We have given a process by which we can avoid assessing the probability of evidence, P(s) by computing a conditional probability for each value of the query variable (here m and ¬ m) and then normalising the results. Applying this procedure on Bayes’ rule, also we get; the same results;

that is we need to estimate P (s|m) instead of P(s).

Thus general form of Bayes’ rule with normalisation is

where a is the normalisation constant needed to make entries in P(Y|X) sum to 1.

One we may raise doubt on Bayes’ rule; why is the conditional probability available in one directions not in the other?

In meningitis domain the doctor knows that a stiff neck implies 1 out of 50,000 cases, that is the doctor has quantitative information in the diagnostic direction from symptoms to causes. Such a doctor has no need to use Bayes’ rule. Unfortunately, diagnostic knowledge is often more fragile than the casual knowledge. If there is a sudden epidemics of meningitis, the unconditional probability of meningitis, P (m) will go up.

The doctor who derived the diagnostic probability(m│s) directly from statistical observation of patients before the epidemic will have no idea how to update the value but the doctor who computes P(m│s) from the other three values will find P(m|s) going up proportionately with P(m). Most importantly casual information P (s│m) is unaffected by the epidemic, because it simply exhibits the way the meningitis works. The use of this kind of direct casual or model-based knowledge provides the crucial robustness needed to make probabilistic systems feasible in the real world.

Bayes’ theorem was used in the expert system, PROSPECTOR, to find commercially significant mineral deposits. The aim was to determine the likelihood of finding specific minerals by observing the geological features of an area.

In spite of such successful uses, Bayes’ theorem makes certain assumptions which make it intractable in many domains. First, it assumes that statistical data on the relationship between evidence and hypotheses are known, which is often not the case. Secondly, it assumes that the relationships between evidence and hypothesis are all independent.

In general, given a prior body of evidence e and some new observation E has cropped up then the conditional probability which arises from their conjunction (not just the sum of their effects) is given by:

Unfortunately, in an arbitrarily complex world, the size of the joint probabilities which we require in order to compute this function grows as 2n if there are n different propositions being considered.

This makes using Bayes’ theorem intractable for the reasons:

1. Knowledge acquisition problem is very unwieldy, as too many probabilities have to be supplied.

2. The space required to store all probabilities is too large.

3. The time required to compute probabilities would be too large.

In spite of these limitations Bayesian statistics has been used as the base for a number of probabilistic reasoning systems. As result, several mechanism for exploiting its power while at the same time making it tenable have been developed.