Pattern Matching Problems in Natural Language Processing Systems

In this article we will discuss about the pattern matching problems in natural language processing systems.

The inherent problems in pattern matching become more clear when we discuss the following programs:

Sir:

The basic algorithm used in Bertram Raphael’s Ph.D. thesis work, SIR (Semantic Information Retrieval) is a pattern matching scheme similar to that used in ELIZA, but SIR differs in a very significant way from ELIZA. The difference is that SIR saves and ‘understands’ input information at a high enough level to allow it to make deductions from the input data.

ADVERTISEMENTS:

Sir could understand a restricted subset of 24 patterns of the following type:

* is *

* is part of*

Is**?

ADVERTISEMENTS:

How many*does*have?

What is the* of*?

where, the wild-card,*, symbols represent arbitrary nouns or noun phrases containing a noun plus such quantifies as a, the, every, each, or a number. The basic logic used in SIR is the hypothetical syllogism which can be stated as: If A is B is B and C, then A is C. This enabled SIR to make logical deductions from information provided by the user.

Through the careful choice of patterns to match the most common forms of statement and query, SIR was able to achieve a very good performance with a minimum of analysis. It could even interact with the user to resolve ambiguities and expand its information base. However, since its performance was limited to dealing with the standard pattern subset, it cannot be considered to be a general natural language processing system.

ADVERTISEMENTS:

Student:

Daniel Bobrow wrote student at MIT as his Ph.D. thesis which was published in 1968.

The complete set of templates used by STUDENT in solving high-school-algebra story problems are the following 12 patterns:

(WHAT ARE* AND*)

ADVERTISEMENTS:

WHAT IS*)

(HOW MANY *1 IS*)

(HOW MANY* DO* HAVE)

(HOW MANY * DOES * HAVE)

ADVERTISEMENTS:

(FIND*) (FIND* AND*)

(* IS MULTIPLIED BY*)

(* IS DIVIDED BY*)

(*IS*)

(*(1 /VERB) *1 *)

(* (1 / VERB) * AS MANY* AS* (*/VERB)*)

where,*

* ⇒ a string of arbitrary length

*1 ⇒ one word

(*1/VERB) ⇒ matching element must be dictionary verb

Although this set of templates appears very restricted, STUDENT could solve complex word problems.

STUDENT, using very simple template matching system, was able to solve problems as fast as MIT graduate students could. A somewhat modified version of the program was found to be a good model for human problem solving.

But a more viable approach to syntactic analysis is sentence parsing. Here the input sentence is converted into a hierarchical structure indicating the sentence constituents.

Parsing systems have two main components:

1. Grammar:

A declarative representation of the syntactic facts about the language.

2. Parser:

A procedure to compare the input sentence with the grammar. Parsing may be top down, in which case it starts with the symbol for a sentence and tries to map possible rules to the input (or target) sentence, or bottom up, where it starts with the input sentence and works towards the sentence symbols considering all the possible representations of the input sentence.

Top-down parsing can precisely be defined as a search problem where:

1. The Initial State is a phrase tree consisting of the root S and the unknown children: [S:?]

In general, each state in the search tree is a parse tree.

2. The Successor Function selects the leftmost node in the tree with unknown children. It then looks in the grammar for rules which have the root label of a node on the left-hand side. For each such rule, it creates a successor state where the? is replaced by a list corresponding to the right-hand side of the rule.

3. The goal test checks that the leaves of the parse tree correspond exactly to the input string, with no unknowns and uncovered inputs.

The choice of which type of parsing to be used (similar to that backward and forward chaining) depends on factors such as the amount of branching each will require and the availability of heuristics for evaluating progress. In practice, a combination is sometimes used. There are a number of parsing methods. These include context-free grammars, transition networks, context-sensitive grammars and augmented transition networks. As we shall see, each has its benefits and drawbacks.

Grammar:

A grammar is a declarative representation of the syntactic facts of the language. It is a specification of the legal structures of a language. It is essentially a set of rewrite rules which allows any element matching the left-hand side of the rule to be replaced by the right-hand side. So for example, allows the string XAX to be rewritten XBX. Unlike template matching, it explicitly shows how words of different types can be combined, and defines the type of any given word.

A → B

A grammar has three basic components:

i. Terminal symbols,

ii. Non-termination symbols and

iii. Rules.

i. Terminal Symbol:

Terminal symbols are the actual words which make up the language (this part of the grammar is called the lexicon). So ‘cat’, ‘dog’, ‘chase’ are all terminal symbols.

ii. Non-Terminal Symbol:

Non-terminal symbols are special symbols designating structures of the language.

These are of three types:

a. Lexical categories, which are the grammatical categories of words, such as noun, pronouns or verb.

b. Syntactic categories, which are the permissible combinations of lexical categories, for instance “noun-phrase”, “verb-phrase”.

c. A special symbol representing a sentence (the start-symbol).

iii. Rule:

The third component of the grammar, the rules, or phrase structures, govern the valid combinations of the words in the language. A rule is usually of the form where, S represents the sentence, NP a noun-phrase and VP a verb-phrase. This rule states that a noun-phrase followed by a verb-phrase is a valid sentence.

S→ NP VP

The grammar can generate all syntactically valid sentences in the language and can be implemented in a number of ways, for example production system implemented in PROLOG. We will look at how a grammar is generated and how it parses sentences by considering a detailed example.

Imagine we want to produce a grammar for database queries on a teacher of a college. We have examples of possible queries. We can generate a grammar fragment by analysing each query sentence. If the sentence can be parsed by the grammar we have, we do nothing. If it can’t, we can add rules and words to the grammar to deal with the new sentence. For example, take the queries.

Who belongs to a college?

Does Amit Sharma work in the IT Department?

In the case of the first sentence, who belongs to a college?, we would start with the sentence symbol (S) and generate a rule to match the sentence in the example. To do this we need to identify the sentence constituents (the non-terminal symbols). The choice of these however does not depend on any grammar of English we may have learned at school.

We can choose any symbols, as long as they are used consistently. We designate the symbol RelP to indicate a relative pronoun, such as ‘who’, ‘what’ (a lexical category), and the symbol VP to designate a verb-phrase (a syntactic category). We then require rules to show how our lexical categories can be constructed.

In this case VP has the structure V (verb) PP (prepositional phrase), which can be further decomposed as P, a preposition, followed by NP, a noun-phrase. Finally the NP category is defined as Det (determiner) followed by N (noun). The terminal symbols are associated with a lexical category to show how they can fit together in a sentence. We end up with the grammar fragment of a sentence shown in Fig. 10.1.

Such rules are supposed to capture the constraints on what is acceptable in language and what is not. They can be used either for synthesis or for analysis of natural language.

This will successfully parse our sentence, as shown in the parse tree in Fig. 10.2, which represents the hierarchical breakdown of the sentence.

The root of the tree is the sentence symbol. Each branch of the tree represents a non-terminal symbol, either a syntactic category or a lexical category. The leaves of the tree are the terminal symbols.

We can write program to use a set of rewrite rules for generating grammatical sentences by basing it on the following algorithm:

1. Given a sequence of symbols, find the first non-terminal.

2. If we find one rule which has its left hand side and replace the occurrence of symbol by the right hand side of the rule. Go to (1).

3. When all the non-terminals in the string have been replaced by terminals, replace each by a word of appropriate sort.

A terminal symbol does not have any rules which have them as their left hand side.

This algorithm makes use of two global data structures, namely a set of rules and a dictionary. The set of rules might conveniently be represented as a list of two-element lists, where the first element of each list is the single symbol, which appears on the left hand side of some rule and the second element is the list of symbols which appear on the right hand side.

For example,

[S [NP VP]]

[NP [determiner adjective noun]]

[VP [verb NPP]]

Such a list of lists may easily be searched for a rule with a particular symbol as its left hand side. We can also use it for checking whether a symbol is terminal or not – if we look for a rule which has it as its left hand side and we cannot find one, then the symbol must be terminal.

The dictionary could also be represented as a list of two-element lists:

i. Terminal symbols appearing as the first element of lists and

ii. Examples or model of the categories named by these symbols appearing as second element, as in

[determiner a] [verb ate] [noun bath] [noun bus] [verb drives]

However, our grammar is still very limited. To extend the grammar, we need to analyse many sentences in this way, until we end up with a very large grammar and lexicon. As we analyse more sentences, the grammar becomes more complete and so more acceptable.

We will analyse the second query, “Does Amit Sharma work in the IT Department”? First, we check whether our earlier grammar can parse this sentence successfully. Our only definition of a sentence so far is as shown in Fig. 10.1.

Taking the VP part first, “work in the department” does meet our definition of a verb phrase, if we interpret “IT Department” loosely as a noun. However, “Does Amit Sharma” is certainly not a Re IP. We therefore need another definition of a sentence. In this case a sentence is an auxiliary verb (Aux V) followed by an NP followed by a VP.

Since Amit Sharma is a proper noun we also need an additional definition of NP, and for good measure we will call “In IT Department” a proper noun as well, giving us a another definition of NP. The additional grammar rules are shown in Fig. 10.3.

It may be noted that we do not need to add a rule to define VP since our previous rule fits the structure of this sentence as well. A parse tree for this sentence using this grammar is shown in Fig. 10.4.

Grammars such as this are powerful tools for natural language understanding. They can also be used to generate legal sentences, constructing them from the sentence symbol down, using appropriate terminal symbols from the lexicon.

Grammar is intimately connected with word meaning. Each syntactic rule encodes some semantic relationships between the entities it deals with, so that the way they express relationships is combined to produce a piece of text (an utterance) and the way we interpret such a thing by recognising the rules which were used in its construction to decode their significance.

We recognise the difference in the ways they were constructed as is reflected in the following sentences:

The tiger has eaten the man.

The man has been eaten by the lion.

She is going to score a goal.

Is she going to score a goal?

Thus, the connection between grammar and meaning is crucial, so in order to extend meaning how the grammar works should also be known and the vice versa.

Syntactic Parsers:

It is legitimate to consider the template matching schemes described above as elementary parsing “picking apart” algorithms in which semantic knowledge is embedded in the structure of the template. We next consider more general parsers, one of the most successful of which is the Augmented Transition Network (ATN) approach of William Woods (1970).

The ATN method of parsing sentences integrates many concepts from Chomsky’s (1957) formal grammar theory with a matching process resembling a dynamic semantic network. Such parsing techniques may be considered as recursive pattern matching in which the string of words of the sentence are mapped onto a meaningful syntactic pattern.

We can understand the characteristics of an ATN by studying the progression from finite-state transition diagrams through recursive transition network to augmented transition networks. Each level of this progression represents increased levels of incorporating options for semantic input into the formal syntactic structure of the parsing program. The goal of all this effort is to take an arbitrary input sentence and assign each word of the sentence its proper part of speech.

Finite State Transition Diagrams:

The starting point in studying ATN’s is the Finite State Transition Diagram (FSTD) as shown in Fig. 10.5.

FST diagrams consist of a set of nodes and arcs connecting the nodes according to the following rules:

a. All terminal symbols (words of the sentence) are represented as arcs.

b. One state (or node) is defined as ‘S’ the START state and one subset of states as the FINAL state ‘F’.

c. The FSTD specifies a (dynamic) process for operating on arbitrary sentences instead of a static relationship between words as in a semantic network. For this reason, an ATN is sometimes called a finite state machine.

The FSTD processes the input sentence, one word at a time, checking to see if the word matches the prescribed structure. If matches, the word is placed on the appropriate arc and removed from the input sentence. When the Final state is reached with no words in the sentence left over, we say the FSTD has accepted the sentence; that is, we know that the sentence matches the syntactic structure specified by the FSTD.

The FSTD shown above will accept any sentence segment beginning with ‘the’, ending with a ‘no un’ and containing an arbitrary number of adjectives in between. For instance, the big, hairy, black, male cat would be accepted by this FSTD.

The main problem with the FSTD is that it can recognise only sentences written as a regular language (type 3) which is the most restrictive of Chomsky’s four categories of language. It cannot handle sentences from the more general context- free (type 2) grammar.

Pattern Matching Problems in Natural Language Processing Systems | AI