There is great value in offering natural language interfaces (NLI) to application users. In this paper we will discuss the features, benefits and challenges associated with natural language and present an agent-oriented software toolkit that supports the development of domain-specific NLI. We will focus on how one would develop such an interface and integrate it with an existing application environment.
Natural language refers to the way we communicate with each other. It is the words we use, the meanings we agree to assign to them and the ways we can combine those words into larger structures. In this definition we are separating the language -- the words and sentences and paragraphs -- from the particular mode of transmission we may be using. Our language is essentially the same whether we type the words, write them out by hand or say them aloud.
Most of our interaction with other people involves the use of our natural language. (Or, in the case of those who are better educated than I, multiple natural languages.) We all have tremendous experience expressing complex concepts and navigating great subtlety of meaning using natural language. The ideal behind NLI is to let us apply our experience, skill and comfort with natural language to our interactions with technology.
Why NLI? One obvious benefit would be a shorter learning time. If we can ask a piece of technology for a service the way we would ask another person, we can become comfortable with that technology much more quickly. And with that greater comfort comes an opportunity for much greater productivity.
Natural language also provides more direct access to a user faced with large numbers of options. Web portals are a good example. To be attractive to visitors they need to offer a variety of interesting services. But as soon as those services are in place, they have the challenge of organizing them all so that visitors can see what's available without becoming overwhelmed. The usual solution is to create a hierarchy of menus, with the user moving through this tree structure one level at a time. Search functions provide a more direct way of finding a specific service, assuming of course that the visitor knows what words to use in their search.
Natural language is the logical (one is tempted to say, natural) enhancement to text search. An NLI would allow users of a service to describe exactly what they want; the service would respond with a list of services that are relevant to the request. And unlike text search, the NLI can be smart enough to understand that a request for "bakeries in Paris" refers to their location and not their name.
Another advantage of natural language is that we can combine many concepts into a single request. If we're looking for a flight, it is far more direct to be able to ask for "the cheapest nonstop between San Francisco or San Jose and New York that arrives before 7PM", instead of having to run a series of requests and then comparing them manually.
When we talk about natural language, we need to be just as clear regarding what it isn't as what it is. Natural language isn't about handwriting or speech processing. These are ways of representing language and can sometimes be thought of as separate from the language itself. For example, an English speaker who can read cursive text will be able to read a handwritten note in French. He just won't be able to understand what he's read.
Speech recognition by computers is a particularly difficult technical problem. Many speech-based systems are not based on an understanding of natural language. Instead they rely on a combination of command keywords and menu structures. Work is being done to combine continuous speech recognition and natural language understanding to produce the ideal user interface for many tasks.
The most common approach to natural language is to start at the top: with the language. Natural language systems begin with the components of the language -- the words -- and analyze text in terms of the rules of the language: parts of speech, the structure of sentences and so on. A good top down NLI will need a large dictionary and thesaurus and a full understanding of the ways people create sentences. It will also require a great deal of general cultural knowledge, including the (often elusive) relationship between the meaning of idiomatic expressions like "he bought the farm" and the literal meanings of the words themselves.
Building an NLI using this approach requires expert linguists. This kind of solution is highly dependent on the specific language being supported. Moving an NLI from one human language to another will likely require more of a rewrite than a simple port, since even languages with a common ancestry often have more differences than they do similarities. This limitation to a single language also has implications for an NLI's ability to handle ungrammatical input or shortcuts. An ability to deal with shortcuts is considered crucial to any acceptance of NLIs on cell phones and other input-challenged devices.
By their nature, top down NLIs tend to be offered as black box solutions. Customizing the NLI, for example to add domain- or application-specific vocabulary, must be done by the developer of the NLI. Such packages are also fairly large, since they must encompass such a large body of knowledge. These two characteristics place limits on the kinds of applications that can make use of natural language interfaces.
An alternative is to consider language from the bottom up, starting with the tasks to be performed and considering all the different ways potential users might phrase their wishes. Such an NLI doesn't need to understand every possible sentence in the target language. It merely has to recognize the subset that is relevant to its particular responsibilities.
A domain-specific NLI has several potential advantages. By limiting itself to a set of tasks, it can be much smaller than a general purpose package. It can be more forgiving of shortcuts and incomplete or ungrammatical input, since the limited number of services provided by the application makes it easier to associate the user's request with the correct one. A domain-specific NLI also has fewer issues with homonyms. For a sports-oriented service, a phrase like "Bat Day" can have only one meaning. And nocturnal flying mammals need not apply.
Domain-specific NLIs can provide an impressive degree of accuracy, measured as the percentage of requests that translate into the operation intended by the user. They can also be extended to support additional vocabulary and additional application functions.
Although top down NLIs are useful for general language understanding, for example to capture and catalogue the ideas found in large numbers of documents, domain-specific NLIs are likely to work better as adjuncts to graphical user interfaces in query or command-and-control applications.
One example will show the difficulties inherent in natural language understanding. Consider the sentence, "Time flies like an arrow." What do we mean by a line like this? There are three potential translations hidden in this admittedly tricky statement. Perhaps we mean:
"Time moves with a swiftness that brings to mind the flight of an arrow through the air."
To arrive at this interpretation, the NLI must understand similes. It must also recognize that the verb 'to fly' can be applied to a concept like time and that there is some way to compare the passage of time to the way an arrow flies. But here's another interpretation:
"I command you to measure the velocity of small insects using the same technique as would be applicable to measuring that of arrows."
In this second version, 'time' and 'flies' reverse their roles as noun and verb. This interpretation is certainly valid. Yet most people would choose the first interpretation without even noticing that a second exists. Is this because most of us have never thought to measure flies? Or is it the power of a cliche like "time flies" to keep our attention away from other possibilities?
There is even a third interpretation:
"There is some class of creature called a 'time fly' that has a fondness for arrows."
We would dismiss this version out of hand, assuming we considered it in the first place. We have never heard of time flies and would probably assume they don't exist. But an NLI will have difficulty with such assumptions. Either it must be provided with a knowledge base encompassing a vast array of information or it must (as we should) allow for the possibility that it is seeing a concept that it has not encountered before.
Assuming an NLI can handle "Time flies like an arrow", how would it do with the second line of Groucho Marx's famous quote: "Fruit flies like a banana"? Would it recognize the humor of the remark? Or would it understand it to refer to the relative airworthiness of different kinds of produce? Perhaps an advantage of domain-specific NLIs is that they don't need to get the joke. That would make them the perfect companion to many in the computer field.
Although natural language has a wide range of applications, we will focus here on its value in user interface as what we might call natural interaction. We will use an example from the UNIX operating system: the find command. This command searches a directory hierarchy and identifies files that match a set of criteria: file name, file type, age, size and so on. It is both one of the more useful commands and among the most complex. We will apply natural language to a few of its more common capabilities.
Imagine a UNIX shell (command) window with a natural interaction front end. We can ask it to find files that interest us:
show me web pages that are less than a week old and more than 30kb
The front end examines this input and translates it into its equivalent as a shell command:
find -mtime -7 -size +30720c \( -name \*.html -o -name \*.htm -o -name \*.shtml \) -printf "%10s %t %P\n"
This command tells find to start in the current directory (the default when no directories are specified) and identify files whose modification times are less than seven days (-mtime -7; the minus indicates less than), bigger than 30,720 bytes (-size +30720c; the plus indicates greater than and the c specifies characters rather than the default of disk blocks) and whose names end in .html, .htm or .shtml. Files that match the criteria will be printed according to the -printf specification: file size, modification date/time and the file's path.
The shell executes the command and returns the files it finds to us:
84531 Thu Jan 4 06:49:00 PST 2001 Docs/PolicyReference.html 38392 Fri Jan 5 22:15:48 PST 2001 public_html/Bookshelf.html 44392 Fri Jan 5 22:17:08 PST 2001 public_html/Java-Embed.html
This example should make the role of a natural interaction interface clearer: to serve as a translator between the free form text provided by the user and a more structured command language that will be used to drive the application.
An important aspect of any natural language solution is that it must deal with a moving target. This is especially true for a domain-specific interface, since new users will present it with new words and new ways of phrasing requests. But it also applies to whole language solutions. The constant invention of technical acronyms and slang and the absorption of foreign words (a fact of life in English and, I imagine, equally true of other languages) conspire to keep any NLI from standing still. Because we expect our system's vocabulary to expand as new features are added and new users encounter it, our ability to extend and enhance it is of paramount importance.
A lot of the early work on natural language understanding came from artificial intelligence (AI) research. Natural language is a perfect example of an AI sort of problem. Contrary to popular belief, artificial intelligence is not centered on an attempt to model and duplicate the way people think about and solve problems. The aim of many researchers has become simpler and more pragmatic: to develop techniques to solve problems that aren't a good fit for the step-by-step algorithmic design of our computing platforms and programming languages. Their efforts have led to the development of a variety of interesting approaches to programming. Object-oriented languages owe a debt to AI, as do rule-based systems and other, less common programming models.
What most AI programming has in common is a focus on the productivity of developers and only secondarily on the efficiency of the code they produce. Obviously, people trying to solve problems that have never been solved before will be more concerned with the speed with which they can try out their ideas than with how fast the program will run if they ever get it working. And, in any event, improvements in hardware have changed our whole concept of efficiency as it applies to software.
AI developers created languages and tools to match their approach to the problem: object-oriented languages for more flexible and modular algorithmic tasks, rule-based languages for tasks that could be described in terms of rules, pattern matching languages for recognition tasks and so on. The idea is to pick the programming model that's the best fit for the problem being attacked.
These new models present us with an interesting opportunity. Instead of using them to work on problems we don't know how to solve, why not apply them to problems we can solve algorithmically? There are programming tasks that are easier and more natural to represent as rules or patterns. By using non-algorithmic approaches, we can reduce our initial development time considerably. But these approaches have a second benefit. Because the code (whether represented as text in a file or glyphs on a screen) maps more directly to the solution than it would with procedural code, it becomes a much simpler process to maintain, enhance and extend the code.
Although we can write a natural language interface using a procedural programming language, the techniques used by the code to recognize and interpret specific words and phrases won't be obvious to a reader of that code and perhaps not even to the developer. That means that adding new words and new phrases can't be a straightforward or efficient process. If we're to build a maintainable and extensible NI interface we need a programming model that fits the task. An agent-oriented approach to development is a natural fit for this kind of user interface.
What do we mean by agents? A quick search of the web found this definition in the ComputerUser High-Tech Dictionary:
Agent: A software program that performs a service, such as alerting the user of something that needs to be done on a certain day; or monitoring incoming data and giving an alert when a message has arrived; or searching for information on electronic networks. An intelligent agent is enabled [to] make decisions about information it finds.
This definition relates to software agents and is the use of the term that will be most familiar to developers. But there is a second definition of agent that relates to its role in a programming model. In a sense this second agent relates to software agents as objects in object-oriented programming relate to distributed object systems like CORBA: one is a low level programming construct and the other is a linkage between large chunks of software.
One of the early papers on agents defines an agent as "an entity whose state is viewed as consisting of mental components such as beliefs, capabilities, choices, and commitments", an explanation that doesn't tell us anything terribly useful. For the purposes of our discussion we will describe agent-oriented programming as a development model based on networks of small independently-operating software modules that communicate via messages and collaborate to solve problems. Agent-oriented programming is a much more significant leap for developers than the move from procedural languages to object orientation. Working with agents changes more than the way we package our procedures; it's more than the "syntactic sugar" a language like C++ provides to make it easier for developers to do things they were already doing in C.
Because agent-oriented programming has seen less general agreement and standardization than objects, we will focus on a particular implementation of agents and its use in natural language interfaces. Some of what follows will apply to other agent systems. But our goal will be to discuss an application of agents to a specific problem domain, rather than to consider all the ways agents might be applied to that problem or all the other problems for which it has value. The toolkit under discussion has been written entirely in Java, which has some implications for its implementation and use that we will discuss later.
Agent-oriented programming defines the agent as the fundamental unit of code. An agent processes requests either directly or by combining its processing with results produced by other agents. Agents are wired together into a network or hyperstructure. This network structure defines the communication paths between agents, which in turn determines the way agents get requests and provide responses. The hyperstructure can be thought of as a basic tree layout with one enhancement: a node may be connected to multiple nodes above it in the tree. (This explanation assumes a tree with its root at the top, a popular layout in computer science that's rather less common in nature.)
The agent network operates by passing requests from agent to agent. A request begins at the root of the tree and flows down to every agent. Leaf agents (agents with no agents connected downchain from them) examine the request and decide for themselves whether they have anything to contribute. Reponses flow back up the tree using the same message paths as the request.
We mentioned a moment ago that the hyperstructure permits an agent to have multiple upchain connections. In such cases the downchain agent will receive the same request from every agent above it. It will only process the request once, however, and will send the same response to all of its upchain agents.
Figure 1 shows an example of an agent network for the UNIX find command we discussed earlier. Each circle represents an agent; the arrows show the paths a request will take to reach every agent in the network. The tree begins with the FIND agent, which receives the request from the user and passes it along to the agents that are downchain from it. These next level agents map to the different kinds of requests the network understands: the FOLDER agent understands where the search is to begin; the SIZE agent understands file size specifications; AGE understands relative modification dates; FILETYPE identifies files by their extensions; and ORDER handles requests to sort the results or to provide a subset (e.g. the five biggest or newest files). Agents further down the chain break the problem down further, into specific file types to be found or different ways of specifying time or date. We will explore this example in more detail shortly.
The agent network processes a request in two phases. The first phase relates to interpretation of the request - that is, the determination of the user's intent. Phase two is the actuation phase, where the network uses its understanding of the request to generate a command to the application.
Phase one begins when the top level agent receives the request from the outside world. It passes the request to all of its downchain agents, which pass it along to their downchain agents, and so on until every agent has seen the request. The leaf nodes then examine the request, each one deciding whether it recognizes anything in the request that it knows how to process. If it sees anything, it makes a claim on whatever part of the request it thinks it understands. An agent may make multiple claims on multiple parts of the request, including claims on overlapping parts of the request. If it sees nothing of interest in the request, it sends an explicit "no claim" message up the chain.
An upchain agent waits until it gets a response from every agent down the chain from it. It looks at the claims it receives and combines those claims with its own thoughts about the request. It may make its own claim based on the downchain agent claims; it may reject those claims based on its own, better understanding of the request and make a claim unrelated to those it received; or it may decide that neither it nor its downchain agents have anything to contribute and send a "no claim" message to its upchain agent. In this way claims and "no claim" responses travel up the tree until they reach the top level agent.
It is possible, in fact likely, that an agent will get multiple claims from the agents below it. A set of heuristics is used to determine the relative strength of each claim. It is up to the upchain agent to decide whether to pass along multiple claims or to send only the strongest.
Once the top level agent has received responses from the rest of the agents, it begins the second phase: the generation of the command to the application. This time, the request is passed only to those agents whose claims were accepted in phase one. Each leaf agent has its chance to contribute some part of the command. For example, given a request for "files more than 30kb", in phase one the BIGGER agent claims the string "more than", the NUMBER agent claims "30" and the SIZE agent combines these claims with its own to claim "more than 30kb".
In phase two the BIGGER agent generates "+" (its translation of "more than" to the find command's requirements), NUMBER generates "30" and SIZE combines and extends these into "-size +$((30 * 1024))c". (The expression in parentheses will be computed by the shell to produce 30k.) This processing continues up the tree until the FIND agent has a complete command. It passes the command on to a special actuation agent, that tells the application what to do. The application sends back an answer, which is fed to the user.
Figure 2 shows the agent network after processing the request "What pictures are there in my agents folder?" In this figure, agents that made winning claims are filled in; agents that made no claim are not. In this case the IMAGES agent claims the word "pictures", FILETYPE forwards the claim from IMAGES; the FOLDERNAME agent claims "agents" (based on its knowledge that there is a folder of that name) and FOLDER claims "in my agents folder", combining the claim of the FOLDERNAME agent with its own rules about the ways we might specify a folder. In this example none of the agents make competing claims for the same words in the request.
To understand what's special about agents, it may help to think about objects for a moment. When we learn about objects, we spend a lot of time understanding the concept of inheritance, the ability to define a new category of object by starting with one we already have and defining what it does that its predecessor doesn't do and what it does differently.
This inheritance hierarchy is fundamental to the way we define classes of objects and the way we think of them. But it turns out that the inheritance hierarchy is only one of three important hierarchies in object-oriented programming. And although it is the only one that is obvious from the code, it is in many ways the least relevant to our understanding of the way a program works.
The three hierarchies can be thought of as the is-a, the has-a and the uses-a hierarchies. They represent inheritance (e.g. a Volkswagen Beetle is-a car; Beetle inherits much of its structure and behavior from a more generic car class), containment (a Beetle has-a transmission; Beetle objects have transmission objects inside them, either directly or through a reference) and invocation (a Beetle uses-a highway; Beetle objects interact with highway objects by invoking methods on them).
Of the three, the uses-a hierarchy is often the most important and the most difficult to understand from a reading of the code. Reading source code may tell us that method a of class x invokes method b of class y. What it can't tell us is when that happens, or why, or whether the conditions even permit it to happen at all. Dynamic languages like Lisp, Smalltalk and Java make our understanding of uses-a relationships even more difficult, as they permit programs to construct calls to methods whose names are built during execution and never appear in the source at all!
In agent programming, flow of control is implicit; what counts is flow of information. As we've already seen, agents decide for themselves whether they have a role to play in processing a request. Another difference is that agent programming is based on the idea that agents are aggressive. It is not at all unusual for agents to be wrong in their belief that they have a contribution to make. It is the responsibility of upchain agents to overrule those further down based on their greater understanding of the problem and the solution. Agents are allowed to be wrong. The important thing is for the agent network as a whole to be right.
And sometimes it isn't a simple matter of wrong or right. Sometimes there are multiple ways of interpreting a request. Dealing with ambiguous requests is a fact of life for agents. We will address ambiguity as part of the next section.
Figure 3 shows the request "find music files that are about two months old". Here we see that two of the agent circles are half-filled. This shows agents whose claims were later rejected.
In this example, the MEDIA agent claims "music", FILETYPE combines with that to claim "music files", NUMBER claims "two" and AGE combines with that claim to claim "two months old". These are the successful claims. But they aren't the only ones. Since NUMBER is connected to both AGE and SIZE, it sends its claim on "two" to both agents. SIZE doesn't see anything else in the request that it understands, so it forwards the claim on "two" up to the FIND agent. FIND compares "two" to "two months old", prefers the latter claim and rejects the claim from SIZE in favor of the one from AGE. Finally, the FOLDERNAME agent makes a claim on "music", since there is a folder called "Music". FOLDER rejects that claim, since the request lacks any of the surrounding words the agent requires for a folder specification.
Let's try one last request: "Are there any word docs under 200?" This request is ambiguous; without any kind of unit after 200 we don't know whether we're talking about file size or age.
As we can see in Figure 4, both BIGGER and OLDER make claims against the word "under". (Despite their names, these agents also handle concepts of smaller and newer.) The SIZE agent combines the claim of BIGGER with that of NUMBER to make its own claim on "under 200". AGE makes the same claim. ORDER also makes a claim on "200", which the FIND agent rejects in favor of the stronger claims from SIZE and AGE. (FOLDERNAME made a claim against a folder called "docs", which was rejected by FOLDER as in the previous example.)
What do we do in the face of two claims of equal strength? The decision belongs to the programmer, who could give precedence to one or the other agent or let the heuristics decide. But in this case there's really no reason for the network to believe that we meant one thing vs. another. Here is a situation where the best course is to explain the problem to the user and let her decide how she wants to proceed.
The programmer allows for ambiguity by identifying places where it should be permitted and then providing an action to take place when it occurs. The network knows what caused the ambiguity and generates a specific set of choices back to the user:
Which one do you mean?
1: File size in kilobytes
2: Last modification in days
3: Ignore input
Selection (choose one):
We can pick one of the options presented, at which point the request will be resubmitted with one of the competing claims chosen over the other. Or we can forget the whole thing and make a different request. The important thing is for the network to be as precise as possible in telling us what it needs to know. Far better to provide a targeted request for clarification than to return some generic "huh?" response.
Up to this point we have talked about the decisions each agent makes without explaining precisely how we tell it to make those decisions. Agents are programmed using policies. The policy language is based on a combination of pattern matching and a rule-based condition and response syntax.
Let's look at a simple example. Here are the policies that make up the BIGGER agent:
(P1: ((('bigger' | 'larger' | 'more') ['than']) | 'over' | 'above' | '>')+ {action: {execute:'+'}} ), (P2: ((('smaller' | 'less' | 'fewer') ['than']) | 'under' | 'below' | '<')+ {action: {execute:'-'}} ), (P10: (/P1 | /P2)+ {action:delegate to /P1, /P2} )
Each policy begins with an identification number. These numbers are used by other policies within the same agent to combine behaviors. The next clause defines the condition. If the condition is met, the policy is relevant. We can see the list of words recognized by the first policy: "bigger", "larger", "more", "than" and so on. The vertical bar indicates an Or condition. Parentheses mean that that this is a required match; square brackets show optional words. Adjacent entries identify words that must be adjacent. There is also a less-than operator for words that must be in order but don't have to be adjacent and an ampersand operator for words that must appear but may be in any order.
We can read the first policy as matching any of the following words and phrases: "bigger", "bigger than", "larger", "larger than", "more", "more than", "over", "above" and ">". The second policy matches all the equivalent "small" words: "smaller", "smaller than", "less", "less than" and so on. We make words like "than" optional to allow shortcuts to work: "word files more 200k" in addition to the more grammatical "more than 200k".
The third policy uses some special syntax to permit multiple claims from this agent. It says that if either P1 or P2 generates a claim, delegate back to that policy to process the claim. The plus sign indicates that this policy can make multiple claims. If both P1 and P2 make claims, they should both go up the chain, as in a request like "java files that are more than 20kb but less than a meg". Either P1 or P2 may itself make multiple claims. For example, we can process a request like "more than 20kb and more than 50kb". It will work correctly, although it's kind of silly.
It would be logical to read policies in terms of condition and action. But that isn't quite accurate. We need to remember the two phases of agent processing. A more correct description of the process is this:
Given a request of "java files that are more than 20kb but less than a meg", policy P1 will match "more than", P2 will match "less than" and P10 will forward both claims up to the SIZE agent. In the actuation phase, P1 will generate the literal "+", P2 will return "-" and P10 will pass both actuations up the chain.
Now that we understand how a leaf agent like BIGGER works, let's see how an upchain agent combines the behavior of its downchain agents to continue the processing. Here are a few of the policies from the SIZE agent:
(P1: NUMBER ('b' | 'byte' | 'char' | 'character') {action: {execute:NUMBER}} ), (P2: NUMBER ['k' | 'kb' | 'kilobyte' | 'kilo' | 'kbyte'] {title:'File size in kilobytes'} {action: {execute:'$((', NUMBER, ' * 1024))'}} ), (P3: NUMBER ('m' | 'mb' | 'meg' | 'megs' | 'megabyte' | 'mbyte') {action: {execute:'$((', NUMBER, ' * 1024 * 1024))'}} ), (P4: 'empty' {action: {execute:'-size 0 '}} ), (P10: [BIGGER] (/P1 | /P2 | /P3) & ['size' | 'length'] {action: {execute:'-size ', BIGGER, /P1, /P2, /P3, 'c '}} )
The first policy in this agent uses the result of the NUMBER agent (which, not surprisingly, recognizes numbers, as well as a few special words like "zero", "a", and "one") followed by words that indicate that we're talking about a number of bytes. P2 handles requests for kilobytes; its actuation generates that command to multiply the number by 1024. P3 does the same for megabytes. P4 just recognizes the word "empty" in phrases like "empty files"; it's an example of the way an upchain agent may also behave like a leaf agent.
P10 combines the results of several other policies and agents into one. It combines a request for a number of bytes, kilobytes or megabytes (the P1, P2 and P3 policies) with an optional greater- or less-than to generate a complete size specification. There is a similar P12 policy for the AROUND agent (sizes and ages within plus or minus 10%) and a P20 that uses the plus syntax to permit this agent to generate multiple claims.
The other point worth noting is the presence of a title on the P2 policy. Titles are used in the case of ambiguity to label the various choices. P2 says that the various expressions for kilobytes are optional, so expressions like "bigger than 200" will be interpreted as more than 200 kilobytes instead of 200 bytes. This was a conscious decision on the part of the developer, and one that might not stand closer examination. But it is useful for our purposes, allowing us to explore ambiguity behavior between the SIZE and AGE agents.
All of the policy examples we have seen here are based on matching of literal values and constrained strings like numbers. The policy language also provides features for matching requests against the contents of flat text files and relational databases. These features are useful for dynamic words that would not be appropriate to code directly in policies. An obvious example is the set of folder names recognized by the FOLDERNAME agent. In the current network, FOLDERNAME uses a dynamically generated file of directory paths. But a more general solution is presented in the next section.
The pattern matching approach of the policy language has powerful implications for applications that need to support multiple human languages, whether as separate versions for different locales or as integrated multilingual systems. Supporting a different language involves the replacement of existing terms with those of the new language and some amount of reordering of elements within policies. It doesn't require anything like the kind of wholesale reconstruction that a whole language NLI would demand.
As powerful as the pattern matching capabilities provided by policies can be, in some cases it is not the easiest or most effective way to identify relevant portions of a request. There are a number of recognition tasks that may be served better using algorithmic code. We can support such tasks by writing agents directly in Java.
What kind of agents are better suited to Java? Here are some characteristics of tasks that are good candidates for a Java implementation:
All of the policy interpretation agents are instances of a particular Java class. Creating a custom agent is simply a matter of defining a subclass of this interpretation agent class and then overriding two methods: the interpretInput() method that examines each request and generates one or more claims from it; and the handleDelegation() method that generates the strings of text that will make up the actuation command.
Each custom agent will be a new object class, although to the rest of the network it will look like any other interpretation agent. It is important to note that while Java agents substitute object-oriented code for pattern-matching policies, they are still operating within the agent network's framework of requests, claims and actuations.
Writing an agent directly in Java gives it tremendous flexibility. However, we should be cautious in our use of such agents. Keep in mind that the purpose of an agent network is to understand and translate the user's question, not to determine the answer. Java agents should be employed only where they can help that recognition process.
In addition to conceptual reasons to limit our use of Java agents, there are practical reasons. Policies are easy to write, modify and test. The edit/compile/debug cycle for Java agents will be at least an order of magnitude longer than that of policies. Use each where the benefit is clear and the cost is acceptable.
Up to now we have been discussing the inner workings of agent networks without any consideration of what is required to integrate them into a larger application. We have assumed that user requests arrive at the top level agent as if by magic, that the command generated by the network somehow causes something useful to happen and that the result of that something useful makes its way back to the user. It is time to understand how all the pieces fit together.
Figure 5 represents the architecture of a server-side application with a natural interaction interface. The system is multimodal, with users submitting requests from a variety of wired and wireless devices. These requests are received by specialized servers based on the wrapper for the request: an SMS server for cell phone messages transmitted using the Short Message Service protocol; a web or WAP server for cell phones, PDAs and computers using web browsers; an email responder for PDAs and computers sending requests to a special email address; and perhaps a speech recognition system that receives voice requests from cell and wired phones.
All of these requests are routed to the agent network via a special agent called an interaction agent. This agent is responsible for receiving the request and using some service-specific bit of information to associate it with earlier requests that should be treated as part of the same conversation. Different services will use different information to make this association: a telephone number, an email address, an HTTP cookie. All that matters is that we have some way of knowing whether this request is within the context of an ongoing conversation, or whether it should begin a new conversation.
The interaction agent passes the request on to a user agent. The user agent retrieves information associated with this particular user and this conversation, including personal preferences, learned keywords and other behavior and contextual information from prior interactions. An example of the latter would be an interaction like this:
User: "show me web files more than a month old"
The system returns a list of files.
User: "which
of these is the most recent?"
The system
understands that "these" refers to the result of the previous request.
The user agent passes the request on to the top level interpretation agent, which sends it to the rest of the network as we've already discussed. At the end of the actuation phase, the top level agent sends the actuation command to a custom actuation agent. This is the agent that interacts with the back end application. How it does so depends on the developer and the specific application. If the application has a Java interface, the actuation agent can use its methods directly. Alternatively, it may need to use the Java Native Interface to integrate with a non-Java application. If the application runs as a separate process, there may be remote procedure calls, Java's Remote Method Invocation or some similar networking protocol involved.
In most cases the application will send a response back to the actuation agent. The actuation agent forwards the response to the user agent, which sends it to the interaction agent and on to the user.
Figure 6 shows the interface agents for the find command network. Both the interaction agent (labeled io in the diagram) and the user agent are generally used without modification. (The connection between the application server or servers and the interaction agent can be customized using the IOMode and IOFormat helper classes, as we will see shortly.) Every application will have its own custom actuation agent. We will discuss the kshactuator used by the find network next.
As we have already seen, an agent network processes a request in two phases. In phase one the network determines which agents understand the request; in phase two those agents work together to create an actuation command. This command is fed to the actuation agent, which uses it to control the application and return a result. Every application will have its own unique actuation agent that uses the programming interface provided by the program to be controlled.
In our example, the program to be controlled is an implementation of the Korn shell. Since the actuation command we have been building is one the shell can understand without further assistance, our actuation agent will be rather simple. Here's an outline of its responsibilities:
What follows is an annotated and slightly edited version of the actuation class used by this application. It has been simplified (including the removal of the exception handlers that pepper every good programmer's code) and reformatted to fit the constraints of this document.
Each actuation agent will be implemented as a unique class. We don't have any initialization in this one, so our constructor simply passes the argument values to the constructor for the base class:
public class kshActuationAgent extends ActuationAgent { public kshActuationAgent(String name, String className, String domainName) { super(name, className, domainName); }
The execute() method is the only one we need to write. The single argument to this method is a string with the command generated by the network; the return value is another string with the response.
public String execute(String command) {
Our first step is to write a file containing the actuation command. We also populate the string that will be the response by storing a copy of the command there:
File file = new File("c:\\.findcmd"); try { PrintStream ps = new PrintStream(new FileOutputStream(file)); ps.print(command + "\n"); ps.close(); } catch (Exception e) { } StringBuffer result = new StringBuffer(command + "\n\n");
Our next task is to start up a process containing the shell program. We have arranged things so the shell will automatically read and execute the file we've created:
Process p = null; try { p = Runtime.getRuntime().exec("c:\\login.exe"); } catch (IOException e) { }
Next we attach an input stream to the process we just created. We read everything produced by the shell and store it in our result buffer:
BufferedReader br = new BufferedReader (new InputStreamReader(p.getInputStream())); String s = null; try { while ((s = br.readLine()) != null) result.append(s + "\n"); br.close(); } catch (Exception e) { }
Finally, we get rid of our temporary file and return the result:
file.delete(); return result.toString(); } }
As we can see, the structure of an actuation agent is very simple: get the command, control the application, capture the response, return it. The details of these operations depend entirely how we communicate with the back end application.
We have now followed a request through the agent network and seen how an actuation command is used to generate a response from a back end application. Now we cover the last stage of the journey, as the response is passed to the user agent, then to the interaction agent and, finally, to the user who made the request.
As we've said, the request may have come from the user via a number of different devices. The specific device may not matter to us. However, the kind of device does matter, at least insofar as it may affect the answer we provide. Considerations of screen size, graphic capabilities, memory capacity and network bandwidth translate into a requirement for different result displays. This capability is encapsulated in the IOFormat class.
The job of the IOFormat object is to receive the response generated by the system and apply some formatting to it based on the needs of the target device. This formatting might include the insertion of HTML tags to improve layout. Alternatively, it might reduce the number of responses delivered to the user to match the limited screen and slow network of a cell phone. Here is a simple HTML formatter for our find application.
public class findIOFormat extends WebIOFormat { public findIOFormat () { super(); } public Object display (String result) { StringBuffer webResult = new StringBuffer(); int i = result.indexOf('\n'); if (i < 0) return ""; else return "The UNIX command is:<BR>\n<STRONG><TT>" + result.substring(0, i) + "</TT></STRONG>\n" + "<PRE>" + result.substring(i + 1) + "</PRE>\n"; } }
The result string generated by the actuation agent consists of the actuation command, two newline characters and then all of the results from the execution of the command. This display() method finds the position of the first newline. It returns the command in a bold, fixed width font (the <STRONG> and <TT> tags), followed by the command's results. Text within the <PRE> (preformatted) tag uses that same fixed font and will appear exactly as we provide it, with every space and newline. Figure 7 shows the result of a query in a web browser.
In addition to the display() method, we may wish to override the displayMenu() method. This is the method that is invoked when we have detected an ambiguous situation and wish to present the user with a menu of choices. Depending on our target device, we might be able to create a more descriptive or more compact interface for resolving the ambiguity.
There is a related need on the input side for some way to identify the user so we can maintain context. This capability is encapsulated in the IOMode class. Most applications will not require a custom IOMode. The standard ServerIOMode class can handle most server application environments: web, email and so on, as well as standalone applications.
It should be obvious that giving an application a natural interaction interface requires a somewhat different development process than the one we use for traditional graphical and text user interfaces. We can't wait until the testing phase to involve potential users. In fact, in a fundamental way those users are the designers of the interface.
The most important aspect of a successful implementation of natural interaction is what's called corpus collection, the job of amassing all the different ways people will find to request particular services of our application. The larger and more representative the user base used for corpus collection, the greater the potential for success of the finished application.
With that in mind, here is one model of the steps we will follow to build an NI interface:
There is likely to be considerable overlap among the phases described here. It is also important to note that vocabulary collection and accuracy testing will be an ongoing process. In web-based applications in particular, it is a good idea to capture user requests and sift them periodically for valid syntax that the network does not handle well.
Java has had considerable success as an application development language and some lesser success in building software tools. In part this dichotomy is by design. Java was intended for applications; its designers made many architectural decisions that trade off raw performance and performance predictability for ease of development, debugging and deployment. In this section we list some of the characteristics of Java that make it a good implementation language for an agent-oriented software development kit.
In this paper we have discussed some elements of an agent-oriented programming model and examined a particular implementation of agents and its application to user interface design. As the number of technological products we encounter in our daily lives increases, the need for these products to present convenient, expressive and natural styles of interface becomes ever more obvious. Developing these interfaces will require that we consider new models of development that offer much greater productivity and flexibility, even as they maintain the level of reuse, ease of integration, performance and memory footprint we require.
No one tool is right for every job. And no one programming language is best for every task. What is true of languages is equally true of programming models. Given the right set of questions, agents are a compelling answer.
Take me home: | Show me some code: |
Comments to: Hank Shiffman, Mountain View, California