Claus Zinn, Johanna D. Moore, and Mark G. Core
INtelligent information presentation for tutoring systems
School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK
Abstract. Effective human tutoring has been compared to a delicate balancing act. Students must be allowed to discover and correct problems on their own, but the tutor must intervene before the student becomes frustrated or confused. Natural language dialogue offers the tutor many ways to lead the student through a line of reasoning, and to indirectly notify the student of an error and use a series of hints and followup questions to get the student back on track. These sequences typically unfold across several conversational turns, during which the student can make more errors, initiate topic changes, or give more information than requested. Thus to support tutorial interactions, we require an intelligent information presentation system that can plan ahead, but is able to adapt its plan to the dynamically changing situation. In this paper we discuss how we have adapted the three-layer architecture developed by researchers in robotics to the management of tutorial dialogue.
Much of the work on intelligent information presentation has focused on systems that support information-seeking (e.g., providing flight times and fares), assist decision-making (e.g., comparison shopping, logistics planning), or describe objects and artefacts (e.g., museum guides). In such applications, while we expect that the system has more information than the user, we also assume that the user understands the domain of discourse and is able to use the information the system provides in order to choose an option, or make a decision, or assimilate the information they receive. In this paper, we focus on intelligent information presentation in the domain of tutoring, where the system is trying to teach the user new concepts and correct user misconceptions. To motivate our approach, we describe the unique characteristics of human tutorial interaction, and present an architecture for intelligent information presentation for tutorial applications.
1.1 Intelligent Information Presentation for Tutoring
Studies show that one-to-one human tutoring is more effective than other modes of instruction. A meta-analysis of the findings from 65 independent evaluations of school tutoring programs found that tutoring raised students' performance by 0.40 standard deviations (Cohen, Kulik, & Kulik, 1982). Results with good tutors are even more promising. For example, the average student who received one-to-one tutoring with a good tutor scored 2.0 standard deviations above the average student who received standard classroom instruction, and 1.0 standard deviation above students in a mastery learning condition (Bloom, 1984).
From its inception, the goal of research in computer-based tutoring environments has been to model the effective behaviors of good human tutors, and in so doing to create an optimal educational tool. There is mounting evidence from cognitive psychology that important (or what many call “deep”) learning is most likely to occur when students encounter obstacles and work around them, and explain to themselves what worked and what did not, and how new information fits in with what they already know (Chi, Bassok, Lewis, Reimann, & Glaser, 1989; Chi, de Leeuw, Chiu, & LaVancher, 1994; Ohlsson, Rees, 1991; VanLehn, 1990). This is consistent with the constructivist movement in education, which argues that students learn best when they are active participants in the learning process and construct knowledge for themselves.
Debates about what makes human tutoring effective, and how this might be captured in a computer-based learning environment, led to several detailed studies of human tutoring (Lepper & Chabay, 1988; McArthur, Stasz, & Zmuidzinas, 1990; Merrill, Reiser, & Landes (1992); Fox, 1993; Graesser & Person, 1994). The consensus from these studies is that experienced human tutors maintain a delicate balance, allowing students to do as much of the work as possible and to maintain a feeling of control, while providing students with enough guidance to keep them from becoming frustrated or confused. Maintaining this delicate balance requires that a tutor be flexible. Our and others' analyses of human tutorial interactions show that human tutors use a variety of strategies, including hinting (Hume, Michael, Rovick, & Evens, 1996), drawing students' attention to an error (often indirectly) and providing students an opportunity for repair (Fox, 1993; Lepper & Chabay, 1988), pointing out features of the solution that are incorrect, scaffolding (Chi, Siler, Jeong, Yamauchi, & Hausmann, 2001), and so on. In addition, human tutors strategically moderate their feedback. They sometimes intervene immediately after an error has occurred, but at other times allow the student to proceed with the solution, returning to the error later (Littman, Pinto, & Soloway, 1990). (Merrill, Reiser, Ranney, & Trafton, 1992) argue that human tutorial guidance appears to be structured around impasses, and the content and timing of feedback are dependent on the error or impasse encountered.
Human tutoring is a collaborative process, in which tutor and student work together to repair errors. It is a highly interactive process, with the tutor providing constant feedback to support students' problem solving. (Merrill, Reiser, Ranney, & Trafton, 1992) argue that regardless of the timing or content of the intervention, human tutors carefully design their feedback to allow students to do as much of the work as possible, while still preventing floundering. (Fox, 1993, p. 122) observes that “the tutor and student both make use of strategies which maximize the student's opportunity to correct his/her own mistake.” In addition, tutors avoid directly telling the student that they are wrong or precisely how a step is incorrect. Instead tutors indirectly guide students through the process of error detection and correction.
Human tutors interact with students via natural language dialogue, sometimes including equations or references to diagrams or simulated models of the domain. They prompt students to construct knowledge, give explanations, assess the student's understanding, and so on, all via natural language. They get and give linguistic cues about how the dialogue is progressing. These cues give tutors information about the student's understanding of the material, and allow the tutor to determine when a strategy is working or when another tactic is needed.
Natural language dialogue is an ideal medium for this type of interaction because it offers many indirect techniques for notifying students that a step in the solution requires repair. (Fox, 1993) found that tutors provide frequent feedback indicating that a step is okay. A short hesitation in responding “okay” typically led the student to assume that something was amiss with the current step, and frequently led students to repair their own errors. When more explicit help was required, the tutor focused the student's attention on the part of the solution that required modification or on information that was useful for repairing the error. Although students sometimes explicitly request guidance or affirmation that their step is correct, this usually is not necessary because the tutor provides such information through hints, leading questions, verbal agreement, and other indirect methods.
1.2 Managing Tutorial Dialogue
Sect. 1.1 described the benefits of natural language dialogue as a presentation modality for tutoring. In this section, we describe the requirements that must be met to build such a system for intelligent information presentation:
Presentations must unfold over many conversational turns, even when it would be possible to present all of the information in a single contribution. This is crucial, because the system must give the student opportunities to contribute to the solutions and must not ignore student's signs of confusion.
The tutor system must have the ability to ask students questions that it “knows” the answer to, either to prompt the student to provide the information to facilitate knowledge construction, or to diagnose the level of the student's knowledge.
The tutor must understand student utterances well enough to respond appropriately.
The tutor system must have the ability to react to unexpected events. By evaluating the current dialogue situation, it must be able to revise its current plan or postpone the refinement of a sketchy plan until the situation provides the necessary information. In particular, the tutor is required
not to ignore student confusion,
to encourage the student to recognise and correct their own errors,
to abandon questions that are no longer relevant,
to handle multiple student actions in a single turn, and
to deal with student-initiated topic changes.
These five sub-requirements stress the need for a tutorial agent to monitor the execution of its dialogue strategies. In the case of failure, the agent needs to adapt its plan to the new situation: inserting a plan for a sub-dialogue to handle student confusion or a student misconception (1+2), deleting parts of a dialogue plan because their effects are now irrelevant or already achieved (3+4), or reorganising sub-plans to handle topic changes (5).
Consequently, there is no need for tutorial dialogue managers to generate elaborate discourse plans in advance. Given the dynamics of tutorial dialogue ─ the large number of potential student actions at any point and the limited ability of the tutor to predict them – it is a more viable approach to enter a tutorial conversation with a sketchy high-level dialogue plan. As the dialogue progresses, the dialogue manager then refines the high-level plan into low-level dialogue activities by considering the incrementally constructed dialogue context. The dialogue manager therefore interleaves high-level tutorial planning with on-the-fly situation-adaptive plan refinement and execution.
In addition to these tutoring specific requirements, the fact that the computer is participating in conversation with a human means it must perform the following dialogue management tasks as described by (Lewin, Rupp, Hieronymus, Milward, Larsson, & Berman, 2000) – turn-taking management: determining who can speak next, when, and for how long; topic management: determining what can be spoken about next; utterance understanding: understanding the content of an utterance in the context of previous dialogue; intention understanding: understanding the point or aim behind an utterance in the context of previous dialogue; context maintenance: maintaining a dialogue context; intention generation: generating a system objective given a current dialogue context; and utterance generation: generating a suitable form to express an intention in the current dialogue context. Although we focus on intention generation here, the other topics are discussed briefly as they are all inter-related, and are sometimes conflated in the literature. In the next section, we discuss the state of the art in dialogue management and why it is not sufficient to meet the requirements for managing tutorial dialogue.
2. State-of-the-art Dialogue Management
We review three industrial-strength models of dialogue processing, namely, finite state machines (FSMs), form-filling, and VoiceXML, as well as an interesting cross-breed of FSMs and planning.
In the FSM approach, dialogue management is performed by a set of hierarchical FSMs that represent all possible dialogues. The top-level FSM is typically based on the structure of the task to be performed (e.g., flight information, phone banking transactions). For each state of the FSM, the dialogue engineer needs to specify its transitions to successor states, many of which handle numerous exceptions (e.g., user help and cancel requests, timeouts). The dialogue engineer thus manually defines all possible dialogue flow; all alternative routes are drawn in full. Consequently, the construction of FSMs is not only domain specific but also labour intensive and prone to error. On the positive side, FSMs run in real time, and a well-designed FSM can produce help messages and issue re-prompts that are sensitive to the task context. On the negative side, the dialogues are system-driven: turn-taking as well as system feedback are hardwired, and there is only a limited and well-defined amount of dialogue context stored in each state of the network. This makes it hard to produce responses sensitive to unexpected input or to the linguistic context, and to provide personalised or customised advice or feedback. Also, it makes it hard to port a FSM-based dialogue system to a new domain. The size of a FSM is practically, not theoretically limited. A typical industrial dialogue system in the area of banking has circa 1500 states (personal communication, Arturo Trujillo, Vocalis plc). FSM construction is supported by various toolkits, e.g., AT&T's FSM library (Mohri, Pereira, & Riley, http://www.research.att.com/sw/tools/fsm).
Form-filling is a less rigid approach to dialogue management. Instead of anticipating and encoding all plausible dialogues, the dialogue engineer specifies the information the dialogue system must obtain from the user as a set of forms composed of slots. The structural complexity of possible dialogues is limited only by the form design and the intelligence of the form interpretation and filling algorithm. This algorithm may be able to fill more than one slot at a time, maintain several active forms simultaneously, and switch among them. In contrast to a FSM-based dialogue system, the user of a form-filling dialogue system can therefore supply more information than the system requested (the system performs question accommodation), or start a task before the system has offered to perform it (the system performs task accommodation).
VoiceXML (W3C, 2002) augments the form-filling approach with an XML-based specification language and support for speech input and output. It is an evolving industry standard that is designed for modeling audio dialogues including synthesised speech, digitised audio, spoken and DTMF key (“touch-tone”) input, and mixed-initiative conversations. A VoiceXML document or set of documents defines a space of possible human-computer dialogues. Two types of dialogues are supported: forms that collect values for variables, and menus that present the user with a choice of options. Each dialogue specifies the next dialogue in the sequence; if there is no successor, execution stops. Subdialogues provide a mechanism for modularising common tasks such as confirmation sequences. Like a subroutine in a computer program, once the subdialogue is complete, the form interpreter returns to the place in the document where it was invoked. Subdialogues can be used to build libraries of short interactions shared among documents comprising a variety of applications. The acquisition and processing of normal input is complemented by an event handler that uses application-specific XML code to cope with user help and cancel requests as well as with no input or no match situations. VoiceXML-based dialogues can be less rigid than our description suggests. The order in which the machine collects the information from the user is not entirely pre-determined. Mixed-initiative dialogues allow the user to provide information in a flexible order or to provide multiple pieces of information in succession without the interruption of intermediary prompts.
A major advantage of these three approaches is the robustness of their language understanding capabilities. Each approach has strong expectations about user input based on the state of the system. In the FSM approach, the nodes of the network can be associated with special grammars and language models; in the form-filling approach, one can associate actions with slot-filling events, for example, controlling the activation and combination of scoped grammars; in the VoiceXML approach, user input, provided in response to a system generated utterance produced by the interpretation of the contents of the
tag, is recognised using the grammar supplied by the associated tag. The form elements