Invented by Rashmi Gangadharaiah, Balakrishnan Narayanaswamy, Charles Elkan, Amazon Technologies Inc
The Amazon Technologies Inc invention works as follows
The techniques for intelligent multi-turn dialog systems automation that is task-oriented are described. A seq2seq model can be developed using a corpus or training data, and a loss-function that is at least partially based on the distance from a goal. A user utterance can be input to the seq2seq model, and the nearest neighbor algorithm can select one or several candidate responses to that user utterance. In some embodiments the specially adapted Seq2seqML model can trained using unsupervised training, and can then be adapted in order to select intelligent, coherant agent responses which move a dialog towards its completion.Background for Task-oriented Dialog Systems Using Combined Supervised and Reinforcement Learning
Conversational Agents have been proposed and used for many commercial domain-specific applications. These applications can be task-oriented in the sense that they are designed to help customers/users achieve a certain goal, like making a hotel or airline reservation. In order to achieve this goal, the agent must collect relevant information (e.g. preferences), give her relevant knowledge (e.g. prices and availability), and issue appropriate system calls. Make a payment, and complete the task effectively.
Chatbots, which are now ubiquitous thanks to recent advancements in speech recognition (e.g. smart speakers at home, mobile applications or computer programs, etc.), reach many people via speech-based services. Recently, chitchat has received a lot of attention in contexts that are open. The term “chit-chat” is used. The term “chit-chat” refers to systems which can generate fluent responses in response to questions and other utterances, that are reasonable within the context of a conversation. This is in contrast to the task-oriented settings that were discussed above in order to guide or conduct a conversation in order to complete a specific task.
BRIEF DESCRIPTION DES DRAWINGS
The drawings will show various embodiments of the present disclosure.
FIG. “FIG.
FIG. “FIG.
FIG. “FIG.
FIG. “FIG.
FIG. “FIG.
FIG. “FIG.
FIG. “FIG. 6 is a diagram of a illustrative environment where machine learning models can be trained and hosted in accordance with some embodiments.
FIG. “FIG.
FIG. “FIG. 8 is a diagram that illustrates an example computer system which may be used for some embodiments.
FIG. “FIG.
FIG. FIG. 10 is an example of a system for implementing various aspects according to different embodiments.
The description includes “Various embodiments” of intelligent multi-turn task-oriented dialog systems. In some embodiments, an ML model (such a sequence-to-sequence (seq2seq), for example, can be trained with a corpus (e.g. prior multi-turn, task-oriented dialogs), and a loss-function that is at least partly based on the distance from a goal. The ML can be given a user’s utterance and the output of the ML (e.g. a vector of multiple values from a plurality hidden units in a seq2seq model) can be used to select a candidate response to the user’s utterance. The specially-adapted ML models can be trained by unsupervised learning and adapted to select intelligent agent responses which move a dialog towards completion.
The large-domain task-oriented dialogue systems are a system that is widely used. Agents may be required to perform simultaneous actions, such as database queries, while also generating fluent responses in natural language. For example, some approaches to implement such systems could use reinforcement learning (RL), or supervised-learning (SL).
RL refers to a class of techniques that allow machines to learn sequential decision making from sparse and distant reward. In RL approaches to dialog, a policy is learned online via interactions with users who provide feedback. RL has the advantage that it learns models that optimize the appropriate long-term reward, in this case, fast and accurate completion of the user’s task. However, RL approaches usually require separate Natural Language Understanding (NLU) and Generation (NLG) components that are tuned separately to generate states, and also typically use predefined templates with slots and values to specify actions. Thus, the state space, the action space, and the rewards need to be carefully defined, requiring expensive human annotations or domain knowledge. Requiring domain-specific knowledge as rules or templates limits the expressive power of the models as the responses must belong to the pre-defined sets of possible responses, making the deployment of such systems difficult in the real world.
In contrast, SL-based approaches learn dialog policy offline by looking at expert trajectories. These approaches are appealing because the dialog policy can be learned without human supervision. However, SL methods require many examples dialogs to reach acceptable performance levels. This trade-off is reasonable for some dialog applications such as customer service and support, where there are many examples created by human agents.
A major disadvantage of SL is that it does not optimize for future reward. They learn to match every utterance of a training dialogue, using a loss-function like the cross entropy between predicted words distributions and the true agent utterances. However, they do not take into account the task, dialog history and the final goal. It is important to consider this, given that dialog systems are characterized by repeated and sequential interactions. Cross-entropy losses has a major flaw in that they penalize small changes to the order or choice of words even if the sentences are semantically the same. Cross-entropy loss can be a big factor in determining the validity of different responses, even if they are all valid within the context of a conversation.
Embodiments disclosed in this document can use aspects of both SL type approaches and RL types to implement highly accurate task-oriented dialog system. Embodiments may provide rewards for every step of the dialog based on the goal state. Embodiments are able to use SL-type techniques in order to learn the embeddings of dialog history at every turn, without needing additional annotation. Embodiments may add a reward to the cross-entropy that is negative at each turn. This term measures the deviation between the predicted state learned embedding and the final state embedded for the dialog. The final embedding can capture information such as the API goal call issued by the agent, or any other event/state that ends the dialog. It may also include information gathered from the customer during the dialog. This reward term encourages agents to respond in a way that moves the conversation in a positive direction within the latent space and reduces cross-entropy. This does not imply that the dialog agent is looking ahead to the customer’s final goal. Instead, the RL rewards will be shaped in training so as to encourage good behavior.