Invented by Stefanos ELEFTHERIADIS, James HENSMAN, Sebastian John, Hugh SALIMBENI, Secondmind Ltd
The Secondmind Ltd invention works as followsA reinforcement-learning system consists of an environment (having a number of possible states), a learner, and an agent. The agent receives state information indicative of the current state of the environment and generates an action signal based on the state data and the policy associated with it. This action signal can be used to change the environment state. The agent is arranged to further generate experience data based on state information and the information conveyed through the action signal. The policy learner can be configured to use the experience data to update the policy for the agent. The reinforcement learning system comprises a probabilistic modeling arranged to produce, based on the state of the current environment, probabilistic information relating to the future states of the environmental. And the agent is arranged to create the action signal based on the probabilistic information.
Background for Machine learning system
Machine Learning is when a computer learns what to perform by analyzing data rather than having it explicitly programmed. Machine learning research has increased in recent years despite the fact that it has been studied for more than fifty years. This research is largely focused on pattern recognition systems.
Machine learning is not limited to pattern recognition. It can also be used for decision-making. This type of decision-making has been used in many ways, from controlling characters that are not playable in a computer games to managing a taxi fleet. The amount or quality of data that a decision-making algorithm has access to is often limited. This makes it difficult to learn the optimal behaviour of a system.
Accordingly to one aspect of the invention, a reinforcement-learning system is provided that comprises an agent, a learning environment and a learner policy. The environment can have multiple states. The agent receives state information that indicates the current state of an environment, and generates an action signal based on this state information and the policy associated with the agents. This action signal can be used to change the state of the environmental. The agent can also be configured to generate experience information based on state information and the information conveyed through the action signal. The policy learner can be configured to use the experience data to update the policy for the agent. The reinforcement learning system comprises a probabilistic modeling model that can generate probabilistic data about future states of the environmental conditions based on the state of the current environment. The agent is then arranged to produce the action signal based on the probabilistic information.
The probabilistic model can include a stochastic intensities function that associates a stochastic strength with each point within a domain in the environment. This stochastic strength is indicative of the probability of an event happening at the particular point. A further aspect of the invention provides a computer implemented method for generating a stochastic function distribution that is also applicable to a wider range of situations. The method includes receiving data corresponding a set points in the domain and generating a variational Gaussian of a latent process dependent on a previous Gaussian and a plurality randomly-distributed induced variables. The induced variables have a distribution variation and can be expressed in terms of multiple Fourier components. The method also comprises determining a set parameters for the variational distributed using the data from the set points in the domain. This is done by iteratively updating intermediate parameters in order to find an optimal value for an objective function. The method includes determining the distribution for the stochastic intensities function from the Gaussian variational process and the set of determined parameters. This distribution corresponds to the quadratic function in the latent function.
Reinforcement learning: Definitions and formulation
For the purpose of this description and the accompanying drawings, it is possible to define a reinforcement-learning problem by specifying characteristics of agents and environments. The systems and methods described herein can be applied to a variety of reinforcement learning problems including continuous and discrete state and action space. “Although a particular problem is often used as an illustration, the example given here, managing a taxi fleet in a city for example, is only meant to be a guide.
A software agent (also called an agent) is a component of a computer program that performs actions in response to a set input signals. In some reinforcement learning applications, each agent is linked to a specific entity within a physical system. In the first example, an agent represents each taxi in the taxi fleet. In a second instance of managing a taxi fleet, an agent will be assigned to every subset of taxis. In some applications of reinforcement-learning, an agent does not have a connection to a physical entity. An agent could be assigned to an NPC in a game. In some cases, an agent can be associated with a physical entity and send it control signals. In some cases, the agent is embedded in hardware or software that is part and parcel of a physical entity associated with it (such as an autonomous robot). In some examples, the agent is implemented in a computer system located far away from the physical entity.
An environment can be defined as a virtual system that agents interact with. A task is the complete description of an environment. In many examples of reinforcement-learning, the environment is a simulated model of a physical system that contains information relevant to the problem. In the case of managing taxis in a large city, the model is a simulation of the entire city. The information that’s relevant to managing taxis includes, for example, a detailed city map, the location of every taxi, information on the weather and time of day variations, information on the average income of households within different parts of the metropolis, information about the traffic, and so forth.
It’s assumed that interactions between a agent and an environmental occur at discrete times n=0. 1, 2, 3. . . . The discrete steps of time do not have to be separated by fixed times. The agent receives reward data and data that correspond to the observation of the surrounding environment at each time step. The data that corresponds to an observation may also contain data indicative of possible future states. Both the data sent and the observation are referred as state signals. Sn is the state that was perceived by an agent at a time step of n. The state perceived by the agent can depend on variables that are associated with the agent. In the problem of taxi fleet management, for example, the agent’s observations can be affected by the location of the vehicle.
In response receiving a signal that indicates a state Sn for a timestep n, a agent can select and perform a action An according to a Markov Decision Process. In some cases, it is not possible to determine the true state of an environment from the state signal. The agent then selects and executes Action An according to a Partially Observable Markov Decisive Process (PO MDP). In general, performing a selected act has an impact on the environment. In cases where an agent is linked to an entity within a physical system sending a signal of control to that entity may be equivalent to performing a chosen action. Action signals are data sent by an agent as it performs an act.
At a time step n+1 later, the agent will receive a state signal from the surrounding environment that indicates a state Sn+1. The agent may initiate the new state signal by completing an action, such as An. Or it can be triggered in response to changes in the environment. The state signal can include data from sensors within the physical system in examples where the agent is linked to an entity. An agent that represents a taxi in a fleet may receive a signal from the state system indicating the taxi just dropped off a passenger to a certain point A of the city. The available actions include: waiting for passengers at point A, driving to another point B and driving continuously around a loop C on the map. The number of states and actions in each state can be infinite or finite depending on the configuration of agents and environment. The systems and methods described in this document are applicable to any of these situations.
An agent performs an action An. The reward Rn+1 is a signal that corresponds to a numerical award Rn+1. This reward Rn+1 is dependent on the state Sn and the action An. It also depends on the Sn+1 state. The agent is then associated with a series of states, actions, and rewards (Sn An, Rn+1 Sn+1) known as a trajectory. The reward can be a real number, which may be negative, positive, or zero. If you are managing a fleet taxis, one possible reward strategy is to give an agent who represents a taxi a positive reward every time a passenger pays the fares. The reward would be proportional to that fare. A second strategy would be for an agent to receive a positive reward every time a client is picked up. The value of the reward is dependent on how long it took between the time the customer called the taxi company until the time the customer was picked up. In a reinforcement-learning problem, an agent’s objective is to maximize the expected value of the return. The value of the return Gn for a given time step n is dependent on the rewards that the agent receives in future time steps. In some reinforcement learning problems the trajectory T indicates a finite number of time steps. The agent will eventually reach a terminal ST, where no more actions are possible. The finite time sequence is called an episode in a problem where T is fixed. The associated task is also known as an episodic one. In other reinforcement learning problems the trajectory T can be infinite and there are no final states. Infinite horizon tasks are used to describe a problem where T is infinite. A problem with a continuous task is managing a fleet in a city. A reinforcement learning task with an episodic nature is the learning of blackjack by an agent. Each round is an episode. For example, Equation (1) gives a possible definition for the return.
G\nn\n=\n?\nj\n=\n0\nT\n-\nn\n-\n1\n?\n?\nj\n?\nR\nn\n+\nj\n+\n1\n,\n(\n1\n)\n\nin which ? “Gnnn=n?njn=n0nTn-nnn-n1n”n The only time?=1 is allowed is if T has a finite value. The equation (1) says that the return given to an agent is the summation of a number of future rewards the agent will receive, multiplied with increasing powers of discount factor. The value of the discount factor determines the extent to which an agent considers future probabilities when making decisions. If the sequence of rewards Rj is bound, then the series in Equation 1 is guaranteed to converge. This is not the only definition of return. In R-learning algorithms for example, Equation 1 is replaced by an infinite sum of undiscounted reward minus an expected average reward. The methods and systems described in this document are not limited by the definition of return provided by Equation 1.
In response, an agent receives a signal indicating the state of the system. The agent then selects an appropriate action based on its policy. A policy is a mapping of states to actions. If an agent is following a policy and receives at time step n, a signal indicating that a particular state Sn=s exists, then the probability that the agent will select a certain action An=a can be denoted as? (a|s). s) takes values of either 0 or 1 for all possible combinations of a and s is a deterministic policy. Reinforcement-learning algorithms describe how an agent’s policy is changed in response to states, actions and rewards.
The goal of a reinforcement-learning algorithm is to determine a policy which maximizes the expected value of a reward. Two different expectations values are commonly referred to, the state value and action value. The state value function is v? for a policy?. For each state s, the state value function v? (s)= ? Sn=s), which states that the state value of state s given policy ? The expectation value at time step is n. This is given that the agent has received a signal at time step of a Sn=s state. For a policy? the action value function is q? For each possible pair of state-action (s,a), the action value function q? (s, a)= ? Sn=s, An=a), which states that the action value of a state-action pair (s, a) given policy ? The expectation value at time t is given by the fact that the agent at time n receives a signal of a state Sn=s and chooses an action A=a. Backup is the name given to a computation that produces a calculation or approximate value of a state or action for a particular state or state-action combination. In general, a reinforcement learning algorithm seeks to find a policy which maximizes the state value or action value functions for all possible state-action pairs or states. In many practical reinforcement learning applications, the number or possible state-action pair is large or infinite. It is then necessary to approximate either the state value or action value function using sequences of state, actions and rewards experienced by an agent. In such cases, the approximate value functions “circumflex (v)”(s a w) and “circumflex (q)”(s a w, s w, s w, s w, s a w, s w, s w, s w, s w, s w, s w, w, s w, s w, s w, s In such cases, approximate value functions circumflex over (v)(s, w) and ‘circumflex over (q)(s a, w), are introduced to approximate the value functions v? The parameter vector w defines the approximate functions. Then, reinforcement learning algorithms adjust the parameter vector to minimize an error (for instance, the root-mean square error) between the approximate values functions circumflex (v) (s,w) or circumflex (q) (s,a,w) and value functions V? Then, reinforcement learning algorithms adjust the parameter vector w in order to minimise an error (for example a root-mean-square error) between the approximate value functions circumflex over (v)(s, w) or circumflex over (q),(s, a. w), and their respective value functions v? (s, a).
In many reinforcement learning algorithms, a policy can be defined using approximate value functions. An agent that follows a greedy strategy will always choose an action to maximize an approximate value function. A greedy agent selects with probability 1. An agent following an?-greedy policy instead selects, with probability 1? Parameter satisfying 0
FIG. “FIG. 1 shows an example of an agent interacting in a given environment. The horizontal axis represents increasing time. The dashed lines 103 and 105 above the axis represent the agent. The agent receives, at time step n from the environment, a first signal 107 indicating a condition Sn. At the same time, the agent also receives a reward Rn that is associated with this state Sn. The agent, in response to the first state signal, selects a policy? and executes the action An. The action An affects the environment and is completed in time step n+1. The environment immediately sends the agent a signal 109 indicating the new state Sn+1 after performing the action An. The reward Rn+1 is linked to the new state Sn+1. The agent performs an An+1 action, resulting in a state Sn+2 with a reward Rn+2. As shown in FIG. As shown in FIG.
A variety of reinforcement-learning algorithms are well known, and different algorithms can be used depending on the characteristics of the environment or the agents who define the reinforcement-learning problem. Some examples of reinforcement learning algorithms are dynamic programming methods and Monte Carlo methods. Other methods include temporal difference learning, actor-critic, and other methods. The present application introduces methods and systems that allow the implementation of existing and future reinforcement-learning algorithms for problems with large numbers of agents or states and/or infinite number of states.
Systems and Methods in accordance with this invention are especially advantageous in situations where more than one agent is interacting with an environment. In the example of managing taxis in a large city, many agents are likely to be involved. FIG. FIG. 2 shows an example of how two agents interact with a given environment. In FIG. As in FIG. The top dashed 201 line represents Agent 1, and the bottom dashed 203 line represents Agent 2. The middle dashed 205 line represents the environment. Agent 1 has the following trajectory: (Sn(1), An(1), Rn+1(1),. . . Agent 2 also has a trajectory: (Sm (2) Am (2) Rm+1(2), . . ) In this example, the time step at which Agent 1 is receiving state signals differs from the time step at which Agent 2 is receiving state signals. In this example the agents don’t send each other signals, but rather interact indirectly through the environment. (In other examples, signals can be sent between agents directly). Agent 1’s action An (1) (represented by the arrow 207) has an impact on the environment, which alters the state signal 209, which indicates a state Sm+1(2) to Agent 2. FIG. In an example where FIG. Am (2) can represent the second taxi that is driving from a different area of the town to the taxi stand. When the second cab reaches the taxi stand, Agent 2 will receive a signal that indicates a state Sm+1(2) in which the first cab is already at the taxi stand. Agent 2 is given a negative Rm+2(2) reward because the state where the first taxi was already waiting in the taxi rank does not favour Agent 2. Agent 2 decides to act Sm+1(2), which causes the second taxi drive to a new taxi rank.
In the example of FIG. The two agents Agent 1 & Agent 2 are autonomous, meaning that they each make decisions independently from the other. They interact indirectly through the effects each agent has on their environment. Each agent chooses actions based on a policy distinct from that of the other agents. In the FIG. In FIG. First, policy source sends data 305 from agent 301a, updating the agent’s policy. In a similar way, policy source sends data to all agents 301. This updates the policies. In some cases, the agents’ 301 policies are updated at the same time. In some examples, policies for agents 301 may be updated at different times. The configuration shown in FIG. The configuration of FIG. In the case of decentralised configurations of agents, such as those shown in FIG. The computing resources required to apply a specific reinforcement learning algorithm to each agent can be scaled proportionally to the number of agents. This includes memory, processing power and storage. A separate reinforcement learning algorithm for each agent can be applied using a processor core or separate processor. This allows parallelised reinforcement.Click here to view the patent on Google Patents.