Class: FiniteMDP::Solver
- Inherits:
-
Object
- Object
- FiniteMDP::Solver
- Defined in:
- lib/finite_mdp/solver.rb
Overview
Find optimal values and policies using policy iteration and/or value iteration. The methods here are suitable for finding deterministic policies for infinite-horizon problems.
The computations are carried out on an intermediate form of the given model, which is stored using nested arrays:
model[state_num][action_num] = [[next_state_num, probability, reward], ...]
The solver assigns numbers to each state and each action automatically. Note that the successor state data are stored in sparse format, and any transitions that are in the given model but have zero probability are not stored.
TODO implement backward induction for finite horizon problems
TODO maybe implement a 'dense' storage format for models with many successor states, probably as a different solver class
Instance Attribute Summary collapse
-
#discount ⇒ Float
readonly
Discount factor, in (0, 1].
-
#model ⇒ ArrayModel
readonly
The model being solved; read only; do not change the model while it is being solved.
Instance Method Summary collapse
-
#evaluate_policy ⇒ Float
Refine the estimate of the value function for the current policy.
-
#evaluate_policy_exact ⇒ nil
Evaluate the value function for the current policy by solving a linear system of n equations in n unknowns, where n is the number of states in the model.
-
#improve_policy(tolerance: Float::EPSILON) ⇒ Integer
Make policy greedy with respect to the current value function.
-
#initialize(model, discount, policy: nil, value: Hash.new(0)) ⇒ Solver
constructor
A new instance of Solver.
-
#policy ⇒ Hash<state, action>
Current estimate of the optimal action for each state.
-
#policy_iteration(value_tolerance:, policy_tolerance: value_tolerance / 2.0, max_value_iters: nil, max_policy_iters: nil) {|num_policy_iters, num_value_iters, delta| ... } ⇒ Boolean
Solve with policy iteration using approximate (iterative) policy evaluation.
-
#policy_iteration_exact(max_iters: nil) {|num_iters| ... } ⇒ Boolean
Solve with policy iteration using exact policy evaluation.
-
#state_action_value ⇒ Hash<[state, action], Float>
Current state-action value estimates; whereas #value returns $V(s)$, this returns $Q(s,a)$, in the usual notation.
-
#value ⇒ Hash<state, Float>
Current value estimate for each state.
-
#value_iteration(tolerance:, max_iters: nil) {|num_iters, delta| ... } ⇒ Boolean
Value iteration; call #value_iteration_single up to
max_iters
times until the largest change in the value function (delta
) is less thantolerance
. -
#value_iteration_single ⇒ Float
A single iteration of value iteration.
Constructor Details
#initialize(model, discount, policy: nil, value: Hash.new(0)) ⇒ Solver
Returns a new instance of Solver.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/finite_mdp/solver.rb', line 38 def initialize(model, discount, policy: nil, value: Hash.new(0)) @discount = discount # get the model data into a more compact form for calculation; this means # that we number the states and actions for faster lookups (avoid most of # the hashing) @model = if model.is_a?(FiniteMDP::ArrayModel) model else FiniteMDP::ArrayModel.from_model(model) end # convert initial values and policies to compact form @array_value = @model.states.map { |state| value[state] } @array_policy = if policy @model.states.map do |state| @model.actions(state).index(policy[state]) end else [0] * @model.num_states end raise 'some initial values are missing' if @array_value.any?(&:nil?) raise 'some initial policy actions are missing' if @array_policy.any?(&:nil?) @policy_A = nil end |
Instance Attribute Details
#discount ⇒ Float (readonly)
Returns discount factor, in (0, 1].
79 80 81 |
# File 'lib/finite_mdp/solver.rb', line 79 def discount @discount end |
#model ⇒ ArrayModel (readonly)
Returns the model being solved; read only; do not change the model while it is being solved.
74 75 76 |
# File 'lib/finite_mdp/solver.rb', line 74 def model @model end |
Instance Method Details
#evaluate_policy ⇒ Float
Refine the estimate of the value function for the current policy. This is done by iterating the Bellman equations; see also #evaluate_policy_exact for a different approach.
This is the 'policy evaluation' step in Figure 4.3 of Sutton and Barto (1998).
function
136 137 138 139 140 141 142 143 144 145 |
# File 'lib/finite_mdp/solver.rb', line 136 def evaluate_policy delta = 0.0 model.array.each_with_index do |actions, state_n| next_state_ns = actions[@array_policy[state_n]] new_value = backup(next_state_ns) delta = [delta, (@array_value[state_n] - new_value).abs].max @array_value[state_n] = new_value end delta end |
#evaluate_policy_exact ⇒ nil
Evaluate the value function for the current policy by solving a linear system of n equations in n unknowns, where n is the number of states in the model.
This routine currently uses dense linear algebra, so it requires that the full n-by-n matrix be stored in memory. This may be a problem for moderately large n.
All of the coefficients (A and b in Ax = b) are computed first call, but subsequent calls recompute only those rows for which the policy has changed since the last call.
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
# File 'lib/finite_mdp/solver.rb', line 162 def evaluate_policy_exact if @policy_A # update only those rows for which the policy has changed @policy_A_action.zip(@array_policy) .each_with_index do |(old_action_n, new_action_n), state_n| next if old_action_n == new_action_n update_policy_Ab state_n, new_action_n end else # initialise the A and the b for Ax = b num_states = model.num_states @policy_A = NMatrix.float(num_states, num_states) @policy_A_action = [-1] * num_states @policy_b = NVector.float(num_states) @array_policy.each_with_index do |action_n, state_n| update_policy_Ab state_n, action_n end end value = @policy_b / @policy_A # solve linear system @array_value = value.to_a nil end |
#improve_policy(tolerance: Float::EPSILON) ⇒ Integer
Make policy greedy with respect to the current value function.
This is the 'policy improvement' step in Figure 4.3 of Sutton and Barto (1998).
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
# File 'lib/finite_mdp/solver.rb', line 199 def improve_policy(tolerance: Float::EPSILON) changed = 0 model.array.each_with_index do |actions, state_n| a_max = nil v_max = -Float::MAX actions.each_with_index do |next_state_ns, action_n| v = backup(next_state_ns) if v > v_max + tolerance a_max = action_n v_max = v end end raise "no feasible actions in state #{state_n}" unless a_max changed += 1 if @array_policy[state_n] != a_max @array_policy[state_n] = a_max end changed end |
#policy ⇒ Hash<state, action>
Current estimate of the optimal action for each state.
made to the returned object will not affect the solver
119 120 121 122 123 |
# File 'lib/finite_mdp/solver.rb', line 119 def policy Hash[model.states.zip(@array_policy).map do |state, action_n| [state, model.actions(state)[action_n]] end] end |
#policy_iteration(value_tolerance:, policy_tolerance: value_tolerance / 2.0, max_value_iters: nil, max_policy_iters: nil) {|num_policy_iters, num_value_iters, delta| ... } ⇒ Boolean
Solve with policy iteration using approximate (iterative) policy evaluation.
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 |
# File 'lib/finite_mdp/solver.rb', line 319 def policy_iteration(value_tolerance:, policy_tolerance: value_tolerance / 2.0, max_value_iters: nil, max_policy_iters: nil) num_actions_changed = nil num_policy_iters = 0 loop do # policy evaluation num_value_iters = 0 loop do value_delta = evaluate_policy num_value_iters += 1 if block_given? yield(num_policy_iters, num_actions_changed, num_value_iters, value_delta) end break if value_delta < value_tolerance break if max_value_iters && num_value_iters >= max_value_iters end # policy improvement num_actions_changed = improve_policy(tolerance: policy_tolerance) num_policy_iters += 1 return true if num_actions_changed == 0 return false if max_policy_iters && num_policy_iters >= max_policy_iters end end |
#policy_iteration_exact(max_iters: nil) {|num_iters| ... } ⇒ Boolean
Solve with policy iteration using exact policy evaluation.
364 365 366 367 368 369 370 371 372 373 374 |
# File 'lib/finite_mdp/solver.rb', line 364 def policy_iteration_exact(max_iters: nil) num_iters = 0 loop do evaluate_policy_exact num_actions_changed = improve_policy num_iters += 1 yield num_iters, num_actions_changed if block_given? return true if num_actions_changed == 0 return false if max_iters && num_iters >= max_iters end end |
#state_action_value ⇒ Hash<[state, action], Float>
Current state-action value estimates; whereas #value returns $V(s)$, this returns $Q(s,a)$, in the usual notation.
100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/finite_mdp/solver.rb', line 100 def state_action_value q = {} model.states.each_with_index do |state, state_n| model.actions(state).each_with_index do |action, action_n| q_sa = model.array[state_n][action_n].map do |next_state_n, pr, r| pr * (r + @discount * @array_value[next_state_n]) end.inject(:+) q[[state, action]] = q_sa end end q end |
#value ⇒ Hash<state, Float>
Current value estimate for each state.
The result is converted from the solver's internal representation, so you cannot affect the solver by changing the result.
made to the returned object will not affect the solver
90 91 92 |
# File 'lib/finite_mdp/solver.rb', line 90 def value Hash[model.states.zip(@array_value)] end |
#value_iteration(tolerance:, max_iters: nil) {|num_iters, delta| ... } ⇒ Boolean
Value iteration; call #value_iteration_single up to max_iters
times until the largest change in the value function (delta
) is less than tolerance
.
267 268 269 270 271 272 273 274 275 276 277 278 279 |
# File 'lib/finite_mdp/solver.rb', line 267 def value_iteration(tolerance:, max_iters: nil) delta = Float::MAX num_iters = 0 loop do delta = value_iteration_single num_iters += 1 break if delta < tolerance break if max_iters && num_iters >= max_iters yield num_iters, delta if block_given? end delta < tolerance end |
#value_iteration_single ⇒ Float
A single iteration of value iteration.
This is the algorithm from Figure 4.5 of Sutton and Barto (1998). It is mostly equivalent to calling #evaluate_policy and then #improve_policy, but it is somewhat more efficient.
function
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
# File 'lib/finite_mdp/solver.rb', line 228 def value_iteration_single delta = 0.0 model.array.each_with_index do |actions, state_n| a_max = nil v_max = -Float::MAX actions.each_with_index do |next_state_ns, action_n| v = backup(next_state_ns) if v > v_max a_max = action_n v_max = v end end delta = [delta, (@array_value[state_n] - v_max).abs].max @array_value[state_n] = v_max @array_policy[state_n] = a_max end delta end |