Policy Gradient Theorem

8 selectedDifficulty 6-88 unseenView topic

IntermediateNew

0 answered

1 intermediate7 advancedAdapts to your performance

Question 1 of 8

120sintermediate (6/10)conceptual

In the REINFORCE algorithm, the policy gradient is \nabla_{θ} J = E [\sum_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}] . Why is subtracting a baseline b (s_{t}) from G_{t} useful even though it does not change the expected gradient?