提示信息

详解 DDPG 模型及其 Pytorch 完整代码

假洒脱 已关注

DDPG(Deep Deterministic Policy Gradient)是一种用于解决连续动作空间的深度强化学习算法。它结合了Actor-Critic框架、策略梯度和深度学习技术。DDPG适用于模型自由环境,其中传统的Q-learning方法难以应用。

DDPG由以下四个关键角色组成:
1. Actor 网络:负责选择动作。输入状态,输出特定策略下选择的动作。
2. Critic 网络:对Actor的动作进行评价。输入状态和动作,输出Q值。
3. 目标 Actor-Critic 网络:用于稳定学习过程,它们是Actor和Critic网络的延迟复制品。
4. 经验回放(Replay Buffer):用于存储经验,以打破数据相关性并提高学习效果。

以下是DDPG的关键步骤:
- 初始化Actor和Critic网络以及它们的对应目标网络。
- 初始化Replay Buffer。
- 在每个时间步:
- 使用Actor网络选择动作,并在环境中执行。
- 将转移(状态、动作、奖励、新状态)存入Replay Buffer。
- 从Replay Buffer中随机采样一个小批量转移。
- 使用Critic网络最小化关于Q值的损失。
- 使用策略梯度方法最小化关于Actor的策略损失。
- 软更新目标网络参数。

下面是使用PyTorch实现DDPG的一个简化的代码示例。为了代码清晰,许多实际应用中的细节被简化或省略:

  1. import torch
  2. import torch.nn as nn
  3. import torch.optim as optim
  4. import numpy as np
  5. import gym
  6. from collections import deque
  7. import random
  8. # Actor Network
  9. class Actor(nn.Module):
  10. def __init__(self, state_dim, action_dim, max_action):
  11. super(Actor, self).__init__()
  12. self.layer1 = nn.Linear(state_dim, 400)
  13. self.layer2 = nn.Linear(400, 300)
  14. self.layer3 = nn.Linear(300, action_dim)
  15. self.max_action = max_action
  16. def forward(self, state):
  17. a = torch.relu(self.layer1(state))
  18. a = torch.relu(self.layer2(a))
  19. return self.max_action * torch.tanh(self.layer3(a))
  20. # Critic Network
  21. class Critic(nn.Module):
  22. def __init__(self, state_dim, action_dim):
  23. super(Critic, self).__init__()
  24. self.layer1 = nn.Linear(state_dim + action_dim, 400)
  25. self.layer2 = nn.Linear(400, 300)
  26. self.layer3 = nn.Linear(300, 1)
  27. def forward(self, state, action):
  28. q = torch.relu(self.layer1(torch.cat([state, action], 1)))
  29. q = torch.relu(self.layer2(q))
  30. return self.layer3(q)
  31. # Replay Buffer
  32. class ReplayBuffer:
  33. def __init__(self, max_size=1000000):
  34. self.buffer = deque(maxlen=max_size)
  35. def add(self, state, action, reward, next_state, done):
  36. self.buffer.append((state, action, reward, next_state, done))
  37. def sample(self, batch_size):
  38. state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
  39. return np.array(state), np.array(action), np.array(reward), np.array(next_state), np.array(done)
  40. def size(self):
  41. return len(self.buffer)
  42. # DDPG Class
  43. class DDPG:
  44. def __init__(self, state_dim, action_dim, max_action):
  45. self.actor = Actor(state_dim, action_dim, max_action).to(device)
  46. self.actor_target = Actor(state_dim, action_dim, max_action).to(device)
  47. self.actor_target.load_state_dict(self.actor.state_dict())
  48. self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=1e-4)
  49. self.critic = Critic(state_dim, action_dim).to(device)
  50. self.critic_target = Critic(state_dim, action_dim).to(device)
  51. self.critic_target.load_state_dict(self.critic.state_dict())
  52. self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=1e-3)
  53. self.max_action = max_action
  54. self.gamma = 0.99
  55. self.tau = 0.005
  56. self.replay_buffer = ReplayBuffer()
  57. def select_action(self, state):
  58. state = torch.FloatTensor(state.reshape(1, -1)).to(device)
  59. return self.actor(state).cpu().data.numpy().flatten()
  60. def train(self, iterations, batch_size=64):
  61. for _ in range(iterations):
  62. # Sample a batch of transitions from the buffer.
  63. state, action, reward, next_state, done = \
  64. self.replay_buffer.sample(batch_size)
  65. state = torch.FloatTensor(state).to(device)
  66. action = torch.FloatTensor(action).to(device)
  67. reward = torch.FloatTensor(reward).reshape(-1, 1).to(device)
  68. next_state = torch.FloatTensor(next_state).to(device)
  69. done = torch.FloatTensor(done).reshape(-1, 1).to(device)
  70. # Compute the target Q value
  71. target_Q = self.critic_target(next_state, self.actor_target(next_state))
  72. target_Q = reward + ((1 - done) * self.gamma * target_Q).detach()
  73. # Optimize the Critic
  74. current_Q = self.critic(state, action)
  75. critic_loss = nn.MSELoss()(current_Q, target_Q)
  76. self.critic_optimizer.zero_grad()
  77. critic_loss.backward()
  78. self.critic_optimizer.step()
  79. # Compute actor loss
  80. actor_loss = -self.critic(state, self.actor(state)).mean()
  81. self.actor_optimizer.zero_grad()
  82. actor_loss.backward()
  83. self.actor_optimizer.step()
  84. # Soft update the target networks
  85. for target_param, param in zip(self.actor_target.parameters(), self.actor.parameters()):
  86. target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
  87. for target_param, param in zip(self.critic_target.parameters(), self.critic.parameters()):
  88. target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
  89. if __name__ == "__main__":
  90. device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  91. env = gym.make("Pendulum-v0")
  92. state_dim = env.observation_space.shape[0]
  93. action_dim = env.action_space.shape[0]
  94. max_action = env.action_space.high[0]
  95. ddpg = DDPG(state_dim, action_dim, max_action)
  96. total_episodes = 100
  97. for episode in range(total_episodes):
  98. state = env.reset()
  99. episode_reward = 0
  100. done = False
  101. while not done:
  102. action = ddpg.select_action(np.array(state))
  103. next_state, reward, done, _ = env.step(action)
  104. ddpg.replay_buffer.add(state, action, reward, next_state, done)
  105. state = next_state
  106. episode_reward += reward
  107. if ddpg.replay_buffer.size() > 1000:
  108. ddpg.train(50)
  109. print(f"Episode: {episode}, Reward: {episode_reward}")

注意事项
- 该示例使用gym库的Pendulum-v0环境,需要根据具体问题选择合适的环境。
- 在实际应用中,为了获得更稳定的结果,可能需要对网络参数初始化、超参数选择、探索策略等进行调整。
- 由于由于代码需用到GPU,因此需要检查是否可用。

希望这段代码和解释能帮助你理解DDPG算法及其实现。

    遇到难题? "AI大模型GPT4.0、GPT" 是你的私人解答专家! 点击按钮去提问......
假洒脱 关注 已关注

最近一次登录:2024-11-20 22:08:57   

暂时还没有签名,请关注我或评论我的文章
×
免费图表工具,画流程图、架构图