## 编程知识 cdmana.com

### A brief talk on how to start from Q_ Learning to dqn

DRL（Deep Reinforcement Learning） For the first time , Should be DeepMind stay 2013 It was first applied to Atari In the game DQN（Deep Q Network） Algorithm . In today's （2017 year ）,DRL It's still one of the most cutting-edge areas of research . But in just four years ,DRL Has been playing from Atari, Evolved into go （Alphago）、 Playing video games （Dota AI、StarCraft AI）, Refresh your three views again and again .

### 1. What is? Q-Learning

Q-Learning The algorithm is a method to solve the reinforcement learning control problem by using time series difference . Through the current state $S$, action $A$, Instant rewards $R$, Attenuation factor $γ$, Exploration rate $ϵ$, The best action value function $Q$ And the most strategic $π$.

• $S$： Represents the state of the environment , stay $t$ The state of the environment $S_t$

• $A$：agent The action of , stay $t$ Actions taken all the time $A_t$

• $R$： Environmental rewards , stay $t$ moment agent In state St Take action $A_t$ The corresponding reward $R_{t+1}$ Will be in $t+1$ Always get

• $\gamma$： The discount factor , The weight of the current delay Award

• $\epsilon$： Exploration rate , stay Q-learning We will select the most valuable action in the current iteration , It may lead to some actions that have never been performed again , stay agent When choosing an action , There is a small probability that it is not to select the most valuable action in the current iteration .

#### 1.1 Q-Learning Introduction to the algorithm

First, we're based on state $S$, use $ϵ−greed$( greedy ) Select to action $A$, And then perform the action $A$, Get a reward $R$, And get into the State $S'$, $Q$ The update formula of the value is as follows ：

$Q(S,A)=Q(S,A)+\alpha(R+\gamma maxQ(S',a)-Q(S,A))$

#### 1.2 Q-learning Algorithm flow of

• The value corresponding to the random initialization state and action value .（ initialization $Q$ form ）

• for i from 1 to TT： The total number of iterations ）

a） initialization $S$ Is the first state in the sequence of the current state

b） use $ϵ$− Greedy law in the current state $S$ Choose the action $A$

c） In state $S$ Perform the current action $A$, Get a new state $S′$ And rewards $R$

d） Update value function $Q(S,A)$：                               $Q(S,A)=Q(S,A)+\alpha(R+\gamma maxQ(S',a)-Q(S,A))$

e） $S=S'$

f） if $done$ Complete the current iteration

#### 1.3 About Q_table for instance

（1） Game map

• The black frame is a trap

• The yellow box is the exit （ Reward points ）

（2） This is after a training model Q form

（3） A simple example

• If agent stay “1” Enter the maze from the right position , It will be better Q form , Going down Q The maximum value is 0.59, that agent It's going to be “5” The location of .
• agent stay “5” After the position of , More Q form , Going down Q The maximum value is 0.66, Still going down , that agent It's time to “9” The location of .
• agent stay “9” After the position of , More Q form , On the right Q The maximum value is 0.73, Still going down , that agent It's time to “10” The location of .
• agent stay “10” After the position of , More Q form , Going down Q The maximum value is 0.81, Still going down , that agent It's time to “14” The location of .
• agent stay “14” After the position of , More Q form , On the right Q The maximum value is 0.9, Still going down , that agent It's time to “15” The location of .
• agent stay “15” After the position of , More Q form , On the right Q The maximum value is 1, Still going down , that agent It's time to “16” The location of , Reach the end point .

Last agent The course of action is 1-->5-->9-->10-->14-->15-->16

Every run , $Q$ The values of the table will vary change , But the principle is the same .

If you want to see more intuitive vision Poke it here

### 2. DQN(Deep Q Network)

I talked about it before. Q-Learning Our decision is based on Q The values of the table , You get more rewards for that action , Just select the action to execute . The state space and action space mentioned above are very small , If the state space and action space become very large , Then we can use one more Q Table to show ？ Obviously not , It's introduced The value function approximates .

#### 2.1 The value function approximates

Because in practical problems , The scale of a problem is very large , One possible solution is to use the value function approximation . We introduce a state value function $\hat v$, By weight $\omega$ describe , In state $s$ As input , Calculate the State $s$ The value of ： $\hat v(s,w)\approx v_\pi(s)$

As we mentioned above $\omega$ It's equivalent to the parameters in our neural network , Through the state of the input $s$, use MC( Monte Carlo )/TD( Temporal difference ) Calculate the value function as the output , And then the weight $\omega$ Training , Until it converges . in fact , So-called DQN That is to combine neural networks with Q-Learning combination , take Q The form becomes Q The Internet .

#### 2.2 Deep Q-Learning Algorithm ideas

DQN It's a kind of Off-Policy Algorithm , In the words of Mr. Li Hongyi , You can watch others learn , that DQN Why can I watch others learn ？DQN It adopts a way of experience playback to learn . Every time agent Rewards for interacting with the environment , The current state and the next state are saved , For the back Q Network updates .

So let's see Nature DQN, Actually Nature DQN by DQN The second generation ,DQN NIPS For the most primitive DQN, There's a lot more on top of that DQN Version of , such as Double DQN,Dueling DQN wait . I'm here to introduce Nature DQN Well ！ I think this version of DQN, It should be the most classic . Now let's see DQN How to carry out reinforcement learning .

#### 2.3 algorithm flow chart

Input : The total number of iterations $T$, State feature dimension $n$, Action dimension $A$, step $a$, Attenuation factor $\gamma$, Exploration rate $\epsilon$, At present Q The Internet $Q$, The goal is Q The Internet $Q'$, Sample size of batch gradient descent $m$, The goal is Q Network parameter update frequency $P$.

Output ：Q Network parameters

• Randomly initializes the values of all states and actions $Q$, Randomly initialize the current $Q$ All parameters of the network $\omega$, Initialize target Q The Internet $Q'$ Parameters of $\omega$'= $\omega$, Clear the experience pool $D$

• for i from 1 to T( Iterating on and on )

a） Initialization environment , Get the first state $s$, Get eigenvectors \phi$$(S) b） stay $Q$ Network usage \phi$$(S) As input , obtain $Q$ All the actions of the Internet Q value , Use $\epsilon$- The greedy law in the current Q Select the corresponding action in the value output $A$

c） In state $S$ Perform the action below $A$, Get a new state $S'$, And the corresponding \phi$$(S') And rewards $R$, Whether it is in the end state $isdone$ d） take { $\phi(S)$, $A$, $R$, $\phi(S')$, $isdone$ } Will this 5 Put elements into the experience pool $D$ e） $S=S'$ f） From the experience pool $D$ Adopted in $m$ Samples ,{ $\phi(S_j)$, $A_j$, $R_j$, $\phi(S'_j$ $isdone_j$), $j=1,2,3,4....m$, Calculate current $Q$ value $y_j$: g） Use the mean square error loss function \left(\frac{1}{m}\right)$$\sum_{j=1}^m( $y_j-Q(\phi (S_j),A_j,\omega))^2$ Update parameters by neural network gradient descent back propagation $\omega$

h） If i%P=0, With new goals $Q$ Network parameters $\omega'=\omega$

i） If $S'$ Is the termination state , Then the current iteration is completed , Otherwise jump to step (2)

#### 2.4 DQN Implementation code

（1） Network structure

class Net(nn.Module):
def __init__(self, ):
super(Net, self).__init__()
self.fc1 = nn.Linear(N_STATES, 50)
self.fc1.weight.data.normal_(0, 0.1)   # initialization
self.out = nn.Linear(50, N_ACTIONS)
self.out.weight.data.normal_(0, 0.1)   # initialization
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
actions_value = self.out(x)
return actions_value
Copy code 
• Two networks Same structure .
• There are differences in parameter weights , One is Real time updates , One is After a while Updating .

（2） The choice of action

    def choose_action(self, x):  #x For the current state of 4 It's worth
x = torch.unsqueeze(torch.FloatTensor(x), 0)  # At the end of the data 0 Add one dimension to the dimension
# input only one sample
if np.random.uniform() < EPSILON:   # greedy # Greedy
actions_value = self.eval_net.forward(x)  ## Pass in eval_net Get the next action
action = torch.max(actions_value, 1)[1].data.numpy()  ## Returns the index of the largest value in this row
action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)  # return the argmax index
else:   # random
action = np.random.randint(0, N_ACTIONS)
# action = random.sample(N_ACTIONS)
action = action if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)
return action
Copy code 

Added a Exploration value $(\epsilon)$, The small possibility is to randomly choose actions .

（3） Experience pool

    def store_transition(self, s, a, r, s_):  #s and s_ All for 4 It's worth , Respectively    Location   Movement speed    angle    Moving angle
transition = np.hstack((s, [a, r], s_))
# replace the old memory with new memory # Update the experience
index = self.memory_counter % MEMORY_CAPACITY
self.memory[index, :] = transition  # Will be the first index Experience is replaced by transition
self.memory_counter += 1
Copy code 

（4） Update network parameters

 def learn(self):
# target parameter update  Target parameter update
if self.learn_step_counter % TARGET_REPLACE_ITER == 0:
self.target_net.load_state_dict(self.eval_net.state_dict())  ##  Every time I study 200 Step by step eval_net The parameters of are assigned to target_net
self.learn_step_counter += 1
# sample batch transitions  # Select transition
sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE) # from MEMORY_CAPACITY Random selection BATCH_SIZE individual
b_memory = self.memory[sample_index, :]
b_s = torch.FloatTensor(b_memory[:, :N_STATES])  # First state
b_a = torch.LongTensor(b_memory[:, N_STATES:N_STATES+1].astype(int)) # action
print("--------")
print(b_a)
print("-----")
b_r = torch.FloatTensor(b_memory[:, N_STATES+1:N_STATES+2]) # score
b_s_ = torch.FloatTensor(b_memory[:, -N_STATES:]) # Next state
# q_eval w.r.t the action in experience
q_eval = self.eval_net(b_s).gather(1, b_a)   # shape (batch, 1)  The current state of Q Value usage eval_net Calculation
# print("++++++")
# print(self.eval_net(b_s))
# print(self.eval_net(b_s).gather(1,b_a))
# print("+++++++")

q_next = self.target_net(b_s_).detach()   # Use target_net Calculate the next step Q value   # detach from graph, don't backpropagate detach prevent targent——net Back propagation
q_target = b_r + GAMMA * q_next.max(1)[0].view(BATCH_SIZE, 1)   # shape (batch, 1)
loss = self.loss_func(q_eval, q_target)
loss.backward()   # Back propagation
self.optimizer.step()    # Perform the next optimization
Copy code 

DQN It's the threshold of deep reinforcement learning , Just step into the gate , The later study will be very easy .

PS： More technical dry goods , Quick attention 【 official account | xingzhe_ai】, Discuss it with the traveler ！

https://cdmana.com/2021/01/20210128041721931Q.html

Scroll to Top