编程知识 cdmana.com

A brief talk on how to start from Q_ Learning to dqn

This article was first published in : Walker AI

DRL(Deep Reinforcement Learning) For the first time , Should be DeepMind stay 2013 It was first applied to Atari In the game DQN(Deep Q Network) Algorithm . In today's (2017 year ),DRL It's still one of the most cutting-edge areas of research . But in just four years ,DRL Has been playing from Atari, Evolved into go (Alphago)、 Playing video games (Dota AI、StarCraft AI), Refresh your three views again and again .

1. What is? Q-Learning

Q-Learning The algorithm is a method to solve the reinforcement learning control problem by using time series difference . Through the current state S S , action A A , Instant rewards R R , Attenuation factor γ γ , Exploration rate ϵ ϵ , The best action value function Q Q And the most strategic π π .

  • S S : Represents the state of the environment , stay t t The state of the environment S t S_t

  • A A :agent The action of , stay t t Actions taken all the time A t A_t

  • R R : Environmental rewards , stay t t moment agent In state St Take action A t A_t The corresponding reward R t + 1 R_{t+1} Will be in t + 1 t+1 Always get

  • γ \gamma : The discount factor , The weight of the current delay Award

  • ϵ \epsilon : Exploration rate , stay Q-learning We will select the most valuable action in the current iteration , It may lead to some actions that have never been performed again , stay agent When choosing an action , There is a small probability that it is not to select the most valuable action in the current iteration .

1.1 Q-Learning Introduction to the algorithm

First, we're based on state S S , use ϵ g r e e d ϵ−greed ( greedy ) Select to action A A , And then perform the action A A , Get a reward R R , And get into the State S S' , Q Q The update formula of the value is as follows :

Q ( S , A ) = Q ( S , A ) + α ( R + γ m a x Q ( S , a ) Q ( S , A ) ) Q(S,A)=Q(S,A)+\alpha(R+\gamma maxQ(S',a)-Q(S,A))

1.2 Q-learning Algorithm flow of

  • The value corresponding to the random initialization state and action value .( initialization Q Q form )

  • for i from 1 to TT: The total number of iterations )

            a) initialization S S Is the first state in the sequence of the current state

            b) use ϵ ϵ − Greedy law in the current state S S Choose the action A A

            c) In state S S Perform the current action A A , Get a new state S S′ And rewards R R

            d) Update value function Q ( S , A ) Q(S,A) :                               Q ( S , A ) = Q ( S , A ) + α ( R + γ m a x Q ( S , a ) Q ( S , A ) ) Q(S,A)=Q(S,A)+\alpha(R+\gamma maxQ(S',a)-Q(S,A))

            e) S = S S=S'

            f) if d o n e done Complete the current iteration

1.3 About Q_table for instance

(1) Game map

  • The black frame is a trap

  • The yellow box is the exit ( Reward points )

(2) This is after a training model Q form

(3) A simple example

  • If agent stay “1” Enter the maze from the right position , It will be better Q form , Going down Q The maximum value is 0.59, that agent It's going to be “5” The location of .
  • agent stay “5” After the position of , More Q form , Going down Q The maximum value is 0.66, Still going down , that agent It's time to “9” The location of .
  • agent stay “9” After the position of , More Q form , On the right Q The maximum value is 0.73, Still going down , that agent It's time to “10” The location of .
  • agent stay “10” After the position of , More Q form , Going down Q The maximum value is 0.81, Still going down , that agent It's time to “14” The location of .
  • agent stay “14” After the position of , More Q form , On the right Q The maximum value is 0.9, Still going down , that agent It's time to “15” The location of .
  • agent stay “15” After the position of , More Q form , On the right Q The maximum value is 1, Still going down , that agent It's time to “16” The location of , Reach the end point .

Last agent The course of action is 1-->5-->9-->10-->14-->15-->16

Every run , Q Q The values of the table will vary change , But the principle is the same .

If you want to see more intuitive vision Poke it here

2. DQN(Deep Q Network)

I talked about it before. Q-Learning Our decision is based on Q The values of the table , You get more rewards for that action , Just select the action to execute . The state space and action space mentioned above are very small , If the state space and action space become very large , Then we can use one more Q Table to show ? Obviously not , It's introduced The value function approximates .

2.1 The value function approximates

Because in practical problems , The scale of a problem is very large , One possible solution is to use the value function approximation . We introduce a state value function v ^ \hat v , By weight ω \omega describe , In state s s As input , Calculate the State s s The value of : v ^ ( s , w ) v π ( s ) \hat v(s,w)\approx v_\pi(s)

As we mentioned above ω \omega It's equivalent to the parameters in our neural network , Through the state of the input s s , use MC( Monte Carlo )/TD( Temporal difference ) Calculate the value function as the output , And then the weight ω \omega Training , Until it converges . in fact , So-called DQN That is to combine neural networks with Q-Learning combination , take Q The form becomes Q The Internet .

2.2 Deep Q-Learning Algorithm ideas

DQN It's a kind of Off-Policy Algorithm , In the words of Mr. Li Hongyi , You can watch others learn , that DQN Why can I watch others learn ?DQN It adopts a way of experience playback to learn . Every time agent Rewards for interacting with the environment , The current state and the next state are saved , For the back Q Network updates .

So let's see Nature DQN, Actually Nature DQN by DQN The second generation ,DQN NIPS For the most primitive DQN, There's a lot more on top of that DQN Version of , such as Double DQN,Dueling DQN wait . I'm here to introduce Nature DQN Well ! I think this version of DQN, It should be the most classic . Now let's see DQN How to carry out reinforcement learning .

2.3 algorithm flow chart

Input : The total number of iterations T T , State feature dimension n n , Action dimension A A , step a a , Attenuation factor γ \gamma , Exploration rate ϵ \epsilon , At present Q The Internet Q Q , The goal is Q The Internet Q Q' , Sample size of batch gradient descent m m , The goal is Q Network parameter update frequency P P .

Output :Q Network parameters

  • Randomly initializes the values of all states and actions Q Q , Randomly initialize the current Q Q All parameters of the network ω \omega , Initialize target Q The Internet Q Q' Parameters of ω \omega '= ω \omega , Clear the experience pool D D

  • for i from 1 to T( Iterating on and on )

              a) Initialization environment , Get the first state s s , Get eigenvectors \phi$$(S)

              b) stay Q Q Network usage \phi$$(S) As input , obtain Q Q All the actions of the Internet Q value , Use ϵ \epsilon - The greedy law in the current Q Select the corresponding action in the value output A A

              c) In state S S Perform the action below A A , Get a new state S S' , And the corresponding \phi$$(S') And rewards R R , Whether it is in the end state i s d o n e isdone

              d) take { ϕ ( S ) \phi(S) ,   A A ,   R R ,   ϕ ( S ) \phi(S') ,   i s d o n e isdone } Will this 5 Put elements into the experience pool D D

              e) S = S S=S'

              f) From the experience pool D D Adopted in m m Samples ,{ ϕ ( S j ) \phi(S_j) ,   A j A_j ,   R j R_j ,   ϕ ( S j \phi(S'_j i s d o n e j isdone_j ), j = 1 , 2 , 3 , 4.... m j=1,2,3,4....m , Calculate current Q Q value y j y_j :

          g) Use the mean square error loss function \left(\frac{1}{m}\right)$$\sum_{j=1}^m( y j Q ( ϕ ( S j ) , A j , ω ) ) 2 y_j-Q(\phi (S_j),A_j,\omega))^2 Update parameters by neural network gradient descent back propagation ω \omega

          h) If i%P=0, With new goals Q Q Network parameters ω = ω \omega'=\omega

          i) If S S' Is the termination state , Then the current iteration is completed , Otherwise jump to step (2)

2.4 DQN Implementation code

(1) Network structure

class Net(nn.Module):
    def __init__(self, ):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(N_STATES, 50)
        self.fc1.weight.data.normal_(0, 0.1)   # initialization
        self.out = nn.Linear(50, N_ACTIONS)
        self.out.weight.data.normal_(0, 0.1)   # initialization
    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        actions_value = self.out(x)
        return actions_value
 Copy code 
  • Two networks Same structure .
  • There are differences in parameter weights , One is Real time updates , One is After a while Updating .

(2) The choice of action

    def choose_action(self, x):  #x For the current state of 4 It's worth 
        x = torch.unsqueeze(torch.FloatTensor(x), 0)  # At the end of the data 0 Add one dimension to the dimension 
        # input only one sample
        if np.random.uniform() < EPSILON:   # greedy # Greedy 
            actions_value = self.eval_net.forward(x)  ## Pass in eval_net Get the next action 
            action = torch.max(actions_value, 1)[1].data.numpy()  ## Returns the index of the largest value in this row 
            action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)  # return the argmax index
        else:   # random
            action = np.random.randint(0, N_ACTIONS)
            # action = random.sample(N_ACTIONS)
            action = action if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)
        return action
 Copy code 

Added a Exploration value ( ϵ ) (\epsilon) , The small possibility is to randomly choose actions .

(3) Experience pool

    def store_transition(self, s, a, r, s_):  #s and s_ All for 4 It's worth , Respectively    Location   Movement speed    angle    Moving angle 
        transition = np.hstack((s, [a, r], s_))
        # replace the old memory with new memory # Update the experience 
        index = self.memory_counter % MEMORY_CAPACITY
        self.memory[index, :] = transition  # Will be the first index Experience is replaced by transition
        self.memory_counter += 1
 Copy code 

(4) Update network parameters

 def learn(self):
        # target parameter update  Target parameter update 
        if self.learn_step_counter % TARGET_REPLACE_ITER == 0:
            self.target_net.load_state_dict(self.eval_net.state_dict())  ##  Every time I study 200 Step by step eval_net The parameters of are assigned to target_net
        self.learn_step_counter += 1
        # sample batch transitions  # Select transition 
        sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE) # from MEMORY_CAPACITY Random selection BATCH_SIZE individual 
        b_memory = self.memory[sample_index, :]
        b_s = torch.FloatTensor(b_memory[:, :N_STATES])  # First state 
        b_a = torch.LongTensor(b_memory[:, N_STATES:N_STATES+1].astype(int)) # action 
        print("--------")
        print(b_a)
        print("-----")
        b_r = torch.FloatTensor(b_memory[:, N_STATES+1:N_STATES+2]) # score 
        b_s_ = torch.FloatTensor(b_memory[:, -N_STATES:]) # Next state 
        # q_eval w.r.t the action in experience
        q_eval = self.eval_net(b_s).gather(1, b_a)   # shape (batch, 1)  The current state of Q Value usage eval_net Calculation 
        # print("++++++")
        # print(self.eval_net(b_s))
        # print(self.eval_net(b_s).gather(1,b_a))
        # print("+++++++")

        q_next = self.target_net(b_s_).detach()   # Use target_net Calculate the next step Q value   # detach from graph, don't backpropagate detach prevent targent——net Back propagation 
        q_target = b_r + GAMMA * q_next.max(1)[0].view(BATCH_SIZE, 1)   # shape (batch, 1)
        loss = self.loss_func(q_eval, q_target)
        self.optimizer.zero_grad()   #zer——grad Set the gradient of all optimizers to 0
        loss.backward()   # Back propagation 
        self.optimizer.step()    # Perform the next optimization 
 Copy code 

DQN It's the threshold of deep reinforcement learning , Just step into the gate , The later study will be very easy .


PS: More technical dry goods , Quick attention 【 official account | xingzhe_ai】, Discuss it with the traveler !

版权声明
本文为[Walker AI]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/01/20210128041721931Q.html

Scroll to Top