亚洲天堂网_国产中出_日韩激情网_97se.com_中国亚洲老少妇色视频

近端策略優(yōu)化(PPO)——完美的增強(qiáng)學(xué)習(xí)優(yōu)化算法

發(fā)布時(shí)間:2019/5/24 瀏覽數(shù):24835

       近端策略優(yōu)化算法(Proximal Policy Optimization,簡(jiǎn)稱PPO) ,它的性能堪比目前最先進(jìn)的相關(guān)算法,同時(shí)極易實(shí)現(xiàn)和調(diào)參。得益于其優(yōu)異的易用性和超高表現(xiàn),PPO已經(jīng)成為了OpenAI的默認(rèn)增強(qiáng)學(xué)習(xí)算法。

      PPO使得我們可以在相當(dāng)復(fù)雜的環(huán)境中訓(xùn)練AI的運(yùn)動(dòng)策略,比如上面所顯示的 Roboschool 項(xiàng)目,該項(xiàng)目中期望訓(xùn)練一個(gè)機(jī)器人(agent)去到達(dá)一個(gè)目標(biāo)(場(chǎng)地上的粉球),機(jī)器人需要自己去掌握行走、跑動(dòng)和轉(zhuǎn)彎等動(dòng)作,并通過(guò)調(diào)整自身的動(dòng)量從輕微擊打(視頻中可以看到飛行的白色小方塊,打到人物模型后模型會(huì)晃動(dòng)甚至倒地!g者注)中恢復(fù),以及學(xué)習(xí)如何在被擊倒之后從地面上重新站起。

      策略梯度算法 是一種很有效的算法,無(wú)論是 視頻游戲測(cè)試 、 3D運(yùn)動(dòng)控制 還是 圍棋AI ,我們?cè)卺槍?duì)這些領(lǐng)域的深度神經(jīng)網(wǎng)絡(luò)研究上所獲得的突破都是基于策略梯度算法。但是,企圖通過(guò)策略梯度算法來(lái)獲得更好的結(jié)果是很難的,這是因?yàn)檫@種方法對(duì)于步長(zhǎng)(stepsize)的選擇很敏感——步長(zhǎng)太小,計(jì)算過(guò)程會(huì)慢到令人發(fā)指;而步長(zhǎng)太大,信號(hào)又很容易被噪聲影響而導(dǎo)致無(wú)法收斂,或是可能出現(xiàn)嚴(yán)重的性能驟降問(wèn)題。采用策略梯度算法的采樣效率往往很難令人滿意,我們進(jìn)場(chǎng)會(huì)為了訓(xùn)練幾個(gè)簡(jiǎn)單的任務(wù)二而耗費(fèi)成百上千個(gè)步長(zhǎng)時(shí)間。

     學(xué)者們期望通過(guò)一些方法來(lái)修補(bǔ)這些瑕疵,比如通過(guò)限制或是優(yōu)化策略更新幅度來(lái)提升效率, TRPO 和 ACER 就是一些著名的嘗試。這些方法通常會(huì)有它們自定義的權(quán)衡方法——ACER需要額外的程序代碼來(lái)做off-policy糾正以及經(jīng)驗(yàn)回放(replay buffer),因而它比PPO復(fù)雜得多,而且根據(jù)雅達(dá)利評(píng)測(cè)模型(Atari benchmark)的測(cè)試,這種復(fù)雜僅僅帶來(lái)了些微的優(yōu)勢(shì);而對(duì)于TRPO,雖然它對(duì)于處理連續(xù)性的控制任務(wù)極為有用,但對(duì)于一些需要在策略和價(jià)值函數(shù)或者輔助損失過(guò)程之間傳遞參數(shù)的算法而言,TRPO的兼容性并不理想,而這些算法在解決雅達(dá)利問(wèn)題或是其他主要依賴圖像輸入的問(wèn)題中相當(dāng)普遍。


      我們已經(jīng)使用PPO所訓(xùn)練的運(yùn)動(dòng)策略來(lái)構(gòu)建了一些可互動(dòng)機(jī)器人——我們可以在Roboschool提供的環(huán)境中 使用鍵盤 來(lái)為機(jī)器人設(shè)置一些新的目標(biāo)位置,即使輸入序列與模型的訓(xùn)練過(guò)程并不一致,機(jī)器人也能設(shè)法完成任務(wù)。

     我們同樣使用PPO來(lái)訓(xùn)練一個(gè)復(fù)雜的仿真機(jī)器人學(xué)會(huì)行走,比如下面的來(lái)自波士頓動(dòng)力學(xué)(Boston Dynamics)的亞特蘭斯('Atlas')模型。這個(gè)模型擁有30個(gè)關(guān)節(jié),相對(duì)的,上面的人形機(jī)器人只有17個(gè)關(guān)節(jié)。 一些其他的研究者 表示使用PPO訓(xùn)練的仿真機(jī)器人在跨越障礙時(shí)甚至?xí)捎靡恍┝钊梭@訝的跑酷動(dòng)作。

     相對(duì)而言,監(jiān)督學(xué)習(xí)是很容易的,我們只要實(shí)現(xiàn)一個(gè)價(jià)值函數(shù),對(duì)此采用梯度下降法,幾乎不用做什么進(jìn)一步的參數(shù)調(diào)優(yōu)就能很驕傲地宣稱自己得到了一個(gè)很優(yōu)秀的訓(xùn)練結(jié)果?上У氖,在增強(qiáng)學(xué)習(xí)中這一切并不會(huì)如此簡(jiǎn)單——增強(qiáng)學(xué)習(xí)的算法有著太多難以調(diào)試的可變部分,我們必須花費(fèi)大把的精力才能得到一個(gè)良好的訓(xùn)練結(jié)果。而為了平衡易于實(shí)現(xiàn)、樣本復(fù)雜度以及易于優(yōu)化這幾個(gè)問(wèn)題,并力求在每一次收斂步驟中都得到策略更新,同時(shí)保證與之前的策略偏差相對(duì)較小,我們必須尋找一種新的優(yōu)化方法,這就是PPO算法。


import os
import shutil
import time
from time import sleep

from ppo.renderthread import RenderThread
from ppo.models import *
from ppo.trainer import Trainer
from agents import GymEnvironment

# ## Proximal Policy Optimization (PPO)
# Contains an implementation of PPO as described [here](https://arxiv.org/abs/1707.06347).

# Algorithm parameters
# batch-size=<n>           How many experiences per gradient descent update step [default: 64].
batch_size = 128
# beta=<n>                 Strength of entropy regularization [default: 2.5e-3].
beta = 2.5e-3
# buffer-size=<n>          How large the experience buffer should be before gradient descent [default: 2048].
buffer_size = batch_size * 32
# epsilon=<n>              Acceptable threshold around ratio of old and new policy probabilities [default: 0.2].
epsilon = 0.2
# gamma=<n>                Reward discount rate [default: 0.99].
gamma = 0.99
# hidden-units=<n>         Number of units in hidden layer [default: 64].
hidden_units = 128
# lambd=<n>                Lambda parameter for GAE [default: 0.95].
lambd = 0.95
# learning-rate=<rate>     Model learning rate [default: 3e-4].
learning_rate = 4e-5
# max-steps=<n>            Maximum number of steps to run environment [default: 1e6].
max_steps = 15e6
# normalize                Activate state normalization for this many steps and freeze statistics afterwards.
normalize_steps = 0
# num-epoch=<n>            Number of gradient descent steps per batch of experiences [default: 5].
num_epoch = 10
# num-layers=<n>           Number of hidden layers between state/observation and outputs [default: 2].
num_layers = 1
# time-horizon=<n>         How many steps to collect per agent before adding to buffer [default: 2048].
time_horizon = 2048

# General parameters
# keep-checkpoints=<n>     How many model checkpoints to keep [default: 5].
keep_checkpoints = 5
# load                     Whether to load the model or randomly initialize [default: False].
load_model = True
# run-path=<path>          The sub-directory name for model and summary statistics.
summary_path = './PPO_summary'
model_path = './models'
# summary-freq=<n>         Frequency at which to save training statistics [default: 10000].
summary_freq = buffer_size * 5
# save-freq=<n>            Frequency at which to save model [default: 50000].
save_freq = summary_freq
# train                    Whether to train model, or only run inference [default: False].
train_model = False
# render environment to display progress
render = True
# save recordings of episodes
record = True

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"  # GPU is not efficient here

env_name = 'RocketLander-v0'
env = GymEnvironment(env_name=env_name, log_path="./PPO_log", skip_frames=6)
env_render = GymEnvironment(env_name=env_name, log_path="./PPO_log_render", render=True, record=record)
fps = env_render.env.metadata.get('video.frames_per_second', 30)

print(str(env))
brain_name = env.external_brain_names[0]

tf.reset_default_graph()

ppo_model = create_agent_model(env, lr=learning_rate,
                               h_size=hidden_units, epsilon=epsilon,
                               beta=beta, max_step=max_steps,
                               normalize=normalize_steps, num_layers=num_layers)

is_continuous = env.brains[brain_name].action_space_type == "continuous"
use_observations = False
use_states = True

if not load_model:
    shutil.rmtree(summary_path, ignore_errors=True)

if not os.path.exists(model_path):
    os.makedirs(model_path)

if not os.path.exists(summary_path):
    os.makedirs(summary_path)

tf.set_random_seed(np.random.randint(1024))
init = tf.global_variables_initializer()
saver = tf.train.Saver(max_to_keep=keep_checkpoints)

with tf.Session() as sess:
    # Instantiate model parameters
    if load_model:
        print('Loading Model...')
        ckpt = tf.train.get_checkpoint_state(model_path)
        if ckpt is None:
            print('The model {0} could not be found. Make sure you specified the right --run-path'.format(model_path))
        saver.restore(sess, ckpt.model_checkpoint_path)
    else:
        sess.run(init)

    steps, last_reward = sess.run([ppo_model.global_step, ppo_model.last_reward])
    summary_writer = tf.summary.FileWriter(summary_path)
    info = env.reset()[brain_name]
    trainer = Trainer(ppo_model, sess, info, is_continuous, use_observations, use_states, train_model)
    trainer_monitor = Trainer(ppo_model, sess, info, is_continuous, use_observations, use_states, False)
    render_started = False

    while steps <= max_steps or not train_model:
        if env.global_done:
            info = env.reset()[brain_name]
            trainer.reset_buffers(info, total=True)
        # Decide and take an action
        if train_model:
            info = trainer.take_action(info, env, brain_name, steps, normalize_steps, stochastic=True)
            trainer.process_experiences(info, time_horizon, gamma, lambd)
        else:
            sleep(1)
        if len(trainer.training_buffer['actions']) > buffer_size and train_model:
            if render:
                renderthread.pause()
            print("Optimizing...")
            t = time.time()
            # Perform gradient descent with experience buffer
            trainer.update_model(batch_size, num_epoch)
            print("Optimization finished in {:.1f} seconds.".format(float(time.time() - t)))
            if render:
                renderthread.resume()
        if steps % summary_freq == 0 and steps != 0 and train_model:
            # Write training statistics to tensorboard.
            trainer.write_summary(summary_writer, steps)
        if steps % save_freq == 0 and steps != 0 and train_model:
            # Save Tensorflow model
            save_model(sess=sess, model_path=model_path, steps=steps, saver=saver)
        if train_model:
            steps += 1
            sess.run(ppo_model.increment_step)
            if len(trainer.stats['cumulative_reward']) > 0:
                mean_reward = np.mean(trainer.stats['cumulative_reward'])
                sess.run(ppo_model.update_reward, feed_dict={ppo_model.new_reward: mean_reward})
                last_reward = sess.run(ppo_model.last_reward)
        if not render_started and render:
            renderthread = RenderThread(sess=sess, trainer=trainer_monitor,
                                        environment=env_render, brain_name=brain_name, normalize=normalize_steps, fps=fps)
            renderthread.start()
            render_started = True
    # Final save Tensorflow model
    if steps != 0 and train_model:
        save_model(sess=sess, model_path=model_path, steps=steps, saver=saver)
env.close()
export_graph(model_path, env_name)
os.system("shutdown")

Copyright 2017-2025 © 嘉興麥特萊博軟件開(kāi)發(fā)工作室
  • 網(wǎng)站備案號(hào):浙ICP備18008591號(hào)-1