1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
\documentclass[ngerman]{scrartcl} %lädt die Dokumentklasse
\usepackage{gnuplottex}
\usepackage{amsmath}
\begin{document}
\section{Naive vs Episodic array learning}
Using the \verb+QArray+ function approximator, different training strategies can be used:
\begin{itemize}
\item Immediately when a new $(\text{oldstate}, \text{action}, \text{newstate}, \text{reward})$ pair has been learned,
update the $Q$ array by:
$$Q_{\text{oldstate},\text{action}} = Q_{\text{oldstate},\text{action}} + \alpha \cdot (\text{reward} + \gamma \max_{a'} Q_{\text{newstate},{a'}} - Q_{\text{oldstate},\text{action}})$$
This is standard textbook reinforcement learning, and referred to as the "old" approach.
\item Alternatively, we can store the $(\text{oldstate}, \text{action}, \text{newstate}, \text{reward})$ pairs in an array,
without instantly training them into out array. Then, at the end of an episode, we go through this list \emph{reversed}, and update as above.
This has two effects:
\begin{itemize}
\item During an episode, we don't learn anything about former actions done during this episode
\item At the end of an episode, a newly discovered path is completely learned, as opposed to the above approach, where for any newly discovered path only the last state before known terrain is learned.
\end{itemize}
\end{itemize}
In the graphs, which show the total reward earned as a function of the episodes ran, we see, that the second ("new") approach converges to an equally good training result, however it gets good faster.
\gnuplotloadfile[terminal=pdf]{array_naive_vs_episodic.gnuplot}
\gnuplotloadfile[terminal=pdf]{array_naive_vs_episodic_deriv.gnuplot}
Code used: \verb+f9a9a51884aadef97b8952b2807541d31b7e9917+ for the "new" plots, and the same code with the line 254 (\verb+self.flush_learnbuffer() # TODO TRYME+) enabled for the "old" plots. Use \verb+sol.py+, or the \verb+test.sh+ script.
\end{document}
|