summaryrefslogtreecommitdiff
path: root/doc/doc.tex
blob: 71e5e1ef39cf6ca196503f487ff2aa18fdcbf3d4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
\documentclass[ngerman]{scrartcl} %lädt die Dokumentklasse

\usepackage{gnuplottex}
\usepackage{amsmath}

\begin{document}
\section{Naive vs Episodic array learning}
Using the \verb+QArray+ function approximator, different training strategies can be used:

\begin{itemize}
\item Immediately when a new $(\text{oldstate}, \text{action}, \text{newstate}, \text{reward})$ pair has been learned,
	update the $Q$ array by:
	$$Q_{\text{oldstate},\text{action}} = Q_{\text{oldstate},\text{action}} + \alpha \cdot (\text{reward} + \gamma \max_{a'} Q_{\text{newstate},{a'}} - Q_{\text{oldstate},\text{action}})$$

	This is standard textbook reinforcement learning, and referred to as the "old" approach.

\item Alternatively, we can store the $(\text{oldstate}, \text{action}, \text{newstate}, \text{reward})$ pairs in an array,
	without instantly training them into out array. Then, at the end of an episode, we go through this list \emph{reversed}, and update as above.

	This has two effects:
	\begin{itemize}
	\item During an episode, we don't learn anything about former actions done during this episode
	\item At the end of an episode, a newly discovered path is completely learned, as opposed to the above approach, where for any newly discovered path only the last state before known terrain is learned.
	\end{itemize}
\end{itemize}

In the graphs, which show the total reward earned as a function of the episodes ran, we see, that the second ("new") approach converges to an equally good training result, however it gets good faster.

\begin{itemize}
	\item new: second approach as above
	\item old: first approach as above
	\item nn2: neuronal network with hidden layer with 50 neurons and sigmoidal-symmetric-stepwise-linear activation function
	\item nn: same as nn2, but with a friendlyness of $1$ instead of $0.7$
\end{itemize}

Note that the nn and nn2 runs get a reward of only $0.5$ for reaching the goal, while the old and new runs get $10$. Therefore, the nn/nn2 results have been multiplied by $20$ to make them comparable.

\gnuplotloadfile[terminal=pdf]{array_naive_vs_episodic.gnuplot}

\gnuplotloadfile[terminal=pdf]{array_naive_vs_episodic_deriv.gnuplot}

Code used:
\begin{itemize}
	\item \verb+f9a9a51884aadef97b8952b2807541d31b7e9917+ for the "new" plots
	\item the same code with the line 254 (\verb+self.flush_learnbuffer() # TODO TRYME+) enabled for the "old" plots
	\item \verb+2483f393d9a3740b35606ca6acb6cb2df8ffdcd2+ for the nn/nn2 plots
\end{itemize}
Use \verb+sol.py+, or the \verb+test.sh+ script.

\end{document}