\documentclass[ngerman]{scrartcl} %lädt die Dokumentklasse

\usepackage{gnuplottex}
\usepackage{amsmath}

\begin{document}
\section{Naive vs Episodic array learning}
Using the \verb+QArray+ function approximator, different training strategies can be used:

\begin{itemize}
\item Immediately when a new $(\text{oldstate}, \text{action}, \text{newstate}, \text{reward})$ pair has been learned,
	update the $Q$ array by:
	$$Q_{\text{oldstate},\text{action}} = Q_{\text{oldstate},\text{action}} + \alpha \cdot (\text{reward} + \gamma \max_{a'} Q_{\text{newstate},{a'}} - Q_{\text{oldstate},\text{action}})$$

	This is standard textbook reinforcement learning, and referred to as the "old" approach.

\item Alternatively, we can store the $(\text{oldstate}, \text{action}, \text{newstate}, \text{reward})$ pairs in an array,
	without instantly training them into out array. Then, at the end of an episode, we go through this list \emph{reversed}, and update as above.

	This has two effects:
	\begin{itemize}
	\item During an episode, we don't learn anything about former actions done during this episode
	\item At the end of an episode, a newly discovered path is completely learned, as opposed to the above approach, where for any newly discovered path only the last state before known terrain is learned.
	\end{itemize}
\end{itemize}

In the graphs, which show the total reward earned as a function of the episodes ran, we see, that the second ("new") approach converges to an equally good training result, however it gets good faster.

\gnuplotloadfile[terminal=pdf]{array_naive_vs_episodic.gnuplot}

\gnuplotloadfile[terminal=pdf]{array_naive_vs_episodic_deriv.gnuplot}

Code used: \verb+f9a9a51884aadef97b8952b2807541d31b7e9917+ for the "new" plots, and the same code with the line 254 (\verb+self.flush_learnbuffer() # TODO TRYME+) enabled for the "old" plots. Use \verb+sol.py+, or the \verb+test.sh+ script.

\end{document}