top of page


Documentations of an undergraduate research on reinforcement learning and robotics. Specifically, I designed an algorithm that can associate an immediate conditioned stimulus with a delayed nature stimulus.

Media Summary

Video introducing the research project

Conference poster (submitted to the 2021 Consortium for Computer Science in Colleges and received a 2nd Prize in Student Poster Presentation)


Research Process

I reviewed several literatures published in recent years to learn from recent findings in the field of reinforcement learning. Then, I formulated a problem statement that encapsulate the area of investigation. And finally, an experiment which requires both a physical and virtual environment was designed, built, and executed. 

Building the environments for the experiment

Documentation on the robotic arena, the design of the robotic agent, as well as the simulator can be found on GitHub


Problem Statement

In a typical reinforcement learning problem, the solution assumes that the Markov property is satisfied, that the future state is independent of the past states given the present state. Yet, most problems of interest in the real world violate the Markov property. Observations often do not immediately reach the agent, instant detection of certain signals might be limited by physics, and these factors may lead to a delay in rewards. Further more, techniques that maximize sample efficiency are favored in training RL agents in the real world since training of this kind is costly compared to simulated trainings. 

In practical application problems, two methods are known to mitigate the problems of delayed rewards and sample efficiency: eligibility traces, which allow information to propagate over multiple time steps, and experience replay, where previous output is fed into the model as a part of input. In using experience replay methods, the outputs of past experiences are processed out of order, which makes the implementation of eligibility traces challenging. This problem has been explored by several research teams in recent years. Harb and Precup [1] explored using eligibility traces with the deep recurrent Q-network (DRQN) introduced by Hausknecht & Stone [2], which requires replays with the entire trajectory from end-to-end or at least sub-trajectories. Daley and Amato [3] proposed using offline λ-return calculation to emulate eligibility traces when using experience replay to improve learning performance. Han et al [4] proposed a modified Q-function that takes into account signal intervals and trajectories coupled with a History-Current (HC) decomposition prediction framework. 


The experiments in the aforementioned studies took place in simulated environments and not the real world. In the real world, besides the latency in reward signals, the learning agent also has to perform under unreliable, often noisy signals. We are interested in comparing the performances of methods that are designed to tackled the delayed rewards problems in real world versus in simulations. 


[1] Investigating Recurrence and Eligibility Traces in Deep Q-Networks (

[2] Deep Recurrent Q-Learning for Partially Observable MDPs (

[3] Reconciling λ-Returns with Experience Replay (

[4] Off-Policy Reinforcement Learning with Delayed Rewards (

bottom of page