A New Reward System Based on Human Demonstrations for Hard Exploration Games

Tareq, Wadhah; AMASYALI, Mehmet

doi:10.32604/cmc.2022.020036

A New Reward System Based on Human Demonstrations for Hard Exploration Games

Atıf İçin Kopyala

Tareq W. Z., AMASYALI M. F.

CMC-COMPUTERS MATERIALS & CONTINUA, cilt.70, sa.2, ss.2401-2414, 2022 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 70 Sayı: 2
Basım Tarihi: 2022
Doi Numarası: 10.32604/cmc.2022.020036
Dergi Adı: CMC-COMPUTERS MATERIALS & CONTINUA
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Communication Abstracts, Compendex, INSPEC, Metadex, zbMATH, Civil Engineering Abstracts
Sayfa Sayıları: ss.2401-2414
Anahtar Kelimeler: Deep reinforcement learning, human demonstrations, prioritized double deep q-networks, atari
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

The main idea of reinforcement learning is evaluating the chosen action depending on the current reward. According to this concept, many algorithms achieved proper performance on classic Atari 2600 games. The main challenge is when the reward is sparse or missing. Such environments are complex exploration environments like Montezuma's Revenge, Pitfall, and Private Eye games. Approaches built to deal with such challenges were very demanding. This work introduced a different reward system that enables the simple classical algorithm to learn fast and achieve high performance in hard exploration environments. Moreover, we added some simple enhancements to several hyperparameters, such as the number of actions and the sam-pling ratio that helped improve performance. We include the extra reward within the human demonstrations. After that, we used Prioritized Double Deep Q-Networks (Prioritized DDQN) to learning from these demonstra-tions. Our approach enabled the Prioritized DDQN with a short learning time to finish the first level of Montezuma's Revenge game and to perform well in both Pitfall and Private Eye. We used the same games to compare our results with several baselines, such as the Rainbow and Deep Q-learning from demonstrations (DQfD) algorithm. The results showed that the new rewards system enabled Prioritized DDQN to out-perform the baselines in the hard exploration games with short learning time.