Challenges in Reward Design for Reinforcement Learning-based Traffic Signal Control: An Investigation using a CO2 Emission Objective




Traffic Signal Control, Reinforcement Learning, Reward Modeling, Pollutant Emissions


Deep Reinforcement Learning (DRL) is a promising data-driven approach for traffic signal control, especially because DRL can learn to adapt to varying traffic demands. For that, DRL agents maximize a scalar reward by interacting with an environment. However, one needs to formulate a suitable reward, aligning agent behavior and user objectives, which is an open research problem. We investigate this problem in the context of traffic signal control with the objective of minimizing CO2 emissions at intersections. Because CO2 emissions can be affected by multiple factors outside the agent’s control, it is unclear if an emission-based metric works well as a reward, or if a proxy reward is needed. To obtain a suitable reward, we evaluate various rewards and combinations of rewards. For each reward, we train a Deep Q-Network (DQN) on homogeneous and heterogeneous traffic scenarios. We use the SUMO (Simulation of Urban MObility) simulator and its default emission model to monitor the agent’s performance on the specified rewards and CO2 emission. Our experiments show that a CO2 emission-based reward is inefficient for training a DQN, the agent’s performance is sensitive to variations in the parameters of combined rewards, and some reward formulations do not work equally well in different scenarios. Based on these results, we identify desirable reward properties that have implications to reward design for reinforcement learning-based traffic signal control.


C. Louw, L. Labuschagne, and T. Woodley, “A comparison of reinforcement learning agents applied to traffic signal optimisation,” in SUMO Conference Proceedings, vol. 3, 2022, pp. 15–43. DOI:

H. Wei, G. Zheng, V. Gayah, and Z. Li, “Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation,” ACM SIGKDD Explorations Newsletter, vol. 22, no. 2, pp. 12–18, 2021, Publisher: ACM New York, NY, USA. DOI:

A. Haydari and Y. Yilmaz, “Deep Reinforcement Learning for Intelligent Transportation Systems: A Survey,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–22, 2020, ISSN: 1558-0016. DOI:

H. Wei, C. Chen, G. Zheng, et al., “Presslight: Learning max pressure control to coordinate traffic signals in arterial network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 1290–1298. DOI:

G. Zheng, X. Zang, N. Xu, et al., “Diagnosing reinforcement learning for traffic signal control,” arXiv preprint arXiv:1905.04716, 2019.

J. Kim, S. Jung, K. Kim, and S. Lee, “The real-time traffic signal control system for the minimum emission using reinforcement learning in v2x environment,” en, Chemical Engineering Transactions, vol. 72, pp. 91–96, Jan. 2019, ISSN: 2283-9216. DOI: [Online]. Available: https// (visited on 09/21/2022).

A. Haydari, M. Zhang, C.-N. Chuah, and D. Ghosal, “Impact of deep rl-based traffic signal control on air quality,” in 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), ISSN: 2577-2465, Apr. 2021, pp. 1–6. DOI: Spring51267.2021.9448639. DOI:

H. Wei, G. Zheng, V. Gayah, and Z. Li, “A Survey on Traffic Signal Control Methods,”arXiv:1904.08117 [cs, stat], Jan. 2020, arXiv: 1904.08117. [Online]. Available: (visited on 01/16/2022).

A. C. Egea, S. Howell, M. Knutins, and C. Connaughton, “Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), ISSN: 2577-1655, Oct. 2020, pp. 965–972. DOI: DOI:

V. Mnih, K. Kavukcuoglu, D. Silver, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:

D. Krajzewicz, M. Behrisch, P. Wagner, R. Luz, and M. Krumnow, “Second Generation of Pollutant Emission Models for SUMO,” en, in Modeling Mobility with Open Data, M. Behrisch and M. Weber, Eds., Series Title: Lecture Notes in Mobility, Cham: Springer International Publishing, 2015, pp. 203–221, ISBN: 978-3-319-15023-9 978-3-319-15024-6. DOI: [Online]. Available: (visited on 11/21/2022).

E. Yudkowsky, “The AI alignment problem: why it is hard, and where to start,” Symbolic Systems Distinguished Speaker, 2016.

B. Christian, The alignment problem: Machine learning and human values. WW Norton & Company, 2020.

J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scalable agent alignment via reward modeling: A research direction,” arXiv preprint arXiv:1811.07871, 2018.

P. Christiano, ”clarifying ai alignment”, 2018. [Online]. Available:

T. Everitt, R. Carey, E. D. Langlois, P. A. Ortega, and S. Legg, “Agent incentives: A causal perspective,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 11 487–11 495. DOI:

E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant, “Risks from learned optimization in advanced machine learning systems,” arXiv preprint arXiv:1906.01820, 2019.

L. Ouyang, J. Wu, X. Jiang, et al., Training language models to follow instructions with human feedback, arXiv:2203.02155 [cs], Mar. 2022. [Online]. Available: (visited on 12/08/2022).

T. Everitt, M. Hutter, R. Kumar, and V. Krakovna, “Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective,” Synthese, vol. 198, no. Suppl 27, pp. 6435–6467, 2021. DOI:

M. Cohen, M. Hutter, and M. Osborne, “Advanced artificial agents intervene in the provision of reward,” AI Magazine, vol. 43, no. 3, pp. 282–293, 2022. DOI:

J. Clark and D. Amodei, Faulty reward functions in the wild, Dec. 2016. [Online]. Available:

J. Skalse and A. Abate, “Misspecification in inverse reinforcement learning,” arXiv preprint arXiv:2212.03201, 2022.

A. D’Amour, K. Heller, D. Moldovan, et al., “Underspecification presents challenges for credibility in modern machine learning,” Journal of Machine Learning Research, 2020.

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.

M. Riedmiller, “Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method,” in Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16, Springer, 2005, pp. 317–328. DOI:

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Icml, Citeseer, vol. 99, 1999, pp. 278–287.

J. Sorg, R. L. Lewis, and S. Singh, “Reward design via online gradient ascent,” in Advances in Neural Information Processing Systems, J. Lafferty, C. Williams, J. Shawe-

Taylor, R. Zemel, and A. Culotta, Eds., vol. 23, Curran Associates, Inc., 2010. [Online]. Available:

S. Singh, R. L. Lewis, and A. G. Barto, “Where do rewards come from,” in Proceedings of the annual conference of the cognitive science society, Cognitive Science Society, 2009, pp. 2601–2606. [30] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity-driven learning,” in Seventh International Conference on Learning Representations, 2019, pp. 1–17.

K. Belhassine, J. Renaud, L. Coelho, and V. Turgeon, “Signal priority for improving fluidity and decreasing fuel consumption,” in SUMO Conference Proceedings, vol. 3, 2022, pp. 159–169. DOI:

J. E. L. Quichimbo, J.-A. Moreno-Perez, E. Lorenzo-Sáez, et al., “Estimation of green house gas and contaminant emissions from traffic by microsimulation and refined origindestination matrices: A methodological approach,” in SUMO Conference Proceedings, vol. 1, 2020, pp. 27–37, DOI: DOI:

B. De Coensel, A. Can, B. Degraeuwe, I. De Vlieger, and D. Botteldooren, “Effects of traffic signal coordination on noise and air pollutant emissions,” en, Environmental Modelling& Software, vol. 35, pp. 74–83, Jul. 2012, ISSN: 1364-8152. DOI: 10.1016/j.envsoft.2012.02.009. [Online]. Available: (visited on 09/06/2021). DOI:

Y. Zhang, X. Chen, X. Zhang, G. Song, Y. Hao, and L. Yu, “Assessing effect of trafficsignal control strategies on vehicle emissions,” en, Journal of Transportation Systems Engineering and Information Technology, vol. 9, no. 1, pp. 150–155, Feb. 2009, ISSN: 1570-6672. DOI: 10.1016/S1570-6672(08)60050-1. [Online]. Available: https:/ /www /science /article /pii /S1570667208600501 (visited on 09/06/2021). DOI:

H. Rakha, M. Van Aerde, K. Ahn, and A. Trani, “Requirements for evaluating traffic signal control impacts on energy and emissions based on instantaneous speed and acceleration measurements,” en, Transportation Research Record, vol. 1738, no. 1, pp. 56–67, Jan. 2000, Publisher: SAGE Publications Inc, ISSN: 0361-1981. DOI: [Online]. Available: (visited on 09/06/2021). DOI:

X. Liang, X. Du, G. Wang, and Z. Han, “Deep Reinforcement Learning for Traffic Light Control in Vehicular Networks,” IEEE Transactions on Vehicular Technology, vol. 68, no. 2, pp. 1243–1253, Feb. 2019, arXiv:1803.11115 [cs, stat], ISSN: 0018-9545, 1939-9359. DOI: [Online]. Available: (visited on 02/15/2023). DOI:

L. Prashanth and S. Bhatnagar, “Reinforcement learning with average cost for adaptive control of traffic lights at intersections,” in 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), IEEE, 2011, pp. 1640–1645. DOI:

W. Genders and S. Razavi, “Evaluating reinforcement learning state representations for adaptive traffic signal control,” Procedia computer science, vol. 130, pp. 26–33, 2018. DOI:

W. Genders and S. Razavi, “Using a deep reinforcement learning agent for traffic signal control,” arXiv preprint arXiv:1611.01142, 2016.

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv, Tech. Rep. arXiv:1412.6980, Jan. 2017, arXiv:1412.6980 [cs] type: article. DOI: [Online]. Available: (visited on08/29/2022).

M. G. Kendall, “Rank correlation methods.,” 1948. [Online]. Available:

H. Akoglu, “User’s guide to correlation coefficients,” Turkish Journal of EmergencyMedicine, vol. 18, no. 3, pp. 91–93, 2018, ISSN: 2452-2473. DOI: [Online]. Available: DOI:

D. T. Campbell and T. D. Cook, “Quasi-experimentation,” Chicago, IL: Rand Mc-Nally,1979.

W. R. Shadish, T. D. Cook, and D. T. Campbell, Experimental and quasi-experimental designs for generalized causal inference. Houghton, Mifflin and Company, 2002.

R. J. Wieringa, Design science methodology for information systems and software engineering. Springer, 2014. DOI:

H. A. Simon, “Bounded rationality,” Utility and probability, pp. 15–18, 1990. DOI:

L. N. Alegre, SUMO-RL,, 2019.




How to Cite

Schumacher, M., Adriano, C. M., & Giese, H. (2023). Challenges in Reward Design for Reinforcement Learning-based Traffic Signal Control: An Investigation using a CO2 Emission Objective. SUMO Conference Proceedings, 4, 131–151.

Conference Proceedings Volume


Conference papers