Challenges in Reward Design for Reinforcement Learning-based Traffic Signal Control: An Investigation using a CO 2 Emission Objective

: Deep Reinforcement Learning (DRL) is a promising data-driven approach for traffic signal control, especially because DRL can learn to adapt to varying traffic demands. For that, DRL agents maximize a scalar reward by interacting with an environment. However, one needs to formulate a suitable reward, aligning agent behavior and user objectives, which is an open research problem. We investigate this problem in the context of traffic signal control with the objective of minimizing CO 2 emissions at intersections. Because CO 2 emissions can be affected by multiple factors outside the agent’s control, it is unclear if an emission-based metric works well as a reward, or if a proxy reward is needed. To obtain a suitable reward, we evaluate various rewards and combinations of rewards. For each reward, we train a Deep Q-Network (DQN) on homogeneous and heterogeneous traffic scenarios. We use the SUMO (Simulation of Urban MObility) simulator and its default emission model to monitor the agent’s performance on the specified rewards and CO 2 emission. Our experiments show that a CO 2 emission-based reward is inefficient for training a DQN, the agent’s performance is sensitive to variations in the parameters of combined rewards, and some reward formulations do not work equally well in different scenarios. Based on these results, we identify desirable reward properties that have implications to reward design for reinforcement learning-based traffic signal control


Introduction
Deep reinforcement learning (DRL) is a data-driven approach that holds promise for improving traffic signal control (TSC), because DRL can learn to adapt to changing traffic demands [1]- [3]. To achieve this, a DRL agent interacts with its environment and learns to take actions that maximize a cumulative scalar reward. By doing so, the agent can optimize the flow of traffic and improve the overall efficiency of the system.
In TSC, the actions correspond to changes in traffic lights and rewards correspond to traffic flow metrics (e.g., average vehicle speed, braking accelerations, and queuing lengths at intersections). However, in real-world applications of DRL, the agent's reward should also reflect the users' goals, which in TSC could be, to minimize traffic delays [4], [5] and CO 2 emissions [6], [7]. Nonetheless, it is not obvious how to select reward formulations that are also effective in satisfying users' goals. This is an open and challenging research problem known as the "agent alignment problem". The optimization goal of minimizing travel time in the context of TSC is challenging due to the influence of various external factors, such as free flow speed and current congestion level, which are beyond the agent's immediate control [8]. While this makes travel time an ineffective reward in practice [4], it is also not obvious which traffic flow metrics are guaranteed to be effective to this goal. There are several studies that combine traffic metrics as rewards for DRL agents [4], [5], [9]. Similarly, there is a growing body of research on training DRL agents to minimize pollutant emissions in TSC [6], [7]. However, these DLR-based approaches provide limited insight into how the convergence curves of traffic metrics behave relative to CO 2 emissions during the training of DRL agents. This information is important for designing reward functions that are effectively aligned with the users' goals. We investigate how to bridge this knowledge gap by performing a systematic study of the reward design space, which comprises single-metric rewards, combined-metrics and their corresponding parameterizations (weights in a linear function). For each candidate reward, we train a Deep Q-Network (DQN) [10] on two traffic scenarios, one with homogeneous traffic and one with heterogeneous traffic. To evaluate the various reward model formulations, we adopt the SUMO (Simulation of Urban MObility) simulator and its default emission model (Handbook Emission Factors For Road Transportation -HBEFA 3.1) [11]. Our evaluation consist of measurements of convergence curves of the agent's reward and the corresponding CO 2 emissions, producing the following results: 1. a CO 2 emission-based reward is inefficient for training a DQN agent, 2. only a few single-metric rewards were capable of minimizing CO 2 emissions, 3. metrics that individually did not produce effective reward formulations, were, when combined, successful in minimizing CO 2 emissions, 4. and, even when there exists an effective instance of a combined reward (e.g., a combination of queue and brake), there are still variations (i.e., from different parameterizations) of those same traffic flow metrics that produce ineffective rewards.
These results generalize both under homogeneous and heterogeneous traffic flow scenarios. Based on these results, we generated two contributions in the form of systematic analyses.
1. Property-based analysis of convergence curves. This analysis generates explanations for the cases of insufficient alignment between the agent's reward model and the CO 2 emission goal. The explanations consist of a paradigmatic classification of the reward models through orthogonal categories defined by two properties. Informativeness captures how well the agent approximates the given proxy reward, and expressiveness reflects how strong episode rewards correlate with episode CO 2 emission levels. 2. Sensitivity analysis of the challenges to align combined reward models with CO 2 emission goals. This analysis shows that alignment has two levels of sensitivity: the choice of traffic flow metrics, and the parameterization of these metrics in a linear reward formulation.
The remainder of the paper is organized as follows. In Section 2, we present the problem of agent alignment and its impact on TSC and emissions. We contextualize our work in relation to DRL for TSC, and for minimization of pollutant emissions (Section 3).
The approach and experimental setup is detailed in Section 4, while the corresponding results are presented in Section 5. The analyses of these results in terms of contributions, implications, and threats to validity are discussed in Section 6. Finally, we offer our conclusions and ideas for future work in Section 7.

Foundations
Deep reinforcement learning (DRL) is a popular approach that combines deep neural networks with reinforcement learning to enable agents to learn optimal behavior in complex environments. However, ensuring that the goals of the system align with the goals of the user is a critical challenge in DRL systems. Misaligned goals can result in unintended and potentially harmful outcomes that undermine the users' goals. This section examines the challenges that make alignment difficult in DRL systems and describes how reward models align with user goals.

The Agent Alignment Problem
The AI alignment problem [12] consists of finding ways to ensure that, quoting [13]: "... these [machine learning] models capture our norms and values, understand what we mean or intend, and, above all, do what we want". In other words, it involves matching agent rewards and users' goals regarding behavior [14], intent [15], incentive [16], inner and outer alignment [17], and instruction alignment [18]. Behavior alignment consists of producing predictions for given inputs, whereas intent looks at more general specification that cover different desired behaviors. Incentive alignment studies how rewards induce desired behaviors, whereas inner and outer alignment deals with partitioning the alignment in scopes that present specific dynamics. Instruction alignment consists of communicating human intent as a sequence of instructions that must be learned. These various definitions of alignment make specification, measurement, and evaluation challenging.
Therefore, a more pragmatic approach is to look at the failure of the agent to align with the user's goals (misalignment). Misalignment can have unintended consequences that are counterproductive (optimize against the users' goals), futile (no effect on users' goals), or simply could jeopardize users' goals (suboptimal behavior). Additionally, misalignment in DRL can increase the chances of reward hacking [19], [20]. For instance, in the case of a game boat race, an agent maximized a reward by indefinitely hitting a nearby target without ever concluding the race [21] -violating what the user intended.
One can argue for a proper definition for the user's goal and how it should be reflected on the reward model; however, this is still challenging, as evident in the many recent AI failure cases reported in the "Artificial Intelligence Incident Database" 1 . In other words, there is no perfect alignment [15]. Instead, one needs to specify models that satisfy the conditions of being sufficiently meaningful and precise to steer the process of achieving user goals (e.g., reducing CO 2 emissions) by optimizing traffic flow metrics. For that, one needs a systematic way to evaluate how reward models align with user goals. Our approach presented in this paper is to partition the alignment specification problem into two metrics that allow to express a meaningful goal, and inform precisely enough how this goal can be achieved.

Alignment Challenges
Partial observability in the form of hidden states (inherent to DRL environments) make alignment more difficult to achieve by preventing the agent from observing all the effects of its actions (in particular the delayed ones).The hidden states can result both from misspecified (wrong) [22] and underspecified (incomplete) [23] models. In the context of DRL, wrong or incomplete models can cause the agent to show good convergence curves at training time, but present unexpected behaviors after deployment. This can have consequences for the safety and cost of applications like autonomous vehicles and robotics.
Delayed and stochastic effects of actions are challenges when performing credit assignment, i.e., determining how each action contributed to achieve the users' goal. While delay and stochasticity cannot be eliminated, as they are properties of the environment, one can have reward models that are less sensitive to these factors. In the case of emissions, one can compare how different traffic flow metrics (e.g., average speed versus queue length) relate to changes in CO 2 emissions.

Deep Q-Network
Q-Learning is a popular reinforcement learning algorithm that helps agents make decisions based on rewards in their environment. It involves estimating the action-value function, which maps a state and action to the expected future rewards. In tabular Q-learning, the action-value function is represented as a table, but this becomes impractical for large or continuous state and action spaces [24]. Function approximation can solve this problem by representing the function using a neural network or another approximator.
Neural Fitted Q-Iteration (NFQ) [25] is an extension of tabular Q-learning with function approximation, improving scalability to large state-action spaces. However, NFQ uses a fixed dataset; thus, it is susceptible to overfitting on the training data. To mitigate this problem, Deep Q-Network (DQN) was introduced [10]. DQN builds on NFQ and introduces two key components: the experience replay buffer and the target network. The replay buffer stores the agent's experiences that can be retrieved for updating the Q-value estimates. The target network is used to set the TD targets, which are calculated based on the immediate reward and discounted future returns. Finally, our choice for DQN relied on its simplicity (off-policy and model-free), as it would allow to establish a comparison baseline for more sophisticated approaches like Proximal Policy Optimization algorithms (PPO) [26].

The State of the Art
This section introduces the topic of reward modeling in deep reinforcement learning (DRL) and its application to traffic signal control (TSC).

Reward Modeling
Reward modeling consists of learning to achieve specific user goals without requiring human feedback [14]. It has become a popular approach that precludes manually solving the credit assignment problem (e.g., via reward shaping [27]). However, because designed rewards can still be tampered by a learning agent [19], one still has to evaluate how alignment is done via reward modeling. This gives rise to the Optimal Reward Problem -ORP [28], which aims to reduce the alignment problem to a reward modeling problem. This might involve defining intrinsic or extrinsic rewards [29]. The intrinsic reward constraints the agent on how it can learn, whereas the extrinsic reward instruments the user's goal by steering the agent on what it can learn 2 . We translate these intuitions respectively into two convergence properties named informativeness and expressiveness (formalized in Section 6.1).

Reward Models for Traffic Signal Control
Minimizing travel time is the main goal of a TSC policy. However, because travel time is affected by a multitude of factors and actions with delayed effects [8], traffic engineers rely on proxy reward metrics, like average waiting time, average intersection speed, or total braking acceleration. Accordingly, in DRL, various combinations for a reward models were investigated: queue length and delay in [5], queue length and pressure in [4], stop time and average speed and time lost [9], and many others (see Table-5 in [8]). We extend this family of work by combining more metrics (vehicle speed, brake acceleration) and evaluating their impact on CO 2 emissions.

Pollutant Emissions in Traffic
Traditionally, the first solutions comprised non-DRL control (both with SUMO [31], [32] and other simulators [33]- [35]). More recently, DRL-based TSC approaches to minimize pollutant emissions have been investigated [6], [7]. However, these DLR-based approaches provide limited understanding about the relationship between metrics for emissions and traffic flow, in particular, regarding how the convergence curves of metrics behave during the training of DRL agents. Without a proper understanding of this relationship, one is hindered in the task of reward modeling for aligning the agent's reward with CO 2 emission goals in TSC. Therefore, to bridge this gap, we investigated various formulations for a linear reward function based on traffic flow metrics, and computed the corresponding CO 2 emissions using SUMO's provided emission model from the Handbook Emission Factors For Road Transportation (HBEFA 3.1) [11].

Deep Reinforcement Learning for Traffic
The specification of the DRL approach goes beyond the choice of reward function: one needs to choose an algorithm and how to model the state-space. Among the many DRL algorithms to have been adopted [8], the DQN [10] algorithm has been one of the most popular choices (Table-1 in [3]). The adoption of DQN for TSC stems from its relative simplicity of having discrete actions, while still providing good convergence behavior [36].
Concerning the state-space, the traffic environment has been modeled at various levels of resolution, from coarse (flow) to fine (vehicle speed and position) [8], resulting in tabular discretized metrics [37], and image representations [36]. We opted for a lane segment level resolution and discretized metrics because studies could not show better results when adopting higher resolution [38] or more complex state representations [5].

Methodology
In this section, we outline the methods used to study CO 2 emissions produced at signalized intersections. Our approach builds on the principles of reinforcement learning, where an agent learns to make decisions based on the interaction with its environment. We outline the traffic simulation scenario in Section 4.1 and formulate the reinforcement learning task in Section 4.2, defining the states, actions, and rewards used by the agent. Finally, in Section 4.3, we provide a detailed description of the experimental setup, including the neural network architecture, hyperparameters, and the setup of the traffic environment, used to train and evaluate the DQN algorithm.

Traffic Scenarios
We propose a scenario that comprises a controlled intersection (shown in Fig. 1), featuring two incoming and two outgoing lanes. The intersection allows two types of phases: either green or yellow in the north-south direction (N SG, N SY ); or green or yellow in the east-west direction (W EG, W EY ). In both cases, the orthogonal direction is set to red. Fig. 1 illustrates the intersection in N SG phase. We combine this infrastructure with traffic flows as shown in Fig. 2, consisting of two types: a time-varying Bernoulli distribution, and a traffic flow that remains constant throughout the simulation. At each second and on each road (north-south, west-east, etc.), a car is released into the simulation with a probability of p. Each traffic demand combined with the signalized intersection infrastructure gives rise to one scenario: a heterogeneous traffic scenario, using the time-varying demand, and a homogeneous traffic scenario (using the fixed demand).
For the heterogeneous traffic scenario, depicted in blue in Fig. 2, we deliberately chose a peak traffic volume of p = 0.25 -the maximum probability of releasing a car. This level of peak traffic makes the scenario challenging, as it exceeds the maximum intersection throughput and causes congestion temporarily. In contrast, the homogeneous traffic flow, depicted in red in Fig. 2, has a fixed probability of releasing a car with a value of p = 0.2. This value represents the maximum intersection throughput, ensuring that the flow remains steady throughout the simulation.

The Reinforcement Learning Task
Traffic signals play a critical role in ensuring safe and efficient traffic flow at intersections. Fixed pre-timed controllers are often insufficient in optimizing traffic flow, as traffic volume and driving behavior vary widely. Adaptive traffic signal control (ATSC) provides a solution, which uses electrical sensors and sets signals based on the data, adapting to the current traffic situation. One of the simplest methods to achieve ATSC is actuated signal control, which triggers a specific signal based on sensory data gathered around the intersection. Reinforcement learning (RL) is a possible solution to obtain a program for ATSC. The output of the RL algorithm, the agent's policy, becomes the desired ATSC program, which works fully automated, and can be scaled. ATSC with DRL has achieved outstanding results, outperforming conventional methods in many situations. The agent repeatedly collects state information, acts, and updates its policy with a scalar reward, while being trained in safe or simulated environments. For the remainder of this Section, we will assume the environment described in Section 4.1 and specify the components of the reinforcement learning problem, the states, actions and rewards of the agent.
The agent's state or observation is a representation of the environment that the agent perceives at any given time, including relevant information that the agent can use to take actions that maximize its rewards. In the case of traffic signal control with reinforcement learning, the DTSE (Discrete Traffic State Encodings) state [39] is a commonly used representation that consists of two 2D matrices. The first matrix is a binary position matrix that encodes the presence or absence of a vehicle at each intersection, as depicted in Fig. 3 (b). The second matrix is a normalized velocity matrix that tracks the average speed of the vehicles on a given segment, as depicted in Fig. 3 (c). . Image source: [39] Our approach uses DTSE representations, which capture the position and speed of vehicles -key factors in determining CO 2 emissions. This allows the agent to make informed decisions about when to change traffic lights to achieve the goal of our RL task, which is to minimize CO 2 emissions.
The agent's actions are determined by the traffic scenario, as described previously. That is, the agent takes action every ∆t (in seconds) and chooses from the set of allowed phases A = {N SG, W EG}. Additionally, on each phase change, a yellow transition phase (N SY or W EY ) is induced to ensure safety. In contrast to our approach, the agent could also cycle through a pre-defined sequence or operate in non-fixed intervals. However, using a fixed action interval with a set of allowed phases provides a balance between flexibility and difficulty, as non-fixed intervals make the problem harder, and a pre-defined sequence limits the agent's options.
The agent's reward is composed of one or multiple of the following average traffic metrics, aggregated over all lanes: queuing length (queue reward), vehicle speed (speed reward), braking acceleration (brake reward), and CO 2 emission rates (emission reward). Additionally, we provide linear combinations of average queuing length and braking acceleration (queue+brake reward) as well as queuing length and speed metrics (queue+speed reward).

Experimental Setup
Each experiment uses one of the intersection scenarios described in Section 4.1, with either heterogeneous or homogeneous traffic. Each training run uses simulations that last for 3600 seconds (simulation time), and the agent interacts in intervals of ∆t = 5s, resulting in 720 steps t = 1, . . . , 720 per episode. At episode termination, the simulation is reset, and the agent continues training. For a phase-switch, we selected a yellow time to of t yellow = 2s. The agents observe DTSE features with speed and position information. To compute DTSE features, we split each road into 30 segments (segments of length c ≈ 8.33m). Table 1 summarizes this general setup.
The DQN agent uses a Multi-Layer Perceptron (MLP) with two hidden layers as the neural network, each containing 64 neurons, and a linear output layer with four neurons (one for each action). We use the Adam optimizer [40] for mini-batch gradient descent, with a batch size of 64 and an initial learning rate of α = 1e − 4. To explore the environment, the agent begins with 100% exploration (ϵ = 1) and gradually decreases exploration linearly to 10% over the first third of training. The replay buffer holds up to 2000 samples, and learning begins after the first episode (720 steps of initial experience). The target network is updated every C = 10000 (steps), and the agent's discount factor is γ = 0.99, which captures long-term rewards. Hyperparameters and training setup are summarized in the second section of Table 1.

Results
This section is organized into three parts. In Section 5.1, we evaluate the suitability of CO 2 emission rates as a reward. In Section 5.2, we compare the performance of agents trained on proxy rewards to those trained on a CO 2 reward. Finally, in Section 5.3, we examine how different combinations of reward parameters impact agent's alignment.

CO 2 Emissions as Reward
In Fig. 4 we show the performance of two DQN agents: one agent was trained on a speed reward, and the other agent was trained on the CO 2 emission reward. The solid line represents the median episode emission rate in g/h, and the shaded area shows the 95% confidence intervals. Our results demonstrate that while the agent trained on the CO 2 emission reward does improve in the first episode of training, it converges to a higher emission rate than the agent trained on the speed reward, and does not show any further improvement over time.
These findings suggest that training with the CO 2 reward leads to suboptimal behavior, as the agent is constrained in maximizing this reward and fails to learn an effective policy for minimizing CO 2 emissions. In contrast, the agent trained on the speed reward is able to converge to a better policy for emission minimization, ultimately achieving a lower emission rate.

Proxy Rewards for CO 2 Minimization
To explore alternative approaches to incentivizing emission reduction, we investigate the use of various proxy rewards in this section. Specifically, we analyze the performance of DQN agents trained on rewards based on queue lengths, braking accelerations, average speed, CO 2 emissions, a combination of queue length and braking acceleration, and a combination of queue length and average speed.
We present the results of our experiments in Fig. 5. This figure summarizes the performance of each agent on the different reward models, with each subplot showing the CO 2 emission rates (red line), proxy reward (green line), and maximum observed proxy rewards (dotted yellow line). In addition, the shaded areas in each subplot represent 95% confidence intervals for the emission rates and rewards.
Based on the results presented in Fig. 5, we observe that the DQN agent trained on the CO 2 emission reward converged to a suboptimal policy after one episode, resulting in comparatively high emission levels. Similar behavior was observed for the DQN agent trained on the queue reward, which achieved a reduction in CO 2 emissions, but at suboptimal levels. The agent trained on the brake reward had a positive correlation between CO 2 emissions and the episode reward, leading to no reduction in CO 2 emissions.
Good emission performance was achieved by agents using a speed reward and a combined queue and brake reward, denoted as queue-brake reward. The DQN agent trained on the speed reward achieved a relatively low CO 2 emission rate, while also achieving the highest speed reward among all agents. The DQN agent trained on the queue-brake reward achieved the lowest CO 2 emission levels so far, showing a negative correlation with CO 2 emissions (see Table 2).
Overall, the queue-brake reward was the most effective in reducing CO 2 emissions, while the speed reward was effective in achieving a relatively low CO 2 emission rate and high speed reward. Conversely, the emission and queue rewards resulted in suboptimal emission levels.
We calculated the degree of association between the episode CO 2 emissions and episode rewards as a measure of the behavior of the agent alignment (see Table 2). For that, we adopted the Kendall-tau rank correlation coefficient 3 .

Sensitivity to Reward Parameters
In this experiment, we explored the impact of reward parameter combinations on the performance of a DQN agent in managing traffic flow with the aim of minimizing CO 2  Figure 5. CO 2 emission rates in g/h (red) and absolute episode emission reward (green) of DQN agents over training time. The solid lines depict the median values, while the shades depict 95% confidence intervals. Each reward is combined to a "best reward" (yellow) that corresponds to the highest value on this reward that was observed among all agents (trained with various rewards).
emissions. We varied the ratio of queue and brake in a combined reward, and queue and speed in a combined reward. For each combination of metrics, we trained six DQN agents for 21800 steps and evaluated their performance in both heterogeneous and homogeneous traffic scenarios.
In Fig. 6 we show the average episode CO 2 emissions in g/h (y-axis) and weightings of queue and brake reward (x-axis). We observed that a combination of both queue and brake reward was necessary to achieve the lowest CO 2 emissions.
Interestingly, we found that the combination ratio of (queue, brake) = (0.5, 0.5) provided the best performance across both traffic scenarios. Additionally, we observed that combinations close to (queue, brake) = (1.0, 0) or (0, 1.0) demonstrated similar performance to those combinations. This suggests that the agent focuses on only one reward parameter, which does not lead to the best outcome. For a combined queue speed reward, we would expect to see a similar trend in terms of the combination of rewards required to achieve good performance. However, as shown in Fig. 7, we observed that the level of queue metric must be zero (or close to zero) to achieve good performance.
Overall, our findings highlight the importance of reward parameter selection in training agents to optimize traffic flow and minimize CO 2 emissions.

Discussion
Next we discuss the results in terms of general properties for reward models, implications for modeling, and the threats to the validity of our results.

Informativeness and Expressiveness
The two convergence curves shown in Section 5 correspond to: (1) how well the reward model informs the agent towards achieving the user's goal (CO 2 ) and (2) how well the reward model expresses the behavior of the emission. We call these two properties of the reward model informativeness and expressiveness. In other words, if the agent fails to converge to the optimal reward, we deem the reward model uninformative (see queue and emission reward in Fig. 5). Meanwhile, if the agent optimizes in the wrong direction, in our case positive correlation between reward and emissions (see brake reward in Table 2), then the reward model is not expressive.
These two properties are important because together they indicate if the agent is sufficiently aligned to the user's goals (minimizing CO 2 emissions). The judgment of sufficient alignment depends on how informative and expressive a reward model is. This is challenging because informativeness and expressiveness are continuous metrics based, respectively, on the measures of distance (from optima) and correlation (between reward and goal). Therefore, for the purpose of illustration and discussion, we assumed two arbitrary thresholds, which we introduce next.

Informativeness.
A reward model (R mod ) is informative (I(R mod ) = 1) if the distance between the reward at convergence (R con ) and the optimal reward (R opt ) is smaller than δ. Formally, we have where R con is the episode reward and R opt is R con of the best performing agent regarding that reward.

Expressiveness.
A reward model (R mod ) is expressive (E(R mod ) = 1) if the correlation (Corr) between the sequence of the agent's episode rewards (R) and the cor-responding episode CO 2 emissions (G) has a certain direction (positive or negative) and its magnitude is above a certain strength (ρ). The correlation should be negative (∈ [−1, ρ]) if the user's goal G has to be minimized, otherwise positive (∈ [ρ, 1]).
We formalized E for the case where G has to be minimized.
where the magnitude ρ depends on the use case. For the purpose of illustration and discussion, we set next |ρ| ≥ 0.30, which corresponds to at least a medium strength correlation [42] and be negative (as it minimizes emissions), hence, the threshold becomes Corr(·) ∈ [−1, −0.30].
Applying these formulas (Eq. (1) and Eq. (2)) as threshold criteria for classification, we populated a Venn diagram (Fig. 8) with the results from Section 5. The intersection area shows the reward models that are both expressive and informative, hence, they are considered to be sufficiently aligned with users' goals (minimize CO 2 emissions). Only the brake reward is considered not expressive, whereas queue, emission, and queue-speed rewards are considered non-informative. Next, we discuss the implications of this classification.

Implications
Independent properties. The informativeness property did not necessarily imply expressiveness, and vice versa. Therefore, one has to monitor both properties while designing reward models. This is an additional requirement that involves a careful study of the thresholds that lead to agent alignment -satisfying users' goals. Uninformativeness detection. Many of the reward models that were deemed uninformative showed a very early convergence to a local minimum, e.g., queue reward and emission reward -their green curves follow a step-like function (Fig. 5). This suggests that the reward models provided a target that was too easy to learn; in other words, the agent is overfitting to the data collected in the first epoch.
Combining metrics. The design of the reward model should therefore incorporate metrics that make learning more challenging, for instance, with properties that are less correlated with emissions (lower expressiveness). This might explain why a combination of brake (low expressiveness) and queue (low informativeness) produced a sufficiently aligned agent, minimizing CO 2 emissions. Looking at the convergence curve of the proxy reward, green curve in Fig. 5, we can see diminishing returns over time, which suggests an increasing degree of difficulty for the agent to learn better policies as the training progresses. In other words, relative to the queue reward model, adding a brake metric made the learning more difficult. Conversely, adding the queue metric to the brake reward model provided the expressiveness that was missing.
Complementary properties. However, looking for complementary properties is not enough. Take, for instance, speed and queue metrics. Although the speed reward model is complementary to the queue reward model regarding informativeness and expressiveness -speed reward has higher correlation with emissions than the queue reward (see Table 2) -the combination of queue-speed did not produce an aligned agent. As we can see in Fig. 8, queue+speed convergence is categorized as expressive, but not informative. This is confirmed by the sensitivity analysis of the parameter weights for speed+queue combined reward (see Fig. 7).
Reward parameterization. Choosing the right traffic metrics to combine is not enough. One still has to decide on the weights that each metric should have in the reward model. While for the queue-brake reward we showed an optimum region (see Fig. 6), there is no guarantee that the combination of other metrics would present the same global optimum. This is important to design methods that systematically and efficiently look for the optimal parameterization. The shape of this parameterization space determines how informative and expressive a reward model should be to be considered aligned to the users' goal. Because a search in this space could be seen as a balance between exploitation (following an informative signal) and exploration (expressing desired behavior), one has to decide how to measure these properties. We note that assuming that these properties have uniform values during training is not realistic.
Property uncertainty. Defining how expressive or how informative a reward is might require new properties, for instance, properties that evaluate the uncertainty in the learning (convergence) process. The brake reward model illustrates this case, where there is larger than 10% variance in reward (green curve in Fig. 5) in the second half of training. This makes it challenging to decide how many training steps to execute or when training should stop, because slightly different stopping points could produce very different policies. Ideally, an engineer would like to know about the trade-off between reward model simplicity (only use the brake metric) and the risk of suboptimal rewards (high uncertainty at convergence).

Threats to Validity
Threats to validity [43], [44] act in ways that can hinder the reproducibility of the experimental results and corresponding interpretations.
Internal Validity evaluates if the causes of the measured effects can be attributed to our experimental design decisions [45]. In our case, we chose a benchmark (the best reward across agents -dotted lines in Fig. 5) and a set of proxy reward metrics (speed, brake, queue). We computed the effects on CO 2 emissions by varying the weights of metrics on combined reward models (e.g., X-axis in Fig. 6). When we claim that a given reward model is more or less informative or expressive, we are interpreting a measurement, i.e., the effect of a parameterization choice, that can still be confounded by what we did not control for, i.e, the other metrics not included in the given reward model, which might still indirectly affect the CO 2 emissions. To improve internal validity, we suggest more extensive simulations with more complex scenarios, for instance, by including real-world data.
External Validity discusses the situations in which the research outcomes might not generalize beyond the current experimental setup [45] comprised by both dataset and models parameters. Concerning the dataset, we showed similar results in two distinct scenarios of traffic demand. Although this might be a straightforward mitigation of the external validity threat (Fig. 6), a recent survey [2] reported that only seven out of 21 studies evaluated their models under distinct traffic scenarios. With respect to parameterization, we showed that certain pairs of weights for the queue-break reward produced suboptimal CO 2 emissions (see the extremes of the bar chart in Fig. 6). This highlights the challenge to generalize the combined reward results across a range of parameter values.
Construct Validity concerns the situations for which the performance indicators (thresholds) do not measure the actual concepts (constructs) [45]. This might happen because of bias in data generation, incorrect definitions, or inappropriate analysis methods (see Statistical Conclusion Validity). In our study, the mismatch between thresholds and the convergence properties (constructs) can happen through misspecification of the reward model and the properties themselves. One example is mistakenly deeming a reward model to be informative or expressive enough, when it is not. The reason for the mistake could be an inappropriate threshold or a reward model that is incomplete. To mitigate that, we specified convergence properties that are independent of the traffic signal control domain, but can be easily instantiated by choosing classification thresholds that are meaningful to what a user consider to be a sufficiently aligned agent reward model.

Statistical Conclusion
Validity concerns the violations in the assumptions of the adopted statistical methods [45]. One example of possible violation is wrong assumption of normal data distributions. As we worked with small samples of reward outcomes, we adopted a non-parametric method (Kendall-tau) to compute the correlations, which we reported with their corresponding p-values (Table 2). Regarding conclusions about categorization within the two properties (informativeness and expressiveness), although we specified thresholds that were appropriate to discriminate among convergence curves, we did not take into account the inherent uncertainty in the convergence curves. A possible improvement could be to incorporate uncertainty measurements to the convergence analysis, for instance, the reward variance at the late training stages (so to ideally minimize it).

Conclusion
In the theory of bounded rationality [46], agents are bounded in their learning by the quality of the information they can access. We investigated this essential limitation in terms of the reward model, which we evaluated concerning the agent's alignment with the users' goals. Our main result is that, for the agent alignment with the goal of minimizing CO 2 emissions, it is necessary that the corresponding reward model formulation be both expressive and informative.

Results and Contributions.
Results. We showed that not all reward models are sufficiently aligned with users' goals (e.g., the models outside the intersection set depicted in Fig. 8). These results were reproduced in two distinct traffic scenarios. The sufficiently aligned reward models shared the characteristic of being both informative and expressive. However, the result from queue+speed indicated that to determine if an agent is aligned, it is not sufficient to look at the properties of the single-metric based rewards. Only after combining the individual traffic flow metrics into a properly parameterized reward formulation, one can ultimately assess the agent alignment (again by evaluating its convergence properties).

Contributions.
We provided two systematic analyses: (1) a property-based paradigmatic classification for explaining the failure of an agent to align with users' goals and (2) a sensitivity analysis for explaining the challenges of aligning combined reward models with CO 2 emission goals.

Future work
Towards principles for reward model selection. We showed that combining complementary metrics worked to some extent. However, some outcomes are still counterintuitive, i.e., we do not know how to predict good and bad combinations based on the properties of single-metrics rewards. This is critical, because one still has to rely on post hoc explanations (as we showed), instead of relying on principles to prioritize reward model combinations systematically.
Reproducibility in more challenging scenarios. A natural step is to reproduce our findings in more complex situations, for instance, incorporating real-data 4 to the simulations and a larger set of traffic flow metrics. Besides creating opportunities to falsify our current claims, we could explore more challenging questions like the effects of partial observability and confounding in reward modeling for agent alignment in TSC.
Alternative convergence property formulations. In order to evaluate non-linear relationships, we plan to study expressiveness in terms of mutual information or metrics like Wasserstein distance. Concerning informativeness, we plan to look at methods that incorporate variance as a criterion of quality of convergence.

Data availability statement
To promote the reproducibility of our results, we made our experimental setup, source code, and dataset publicly available 5 , which further facilitates extension of our results and serves as a baseline for comparing different approaches w.r.t. reward modeling. Note that the convergence properties are independent of the choice of traffic flow metrics, therefore, they can be instantiated for TSC environments with different sensors and state-space configurations.