Simulation-based origin-destination matrix reduction: a case study of Helsinki city area

. Estimation of a travel demand in a form of origin-destination (OD) matrix is a necessary step in a city-scale simulation of the vehicular mobility. However, an input data on travel demand in OD matrix may be available only for a specific set of traffic assignment zones (TAZs). Thus, there appears a need to infer OD matrix for a region of interest (we call it ‘core’ area) given OD matrix for a larger region (we call it ‘extended’ area), which is challenging as trip counts are only given for zones of the initial region. To perform a reduction, we explicitly simulate vehicle trajectories for the extended area and supplement trip values in ‘core’ TAZs based on the recorded trajectories on the border of core and extended areas. To keep validation results consistent between extended and core simulations, we introduce edge-based origin-destination assignment algorithm which preserves properties of traffic flows on the border of the core area but also keeps randomness in instantiating simulation for the core area. The experimental study is performed for Helsinki city area using Simulation of Urban MO-bility (SUMO) tool. The validation was performed using DigiTraffic data from traffic counting stations within the city area for workdays of autumn 2018. Validation results show that the reduced OD matrix combined with edge-based OD assignment algorithm keeps the simulated traffic counts in good agreement with results from the extended area simulation with average MAPE between observed and simulated traffic counts equal to 34%. Simulation time after reduction is equal to 20 minutes compared to 6 hours for the extended OD.


Introduction
Data-driven models of vehicular urban mobility are widely used to develop and to test strategies of future transportation, to estimate the economical, societal and environmental effects of traffic planning decisions and to develop novel algorithms for controlling vehicles and city infrastructure. The applicability of the model for these purposes is significantly determined by the extent to which it resembles existing patterns of traffic flows within the city. Thus, development of realistic simulation of urban vehicular mobility requires collecting, preprocessing and fusion of heterogeneous data sources including data about road network layout, traffic signal controls, speed limits, types and amounts of vehicles, and travel demand during different times of a day and different seasons.
The situation is complicated by the fact that the initial data which are required for creation and validation of a model may be not available directly for a certain area of interest. For example, data on travel demand may be delivered from another model and may be available not for the city itself but for the larger region. This poses a problem for a developer of how to extract and to compose data for training and validation of the model in a reproducible and time-efficient way.
In this study, we focus on the problem of estimating origins and destinations of the vehicles for the case when simulated area (we call it 'core' area) is located within larger ('extended') area, and data on travel demand are available only for this larger area. This is a typical case when travel demand is produced by an external four-step mobility model (see e.g. [1]). Given origin-destination (OD) matrix for an extended area, we aim to solve the following two problems:  to infer origin-destination matrix for a core area;  to estimate origins and destinations of vehicles (starting and ending edges) for a core area.
For the first problem, naïve approach is to use a submatrix of OD matrix for extended area, corresponding to traffic assignment zones which also belong to the core area. However, this underestimates travel demand as trips which start or end outside the core area are not accounted. For the second problem, a basic approach is to use random assignment for core OD matrix. The disadvantage of this approach is that traffic flows on the border of the core area become less accurate.
To tackle these problems, we use vehicle trajectories from the extended simulation to infer origins and destinations for the core simulation. Data on trajectories are used in our novel edge-wise OD assignment algorithm. This algorithm may be used to get arbitrary number of instances of core simulation with sufficient level of variability after a single run of extended simulation. As simulation of an extended area may be much more time-consuming than simulation of a core area, proposed approach can reduce overall time of modelling workflow.
The experimental study of the method is performed using case study of Helsinki city area using Simulation of Urban MObility (SUMO) tool [2]. The extended area comprises of 1972 traffic assignment zones while the core area (Helsinki municipalities) has 381 zones. In this case, a single run of extended simulation takes more than 6 hours. For the purposes of validation, we use the data from traffic counting stations located within the city. In the experimental study, we show that: (i) the core simulation reproduces traffic flows of the extended simulation, (ii) edge-wise assignment provides sufficient variability of simulation instances and at the same time does not significantly influence model quality. The code implementing proposed approach is available on GitHub [3].
The paper is organized as follows. Section 2 gives an overview of the related work. Section 3 contains formal problem statement. Section 4 describes the proposed method and its SUMO implementation. Section 5 describes the data for the experimental study, and the results are discussed in Section 6.

Related work
The problem of reduction of OD data has often been considered in the studies aiming at creation of validated models of vehicular traffic in particular urban areas. In the recent study [1], authors propose large-scale agent-based traffic microsimulation for Barcelona city. As in our study, extended area (577 traffic assignment zones) is simulated to get input data for core area (296 zones). However, instantiating of core simulation is performed by cropping the paths for extended simulation, which requires launching extended simulation each time when one needs to get new instance of core simulation. In our study, we propose edge-wise assignment to avoid multiple launches of time-consuming extended simulation. The case study of Nanjing, China, is considered in [4]. Initial values of OD matrix are supposed to be given, but they are improved by Adaptive Fine-Tuning algorithm to minimize the error between simulated SUMO results and real-world Radio Frequency Identification Data. This is an example of OD matrix calibration problem when an initial matrix is tuned to increase its correspondence to the observed urban data. In [5], the source of origin-destination matrix for traffic modelling of Köln, Germany, is Travel and Activity Patterns Simulation (TAPAS) framework which uses population information, data on points of interests within the city as well as the time use patterns. Authors report that the resulting demand still needed to be improved: TAPAS model provides demand not limited to vehicular traffic, and variability of traffic over short time scales is not realistic. To cope with that, initial OD matrix is adjusted (only trips corresponding to vehicular traffic were considered) and smoothed (random offsets were added to departure times).
Another strand of research is related to estimation of origin-destination matrix using various types of available data. When estimates of a number of travellers in different areas are provided, a gravity model [6] is commonly used. For instance, in [7], the gravity model is applied to infer OD matrix for Bogor city. For the case when data from urban sensors are available (such as link counts, flows and travel times), more sophisticated methods of data fusion are usually applied. In [8], time-dependent demand is estimated via Bayesian approach when the posterior distribution of OD matrices is updated based on traffic counts data. Traffic counts are also used for OD matrix estimation in [9], when an iterative bilevel framework is proposed to minimize the deviation between estimated and real-time link counts. [10] presents combination of approaches: initial OD matrix is estimated by gravity model, and after that the process of dynamic OD matrices' estimation is performed. The latter includes using a macroscopic traffic simulator to model the traffic flows and an optimization algorithm which aims to minimize the normalized variation between the historical and the simulated link flows. These studies assume that the input data are available for the whole area for which OD matrix is estimated. In contrast, in the current study, we focus on the case of OD reduction, that is, estimation of demand for a subarea of the initial area, which is rarely considered in the field of OD estimation.
In [11], authors consider the problem of origin-destination trip demand estimation for subarea analysis. They propose two-step procedure: (i) generation of induced OD demand for a subarea network, (ii) OD updating based on the induced demand and archived traffic measurements. Generation of induced demand is performed based on the path-based traffic assignment results as in our study; however, in [11] the focus is mostly on demand calibration while we consider algorithmic and computational aspects of demand estimation.
Beyond the estimation of origin-destination matrix, there are algorithms and tools for trips assignment available at urban mobility simulation frameworks. In [12], authors perform experimental comparison of different demand generation tools available in SUMO. They divide all tools in two groups: (i) countless, which do not require any extra data -named randomTrips, SAGA and randomActivityGen, (ii) tools which are using traffic counts -named dfrouter, flowrouter, cadyts and routesampler. For the considered use case (Wildau, Germany) routesampler showed the best results in terms of root mean squared error of vehicle count and network coverage.

Problem statement
Let's assume that we have an initial origin-destination (OD) matrix × , where is a number of traffic assignment zones (TAZs). Each element of the matrix , , ∈ 1, … , represents the number of trips between TAZ and TAZ during a certain time period (e.g. one hour). The set of TAZs for matrix is denoted as . We will call matrix an extended OD matrix, area covered by TAZs from -an extended area, and simulation of vehicle movement according to origins and destinations in matrix -an extended simulation.
Let's also assume that we have another set of TAZs, denoted as , which is a subset of the extended area: ⊂ . We will call a core area. The problem is to obtain an instance of core simulation for area given , , as inputs, that is, to induce core simulation from extended one.
Extended and core areas are depicted in Figure 1a. TAZs in Figure 1a are shown schematically as squares for a simplicity, but they may have any shape which is usually represented in polygon format. Depending on if origin/destination zones belong to , or to − , all trips may be categorized into four groups: 1. in-in, including trips starting and ending in . 2. in-out, including trips starting in and ending in − . 3. out-in, including trips starting in − and ending in . 4. out-out, including trips starting in − and ending in − . a) out-out trips where all trajectory belongs to − . b) out-out trips where part of the trajectory belongs to , and part of the trajectory belongs to − .
(a) (b) Figure 1. (a) Extended ( ) and core ( ) simulation areas. By arrows, different types of trips in extended simulation are presented. "in" denotes trips which start/end in , "out" denotes trips which start/end in . Core area is depicted in blue, yellow color depicts TAZs belonging to − . Numbers represent indices of traffic assignment zones in . (b) Edge assignment problem for a fixed TAZ.
If a trip does not pass through core area , it is not needed to be reproduced in a core simulation. Therefore, for the core simulation, all trips should be accounted except of type 4a. Moreover, only in-in trips (type 1) would have the same origins and destinations for extended and core simulations. For in-out, out-in, and out-out (passing both − and ) trips only part of the trajectory passes through . This means that for these types of trips we need to find new starting/ending points of trips which are inside . We call this problem an origin-destination reduction problem and address it in this study.
The origin-destination reduction may imply two steps: 1. Estimation of a core origin-destination matrix × , where is a number of traffic assignment zones in , < . Rows and columns of matrix correspond to TAZs for the core area. It should include all the trips from matrix which have at least part of the path inside core area . For example, if we assume that we have 5 trips in Figure  1a represented with arrows, non-zero elements of matrix will be 2,6 = 1, 4,4 = 1, 5,3 = 1 and 7,6 = 1. 2. Trip origin/destination edge assignment, executed for each trip in the core OD matrix . Each traffic assignment zone ( Figure 1b) has a correspondent road network which may be represented as a set of edges which are inside this TAZ. To perform simulation, for each trip one needs to specify starting and ending edges. The most common way to assign the edges is a random selection from the set of edges for the origin/destination TAZ.
It is worth to mention that in general step 2 (origin/destination edge assignment) does not require stating core OD explicitly (step 1), but not vice versa (after having an OD matrix, one still needs to specify starting and ending points of the trajectories).

Method
As it was described in Section 3, a problem of reduction of extended simulation to core simulation consists of two steps: (i) origin-destination matrix reduction, (ii) edge assignment. Both steps are presented in Figure 2.
Origin-destination matrix reduction is presented by steps 1, 2, 3, 4a, 5, 6a, 6c in Figure 2. The input for the algorithm is the OD matrix for an extended area, denoted as in Section 3. For each pair and from traffic assignment zones for a core area (which correspond to rows and columns of matrix ) and for each trip the random assignment of origin and destination edges is performed (step 1). After that, SUMO simulation for extended area is launched (step 2). The result of the simulation are vehicle trajectories (step 3). Given the vehicle trajectories, one may determine for in-out, out-in and out-out trips the TAZs from which serve as origin and destination areas for a core simulation (step 4a). Then, the resulting OD matrix for a core area is composed (steps 6a, 6b) from a submatrix of OD matrix for in-in trips (step 5), and OD matrices for other types of trips created at step 4a. The pseudocode for OD matrix reduction is presented in Table 3. The second output from parsing vehicle trajectories depicted in Figure 2 (step 4b) are indices of edges when a vehicle enters or exits the core area. These edges are also used to determine new origin and destination traffic assignment zones for a core simulation ( Table 3).
The second stage of a simulation-based OD reduction is an edge assignment after which one can start simulation for a core area (step 8). The simplest strategy of the edge assignment is a random assignment (step 6c). With random assignment, for each pair of origin and destination TAZs for core OD matrix we select a trip and assign a starting and an ending edge of the trip by random selection of an edge from corresponding TAZs. The drawback of this approach is that we do not account pre-calculated flow directions from an extended simulation. However, traffic flows coming to the core area mostly use major roads which serve as main entrances to the city. Random assignment strategy does not account for the size of the roads, and then leads to unrealistic flows on the border of a core area.
Regarding that fact, we propose another assignment strategy that we call edge-wise assignment (steps 7a and 7b in Figure 2). The idea of the edge-wise assignment is to keep edges which were used by vehicles to enter and to exit the core simulation area. From the other side, if one just keeps all the trajectories from the extended simulation, one will get only one fixed instance of a core simulation. As our goal here is to provide input data for multiple instances of core simulation, we need to keep some randomness in the assignment process. Thus, the proposed algorithm for edge-wise assignment is summarized in Table 2.
For trips which start and end inside the core area, we apply random assignment. For trips which start and/or end in the extended area, we fix those edges from extended simulation which correspond to entry and/or exit edges which vehicles use to enter/exit the core area. Table 2. Edge-wise assignment algorithm. RA denotes random selection of edge within corresponding TAZ, / denotes selection of entry/exit edge from the extended simulation.
Functions: originExtTAZ(t) -get origin TAZ from matrix for a trip , destExtTAZ(t) -get destination TAZ from matrix for a trip , coreEntryEdge(t) -get first edge of a trip belonging to core area , coreExitEdge(t) -get last edge of a trip t belonging to core area , getTAZ(edge) -get TAZ index by an edge index Q ← 0 // initialize reduced OD matrix Q with zeroes for each trip in the extended simulation: if trip is of type "in-in": // copy origin and destination TAZ from matrix A originTAZ = originExtTAZ(trip); destTAZ = destExtTAZ(trip); // for out-in/out trips, find TAZ to which an entry edge in core simulation belongs if trip is of type "out-in" or trip is of type "out-out": originTAZ = getTAZ(coreEntryEdge(trip)); // for in-out and out-out trips, the same for destination edge if trip is of type "in-out" or trip is of type "out-out": destTAZ = getTAZ(coreExitEdge(trip)); // increment number of trips for the pair of origin and destination zones Q[originTAZ, destTAZ] += 1 The final scheme of the implementation of simulation-based OD reduction in SUMO is presented in Figure 3. It shows programming routines and data files which are used during simulation and validation processes. For routing, SUMO's default routing procedure is used. To find positions of vehicles on the border of the core area, we use SUMO FCD device because we need to check all border edges (the other option was to use induction loops, but their manual placement is too time-consuming and also needs to be redone for each new core area).For out-in and out-out trips we also record departure times of the vehicles from extended simulation to make starting time of vehicles in a core simulation more realistic. To validate the results of the core simulation, we use data on traffic counts from traffic counting stations (TCS), so we record simulated traffic counts for the same positions where real traffic counting stations are located.

Data
In this study, we test the proposed approach of OD reduction using Helsinki city area as an example. Initial traffic assignment zones and origin-destination matrices were obtained from Helmet [13], transport demand model system developed by HSL (the Helsinki Regional Transport Authority).
Extended and core simulation areas are shown in Figure 4. Extended area includes 1972 TAZs, which are the same as the zones used in the Helmet model. The core simulation area includes 381 TAZs which cover municipalities included in a Helsinki city area. One may observe that traffic assignment zones have different sizes being fine-grained for dense urban areas.
(a) (b) Origin-destination data include matrices for different transportation means (e.g. private car, bicycle, truck, van) and different times of the day (morning rush hour, day hour, evening rush hour). To get OD matrix for passenger transportation, we summed up matrices for private cars and vans. Each element of a matrix is a number of trips between two traffic assignment zones during a selected time of a day. In this study, we created a model for morning rush hour (07:20-08:19).
Demand matrices in Helmet are calculated using a set of models and data including land use data, growth factors of external traffic, car ownership data, models of destination and mode choice, applied for a particular time period. In this study, demand matrices were generated based on the data from workdays, Autumn 2018. Available demand matrices represent average demand for e.g. morning rush hour over this period. We used the same time period while collecting validation data from traffic counting stations.
For validation purposes, we used a Digitraffic dataset [14] containing hourly traffic counts for a set of 15 traffic counting stations (TCSs) located within a core simulation area. All stations may be divided into two types:  border TCSs, which are located at the edge of a core simulation area and mostly measure flows of traffic coming from / to an extended simulation area;  inner TCSs, which are located within a core simulation area.
From all available TCSs, 7 are border TCSs and 8 are inner TCSs.
Data for morning rush hour for separate days were averaged to get mean traffic counts comparable with Helmet demand matrices.
Traffic assignment zones were uploaded to SUMO together with Helsinki city infrastructure network obtained from OpenStreetMap. The traffic assignment zones were originally in shapefile format (.shp) and had to be converted into SUMO's TAZ format (.taz). This was done by first converting the original file into OpenStreetMap format (.osm) using Java OpenStreetMap Editor and from OpenStreetMap format to SUMO's polygon format (.poly) using a selfmade Python script. After that the tool edgeInDistricts.py that comes with SUMO was used to create a TAZ file from the polygons. Traffic counting stations were simulated as induction loops placed on all lanes in both directions at the same locations as real TCSs. Counts from the same direction were saved to the same file.
In the experimental study, default SUMO parameters were used.
Python implementation of the algorithms proposed in the study is available on GitHub [2]. Table 3 shows the results of the comparison of validation metrics for two instances of SUMO simulation: simulation of the extended area for initial Helmet OD matrix and simulation of the reduced area with origins and destinations assigned by a proposed algorithm. Metrics are calculated for all 15 traffic counting stations (TCSs), as well as separately for border and inner TCSs.

Experimental study
We may observe that average mean absolute percentage error (MAPE) is between 34% and 45%. Traffic flows at border stations are reproduced significantly better (by 10%) than for inner stations. This may be explained by the fact that border stations are located at main entries / exits to the city area ( Figure 5) and measure in/out and out/in traffic flows. These flows are generally less dependent on routing procedure than flows simulated within the city. Second observation is that there is no difference of MAPE between initial and reduced simulations which shows that our algorithm performs reduction correctly.
Generally, there is more traffic counting events in Digitraffic data than in SUMO model. For example, for extended area simulation we have sum of traffic counts for all stations equal to 51,3K in real data, and to 42,2K in the model (16% difference). Here we also see that border stations are reproduced better than inner stations (7-10% difference for border stations and 22-23% for inner stations). This, again, supports our conclusion about larger influence of routing results for internal city area than for entries and exits to and from the city. Reduction process results in slightly (2%) less amount of traffic counts. This may be related to the cases when routing procedure cannot find the path from an origin edge to a destination edge (these trips are eliminated from simulation), because the amount of these cases tend to be larger for smaller areas. Summarizing, there is not significant difference with basic model after reduction considering both metrics, that is, the reduction algorithm provides consistent results. green label have MAPE less than average MAPE for border or inner stations. Red labels denote MAPE larger than average. We may see that stations located in the northern part of Helsinki are reproduced better than in the southern and eastern parts. In the extension of this study, flows crossing 'red' stations need to be considered manually in more details. Validation results show moderate correspondence of SUMO model to the observed traffic counts which may be explained by peculiarities of the initial data. Firstly, Helmet OD matrices are produced based on basic 4-step traffic simulation model [15] using census and land usage data, and then, are not well tailored to actual traffic count measurements. Secondly, initial OD matrices cover more than 100x larger area than Helsinki city area (a square of the extended area is equal to 24500 km 2 while reduced area is only 210 km 2 ), that is, they were aimed to reproduce coarse-grained patterns of vehicular mobility rather than city-scale behaviour. The overall quantity of simulated traffic resembles observed data with error equal to 16% but the distribution of cars between the roads is still not reproduced sufficiently good (MAPE is 40%).
To give an idea about the typical values of reported metrics, in [1] it is stated that "according to the established literature, any value of [%RMSE] below 30% can be considered good". The current validation quality of our Helsinki model may be further improved using several approaches including OD calibration based on traffic measurements, tuning hyperparameters of models and using SUMO calibrator objects for removing or inserting vehicles according to the desired flows on the inspected edges. We consider this as an extension of the current study as the main goal of this paper was to investigate the quality of reduction process itself.
To check that a proposed algorithm of edge assignment allows for getting different instances of core simulation while keeping correspondence of simulated traffic counts to the observed ones, we analysed the results of 10 different runs of core simulation with edge-wise origin/destination assignment algorithm. To measure the level of randomness between different runs, we calculate coefficient of variation (1) of simulated counts for different traffic counting stations: (1) Here is a standard deviation of simulated traffic counts for -th traffic counting station, is a mean of simulated traffic counts for -th TCS.
(a) (b)  . The mean value of for border stations is equal to 1.3%, for inner stations is equal to 3.5%. Variation of traffic counts for border stations is in the range 0.05%-5%, and for inner stations is in the range 0.4%-10%. Thus, we see that in both cases edge-wise assignment allows for getting random instances of a core simulation. At the same time, the level of randomness is lower for border stations than for inner ones. It is explained by the essence of the proposed algorithm because for border areas larger number of edges are fixed and do not change between runs as they are inherited from extended simulation to keep validation metrics sufficiently high. The latter is confirmed by the average value of coefficient of variation for MAPE which is equal to 6% for these 10 runs. Thus, proposed algorithm allows for generation of random instances of core simulation with sufficiently stable values of validation metrics.

Conclusions and discussion
In this study, we consider the problem of subarea demand estimation. Given origin-destination matrix for a larger area, we aim to infer origin-destination matrix and perform trips assignment for the reduced area. Proposed approach is simulation-based which means that we use traffic modelling tool to get the results for the extended area, and then post-process these results to get input data for the simulation of the core (reduced) area. As simulation of the extended area may be time-consuming, to avoid multiple runs of extended simulation in case when one wants to test different simulation instances for a core area, we proposed edge-wise origin/destination assignment heuristic. The experimental study for Helsinki city area has showed the applicability of our approach for the problem of OD reduction.
Proposed method is mainly purposed for reconstructing external traffic flows coming from an extended area to a core simulation area (and vice versa), that is, reproducing the observed values of traffic counts for border traffic counting stations. To reduce MAPE for inner traffic counting stations, methods of data-driven OD calibration may be applied. For example, in [11] authors propose dynamic OD estimation method for sub-area analysis which includes iterative OD updating procedure based on induced OD demand (similar to our reduced OD) and archived traffic measurements. Then, a two-step procedure for minimizing the errors may be proposed as an extension of current study:  calibrating SUMO parameters (e.g. device.rerouting.probability and weights.priorityfactor using grid search as in [1]) using traffic counts from edge TCS;  calibrating reduced OD matrix using traffic counts from inner TCS.
Proposed procedure of OD reduction may be applied for any case study. The only assumptions here are that a core area is a subset of an extended area, and that borders of traffic assignment zones within core area are the same as corresponding zones in extended area. To support a decision of this typical task for a modeler, we share a code repository [2] with implementation of OD reduction procedure which may be reused during creation of large-scale traffic models for different cities.

Data availability statement
The data supporting the results of contribution can be accessed on GitHub [3].

Underlying and related material
The code of the implemented method as well as the manual on its usage can be accessed on GutHub [3].