The use of traffic data from automa c monitoring systems to obtain day-to-day me series of vehicle traffic volumes and origin-des na on flows in urban networks

The use of traffic data from automa c monitoring systems to obtain day-to-day me series of vehicle traffic volumes and origin-des na on flows in urban networks O uso de dados de sistemas de monitoramento automá co de tráfego para obter séries temporais dia-a-dia de volumes de tráfego e fluxos origem-des no em redes urbanas Joana Maia Fernandes Barroso, João Lucas Albuquerque Oliveira, Francisco Moraes de Oliveira Neto


INTRODUCTION
The road travel pattern in a city can be represented mainly by two variables: the origin-destination (OD) lows, which indicates the number of trips made between zones in a study area over a given period of day; and the traf ic volume, which indicates the demand in nodes and arcs in the transportation network over a speci ic time interval or daily period. Both the OD lows distribution and the traf ic volume magnitude represent basic information for transportation planning and design, as well as traf ic management and control (Cremer and Keller, 1987).
The OD lows are the result of the trip decisions of a population aiming to carry out activities located in the urban environment. On the other hand, the traf ic volumes are the result of the distribution of OD lows on the network and may be de ined as the number of vehicles that travel on a road section in a certain direction over a speci ic time interval (Roess and McShane, 2004). OD lows require a lot of effort to be directly measured, requiring individual interviews or plate surveys. In contrast, the development of traf ic monitoring systems opened up the possibility of acquiring data on traf ic volumes in an automatic way at low cost (Pitombeira et al., 2017). This sparked an increasing interest in indirectly estimating the OD matrix through mathematical models by using data on traf ic counts on the network (Pitombeira-Neto and Loureiro, 2016;Pitombeira-Neto et al., 2018).
According to Cascetta (2009), knowledge of users' travel pattern is essential for the formulation and implementation of travel demand models that help in decision-making about the system supply. One important aspect to understand in modelling the travel behavior is the day-today as well as daily temporal variation of the traf ic volumes and OD lows (Pitombeira-Neto et al., 2018). Currently, this travel dynamic can be obtained from different data sources, such as data from mobile phones (Järv et al., 2014), which can generate information about the movement of mobile users, Global Positioning System (GPS) (Li et al., 2004;Anda et al., 2017), which provide information on vehicle movement on the road network, and smart card data (Anda et al., 2017;Milne and Watling, 2019), which can help understanding the public transport demand. As stated by Pitombeira-Neto et al. (2018), recent approaches consider the development of dayto-day dynamic models to represent the traf ic states on a network. However, there is a lack of research dealing with the issue of assessing the modelling assumptions stated about multiday dynamic of travel behavior.
In the city of Fortaleza, Brazil, there are approximately three hundred sites monitored by a Traf ic Monitoring System, including an Automatic Number Plate Recognition component (TMS-ANPR). The system includes a set of sensors installed on the road network, located mainly in arterial roads, that record the passage and speed of each vehicle. The system is also equipped with video cameras and a license plate recognition system that can capture and read, through an Optical Character Recognition (OCR) algorithm, the number plates of the detected vehicles. It is worth noting the cameras and the OCR algorithm are not capable of capturing and reading the license plate of all vehicles. From this dataset, it is possible to extract the vehicle volume recorded by each equipment, as well as to associate license plate readings between equipment from different regions of the city and obtain an estimate of the OD lows between regions.
Despite its great availability, data from TMS-ANPR of Fortaleza can have some limitations for use in demand studies, depending on the number and distribution of equipment. Since the main purpose of the system is traf ic enforcement, the equipment are mainly concentrated in arterial roads, which makes it impossible to observe many of the routes used by users in the central area. In addition, travel information obtained from the association of license plate readings from TRANSPORTES | ISSN: 2237-1346 two equipment does not guarantee that it is in fact an OD low, given the small amount of equipment in certain areas. Possible equipment failures during the operation of the data collection system, both regarding vehicle detection and plate reading, should also be considered. Besides, in both day-to-day and within-day traf ic variability applications it is necessary to de ine offpeak and peak periods, i.e., the separation of the day into periods where vehicle low can be considered stable. In transportation planning, this de inition of different periods within a day with constant traf ic low is important for understanding the day-to-day variability of traf ic volumes and OD lows (Cheng et al., 2012;Stathopoulos and Karlaftis, 2001). Speci ically, in the analysis of day-to-day traf ic dynamics, this de inition is essential for the analysis of correlation between consecutive days of traf ic volumes and OD lows.
In this work, we give a step forward for the understanding of multiday dynamic of travel behavior by acquiring adequate data on day-to-day traf ic volumes and OD lows. Such knowledge is essential for real time traf ic operations and to identify the mechanism behind travel behavior. Thus, the main goal of this paper is to propose a methodology for treatment of the TMS-ANPR data that can be used for empirical analysis of the day-to-day travel dynamic. To this end, the speci ic objectives are: i) to treat original TMS-ANPR data eliminating the suspicious data caused by failures in the traf ic sensors, by the limitations of the license plate recognition system, and by atypical traf ic variations; ii) to propose a method based on clustering analysis for de ining typical periods of day with constant traf ic, allowing to analyze day-to-day variation of traf ic volume and OD lows; iii) to evaluate the data obtained by inferring if the probability distributions of the obtained variables it what is stated in the literature. The main product of this data treatment are the day-to-day time series of traf ic volumes and OD lows. Therefore, this treated data can then be applied to assess the multiday dynamic of travel behavior and verify the model assumptions stated in the literature about such dynamic behavior. To our knowledge, such analysis, which is out of scope of this work, has not been done yet using empirical data. The treatment method can be applied to any type of TMS system equipped with ANPR system and can be used for generating data for other purposes such as operational analysis of traf ic control systems.

BACKGROUND
As stated by Loureiro et al. (2009), an ef icient alternative for urban traf ic management consists of implementing traf ic management centers (TMCs), which collect, model, and store data relating to traf ic conditions. Traf ic monitoring system (TMS) is an essential component of the TMC that automatically collects and stores traf ic data by loop detectors (traf ic sensors) placed on the road or at intersection approaches and sends, at each second, to the management center through private phone lines.
Besides the function of collecting traf ic data, the TMS can also be equipped with a ANPR system, allowing to track vehicle between different locations. According to Oliveira-Neto et al. (2013), these systems were developed with the main objective of interpreting the alphanumeric characters on vehicle plates without human intervention. They typically rely on four main components: an imaging acquisition processor, a license plate detection system, a character segmentation and recognition engine and a computer to store the data. The ANPR technology is a mature but imperfect technology. As stated by Oliveira-Neto et al. (2012, 2013, the ANPR accuracy is around or less than 60%, depending on the model, installation, variation of the license plates in the traf ic stream, lighting conditions, and other factors. Exploring the fact that the most errors made by ANPR hardware are only one or two misread characters of the vehicles plates, Oliveira-Neto et al. (2012, 2013 proposed a method for matching imperfect readings between two locations, even when the ANPR accuracies are unknown, increasing the number of matches or observed trips between two locations. The data generated by TMS and APNR, or TMS-APNR, support real-time information systems, as well as assist researchers with a better understanding of travel behavior in urban network systems. The extraction of vehicular traf ic information such as average speed, travel time, volume and OD lows from TMS-ANPR has been explored in the literature, as in Castillo et al. (2008) and Rao et al. (2018) who used license plate data, along with traf ic volume counts on network arcs, to estimate an OD matrix, as well as in Bertini et al. (2005) and Liu et al. (2011), who used license plate data to predict vehicle travel time on the network. However, as can be seen in these studies, the focus is on using the TMS data for reconstructing OD lows and/or predict the traf ic network performance. Not much detail of the previous stage of data treatment is presented and there is no concern of treating the data for the purpose of analyzing the multiday travel behavior, which is the main purpose of this work.
The data treatment of TMS-ANPR data is an essential step before any traf ic analysis and demand modelling. At this step, we seek to eliminate any bias, whether due to failures in automatic collection or when it is only possible to collect a sample for speci ic categories of the population. The importance of data treatment is highlighted by Oliveira and Loureiro (2006) who presented a method for data treatment of traf ic data obtained from several loop detectors of the Real-Time Traf ic Control System of Fortaleza, with the main objective of identifying outliers or atypical data. As pointed out by Cheng et al. (2012), if a prior analysis of the data is not performed any conclusion based on the analysis of the variables of interest may be misleading.
Furthermore, before any analysis after the treatment phase, it is also important to de ine periods in which the traf ic is stable during a typical day (i.e., periods of peak and off-peak with constant traf ic low). As stated by Cheng et al. (2012) and Stathopoulos and Karlaftis (2001), these periods are usually de ined to represent variation of traf ic in urban environment, with the purpose of designing different strategies of traf ic control according to the demand variation. According to Cheng et al. (2012) an arbitrary categorization of traf ic data into different time periods is not adequate to isolate speci ic states of network traf ic. Stathopoulos and Karlaftis (2001) highlight the importance of de ining different periods with similar traf ic characteristics in terms not only of traf ic volumes but also related to OD lows, following the idea that commuting patterns are caused by the activities that are carried out in certain periods of day. Another important issue related to traf ic patterns and the de inition of the traf ic states is that different traf ic pro iles can be observed in different locations on an urban network, resulting in different peak and off-peak periods. Some authors approached this problem by using clustering techniques. As suggested by Weijermars (2007), clustering techniques (e.g., k-means algorithm) can be used to classify traf ic volume pro iles and can be useful for identifying peak times according to each pro ile, considering that traf ic may vary according to the region and trip direction.
Finally, the assumptions about probability models to represent the day-to-day traf ic volumes and OD lows in urban networks is also an important issue discussed in the literature. Although this is not really a part of data treatment, a irst analysis of the variables could be done to verify the hypothesis about the probability distributions stated in the literature. As stated by Pitombeira-Neto et. al (2017) and Pitombeira-Neto et. al (2018), several models have been proposed to estimate OD lows from traf ic volumes. To account for the variability on OD lows, the early models assumed that OD lows are the result of a Poisson process (Vardi, 1996;Tebaldi and West, 1998). Hazelton (2000Hazelton ( , 2001Hazelton ( , and 2003 proposed to approximate the distribution of OD lows with a multivariate normal density, which has more tractable computational properties. In this latter case, the traf ic volumes can be also modeled by a multivariate normal distribution, and each traf ic volume for a given arc modeled by a normal distribution. All these earlier works have in common the main assumption that the OD lows are the result of a stationary process in which the mean OD and variances are constant day after day at the same period of day. Pitombeira-Neto and Loureiro (2016) and Pitombeira-Neto et al. (2018 and2020) also have suggested the normal distribution as approximation for the Poisson process, but they relaxed the assumption that the OD lows, and consequently the traf ic volumes, are independently distributed random variables, by assuming some dynamic structure for day-today variation. We state in this paper that for the case of weak dependence (i.e., the state of the system -represented in terms of route lows or route costs -does not depend strongly on the previous history of the system's states), it is possible to verify the hypothesis of normality.

METHODS
The TMS-ANPR data used in this study was collected in the city of Fortaleza, Brazil, in 2017. Three companies are responsible for the TMS-ANPR equipment. The system was designed to monitor drivers who exceed the speed limit or cross the red light. A total of 358 sites (including intersections and middle block) on the road network were monitored in 2017 ( Figure 1a). Comma-Separated Values (CSV) iles of data collected by each company are provided by the Municipal Traf ic and Citizenship Authority (AMC). The iles contain for each vehicle detected a record with the following information: equipment identi ication code, date and time of detection, lane in which the vehicle was detected, the speed limit of the road site, the measured speed, an estimated vehicle size in meters, an estimated vehicle classi ication and the reading of the detected vehicle license plate. The vehicle license plate is encrypted for privacy purposes. The study area, Fortaleza urban network, was divided in six regions, as shown in Figure 1b. The regions were constructed originally from the census tracts, by joining tracts with similar employment and demographic characteristics. For more details, see Lima (2017). The result was a central region, with a mixed type of land use (i.e., residential, and commercial uses) and ive others peripherical regions, with mostly residential characteristic, but with distinct levels of income. The southeast region was excluded from the analysis due the small number of devices in this region. Therefore, this de inition of distinct regions, with respect to socioeconomic characteristics and with a coarse level of spatial aggregation, allows not only to obtain the OD lows between regions from the TMS-ANPR data, but mainly to analyze the day-to-day variability of OD lows relating this dynamic with the context of the urban environment (i.e., the land use and socioeconomic characteristics of the different regions). Regarding the OD lows from TMS-ANPR data, it is believed that at this aggregation level most of traf ic lows between two devices located at two different regions are origin-destination trips between those two regions. the main periods, peak and off-peak, of typical days (excluding holidays and weekends). The data was treated according to the following steps: equipment selection for the analysis of traf ic volumes, de inition of peak and off-peak periods of analysis, selection of devices for the analysis of OD lows, detection of outliers or anomalies on the time series of traf ic volumes and OD lows. The irst step was to organize the data by each company into a single ile by day. Next, the data was aggregated at 5-minute intervals. This time interval was set to be short enough to represent the daily variation of traf ic lows, allowing to identify the different traf ic states during a typical day, and large enough to make it possible to detect any missing or faulty data in the dataset.

Equipment selection
The equipment selection step (1) aims to ensure the selection of equipment that has an operation considered acceptable to obtain the data of interest during the analysis period. Initially, it was selected only the equipment that worked over all months along a given year. Besides this criterion, it was also considered as a selection criterion the proportion of 5-minute intervals between 5:00 a.m. and 10:00 p.m. with non-zero traf ic volume. Since most of the surveillance equipment are in arterial roads of the city, it is expected, within the de ined period for a given day, that all intervals have traf ic volume greater than zero. Any interval with zero traf ic volume would be classi ied as a faulty data or missing data. Based on this proportion, a day of traf ic observations was classi ied as either acceptable or not acceptable. The threshold de ined to classify any day of observations into these two groups was de ined based on a trade-off between the number of days of traf ic volume necessary to describe the day-to-day variation of traf ic and the number of 5-minute intervals necessary to describe the within-day traf ic variation. To this end, a sensitivity analysis was performed to assess the effect of the proportion of non-zero intervals on the number of acceptable days for each equipment. This analysis was also performed by different time intervals of aggregation (10 min, 15 min and 30 min). Finally, the data for a given equipment was considered acceptable for analysis when the proportion of acceptable days along a given year was greater than 90%, assuming that with this proportion of valid days would be possible to analyze the day-to-day variation of traf ic for a given year of data.

Traf ic volume pro ile identi ication
The next step refers to equipment clustering (2) according to their relative average traf ic volume pro ile. The traf ic pro ile for each equipment is de ined by calculating the average volume at each 5-minute intervals over a given number of typical days, corresponding to a week, a month, or a year. The clustering analysis was performed by applying the k-means algorithm as suggested by Song et al. (2019). The k-means technique de ines groups based on a prede ined number of clusters and a measure of similarity between each element to be grouped. There are different techniques of clustering in the literature, so the k-means was chosen assuming that some types of traf ic pro iles in an urban network are usually expected given that the activities are usually concentrated in certain areas of the city and take place at speci ic hours along the day. Therefore, the number of clusters could be previously de ined based on the knowledge of the temporal and spatial distribution of activities. another search circle with at least MinPts points. Points that are not packed together (those that do not fit to any of these criteria) are classified as noise points.
The similarity measure adopted to compare the traf ic volume pro iles between every two sites was the Euclidean distance between the pro iles. Since the main goal was to identify groups with similar shape, instead of using the 288 absolute values (corresponding to the number of 5-minute intervals in a day) of the traf ic volume pro iles, the relative values (fraction of the total daily traf ic volume) at each 5-minute interval throughout the day was considered. To de ine the number of clusters, the average silhouette width criterion was adopted. This criterion seeks to maximize the separability (variation between groups) and compactness (variation within each group) of the clusters. Finally, to validate this number of clusters, a sensitivity analysis was performed by varying the number of clusters around this initial value, and visually analyzing the average pro ile for each group, as well as the spatial distribution of the equipment groups in the network. The inal number of clusters was chosen to represent the different variations of traf ic volume according to the road site of the equipment in the network and the direction of the monitored traf ic.
The k-means method was compared to the Functional Clustering method (Jacques and Preda, 2014), which is a technique designed for data generated by a process that occur on continuous space (e.g., continuous time space). A probabilistic approach was adopted in this case that consists in assuming a density probability on a inite number of parameters describing the pro iles for each cluster. However, the type of pro iles obtained by this technique were not much different than the pro iles using the k-means. Therefore, the k-means was chosen due to its simplicity.

Peak and off-Peak periods identi ication
The identi ication of peak periods (3) is performed for each pro ile identi ied in the previous step, aiming to de ine periods of the day when the vehicle volume can be considered stable, also including the identi ication of an off-peak period. To this end, it was applied the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm proposed by Ester et al. (1996), which is suitable for identifying arbitrary clusters of different sizes and identifying elements that do not belong to any cluster, called noise, without the need to provide preliminary information about the groups.
The DBSCAN technique is based on spatial density of elements, so points that are tightly packed together are grouped, while elements that lie on low density regions or regions with high variation are classi ied as noise elements or outliers. The use of this algorithm instead of k-means was due to its capability to identifying the different states of traf ic for a given traf ic pro ile, corresponding to different densities of traf ic volumes by time of day, and also to identifying the transitions between every two states of traf ic, that we de ined as periods of traf ic with high variation of intensity. In sum, DBSCAN is best suited for generating groups whose elements have similar intensity, while k-means generates groups whose elements have similar variation trends. After this clustering analysis, the day-to-day series of traf ic volume for each daily period can be generated.
Basically two global parameters are required for the execution of the DBSCAN algorithm: the radius of the search circle, ε, so every two points located within this radius are said to belong to the same group; and the minimum number of points, MinPts, that are considered to form a cluster within the radius ε. Points belong to same group if they are either inside the search circle formed by at least MinPts points or they can be reached by any point of a set of points inside TRANSPORTES | ISSN: 2237-1346

Equipment selection and daily periods of analysis
The next step (4) concerns to the selection of suitable equipment for extracting the OD low from the sample of equipment previously selected for volume extraction. As pointed out in Sections 1 and 2, the TMS-ANPR has the limitation of not registering all license plates of the vehicles detected and not covering the entire road network. However, since most of arterial roads are monitored (Figure 1b), it is believed that a set of the equipment can be selected to provide data of the day-to-day low variation between urban areas, for speci ic time periods of a day. The equipment suitable for this analysis were selected based on the proportion of plates read by each equipment (reading rates).
The reading rates for each equipment are calculated for different daily periods of analysis (i.e., morning, midday, and afternoon), which are determined based on the traf ic states de ined in Subsection 3.1.3. To determine the daily periods for OD low analysis, the average traf ic proiles of every two regions are associated to de ine time periods that include the same state of traf ic observed at the two different regions (e.g., the morning period for OD low observation between a peripherical and a central region includes both morning peaks for those two regions).
The equipment selection criterion for each daily period de ined was based on the standard deviation of reading rates for a given year of observation. Therefore, for each equipment a maximum threshold of 0.12 for the standard deviation of the reading rates was set by seeking to select equipment with small variation of the reading rates (reducing the effect of the reading proportion on the observed day-to-day traf ic pattern between regions) but keeping a minimum number of equipment for each region that allows to analyze day-to-day variation of OD low.

License plate association between regions
The equipment association step (5) is simply the identi ication of trips made by vehicles between regions during the daily periods of analysis. Since the APNR accuracy is unknown, trips between two regions are identi ied by counting the number of exact matches between plate readings within the adopted analysis period. It is assumed that two exact license plate readings are unlikely to come from different vehicles. To avoid the possibility of replications (when vehicles are registered on more than two equipment at a sequence), the irst reading record is associated with the last exact reading record, eliminating intermediate records of the same vehicle.
The great concern here at this point of treatment is how to represent the pattern OD low between city regions through the low of passage observed by associating plate readings of equipment located at different regions. Firstly, we already argued that each exact match of reading plates between regions is likely to represent a trip between the same regions, since the adopted zones are large areas representing different land use and socioeconomic characteristics in the city. Secondly, although the day-to-day series of OD matrices obtained from the proposed process may not represent the whole OD low patterns among all regions in absolute terms, it allows to analyze the day-to-day OD low variation for each pair of regions, which is the main goal of this data treatment. Finally, as we will see at the Subsection 3.2.3, the difference in time between reading associations was also used as criterion to de ine a trip based on expected travel times between regions. TRANSPORTES | ISSN: 2237-1346

Travel time ilter
The last step (6) of the method concerns the analysis of the time differences between reading associations. The analysis of the distribution of the resulting time differences allows to eliminate associations with very short time differences, possibly corresponding to associations between very closely equipment located on the same arterial street that tends to generate many observations (which are likely to represent a traf ic volume associated with any pair of regions), leading to a misrepresentation of the OD lows. Further, the observed time differences are compared to the travel times obtained by the Open Street Map (OSM) platform through the 'osrm' package developed by Giraud (2018) which is part of the R software.
Thus, the time differences were iltered according to the following criterion: 0.85×tosm < ∆t < 3×tosm, where ∆t is the time difference between reading associations, and tosm is the corresponding time obtained by OSM. Any time difference outside this interval is assumed to be unreliable, since it is likely to be originated from an association of misreading plates (i.e., association between very distant equipment but with very short time difference, or association between very closely equipment, located few blocks away at the same road, but with long time difference). The upper bound of 3×tosm of the time provided by the OSM was set to consider the possibility of intermediate stops. The lower limit of 85% was mainly de ined to incorporate the possibility of synchronization failures between equipment, which may result in shorter time differences due to small differences in time clock.

Outlier detec on in the me series and probability distribu on of the traffic variables
After generating the initial time series of day-to-day traf ic volumes and day-to-day OD lows, an analysis of outliers is performed to identify atypical values that were not detected by the previous analyzes. Atypical variations in the data can occur due to accidents, weather conditions, among other reasons. Since the main interest lies on generating traf ic low series that represent typical day-to-day variations of traf ic, such atypical events should be left out. The criterion adopted was to consider an extreme value, or outlier, an observation outside the following range (see distance-based methods in Aggarwal, 2015): [ -3s, + 3s], where is the median and s is the sample standard deviation for a sample obtained using a time window of 20 days. This 20-day time window was de ined to estimate the data dispersion and to represent the local pattern of the data, avoiding eliminating any observation due to variation between months. A shorter time window (e.g., 7 days) could be considered, but in this case the sample size would be smaller.
To analyze which probability distribution may represent the variation of the time series obtained, the histograms of the treated data considering only the typical months (without January, July and December) of the year were compared with the Normal distribution, by applying the Shapiro-Wilk test. In fact, it is believed that the values of traf ic variables luctuate around a central measure, since the series are generated from stable periods of traf ic throughout the day, or periods of commuting, where the traf ic lows tend to repeat day by day.

APPLICATION TO THE URBAN NETWORK OF FORTALEZA, BRAZIL
This section presents the results obtained from the application of the proposed method for treatment of the 2017 TMS-ANPR data in Fortaleza, Brazil.

Genera ng the day-to-day traffic volumes
From the total of 358 equipment, 271 devices with detection data for all months were initially selected. The number of equipment with acceptable days was de ined by varying the threshold for the proportion of non-zero 5-minute intervals from 60% to 100%, as shown in Figure 3. It was found that for a proportion up to 90%, around 200 days were classi ied as acceptable and that above this proportion the number of acceptable days decreases considerably, as compared with other time intervals of 10 minutes, 15 minutes and 30 minutes. Hence, a threshold of 90% for the data aggregated at 5-munute intervals was adopted, which corresponds to a balance between desirable sample size (number of days) and data quality (data representing the daily variation of traf ic low). With the indication of acceptable days for each equipment, the second criterion was applied regarding the minimum proportion of acceptable days (at least 90% of the total workdays present in the sample), resulting in a sample of 179 equipment, representing about 50% of the initial number.

Traf ic volume variation patterns
As for the de inition of daily traf ic pro iles, the k-means method was applied to determine the daily traf ic pro iles for typical and atypical months (January, July, and December), and for different weekdays, to account for possible seasonal variation at the daily traf ic pro iles. The method suggested three traf ic patterns to represent the different traf ic dynamics existing in the city, as shown in Figure 4, which shows the average traf ic pro iles for typical and atypical months. As can be seen in Figure 4, these 3 pro iles represent three main traf ic patterns corresponding to the mainly low directions of traf ic in the city of Fortaleza: with links of intense traf ic going towards the central region in the morning (pro ile 3), links of intense traf ic going outwards the central region in the late afternoon (pro ile 1), and some locations with two less intense peaks and an off-peak period of less intensity between them (pro ile 2). The results did not show evidence of differences between the traf ic pro iles of the weekdays. According to Figure 4, it was observed a change of traf ic pattern between atypical and typical moths. Several traf ic segments of pro ile 2 and 3 switched to pro ile 1 during atypical months. This pattern changing probably re lects the change in daily activities during vacation months, which usually have less intense traf ic during the morning period and higher intense traf ic during the afternoon period.
Regarding the typical months, the spatial locations, together with the traf ic directions, of the equipment for each group reveal these three traf ic patterns detected by the clustering analysis. Equipment of type 3 pro ile are located mostly in arterial roads within residential regions and whose direction is predominantly toward the central region (suburb to central direction), while pro ile 1 equipment are also mostly located closed to residential areas of the city usually in the opposite direction, revealing the predominantly commuting characteristic of travel, moving towards the center early in the day and returning to residence at the end of the day. Pro ile 2 equipment indicates locations where traf ic generally does not change signi icantly as a function of the time of day and are usually on major roads in the central area, with no dominant direction of traf ic, linking either commercial or residential neighborhoods, which may account for two well-de ined peaks beyond the plateau between them. Therefore, the results revealed traf ic patterns that are consequence of an urban environment in which most of the activities are concentrated on a single region or central region of the city, generating intense traf ic in certain locations and periods of the day. From the identi ied pro iles, the DBSCAN parameters were de ined by a sensitivity analysis based on two criteria: the silhouette width and the number of traf ic periods. The quality of the clusters is better for higher values of the silhouette width. Figure 5 shows this analysis for the pro ile 2 of typical months. The silhouette indicator was calculated only for observations that were classi ied at any group (excluding the noises). Since the daily traf ic pro ile is quite variable, short values for both MinPts and e result in too many groups or even no group (only noise). As the parameters increase the DBCAN algorithm tends to identify only one single group. Hence, the criterion of the number of groups was used to de ine a set of parameters that better represent the daily variation of traf ic at each pro ile.
For the case shown in Figure 5, the adopted parameters were: MinPts = 8 intervals of 5 minutes and ε = 14. The MinPts of 8 corresponds to a period of at least 40 minutes for each cluster. As for ε = 14, it is the maximum search radius obtained by observing a cumulative distance graph of the 8 (de ined MinPts) nearest neighbors. Recall that the value of e has no meaning since the distances between observations are calculated from both attributes of volume (vehicles/5 minutes) and time (minutes). The same analysis was done for other traf ic pro iles. The clusters identi ied for each pro ile, are presented in Figure 4. The observations attributed to cluster 0 are the noises and, as expected, correspond to those transition intervals between stable periods of traf ic.
The traf ic peaks identi ied revealed that the size and intensity of the morning peak periods are quite different for the three pro iles, with the pro ile 3 presenting the most intense peak, between 6:55 a.m. and 8:00 a.m. The time intervals of the late afternoon peaks, mostly between 5:00 p.m. and 7:00 p.m., are similar for the three pro iles, with the pro ile 1 presenting the sharpest peak. It is worth noting that only pro ile 1 has a well-de ined midday peak, between 11:00 a.m. and 4:00 p.m., probably representing trips mainly for lunch purposes. Pro ile 3, on the other hand, presents a well-de ined peak between 1:00 p.m. to 5:00 p.m., perhaps representing trips mainly for non-work purposes (e.g., health and shopping), since the equipment of this group are in the suburb areas and monitor the traf ic going toward the central region. Finally, it is worth noting the pro ile 2 do not present any sharp peak, as observed in the other two pro iles, and present a long period of low traf ic variation between 8:00 a.m. and 4:40 p.m., probably representing trips of different purposes having the central area as either origin or destination. The major difference between atypical and typical months is that the pro ile 3 for atypical months present not much variation for the period between 10 a.m. to 5 p.m.

Day-to-day time series of traf ic volumes
The 179 time series of traf ic volumes were generated for each daily period. Three time periods were assumed for the analysis for each pro ile: morning peak, afternoon peak and a midday period. The midday period was de ined to be between 1:00 p.m. to 5:00 pm for all pro iles. Figure 6 shows the result of the time series for the morning peak of a given equipment, classi ied as pro ile 3, along with the timeline trend, the identi ied outliers, and the histogram of the variable. It was observed for the most series of traf ic volumes that the atypical months (January, July, and December) presented less intense traf ic.
For the histogram in Figure 6, the Shapiro-Wilk normality test, with p-value = 0.52, showed evidence that the distribution of the traf ic volume variable does not differ signi icantly from the normal distribution. The test was performed for all series of traf ic volumes, considering only the typical months, yielding for most samples to no rejection of the null hypothesis of normality, at a signi icance level of 5%. It is worth to say that the traf ic volume presented overdispersion when compared to that expected by a Poisson variable. This indicates that the traf ic volumes are not completely random, since they are the result of user decisions (trip or route decisions), that in turn are in luenced by the temporal and spatial distribution of activities on the city and probably by the previous history of travel decisions.

Genera ng the day-to-day OD flows
Regarding the daily periods for OD low analysis, Figure 7a illustrates how the daily time period for an OD pair was de ined. The igure shows a sequence of hourly traf ic volume pro iles of three devices located on a roadway connecting the northwest region to central region. As can be seen the most predominant pro ile between these two regions is pro ile 3, which was used to de ine the time periods for the corresponding OD low analysis.
For each daily period (morning, afternoon, and midday periods), the analysis of the reading rates variation along the 2017 year resulted in 106 devices suitable for license plate association. Figures 7b and 7c show the variation in the proportion of readings throughout 2017 for one equipment that was selected and another that was rejected, respectively.
After performing the license plate association between regions and the travel time treatment based on the OSM tool, the day-to-day OD lows for each daily period were generated. Figure 8 shows the time series for the morning peak of the OD low between the northwest and central regions, along with the identi ied outliers, the timeline trend, and the histogram of the variable. Like Figure 6a, it was observed for the most series of OD lows that the atypical months (January, July, and December) presented less trips.
The Shapiro-Wilk's normality test, with p-value = 0.19, showed evidence, at 5% signi icance level, that the OD low distribution does not differ signi icantly from the normal distribution. The test was performed for all OD pairs, considering only typical months, and the null hypothesis of normality could not be rejected at most tests, at a signi icance level of 5%. As with the traf ic volumes, it was observed an overdispersion for the OD lows, even higher than the variation of the traf ic volumes. This is not only a result of the variation of the OD low along the year, but also of the variation in reading rates and of the spatial distribution of the selected equipment. Considering that the detected trips were random sampled, it is possible to analyze the dynamic of OD low between regions looking at the generated time series of day-today OD low. This analysis is out of the scope of this paper.

CONCLUSIONS
This work presented a methodology for TMS-ANPR data treatment for generating day-to-day time series of traf ic volumes and OD lows in urban networks. The study contributes to the use of data from TMS systems for the generation of time series that allow evaluating the variability of urban traf ic, thus supporting studies that aim to empirically verify theoretical assumptions about day-to-day assignment methods and OD matrix estimation models, such as the probability distributions adopted and the temporal correlation of the variables.
As discussed so far, the process of handling large volumes of data from automatic collection systems is not only essential for reliable analysis, but also it is the irst step in understanding the dynamics of traf ic in an urban environment. In other words, in addition to the cleaning (eliminating failures and anomalies in the data) and organization procedures, the treatment stage can also include the de inition of stable traf ic periods and a preliminary descriptive analysis of the resulting data allowing to identify traf ic patterns that will assist in raising hypotheses to be tested about the phenomenon of interest. Such hypotheses may include probability distribution of the variables, seasonal effects (monthly, weekly, and daily difference) and difference of traf ic patterns between regions of the urban network. Speci ically, this work contributed to the de inition and analysis of traf ic variation patterns, applying clustering techniques, which in the case of Fortaleza-CE revealed the tendency of commuting towards the central region of the city, where most commercial and service activities are concentrated.
The study also contributed to the use of TMS-ANPR data to generate OD lows in urban areas, which usually requires a great effort to collect. Such data allows to analyze the day-to-day variability of OD lows that is essential for the urban planning of major cities. Therefore, we stress the effort in this study to adopt a division of study area in regions representing different land use and socioeconomic characteristics and that allow to obtain an adequate sample of OD trips, to select a set of equipment for analysis that worked properly most of the year (i.e., considered reliable according to the proportion of non-zero traf ic observed by 5-minute intervals) and with low variation of reading rates, and to treat the obtained OD lows based on travel time expected from the OSM tool avoiding some bias in the data.
One limitation of this study was that several equipment was not suitable for generating traf ic volume and OD low series, probably due to failures (e.g., lack of synchronization between equipment, plate reading failure and failure of vehicle registration). Another limitation of the study concerns to the distribution of equipment in the analyzed area, with some regions having few equipment. The effect of the spatial distribution of the TMS-ANPR equipment (installed for enforcement purposes) on the network is an important issue for further studies. Furthermore, the accuracy of the ANPR system of Fortaleza is unknown. This issue has been addressed in previous studies that proposed methods to improve the matching of imperfect license plate readings, even when the ANPR accuracy is unknown (Oliveira et al., 2012 and2013). Such techniques can be incorporated into the proposed method and should be a subject for future work. Finally, the method of data treatment can support future studies about the in luence of the network performance on the multiday dynamic of traf ic volumes and OD lows.