Evidence for a Conserved Quantity in Human Mobility

Recent seminal works on human mobility have shown that individuals constantly exploit a small set of repeatedly visited locations. A concurrent literature has emphasized the explorative nature of human behavior, showing that the number of visited places grows steadily over time. How to reconcile these seemingly contradicting facts remains an open question. Here, we analyze high-resolution multi-year traces of $\sim$40,000 individuals from 4 datasets and show that this tension vanishes when the long-term evolution of mobility patterns is considered. We reveal that mobility patterns evolve significantly yet smoothly, and that the number of familiar locations an individual visits at any point is a conserved quantity with a typical size of $\sim$25 locations. We use this finding to improve state-of-the-art modeling of human mobility. Furthermore, shifting the attention from aggregated quantities to individual behavior we show that the size of an individual's set of preferred locations correlates with the number of her social interactions. This result suggests a connection between the conserved quantity we identify, which as we show can not be understood purely on the basis of time constraints, and the `Dunbar number' describing a cognitive upper limit to an individual's number of social relations. We anticipate that our work will sparkle further research linking the study of Human Mobility and the Cognitive and Behavioral Sciences.

Here, we investigate individuals' routines across months and years. We reveal how individuals balance the trade-off between the exploitation of familiar places and the exploration of new opportunities, we point out that predictions of state-of-the-art models can be significantly improved if a finite memory is assigned to individuals, and we show that individuals' exploration-exploitation behaviours in the social and spatial domain are correlated.
Our study is based on the analysis of ∼ 40 000 high resolution mobility trajectories of two samples of individuals measured for at least 12 months (  Table 1: Characteristics of the datasets considered. N is the number of individuals, δt the temporal resolution (for the Lifelog dataset, location is recorded at every change in motion), T the duration of data collection, δx the spatial resolution, T C the median weekly time coverage, defined as the fraction of time an individual's location is known. See also SI Figs. S1, S2, S3 Our datasets rely on different types of location data and collection methods (see section Data Description, and SI Data pre-processing), but share the high spatial resolution and temporal sampling necessary to capture mobility patterns beyond highly regular ones such as home-work commuting [46].
All datasets display statistical properties consistent with those reported in previous studies focusing on shorter timescales [1,2] (see also SI, Fig. S7), and their temporal resolution and duration make them ideal for investigating the evolution of individual geo-spatial behaviors on longer timescales. Moreover, three of the datasets considered (CNS, MDC, RM) include also information on individuals' interactions across multiple social channels (phone calls, sms, Facebook, see also SI Data pre-processing), allowing us to connect individuals' spatial and social behaviours across long timescales.
The results presented below hold for the four considered datasets.

Results
Individuals' set of visited locations grows with characteristic sub-linear exponent. When initiating a transition from a place to another, individuals may either choose to return to a previously visited place, or explore a new location. To characterize this exploration-exploitation trade-off, we represent individual geo-spatial trajectories as sequences of locations, where 'locations' are defined as places where participants in the study stopped for more than 10 minutes (Fig. 1A, see also SI, Data pre-processing). A first question concerning the long term exploration behavior of the individuals is whether an individual's set of known locations continuously expands, or saturates over time. We find that the total number of unique locations L i (t) an individual i has discovered up to time t grows as L i ∝ t αi (Fig. 1B), and that individuals' exploration is homogeneous across the populations studied, with α i peaked around α (Lifelog: α = 0.73, CNS: α = 0.61, MDC: α = 0.69, RM: α = 0.76) (Fig. 1C).
This sub-linear growth occurs regardless of how locations are defined or when in time the measurement starts (see SI, Fig.S15). This behavior is a characteristic signature of Heaps' law [47], and consistent with findings from previous studies focusing on shorter time-scales [2].  [5,48,49,50,51]. Hence, at any point in time, each individual is characterized by an activity space (Fig. 1D), defined as the subset of all locations within which she visits as a result of her daily activities [52,49]. Operationally, we define the activity space as the set AS i (t) = { 1 , 2 , ..., k , ... C } of locations k that individual i visited at least in two different weeks and where she spent on average more than 10 minutes/week during a time-window of 20 consecutive weeks preceding time t. The results presented below are robust with respect to variations of this definition, such as changes of the timewindow size or the definition of a location (see SI, Figs. S9, S11, S14, Tables S1 and S2).
Thus, individuals continually explore new places yet they are loyal to a limited number of familiar forming their activity space. But how does discovery of new places affect an individual's activity space?
We find that the average probability P that a newly discovered location will become part of the activity space stabilizes at P (CNS: P = 15%, Lifelog: P = 7%, MDC: P = 15%, RM: P = 20%) over the long term, indicating that individuals' activity spaces are inherently unstable and new locations are continually added. However, over time individuals may also cease to visit locations that are part of the activity space. The balance between newly added and dismissed familiar locations is captured by the temporal evolution of the activity space, which we characterize by the spatial capacity and net gain. We define spatial capacity C i as the number of an individual's familiar locations, i.e. the activity space size, at any given moment. The net gain G i is defined as the difference between the number of locations that are respectively added (A i ) and removed (D i ) at a specific time, Fig. 2A shows the evolution of the average capacity C for the populations considered, normalized to account for the effects due to different data collection methods (see SI, Data pre-processing).
We find no evidence for rejecting the hypothesis that the average capacity does not change in time. Thus, despite individual activity space evolving over time, the average capacity is a conserved quantity. Conservation of spatial capacity holds for individuals. The conservation of the average spatial capacity may result from either (i) each individual maintaining a stable number of familiar locations over time or (ii) a substantial heterogeneity of the populations considered, with certain individuals shrinking their set of familiar locations and other expanding theirs. We test the two hypotheses by measuring the individual average net gain across time G i and its standard deviation σ G,i . If a participant's average gain is closer than one standard deviation from 0, hence | G i |/σ G,i < 1, then the net gain is consistent  To interpret the information contained in the measured value of the spatial capacity, we randomize the temporal sequences of locations in two ways, preserving routines of individuals only up to the daily level.
After breaking individual time series into modules of 1 day length, (a) we randomize individual timeseries preserving the module/day units (local randomizations) or (b) we create new sequences by assembling together modules extracted randomly by the whole set of individual traces (global randomization) (see SI, Fig. S16). Due to the absence of temporal correlations, the capacity is constant in time also for the randomized datasets. However, the capacity of the random sets is significantly higher than in the real time series for both randomizations under the Kolmogorov-Smirnov test (see SI, Table S3), implying that the observed value in real data is not a simple consequence of time constraints. Instead, the fixed capacity is an inherent property of human behavior. Time-evolution shows that activity space changes gradually. The time evolution of the activity space supports this finding. We measure the turnover of familiar locations using the Jaccard similarity J i (t, γ) between the weekly activity space at t and at t + γ (see In order to characterize the structure of the activity space, we investigate how individuals allocate time among different classes of locations defined on the basis of their average visit duration. We consider intervals ∆T , with ∆T ranging from 10 to 30 minutes per week (the time it takes to visit a bus stop or grocery shop) up to 48 to 168 hours per week (such as for home locations). For each of these locations classes, we compute the evolution of the capacity c ∆T i and the gain G ∆T i , and test the hypothesis G ∆T i = 0, as above. We find that, although the activity space subsets are continuously evolving, c ∆T i is conserved for each ∆T (Fig. 4, see also SI, Figs. S18, S19, S20, S21 and Table S4 Including evolution of activity space improves modeling. Our results have consequences for the modeling of human mobility. The state of the art exploration and preferential return model [2,34] describe agents that, when not exploring a new location, return to a previously visited place selected with a probability proportional to the number of former visits. The model reproduces some of the empirical observations described above, including the conservation of the spatial capacity (Fig. 2), but fails to describe the time evolution of the activity space (Fig. 3) and the differences between randomized and not-randomized series. To overcome this limitation, we start from the observation that exploitation probability for a location is time-dependent [53,54] and endows the agents with a finite memory M  [55,56,32,57], due to cognitive constraints [55], and it has been hypothesized that the size of one's activity space is proportional to one's social network geography [58]. Motivated by these observations, we test the hypothesis of a correlation between individuals' spatial capacity and the size of their social circle, as measured by the people contacted by either phone call or sms over a period of 20 weeks. We find that a significant positive correlation exists (see Fig. 5) and consider that this observation calls for further analyses on the connections between human social and spatial exploration-exploitation behaviour. The spatial capacity is hierarchically structured, indicating that individual time allocation for categories of places is also conserved. These results have allowed us to improve existing models of human mobility which are unable to fully account for long-term instabilities and fixed-capacity effects.
Taken together, these findings shed new light on the underlying dynamics shaping human mobility, with potential impact for a better understanding of phenomena such as urban development and epidemic spreading. Extending our scope beyond mobility, we have shown that individuals' spatial capacity is correlated with the size of their social circles. In this respect, it is interesting to note that fixed-size effects in the social domain [55,56,32,57] have been put in direct relation with human cognitive abilities [55]. We anticipate that our results will stimulate new research exploring this connection.

Data description
Reality To preserve privacy, GPS traces were pre-processed (internally at SONY Mobile) to infer stop-locations

Supplementary Information for Evidence for a Conserved Quantity in Human Mobility
1 Data pre-processing The four datasets considered collect different types of location data. For each of them we obtained sequences of intervals describing individuals' pauses at a given location:

User Interval Start Interval End Location
In this section, we describe data collection and the pre-processing applied to obtain such records.
Characteristics of the datasets are shown in Figs. S1, S2, S3. For all datasets, we consider only intervals longer than 10 minutes.

Lifelog dataset
Data Collection Data was collected by the LifeLog Sony app [62]. The app is opportunistic in collecting location data. (i.e. if another app requests location data for the device, Lifelog will get a copy of the location). The app does not collect locations with a fixed time interval. Instead, the heuristic is to get updates when there is a change is the motion-state of the device (if the accelerometer registers a change), or if the app uploads/downloads data to/from the servers, which by default is set to at least once per day. Communication with servers can be more frequent as the app will connect to the servers every time it is opened. If two data-points are close together in time (less than 15 minutes) and space the backend aggregates them. The spatial distribution of data points is shown in Fig. S4.

Selection of users
We have selected users who have data for at least 365 days (∼ 36.000users).
Definition of Locations GPS data is pre-processed to infer stop-locations using the distance grouping method described in [61]. The method is built on the idea that a stop corresponds to a temporal sequence of locations within a maximal distance d max from each other. In the main text, results are presented for d max = 50m. Below, we show that the same results hold for d max = 30m, and -Down-sample weeks with weekly time-coverage higher than w m by selecting a random sample of total duration tC(w m ) * 60 minutes.
Results presented in the main text are produced with method (a). We show below (see section Robustness Tests) that results hold also under method (b).

CNS mobility dataset
Location data is obtained combining Wi-Fi data (sampled every ∼ 15s) with GPS data (high spatial resolution). The following methodology was implemented to estimate the sequences of individuals stoplocations: Estimation of Wi-Fi Access Points (AP) position Access Points (AP) positions were estimated using participants' sequences of GPS scans. We discarded mobile APs, that are located on buses or trains, and moved APs that were displaced during the experiment (for example by residents of Copenhagen changing apartment, taking their APs with them). Then, we considered all WiFi scans happening within the same second as a GPS scan to estimate APs location. The APs location estimation error is below 50 meters in 99% cases. Most of the APs are located in the Copenhagen area (see [60] for a detailed description of the methodology).

Definition of Locations
We find locations by clustering APs based on the distance between them.
First, we built the indirect graph of APs simultaneous detection G = (V, E). V is the set of geolocalised APs, links e(j, k) exist between pairs of access points that have ever been scanned in the same 1 min bin by at least one user. Then, we compute the physical distances dist(j, k) for all pairs of (j, k) ∈ E. and we consider the set of links E D ⊂ E such that dist(j, k) < d, where d is a threshold value, to define a new graph G d = (V, E d ). Finally, we define a location as a connected component in the graph G d . For d = 5m the maximal distance between two APs in the same location is smaller than 10m for most locations and at most ∼ 200m (see Fig. S6-A). The number of APs in the same location is lower that 10 for most locations, but reaches ∼ 1000 for dense areas such as the University Campuses (see Fig. S6-B). An example of APs clustering for d = 5m and d = 10m is shown in Figs. S6-C and S6-D. We show below that our findings do not depend on the choice of the threshold (see section Robustness Tests).
Temporal aggregation Data was aggregated in bins of length 1 min, where for each bin we selected the most likely location.

MDC mobility dataset
Data collection is described in [43] and [42]. We used the GSM data, sampled every 60 seconds.

RM mobility dataset
Data collection is described in [59] and [44]. We used the GSM data.

Comparison with previous research
Our datasets displays statistical properties consistent with previously analyzed data on human mobility.
• The visitation frequency of a location, defined as the fraction of visits to that location, goes with the location rank r as r −ζ , with ζ ∼ 1. (Fig. S7-A). Our result is consistent with [1], where the authors found f (r) ∝ 1/r , and [2], where it was found f (r) ∝ r −1.2 .
• Individuals' radius of gyration (see [1], SI for definition) growth across time is consistent with the logarithmic growth described in [2] (Fig. S7-C).
• Individuals are distributed heterogeneously with respect to their radius of gyration measured at the end of the experiment, with the probability distribution P (r g ) (Fig. S7-D) decaying as a power-law with coefficient β = −1.47. This is comparable with the results found in [1], β = −1.65 and [2] β = −1.55, where both studies relied on CDRs.

Robustness Tests
The results presented in the main text do not depend on how locations are defined, nor on the timewindow used to investigate the long-term behavior. In this section, we show how the results are derived and we demonstrate their statistical robustness. To avoid confusion, we will indicate with x the average value of a quantity x across the population, and x the average across time.

Conservation of the spatial capacity
The activity space is defined here as the set AS i (t) = { 1 , 2 , ..., k , ... C } of locations k that individual i visited at least twice and where she spent on average more than 10 minutes/week during a time-window of W consecutive weeks preceding time t. In Fig. S8, we show that for W = 10 weeks, the activity space contains on average a small fraction of all locations seen during the same 10 weeks. Yet, the time spent in these locations is on average close to the total time (Fig. S8).
Given this definition, the number of locations an individual i visits regularly is equivalent to the activity space size C i (t) = |AS i (t)|. We call this quantity spatial capacity.
Evidence 1 The average individual capacity C is constant in time regardless of the definition of location or the choice of the window size W (Table S1 and Table S2).

This result is tested in several ways:
Linear Fit Test We perform a linear fit of the form C(t) = a + b · t, computed with the least squares method. We test the hypothesis H 0 : b = 0, under independent 2-samples t-tests.

Power Law Fit Test
We perform a power-law fit of the form C(t) ∝ t β , computed with the least squares method. We test the hypothesis H 1 : β = 0, under independent 2-samples t-tests.

Multiple intervals test
We compare the value of C across different time-intervals δt k . We divide the total time range into time-intervals δt k spanning w weeks. We compute the average capacity C(δt k ) and its standard deviation σ C (δt k ) for each time-interval δt k . We test the hypotheses  Table S1.

Evidence 2
The individual weekly net gain of locations is equal to zero. The net gain defined  Table S2 and Fig. S9. Evidence 3 The average value of spatial capacity saturates for increasing values of the time-window W . We find that for all datasets the average time coverage C ∼ 25. This result is obtained after accounting for the differences in data collection by considering the normalized spatial capacity where T C i is the weekly time coverage of individual i (see Figs. S10, S11). Individuals' capacity values are distributed homogeneously around the mean (Fig. S12).

Evolution of the activity space: Invariance under time translation
We verified that the evolution of the activity space is not influenced by the particular time at which the data collection started or by the time elapsed from that moment. We borrow the concept of aging from the physics of glassy systems [63, 64]. A system is said to be in equilibrium when it shows invariance under time translations; if this holds, any observable comparing the system at time t with the system at time t + γ is independent of the starting time t. In contrast, a system undergoing aging is not invariant under time translation. This property can be revealed by measuring correlations of the system at different times.
We measure the evolution of the activity space starting at different initial times t to verify if the system undergoes aging effects. The evolution is quantified measuring the Jaccard similarity J i (t, γ) = . The average similarity J(t, γ) decreases in time: power-law fits of the form J(t, γ) =∼ γ λ(t) yield λ < 0 for all t. The fit coefficient λ(t) fluctuates around a typical value, because of seasonality effects, but does not changes substantially as a function of the starting time t (Fig. S13), hence J(t, γ) = J(γ). This implies that the rate at which the activity space evolves does not substantially depends on when the measure is initiated. We conclude that our data reflect the 'equilibrium' behavior of the monitored individuals. The fact that our dataset allow us to replicate measures performed on other datasets obtained with different methods (see above) further confirms this finding.

Sub-linear growth of number of locations
Individual exploration behavior is quantified measuring the number of locations L i (t) discovered up to day t. In the MS, we show that L i (t) grows sub-linearly in time. Here, we show that this holds also changing the definition of locations (See Fig. S14). This property of exploration behavior is not affected by the waiting time before starting the measure as we verify by repeating the same measures starting M months after the participant received the phone, for several values of M (See Fig. S15).

Discrepancy relative to the randomized cases
Individual capacity is lower than it could be if individuals were only subject to time constraints. We showed this by randomizing individual temporal sequences of stop-locations for 100 times, and then comparing the average randomized capacity C rand,i with the real capacity C i . We perform two types of randomizations (see Fig. S16): • (1) Local randomization: For each individual i, we split her digital traces in segments of length 1 day. We shuffle days of each individual.
• (2) Global randomization: For each individual i, we split her digital traces in segments of length 1 day. We shuffle days of different individuals.
The individual randomized capacity C rand,i averaged across time, (see Fig. S17), is higher than in the real case both for the global and the local randomization cases. We compute the KolmogorovSmirnov test-statistics (Table S3) to compare the real sample with the randomized samples. We reject the hypothesis that the two samples are extracted from the same distribution since p < α with α = 0.05. We test several choices of intervals ∆T . We find that when ∆T increases, the subsets are empty for many individuals, since no locations satisfy the above-mentioned criteria. In Figs. S18, S19, S20, S21 we show the distribution of average individual sub-capacities C ∆T i . Only subsets with small enough ∆T are significant for more than 50% of the population, and typically each individual has 1 location where he/she spend more than 48 hours per week.

Conservation of time allocation
The average sub-capacities C ∆T (t) are constant in time for several choices of ∆T and different definitions of location. This is verified with the linear fit test as detailed in a previous section (see table S4).