## Issues with Data Uncertainty in Tour Optimization

- 18 février 2020
- 4 mins

For a long time, Operations Research algorithms were built to solve problems where data was assumed to be known perfectly, which is approximately never the case in reality. This is obviously paradoxical for a science whose purpose is to help solving problems from the real world with mathematics.

To be fair, true data availability is recent and it really is the arrival of databases filled with (sometimes very) dubious data that helped the scientific community to realise that poor data quality or even unknown data could be problematic for real world applications.

But is it actually an issue ?

In this article, we will try to show that it can actually be an issue, if not THE issue, when dealing with optimization, and that there is no easy answer to that problem.

# The Optimizer’s Curse

To better explain how data uncertainty can be highly detrimental to optimization contexts, we can refer to the concept of Optimizer’s Curse, introduced by Smith and Winkler in [1]. The key idea behind that curse is that when data is uncertain, optimized solutions will tend to be highly deceptive once applied in reality, due to a selection bias from the algorithm itself.

To illustrate that effect, I will take an example from Daniel Kuhn, based on portfolio optimization. Imagine a very simple optimization problem where you want to select the best k items out of n, based on their value. The algorithm for solving that problem is very simple: you just need to rank items by decreasing value and take the top k items.

In that example with n=6 and k=3, the best decision would be to pick the top 3 objects, thus leading to a global picked value of 25.

Now, let us imagine that values are actually unknown (in the context of portfolio optimization, it models the fact that return on investments cannot be known exactly in advance) and we have no choice but to predict them. It is thus safe to assume that the prediction method, be it an average value from past realization or an advanced machine learning approach, will have produced values that are either over or under estimated.

Let us imagine that the predictions for each value are as such (in orange):

If we only know the prediction, our choice will be to pick items 1, 2 and 5, for an expected value of 27. But then the outcome would be a total value of 22, almost 20% less than expected ! In addition to this, we saw that 27 is also higher than the actual optimal value for this problem which is 25. To wrap things up, uncertainty makes us predict a total gain that is actually impossible to achieve even with real data, for a very deceptive outcome.

To conclude, with uncertain data, the decision produced by optimization is overly optimistic. This occurs because making the best decision with uncertain or predicted data will lead to favor decisions based on data biased towards our goal. Here: maximizing the total amount of value collected leads to selecting over-estimated items.

Actually, this is not only true with that example, but with all optimization problems. Indeed, with a bit of thinking, you will find that an optimization problem over uncertain data will always produce solutions that are at most as good as the true optimal solution.

That is the Optimizer’s Curse: in the presence of uncertain data, decisions will always be deceptive in reality.

That rather fundamental property of optimization problems has impacts on many real-world applications. Let’s review how it can impact Vehicle Routing Problems.

# Uncertainty in Vehicle Routing Problems

We’ve seen that data uncertainty, if not handled, will make optimization algorithms producing decisions whose performance will turn out to be deceptive in practice. Routing problems in practice are actually full of data uncertainty. Here are a few examples:

- Travel durations: traffic is obviously something that makes travel durations highly uncertain in practice, even with traffic prediction models that can manage to reduce uncertainty.
- Weights and volumes: in many real-world applications, weights and volume are only declared and not measured, or only a part can be measured accurately.
- Customer demand: when dimensioning a fleet, the major difficulty is that customer demand cannot be known exactly in advance.
- Etc.

If we consider the most prevalent source of uncertainty, traffic, one can easily understand why and how it can be detrimental to tour optimization. Indeed, optimizing a tour can always be seen as finding the best sequence of stops to visit for each driver, so as to optimize some objectives, like the total time spent on the road, under some constraints. Those constraints can be time windows on some stops, or working hours of the drivers. In that context, objective and constraints rely on travel time data, which is highly uncertain due to traffic variability.

That problem is so relevant that it has been investigated in many ways with advanced techniques [3-6]. We will give details of these techniques in another blog post, but for now, we feel like it’s important to give the reader some hints about why that issue cannot be tackled simply.

An intuitive first action one could take to overcome the difficulty of not knowing what data are over or under estimated, is to add a “security margin” on each data. For example, with travel durations, it would make sens to add a 10~20% travel duration margin on every possible trip. Therefore, all tours will be feasible with a higher probability. But even if that seem to be true, it poses two major problems:

- Adding a margin on all durations will also lead to a lower tour performance
- How should we choose that margin ?

The first issue tells us that adding robustness to our solutions will always come at a cost. This is the famous “Price of Robustness”, popularized by Bertsimas and Sim [2]. As a consequence, this example show that while seeking robustness, one should take care of finding the right tradeoff with performance, and it also show that it cannot be done easily.

In our example, the “margin applied to all data” approach is obviously too conservative. Indeed, we know that when data is actually uncertain but a value is provided, it is either over or under estimated. Adding a margin to everything means over estimating everything. One could argue that margins could be added randomly to only a part of the data to mitigate that effect, but as we saw in the first part of the article, the Optimizer’s Curse will make optimized solution relying more on under estimated values, thus making things worse than before !

# Conclusion

To conclude, we saw that data uncertainty can deal fatal blows to an optimized solution and make it very deceptive in reality, due to the Optimizer’s Curse, and that there is no easy answer to this problem. Moreover, we saw that in Routing Optimization, data uncertainty is almost everywhere, and that it’s posing a major threat on most approaches if it is not handled appropriately. What we did not see in that article is how to actually handle it, and that will be our focus on the next blog post.

# References

[1] James E. Smith, Robert L. Winkler. **The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis**. Management Science Vol. 52, No. 3 (2006).

[2] Dimitris Bertsimas, Melvyn Sim. **The Price of Robustness**. Operations Research Vol. 52, No. 1 (2004).

[3] Astrid S. Kenyon, David P. Morton. **Stochastic Vehicle Routing with Random Travel Times**. Transportation Science Vol 37, Issue 1 (2003).

[4] Patrick Jaillet, Jin Qi, Melvyn Sim. **Routing Optimization Under Uncertainty**. Operations Research Vol. 64, No. 1 (2016).

[5] Yu Zhang, Roberto Baldacci, Melvyn Sim, Jiafu Tang. **Routing optimization with time windows under uncertainty**. Mathematical Programming Vol. 175, Issue 1-2, pp. 263-305 (2019).

[6] Yu Zhang, Zhenzhen Zhang, Andrew Lim, Melvyn Sim. **Robust Data-Driven Vehicle Routing with Time Windows**. Optimization Online, November 2018.