Network Troubleshooting: Creative Problem Solving as applied to BGP Leaks
Recently, a customer opened a ticket with an interesting problem: one of their 3rd party leased ipv4 prefixes could only reach part of the internet. Packets were disappearing into a black hole for many destinations. In particular, a speed test server in Atlanta, GA, was identified as a problem. They provided traceroutes showing that the vanished packets never made it to the target network, even though traces originated from a different server on their network worked fine. What was up with this leased prefix that it didn’t work while others did?
In any connectivity issue, traceroutes generally provide a good place to start. Especially when a customer can provide you with ones in both directions, you can compare routes, identify likely bottlenecks, and tell if the route is symmetric or asymmetric. Unfortunately, in this case we only had the outbound traceroutes, so the first thing to do was to confirm the failure with a new traceroute, and compare it to a successful traceroute from a working source IP. Because we provide managed router services to this client, we were able to conduct our testing from their IP addresses directly, observing the successful and unsuccessful traceroutes.
The problem appeared after leaving one of our transit providers, so either a route wasn’t getting propogated, or someone was dropping the route for that prefix. With a handle on the issue, we started looking at the BGP routing for these ranges. We were originating both the working and failing source prefixes, and they were all working in other places on the net, so our origins were probably not the problem. Just to be sure, we confirmed everything was working properly and there were no differences in advertisements. We also confirmed the received routes from the failing target, and everything looked normal.
If the problem wasn’t on our side, we’d have to look around the internet. Unfortunately, the target didn’t have a looking glass or traceroute server of their own, so we’d have to track the route down the hard way. Using several BGP looking glasses, we were able to determine that some networks were getting a much longer route to the problematic prefix than others. And more telling, these longer routes all had one thing in common: the AS upstream of NCHC wasn’t one of our expected transit providers even though one of our transit providers appeared in the AS path. This unexpected AS had inserted itself between us and one of our transit providers, and was siphoning off traffic for the failing prefix, blocking delivery to us! Someone was leaking a peering route they shouldn’t be passing on, and because they also used one of our transit providers, they had created a routing blackhole.
AS-PATHs in hand, we immediately knew who to talk to and were able to quickly communicate the problem to their NOC. After we laid out the evidence for a route leak, they realized what was up pretty quickly and were able to correct their filters and block the improper advertisements.
Watching the traceroutes & looking glasses, we were able to see the moment they fixed the issue as the bad routes disappeared from the internet. Moments later, the traceroutes to Atlanta were completing normally. Problem solved!
This is the kind of creative problem solving that can benefit our customers – especially managed service providers and business owners who are serving their own customers and don’t have the time or expertise to know how to reverse-engineer sophisticated issues like this. One more reason to ask the right questions and develop confidence in your data center and the people in it that can make the difference between downtime and productivity.