Saturday, November 9, 2013

Cisco Object Tracking using IP SLA

One of the very useful features of Cisco's IOS is the track command which allows you to track certain events, based on that events you can take actions. Since this is a very big topic i will split on different posts. Let's first get familiar with the track  command and we will build our scenarios on what it can do

The track command as the name implied detects the state of a certain variable, this variable can be one of many, but i'll just give an example of some of the things that it can observe


  • a route in the routing table, wether it exists or not
  • an Interface state
  • reachability to a certain host ( in conjunction with IP SLA )
this can be very useful in different situations where the router or switch would be able to act on its on in case an event happened. Let's elaborate this with the following topology



Lets assume here that R1 is connected to two switches, both switches are connected to R4.
R1 has two default routes, one which is the main default route with admin distance of 1 and the backup default route with the admin distance of 200. There is no IGP running in this topology.

However, the first default route is pointing to next-hop IP 10.1.4.4, the backup default route is pointing to next-hop IP of 20.1.4.4. Now, here's the real problem. Ethernet unlike many other L2 protocols doesn't detect remote hops failure, meaning that if the link between R4 and SW1 went down, the link between R1 and SW1 will still be up even though the Layer-3 termination of the subnet 10.1.4.0/24 is on R1 and R4, R1 will know nothing about the failed link between R4 and SW4. 

Let’s first check the normal operation of the setup we have on hand.

R1#show run | i ip route
ip route 0.0.0.0 0.0.0.0 10.1.4.4
ip route 0.0.0.0 0.0.0.0 20.1.4.4 200

R1#show ip route
Gateway of last resort is 10.1.4.4 to network 0.0.0.0

     1.0.0.0/32 is subnetted, 1 subnets
C       1.1.1.1 is directly connected, Loopback0
     20.0.0.0/24 is subnetted, 1 subnets
C       20.1.4.0 is directly connected, FastEthernet0/1
     10.0.0.0/24 is subnetted, 1 subnets
C       10.1.4.0 is directly connected, FastEthernet0/0
S*   0.0.0.0/0 [1/0] via 10.1.4.4

As you can see, the first default route is the only one installed in the routing table due to its lower admin distance.

R1#show ip int brief
Interface                  IP-Address      OK? Method Status                Protocol
FastEthernet0/0            10.1.4.1        YES manual up                    up     
FastEthernet0/1            20.1.4.1        YES manual up                    up     
Loopback0                  1.1.1.1         YES manual up                    up

All the interfaces are up and everything seems good, now let’s ping R4 loopback sources from R1 loopback

R1#ping 4.4.4.4 source lo0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/52/124 ms

Seems legit, now let’s simulate a failure between R4 and SW1 by shutting down the interface from R4 side

R4(config)#int f0/1
R4(config-if)#shut
R4(config-if)#
*Mar  1 00:47:51.691: %LINK-5-CHANGED: Interface FastEthernet0/1, changed state to administratively down
*Mar  1 00:47:52.691: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/1, changed state to down

Now the whole path of the main default route isn’t usable, but still the F0/0 interface and the main default route is pointing to F0/0

R1#show ip int brief
Interface                  IP-Address      OK? Method Status                Protocol
FastEthernet0/0            10.1.4.1        YES manual up                    up     
FastEthernet0/1            20.1.4.1        YES manual up                    up     
Loopback0                  1.1.1.1         YES manual up                    up     

R1#show ip route
Gateway of last resort is 10.1.4.4 to network 0.0.0.0

     1.0.0.0/32 is subnetted, 1 subnets
C       1.1.1.1 is directly connected, Loopback0
     20.0.0.0/24 is subnetted, 1 subnets
C       20.1.4.0 is directly connected, FastEthernet0/1
     10.0.0.0/24 is subnetted, 1 subnets
C       10.1.4.0 is directly connected, FastEthernet0/0
S*   0.0.0.0/0 [1/0] via 10.1.4.4

Now if we try to ping, the ping will ofcourse fail

R1#ping 10.1.4.4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.1.4.4, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

This is where tracking comes into play, since R1 isn’t by default aware by remote Ethernet links state, we can make it track events that might indicate that the link is down, and based on that it can remove the main default route and install the backup routes instead. Here’s how we can do this.

To make it more easier, we’re going to send probes to  10.1.4.4 through interface F0/0 , and we’re going to track it, in case the probes failed, we’re going to switch to remove the main default-route, which ultimately means the other default route will be installed instead

First let’s create a SLA object to start pinging our next-hop IP from the desired interface

R1(config)#ip sla 1?
<1-2147483647> 

R1(config)#ip sla 1
R1(config-ip-sla)#icmp-echo 10.1.4.4 source-interface f0/0

Now we need to specify atleast three parameters to make this work

R1(config-ip-sla-echo)#frequency ?
  <1-604800>  Frequency in seconds (default 60)
R1(config-ip-sla-echo)#frequency 5

R1(config-ip-sla-echo)#timeout ?
  <0-604800000>  Timeout in milliseconds
R1(config-ip-sla-echo)#timeout 1000

R1(config-ip-sla-echo)#threshold ?
  <0-2147483647>  Millisecond threshold value
R1(config-ip-sla-echo)#threshold 1000

Basically, here’s the definition of each of those

·         Frequency (sec) is how often do you want to send a probe
·         Timeout (msec) is what is the absolute timeout if there’s no reply for the probe sent
·         Threshold (msec) the probe was replied but it exceeded a certain amount of time

Keep in mind that the threshold has to have a lower value than the timeout, which makes sense.

After we created the SLA object, we need to activate it by determining when it should run and for how long this probe should be periodically sent.

R1(config)#ip sla schedule 1 life forever start-time now

We just indicated that I want to start SLA object 1 immediately and make it loop forver

Let’s check if it’s working

R1#sh ip sla statistics 1

Round Trip Time (RTT) for       Index 1
        Latest RTT: 32 milliseconds
Latest operation start time: *01:16:38.391 UTC Fri Mar 1 2002
Latest operation return code: OK
Number of successes: 1
Number of failures: 13
Operation time to live: Forever


R1#sh ip sla statistics 1

Round Trip Time (RTT) for       Index 1
        Latest RTT: 32 milliseconds
Latest operation start time: *01:16:38.391 UTC Fri Mar 1 2002
Latest operation return code: OK
Number of successes: 12
Number of failures: 0
Operation time to live: Forever

It seems that our probes are working just fine, now all we need to do is track these probes states to take actions in case it failed.

R1(config)#track 1 rtr 1 reachability

R1(config-track)#delay ?
  down  Delay down change notification
  up    Delay up change notification

R1(config-track)#delay up ?
  <0-180>  Seconds to delay
R1(config-track)#delay up 2

R1(config-track)#delay down ?
  <0-180>  Seconds to delay

R1(config-track)#delay down 2

Keep in mind that Cisco has always been inconsistent with it’s commands, the old name for IP SLA was RTR, now they changed the RTR to SLA syntax but for some unknown reason they didn’t change it under the track command, so rtr 1 here refers to the sla 1 object.

What we just configured here is a tracking instance that observes the start of the SLA object reachability to 10.1.4.4. the delay up refers to the amount of time the track should wait before reacting after it detects that the SLA has reachability, and delay down is what time should it wait until it indicated that the reachability is down. This is useful because in case of flapping links you don’t router to act instantly, you might need to give it time to switch between 2 states

R1#show track 1
Track 1
  Response Time Reporter 1 reachability
  Reachability is Up
    1 change, last change 00:08:31
  Delay up 2 secs, down 2 secs
  Latest operation return code: OK
  Latest RTT (millisecs) 84

Associating the track with our main default-route, we should be good to go

R1(config)#ip route 0.0.0.0 0.0.0.0 10.1.4.4 track 1

R1#sh run | i route
ip route 0.0.0.0 0.0.0.0 10.1.4.4 track 1
ip route 0.0.0.0 0.0.0.0 20.1.4.4 200

Now let’s simulate a failure again between R4 and SW1 by shutting the interface F0/1 on R4. Here’s what happens afterwards on R1

R1#
*Mar  1 01:34:18.587: %TRACKING-5-STATE: 1 rtr 1 reachability Up->Down
R1#
*Mar  1 01:34:18.587: RT: del 0.0.0.0 via 10.1.4.4, static metric [1/0]
*Mar  1 01:34:18.591: RT: delete network route to 0.0.0.0
*Mar  1 01:34:18.591: RT: NET-RED 0.0.0.0/0
*Mar  1 01:34:18.591: RT: NET-RED 0.0.0.0/0
*Mar  1 01:34:18.591: RT: add 0.0.0.0/0 via 20.1.4.4, static metric [200/0]
*Mar  1 01:34:18.591: RT: NET-RED 0.0.0.0/0
*Mar  1 01:34:18.591: RT: default path is now 0.0.0.0 via 20.1.4.4
*Mar  1 01:34:18.595: RT: new default network 0.0.0.0
*Mar  1 01:34:18.595: RT: NET-RED 0.0.0.0/0

R1#show ip route

Gateway of last resort is 20.1.4.4 to network 0.0.0.0

     1.0.0.0/32 is subnetted, 1 subnets
C       1.1.1.1 is directly connected, Loopback0
     20.0.0.0/24 is subnetted, 1 subnets
C       20.1.4.0 is directly connected, FastEthernet0/1
     10.0.0.0/24 is subnetted, 1 subnets
C       10.1.4.0 is directly connected, FastEthernet0/0
S*   0.0.0.0/0 [200/0] via 20.1.4.4

The backup default route is now installed in the routing table eliminating the Ethernet problem we had before.

R1#sh ip sla statistics

Round Trip Time (RTT) for       Index 1
        Latest RTT: NoConnection/Busy/Timeout
Latest operation start time: *01:35:58.391 UTC Fri Mar 1 2002
Latest operation return code: Timeout
Number of successes: 176
Number of failures: 70
Operation time to live: Forever

The IP SLA indicated that the reason for failure due to timeout, in case it was the threshold, the return code would’ve been threshold. And the latest RTT indicated that there is no connection.

R1#show track
Track 1
  Response Time Reporter 1 reachability
  Reachability is Down
    4 changes, last change 00:04:01
  Delay up 2 secs, down 2 secs
  Latest operation return code: Timeout
  Tracked by:
    STATIC-IP-ROUTING 0

Now we should be able to ping from R1 lo0 to R4 lo0 without a problem
R1#ping 4.4.4.4 source lo0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/25/36 ms



In the next couple of posts, we’ll dig deeper with advanced configuration of SLA and tracking, overcoming lots of problems we face in our networks.