MAC Flaps – why are they bad?

What is a MAC Flap?

A MAC Flap is caused when a switch receives packets from two different interfaces with the same source MAC address. If this makes no sense, perhaps a  quick summary of how switching at layer 2 works will help.

Switches learn where hosts are by examining the source MAC address in frames received on a port, and populating its MAC address-table with an entry for that MAC address and port. Say a device ‘A’ with MAC aaaa.aaaa.aaaa (hereafter aaaa) sends a frame to device ‘B’ with MAC address bbbb. Assume A is on port 0/1 and B is on port 0/2. The switch populates it MAC address-table something like:

Port		Host
0/1		aaaa

and floods the frame out of all other ports. When B replies the MAC address table becomes:

Port		Host
0/1		aaaa
0/2		bbbb

and the switch forwards the frame to port 0/1 – there is no need to flood now since the location of A is known.

If the switch were to then receive a frame on port 0/2 with a source MAC address of aaaa, there would be clash and the switch would log something like this:

1664321: Nov 14 11:18:16 UTC: %MAC_MOVE-SP-4-NOTIF:
Host aaaa.aaaa.aaaa in vlan A is flapping between
port 0/1 and port 0/2

and the MAC address-table would become:

Port		Host
0/1
0/2		bbbb
0/2		aaaa

What happens when B tries to send A a frame now? The switch won’t flood the frame as it knows a destination and it won’t send the frame back down the link – it gets dropped.

Lab time…

Let’s see if we can mimic this. This isn’t an easy thing to replicate so please forgive the artificial nature of the lab. I configured a switch with three hosts directly connected on VLAN 30. The hosts could ping each other and the MAC address-table was as follows:


3750-1#show mac address-table dynamic vlan 30
          Mac Address Table
-------------------------------------------

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
  30    0008.7c82.5409    DYNAMIC     Fa1/0/1
  30    001a.2f22.d0c2    DYNAMIC     Fa1/0/2
  30    0024.97f0.3a70    DYNAMIC     Fa1/0/3
Total Mac Addresses for this criterion: 3

Host A had an IP of 192.168.30.1 and was on port 1. Host B was 192.168.30.30 and on port 2. Host C was 192.168.30.254 and on port 3.

So, ping with host A:

Host A# ping 192.168.30.254
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5),
round-trip min/avg/max = 1/201/1000 ms

Ping with host B:

Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5),
round-trip min/avg/max = 1/2/8 ms

Next I manually set host A to have the same MAC address as host B (001a.2f22.d0c2). The results? Host B lost connectivity for a few seconds.

Host A# int vlan 30
Host A(config-if)# mac-address 001a.2f22.d0c2
Host A# ping 192.168.30.254
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Here is the switch mac address table after the clone:

3750-1#show mac address-table dynamic vlan 30
 Mac Address Table
-------------------------------------------

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
 30    0008.7c82.5409    DYNAMIC     Fa1/0/1
 30    001a.2f22.d0c2    DYNAMIC     Fa1/0/1
 30    0024.97f0.3a70    DYNAMIC     Fa1/0/3
Total Mac Addresses for this criterion: 3
3750-1#
*Mar 17 04:22:02.620: %SW_MATM-4-MACFLAP_NOTIF:
Host 001a.2f22.d0c2 in vlan 30 is flapping between
port Fa1/0/2 and port Fa1/0/1
3750-1#

Here is what happened to Host B:

Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5),
round-trip min/avg/max = 1/2/8 ms
Host B#ping 192.168.30.254

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.30.254,
timeout is 2 seconds:
.!!!!
Success rate is 80 percent (4/5),
round-trip min/avg/max = 1/1/1 ms

Yes, this is the same impact you would have if two hosts had the same MAC on your network – there is a reason they need to be unique!

What does all this mean?

When you have an annexe VLAN [1] the backbone can be thought of as a series of Layer 2 switches for that VLAN. The ‘Broadcast Domain’ stretches over the entire Backbone. This means the CPU of every host (including our core switches) on a VLAN will receive every broadcast from every other host – this is not ideal but the only way we can offer the same subnet at multiple sites in this generation of the backbone. Another term sometimes used is ‘Failure Domain’. That is, a failure in part of the VLAN could impact the entire core. It is because of this risk to other units that we are keen to make sure annexe VLANs are tightly managed.

[1] These are known as Layer 2 end-to-end VLANs as there is no routing involved. We have called them ‘switched’ VLANs in the past. VLANs with a Layer 3 interface or SVI on the backbone are known as Layer 3 Routed VLANs.

To return to the the issues MAC flaps will cause on your network, each switch in the backbone has a MAC address-table for your VLAN. If for some reason your MAC addresses appear from different locations you will get dropped packets and our logs will fill up with messages which cause issues when we raise a support case with Cisco as our network appears to have loops.

What could cause it?

There are two or three common causes that we see.

  1. Local loops
  2. NAC
  3. Wireless

1. Local Loops

If you don’t run STP then you are far more likely to suffer from network loops. Here are a couple of resources: STP is your friend and Implementing Spanning Tree. The issue with an annexe VLAN is that a local loop is no longer so local and could cause problems everywhere, both for you and others.

2. NAC

There is a legitimate but ill-advised network design which can cause issues. If you have a L2 NAC which forces all traffic through itself then it is possible that a frame will need to leave site A, get switched through to site B only to return to site A, all with the same MAC address. See the image below. I’ve represented the Backbone as one red switch and the ingress and egress ports as tunnel entrances and exits. This design mustn’t be used with the current generation of the backbone.

NAC issue

3. Wireless

We used to run OWL and eduroam (Phase 1) over two VLANs which spanned the entire core. Due to the issues I’ve mentioned we changed this last year. Now the VLANs are local to the FroDos and routed through the core. Prior to doing this it was possible to roam from access points connected to different FroDos and cause MAC flaps.

What should I do next?

We’re going to keep an eye on the logs and will let Units know if they are causing MAC flaps. We’ll work with you as far as possible to locate the source of the issue and get things stable. If you aren’t yet running STP please can I urge you do consider doing so. The new backbone is still some years off so for the good of everyone we need to work together to reduce this. For units which cannot resolve this we may need to look at reverting to a fully routed connection, with each Annexe having its own subnet.

Do get in touch if you have any questions.

Posted in Backbone Network, Best Practices, Cisco Networks | Tagged | 2 Comments

IPv6 Stateful Active/Standby Failover with Cisco ASAs

There was some debate on the Cisco ASA failover situation with regard to IPv6. Since we’re potentially about to make a interim firewall purchase for the main university IPv6 traffic (we route IPv6 separately to IPv4 to avoid a limitation of the older FWSM firewall modules that currently handle the Universities IPv4 traffic) we tested the capabilities to ensure they matched what was required – namely stateful failover of IPv6 traffic. In laymans terms: your communications with the Internet over IPv6 shouldn’t be interrupted when one firewall is unplugged.

We’ve enough equipment to be able to test, so I setup an airgapped network using IPv6 only, roughly mimicking a basic dual site setup. In production it would hopefully have redundant crosslinks and a fibre would be used to connect between the ASAs due to the physical distance of being located at two separate sites (in case one burns down or similar). I used addresses from our public provision but there were no physical connections from the test network. The  ASAs need matching software, I applied 8.3(2) although I’ve since been told that anything from 8.2.2 onwards should match my results – obviously I can only confirm the version I tested. The ASA 5510 upwards have identical software/commands so this test should be valid for 5520s, 5540s etc, it’s the smaller 5505 that is different to the rest of the range in some ways.

I am not a Cisco expert, my own background is system administration, so some of the test was perhaps needlessly complex (the dual switches at each end) but was useful for my own switch revision and practise. If I’ve accidentally left out any configuration from my test writeup that you think would be helpful for people, let me know in the comments and I’ll add it in (the intended audience is IT officers in colleges or departments). The basic plan looked like:

With the switches involved there was one firmware difference which I ignored, also the configuration of the switches isn’t important, however on the green/inside 2960 I used (on one of them)

interface Vlan5
 description internal ipv6 network
 ipv6 address 2001:630:440:400::1/64
!
ipv6 route ::/0 2001:630:440:400::EE

…plus the etherchannel and interface vlan memberships which if the above made sense to you, you are most likely already familiar with.

On the red/outside switches

interface Vlan4
 description outside networks
 ipv6 address 2001:630:440:401::1/64
!
ipv6 route 2001:630:440:400::/64 2001:630:440:401::EE

…again, plus the etherchannel and interface vlan memberships which are as expected.

Interfaces

On the ASA themselves the major important parts are firstly the interfaces:

!
interface Ethernet0/0
 description RED (outside) to 3750-1
 nameif outside
 no ip address
 ipv6 address 2001:630:440:401::ee/64 standby 2001:630:440:401::ed
 ipv6 enable
!
interface Ethernet0/1
 description GREEN (inside) to 2960-1
 nameif inside
 ipv6 address 2001:630:440:400::ee/64 standby 2001:630:440:400::ed
 ipv6 enable
!

Just put the above one one ASA of the pair. I left off a management interface for this test as it wasn’t needed.

Failover Link

Then it’s a case of configuring the failover link

On the ASA that you configured the interfaces on, set it as the initial primary unit in the pair

failover lan unit primary

Then configure the failover interface

failover lan interface FOCtrlIntf Ethernet0/3
failover key *****
failover link FOCtrlIntf Ethernet0/3
failover interface ip FOCtrlIntf 2001:630:440:402::1/64 standby 2001:630:440:402::ee
failover

Type exactly the same failover configuration in the above section on the second ASA (e.g. excluding the ‘primary’ statement). Don’t swap the interface addresses around when configuring the second device or it wont work. You should see a message saying it’s found the second ASA and it’s mirroring the configuration across. You no longer need to type any configuration on the secondary (non active ASA), and it will warn you if you attempt to do so.

Firewall Rules

I don’t care about firewall rules for this test, but we want to pass traffic. Obviously on a production system you probably have some more restrictive rules in mind:

ipv6 access-list inbound remark test acl
ipv6 access-list inbound permit icmp6 any any
ipv6 access-list inbound permit ip any any
ipv6 access-list outside remark test outside acl
ipv6 access-list outside permit icmp6 any any
ipv6 access-list outside permit ip any any
access-group outside in interface outside
access-group inbound in interface inside

and I’d like to be able to ping the firewall interfaces themselves while setting up the network in case of human error on my part.

ipv6 icmp permit any outside
ipv6 icmp permit any inside

HTTP Gotcha

Now, if you test sending traffic from a host on the outside to a host on the inside now, all transfers will be fine during failover except http – you have to expressly turn this on. This caught me out initially as SSH transfers continued fine when the network cable was wrenched from the active ASA but http connections died. If I’d set aside some time and read the failover section of the ASA book properly instead of skim reading it this wouldn’t have been a surprise as p539 of the Cisco Press ASA book states:

“HTTP connections usually have a short lifetime and therefore are not replicated by default. Additionally, they add considerable load on the security appliance if the amount of http traffic is large in comparison to other traffic.”

The command to enable it is

failover replication http

…after which http transfers during a failover condition will continue fine.

Testing

I tested by transferring a large file via http and ssh (I used a 120MB file) then removing the network cable from one of the active interfaces on the live ASA. When you pull out the network interface you’ll see a pause of about 2 seconds but the transfer will then continue (the session has not died).

For my test a Windows 7 machine was the client, GNU/Linux from a Live CD was the server, although it was just what I had to hand and shouldn’t make any difference. For these I used 2001:630:440:400::2 on the client and 2001:630:440:401::2 on the server.

Without the http replication feature on you’ll see the transfer hang, despite the secondary ASA having taken over the duties of the first successfully. Without stateful failover in general your users would notice a failover, this is why the state information is needed: to remove impact on your users of a fault.

Conclusion

Everything worked fine. Yes, you may already be aware of this, but we wanted to test to be sure before considering making any purchase.

Posted in Cisco Networks, Firewall, IPv6 | 1 Comment

IPv6 from a Systems Development perspective

[Guest article by Dominic Hargreaves from the Systems Development and Support team]

Regular readers of the Networks team blog will know that preparatory work to enable IPv6 across the University backbone has been underway for some time. This article offers a perspective of this from the point of view of my team which is concerned with the server side of things (in particular GNU/Debian based). Some of this will be fairly closely tied to the bespoke system management infrastructure we use, but I hope it may be of interest in any case.

Continue reading

Posted in IPv6 | Leave a comment

Maintenance and Development, January End

So last week was more steady progress, here’s the rundown of what I’ve been doing and will be doing next week.

DHCP servers maintenance

The new DHCP servers arrived and a base install was completed for both hosts (our new-ish teammate performing one of the installs). This week should see them racked up in their appropriate cabinets and the testing beginning. I’ve already prepared a plan for a deployment day but I’ll go back over it and do a dry run near the end of the week. We might be able to replace these in the Tuesday JA.NET at risk period on the 1st Feb depending on how well this week goes.

Off site servers

When moving a DNS warmspare offsite I hit a snag in that one of our off site locations belongs to another section and for various rather odd physical reasons we’ve run out of space at that site and it’s going to be disruptive to the site providers to fix the issue. We can’t complain too loudly (our section gets hosting there as a favour) so instead we’ve decided to take the three servers to another remote site but the dismantling, travel and re-racking might take up half a day.

DNS Warmspare

I’ve another DNS warmspare to install and deploy, I might do this before moving the off site servers so that I can rack it up at the same location.

IPv6 Firewall

The testing network is setup without any IPv4 and ICMPv6 is forwarding across the Firewall. I need to setup a sending and receiving host of some kind, I need to setup the failover configuration and then test the failover behaviour. This needs to be higher priority than the DNS and DHCP work due to political/management time frames.

DNS Web Interface

The DNS interface that local IT officers used has had a bug fixed (Warts and all: originally reported in September – sorry I had to prioritise) where there was case sensitivity in a created DNS record, causing an issue in CNAME records being rejected by the web interface at creation time. It’s hard to believe that the bug has been there since the interface was originally written over a decade ago but it has. The reporter mentioned it should just be a case of ‘strtolower’ or similar on the users input, which indeed sounds a sane assumption – for instance I can’t stand web applications that make me manually format whitespace in a postcode instead of the lazy programmer writing something to do it on the backend and how hard can it be to lower case user input?

Sadly the dns interface is 4k+ lines of code with no comments, no Perl ‘strict’, ‘warnings’, HTML in with the code, single letter variables and all variables are globals (in the Perl sense, not the PHP sense). The interface also seems to use deliberate user input case sensitivity in certain places and due to the way the data is trackedby the application, might delete all your CNAMEs if the case sensitity fix is incorrectly implemented, which I luckily discovered when testing on our dedicated test data. Anyway, it  appears to be fixed now.

Since I was working on it I’ve also done some additional minor work so that the success and error messages uses our modern CSS styles to make them stand out – we get a lot of RT helpdesk queries to the team from people that have missed the error message the interface was trying to tell them since it didn’t visually stand out.

[edit] This went wrong – I tested on data that included no MX records. The interface handles aliases and MX records in the same process and in production decided it would delete and recreate all MX records whenever the edit page was a submitted - this caused some issues. I think I’ll have to document the interface as having a case sensitive bug and leave it for another decade.

I reverted the case sensitivity fix but left the interface visual changes in (I try to do separate SVN commits when fixing different issues so if needed the specific changes can be rolled back quickly)[/edit]

Possibility of [www.]ox.ac.uk having a AAAA for the Global IPv6 Day

Last week I made initial approaches the three groups that make up the www.ox.ac.uk provision. We’ve meeting arranged for Friday 28th to discuss the politics and technicalities. Another (friendly rival?) university has already informally mentioned they will be IPv6 enabling their main site address for that day. I don’t know about respective plans for lboro.ac.uk and soton.ac.uk but knowing their staff I suspect they’ll be prime suspects for taking part.

Broken IPv6 websites for Testing

For some a testing scenario I’ve setup two broken websites. Both of the follow sites should work fine if you’ve an IPv4 only host. If your host is dual stacked then the behaviour I suspect you’ll see is documented below. An IPv6 only host should get neither site.

http://broken-ipv6.oucs.ox.ac.uk/

This site has working IPv4 connectivity but IPv6 connections are being silently dropped by a firewall as a simulation of a misconfigured server.

On a dual stack client that favours IPv6 you should see a long delay (~20 seconds?) followed by success.

http://broken-aaaa.oucs.ox.ac.uk/

This site has working IPv4 connectivity and a correct AAAA DNS record exists but the webserver is not configured to listen on the IPv6 address as a simulation of a transitioning or otherwise misconfigured server.

On a dual stack client you may seem to be instantly connected to the host, the web browser trying IPv4 after getting the refusal on IPv6.

I’m not sure the site names are perfect but they’ll do. These sites aren’t to prove any point, they just exist and are there for technical behaviour confirmation on different hosts/software. If you use these, please sanity check the behaviour before each formal testing session in case one day I’m no longer here and someone discovers these sites and mistakenly ‘fixes’ them.

Posted in Uncategorized | Leave a comment

DNSSEC first steps

DNSSEC is a security extension to the Domain Name System which offers

  • origin authentication of DNS data
  • data integrity
  • authenticated denial of existence

This is useful in helping to protect against attacks such as DNS cache poisoning.

Information on DNSSEC can be found at

We have built a development DNS infrastructure in order to be able to experiment with DNSSEC without risk of adversely affecting the production DNS service. This consists of a hidden master and two secondary authoritative servers running ISC BIND on Debian GNU/Linux.

We wanted to use a zone that was small, static, low-profile and from a DNSSEC-capable registrar – ‘oxford-university.edu.’ matched on all counts.

There are two types of keys used in DNSSEC, zone signing keys (ZSK) that sign the individual resource records in the zone file, and key signing keys (KSK) that sign the ZSKs. The public part of the KSK is registered with the parent zone. This allows frequent changing of ZSK without having to bother the parent zone every time – this only has to be done with a change of KSK. The general consensus seems to be that ZSKs should be changed every month and KSKs every year.

There are various cryptographic algorithms to choose from. To start with, we’ve chosen to generate our KSK with 2048-bit RSASHA1-NSEC3-SHA1, the ZSK with 1024-bit RSASHA1-NSEC3-SHA1, and to use the SHA-256 hash function to generate the digest of the KSK that is used by the parent zone in the Delegation Signer (DS) record.

Now that we’ve gone to the trouble of generating keys, signing our zone and giving our parent a DS record, what does this achieve? Any DNSSEC-enabled DNS resolver in the world can now follow a chain of trust all the way from the top (root) of the DNS tree down to an individual resource record in our zone. The resolver must be configured to trust the public component of the root’s KSK.

The excellent site http://dnsviz.net/ offers a visualisation tool which makes it easier to understand the chain of trust. In this example we’re drilling down to the A record for www.oxford-university.edu.

The double ellipse at the top of the diagram indicates that we’re using root’s KSK as the trust anchor and the blue/green arrows represent trusted relationships.

In contrast, here’s what happens when we try to validate the A record for bad.oxford-university.edu which has a deliberately broken signature.

The red colour at the bottom of the diagram shows that the signature for bad.oxford-university.edu is bogus.

Digging around

We can use the dig utility on a host with a DNSSEC-enabled resolver to explore a little (some output lines have been omitted for clarity).

Lookup a known valid record

$ dig good.oxford-university.edu. a

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7141
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; ANSWER SECTION:
good.oxford-university.edu. 14400 IN    A       163.1.0.90

Lookup a known bogus record

$ dig bad.oxford-university.edu. a

;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16837
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

Here we’ve received SERVFAIL which has prevented us from using a potentially compromised answer.

Lookup a known bogus record with checking disabled

We can look up the bogus record again, but this time setting the Checking Disabled (CD) bit in our query

$ dig +cd bad.oxford-university.edu. a

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27875
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; ANSWER SECTION:
bad.oxford-university.edu. 14400 IN     A       163.1.0.90

Now that we don’t care about validation, we get an answer returned.

Next steps

Key generation and roll-over is one of the key (sorry!) components of managing a signed zone. We may run our production DNS service on an appliance in the medium-term which would take care of the tedium of key management so it doesn’t make sense to invest time in developing a local solution at this stage.

The root zone was signed on 2010-07-15. The uk zone was signed on 2010-03-01. We need JANET(UK) to sign the ac.uk zone before there is a possibility of a chain of trust from the root to our ox.ac.uk zone. At the time of writing, JANET(UK) has not given any indication as to when it might get round to signing ac.uk.

Posted in DNS | 3 Comments

Global IPv6 Day

On the 8th of June, for 24 hours, the major names that make up the web experience for a large proportion of users of the Internet will be enabling IPv6 on their services.

The announcement: http://isoc.org/wp/worldipv6day/

What does this mean?

Up until now there’s been an argument made by some network administrators that there’s no point deploying IPv6 as the home Internet Service Providers haven’t , and the ISPs might say there’s no point as a lot of websites aren’t IPv6 enabled, the website owners are worried 1/2000 of their visitors might have IPv6 issues and go to a competitors instead. The network hardware vendors have a similar opinion and so you risk a monotonous stalemate, with the occasional voice of ‘have we run out of addresses yet?’.

This date means all of the above groups joining in, all having the same risks on the same date.

This is great as it means actual progress now, rather than when it’s a panic later. This means ISPs, website owners and even end users[1] taking notice.

[1] Perhaps ideally they shouldn’t know anything has happened but if they’re seeing the publicity and putting pressure on ISPs, vendors and websites then that’s fine.

What about Oxford?

  • With regards to www.ox.ac.uk , I’ve had no involvement with the running of it but I believe it’s maintained by a lot of teams from different parts of the university. I think by June it will be running on hardware from a non OUCS section of the university (I think currently it is NSMS, later it will be BSP), the backend is written by a contracted company and the political control of the website content is via a dedicated team at the Public Affairs Directorate. This makes it all slightly tricky but I’ll begin prodding the contacts involved tomorrow.
  • For smaller university websites hosted by OUCS or via NSMS the outlook is much better, the technical and political challenges are much smaller and we’d like to get as many sites on a AAAA for the date as possible. The systems development team in OUCS have already started deploying sites (such as this blog) with a AAAA.
  • As our first test unit the Maths Institute already has IPv6 connectivity and I’ll be trying to assist them to get their websites IPv6 enabled (if they need my help of course; they might not).
  • For units themselves: (If you aren’t from the university it may help to first explain that the networks team doesn’t supply networking to the end user, we supply networking to the ‘front door‘ of each department/college/unit and the unit has it’s own politically separate IT staff that maintain the unit)
  1. For IPv6 connectivity look at the checklist then get in contact when ready. If in doubt you can phone myself.
  2. You can start today – when someone asks how your IPv6 deployment preparation is going, don’t say that you can’t do anything because OUCS haven’t yet given you IPv6 connectivity. Do an audit of switch hardware, check your firewalls IPv6 support, make a list of the services you run, plan how you will layout your network (these tasks may take months whilst doing your normal duties, please start now).
  3. Please listen to the technical advice given and remain professional. 128bit numbers are long and noone expects you to be perfect beacuse humans make mistakes. We don’t mind mistakes and the move to IPv6 is tricky but we’ll assist and providing you don’t expect us to configure your hardware for you we’ll give advice when asked. As time allows we do go out of our way for approachable IT staff, but please don’t refuse to listen to the advice given.

What about the Networks Team?

You might remember from previous posts that our three main issues were/are:

  1. The firewall: It’s always dangerous to suggest dates in a blog but the IPv6 firewall should be replaced with something more sturdy in late February. The replacement should be quite straight forward and it should be transparent to most users (we’ll see how it goes but at worst IRC server users might notice a disconnection at some dark hour of the morning).
  2. The IPAM (DNS and DHCP management for units): We had a lot of discussions with the vendor in late last year for our replacement system, publically I’m expecting it to be early May before I can state anything. In the meantime our existing system requires entries to be made to the forward and reverse zones by hand. This isn’t so bad for individual website entries so for the June 6th date it should be survivable.
  3. Security blocking: We’ve some code to re-write, I think we can have it done by June.

With the delay in the IPAM I’m thinking about possibly sacrificing some time to modify one of the shorter scripts that pushes out configurations on the existing DNS infrastructure. The current script can’t deal with both a IPv4 and an IPv6 address being pushed to the hosts DNS service configuration, although the hosts themselves (resolver and authoritative DNS) have working IPv6 connectivity. It might be that on the 8th June we can get the auth and resolver DNS systems to have IPv6 service addresses.

I’ll need to consult with my teammates however it might be that with reasonably little pain we can get eduroam and/or the vpn network to have IPv6 client connectivity, since they are self contained networks we administer the service for.

I should stop now and make no more promises, but I’m glad there’s a firm date and I’m looking forward to this.

Posted in IPv6 | Leave a comment

Early 2011 Work

So an overview of my own individual tasks for early 2011 looks like:

Replacing the DHCP servers for the university

This was scheduled for last year but the sequence of events needed to free up hardware for the service to move to has been more awkward than expected so instead we’re going to purchase two low end fault tolerant servers. Hopefully the order, delivery and base install will take place in January with testing at the end of the month. Actual deployment may take place either at the immediate start of a Tuesday standard Janet at risk period (e.g. 7am) or on a weekend, but we’ll decide closer to the time and make an announcement to the IT officers beforehand.

This is important because it’s already behind schedule and the hardware it’s replacing is out of warranty. Essentially if a DHCP servers hardware failed now, although it would failover to the other we’d be redeploying the nearest development server rather quickly as the replacement. The new hardware will be in 5 year warranty, which should be well past when the system is replaced by either an integrated IPAM (DNS/DHCP) system or virtualised.

On the virtualisation note, and before there’s any comments of ‘why don’t you put this service on a virtual host?’ I believe there’s a university virtualisation service in the works from other sections but I don’t know enough detail to talk about it. NSMS currently have a smaller service but we’ll be keeping the DHCP changeover simple for now due to the high number of people affected if there were to be an issue with the service to an offering our own team isn’t familiar with. We do virtualise the majority of our development hosts but our own team doesn’t currently have a public service virtualised – we will in the future, probably as the warranty runs out on more minor services.

ASA IPv6 firewall

The second project in January is to setup and test the intended IPv6 firewall configuration on a ASA 5510 platform that’s currently available for testing here. The decision on purchasing isn’t until the end of the month, if it went ahead I’d expect deployment near the end of February.

The is important in order to replace the temporary IPv6 firewall we currently have, it also means we can get on with deploying websites in OUCS onto IPv6 (e.g. with a AAAA) and (hopefully) websites in Maths. The Mathematical Institute has capable IT staff of it’s own but I’m keen on seeing some things deployed before others so have offered to assist.

LMS

At the start of February I’d like to spend some time trialling Cisco LMS and if this goes well perhaps the Cisco Security Manager. Specifically instead of developing our own in house scripts to manage IPv6 network restrictions (via a Perl Expect module and similar) perhaps we might have better visibility and less maintenance issues with the Cisco tools.

We also have our own in house inventory and network monitoring systems, with various overlapping reporting – I’d like to check that we aren’t needlessly making our lives hard.

Aside from saving on maintenance and misunderstandings, an important aspect I’m interested is problem visibility in a disaster. Specifically if something that should never happen does, I’d like a magical arrow that points to the exact issue. From experience I think we currently have the information needed but it takes some time to realise which place to dig it out from and compare with what, the integration and usability is low.

DNS warmspares

February should also see the deployment of two DNS warmspare hosts, to replace a host lost to hardware failure. These will be the old DHCP servers, since the hardware need not be in warranty. This will start as soon as the hardware is available and the new DHCP service has been running a couple of days.

Other

I’ve planned beyond this however with an upcoming change in management it could well be my priorities change plus well laid plans are vulnerable to some unrelated work suddenly cropping up halfway through the timeframe with a high technical or political priority and needing all other projects postponed.

We’ll also be continuing normal duties, so for 2 days a week I’m on the support queue for our team.

There’s also been progress on the new IPAM system over December however I’m not keen on making promises with regards to this project. We’re hoping for a significant development from the vendor involved in April.

Posted in Uncategorized | Leave a comment

BBC iPlayer and the University VPN

[edit] Since writing this an iPlayer developer has passed on via informal channels that they’re using the Quova geolocation service. In this database part of our VPN address range was designated an ‘international proxy’ – while this may be regarded as true or not for restricted access VPN clients I simply wanted a decision, and so I’ve contacted Quova stating that users in that range are told their network will behave as if in Oxford – they appear to accept this so as of ~15th of January this issue may be fixed unless there’s also an additional system the BBC use to override this.

Note that I’m not arguing if the VPN range should or shouldn’t be able to use the iPlayer, I simply wanted to know a response from the BBC to my contact through their iPlayer support channel and wanted to give the users a definitive answer about iPlayer access without have to vaguely reverse engineer the way iPlayer works. There are more important things on the backbone network for our team to be working on than iPlayer access.

The BBC appear to have blocked the university VPN address range from iPlayer, you will get a message stating that content is not available for your region no matter where you were when connected to the VPN.

This was originally reported to our team in August 2010. We were asked to look at an issue with the BBC iPlayer service from central VPN service connections. The original user reports made quite a few claims but I’ll stick to what we were able to verify, since it seems there were some changes and maintenance at the iPlayer end affecting results at the time of the initial reports.

The initial reports suggested that all users of the VPN were affected but we were unable to reproduce the issue – it then transpired that the requestor was a university member who was abroad and that it was only for them that there was an issue. Hence at this point our interest waned and it was pointed out to the user that BBC policy is not to provide content to overseas users . It’s not an especially good/sane (legal?) use of our resources to try and get around the BBC content restrictions so we were not interested. If this had been the end of it, that would have been fine.

A few more queries followed however, with the occasional suggestion by requestors that it must be something ‘special’ about our VPN service we provide that is causing the problem, and so I contacted the BBC to clarify what the BBC position was, and asked for clarification on the technicalities to ensure what was seen was expected and not in fact the symptoms of a technical issue instead of the assumed intended restriction. As part of this request we provided our VPN address range, for which access is restricted to university members, as part of the technical information. What we were hoping for was a BBC statement which we’d pass onto the users (I seem to recall from memory that the BBC policy documentation at the time wasn’t quite as well explained as the current BBC link I’ve given in the paragraph above, but I could be wrong). In hindsight this was not such a good idea.

It now appears that although there was no email response, the VPN client address range we provided was added by the BBC to some form of iPlayer blacklist – but note that this is an educated guess based on the evidence as we have no response from the BBC nor do we have visibility in to the access control mechanism of the BBC iPlayer. It is now the state that all users of the university VPN service, whether inside or outside the UK are denied content by the iPlayer application with a message that content is not available in the persons region.

The knock on effect is that this also stops access to iPlayer for university members using the OWL campus wireless service.

The OWL campus wireless service (which pre-dates WPA technology in consumer devices) uses an unauthenticated/unencrypted network that (to simplify) has destination access restricted to the university VPN service and hence clients make encrypted VPN connections across the unencrypted network to the VPN server in order to provide ‘normal’ and secure network access. The eduroam wireless service also offered in the university is based on WPA enterprise and so needs no VPN connection, leaving it unaffected.

  1. Hence if you’re on campus using the wireless services, connect to eduroam rather than OWL wherever possible (there’s also other reasons to prefer the WPA based service but I don’t want to drift off topic). If your site only offers OWL, ask your local IT support if/when they are hoping to deploy eduroam – they will contact our team when they need assistance with doing this. The BBC have affected the service, it’s not something we have implemented, nor can we affect it so complaints should be directed to the BBC (feel free to link to this post).
  2. If you are outside the UK using the VPN – the BBC policy is not to provide service to you, write to the BBC if this annoys you.
  3. If you are inside the UK using our VPN for internet connectivity but not on wireless (which is an odd situation), then you’ll need to find a different mechanism for internet connectivity that doesn’t use our VPN.

My only comment to the BBC would be that the restriction that was initially in place worked fine – users abroad couldn’t access iPlayer but your new restriction is over the top.

Posted in General Maintenance, VPN, Wireless | Leave a comment

AOL mail

Just a minor post about an issue some people might have seen (things are fairly quiet in the runup to Christmas).

If you had an issue delivering mail to or from an aol.com address today this post explains why. I don’t currently see anything on AOL’s postmaster blog with regards to the outage.

At approx 07:00 GMT today aol appear to have removed the MX record for aol.com

Here we lookup their nameservers – the servers that hold all the DNS records for their domains

$ dig NS aol.com +short
dns-02.ns.aol.com.
dns-01.ns.aol.com.
dns-06.ns.aol.com.
dns-07.ns.aol.com.

So (during the outage) lets ask one of those DNS servers where the mailserver for the domain aol.com is – we’re querying their nameserver directly:

$ dig MX aol.com @dns-02.ns.aol.com.

; <> DiG 9.7.2-P3-RedHat-9.7.2-1.P3.fc13 <> MX aol.com @dns-02.ns.aol.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48542
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;aol.com. IN MX

;; AUTHORITY SECTION:
aol.com. 300 IN SOA dns-02.ns.aol.com. hostmaster.aol.net. 304268691 43200 60 1209600 300

;; Query time: 115 msec
;; SERVER: 205.188.157.232#53(205.188.157.232)
;; WHEN: Tue Dec 21 10:07:02 2010
;; MSG SIZE rcvd: 89

So in short, there’s nothing – nowhere to deliver mail and so the domain will not handle mail. This means mail to @aol.com addresses was returned as unroutable and mail from that domain was rejected by basic sender verification (e.g. does the domain you’re claiming to be from actually exist as a mail domain?).

It appears to have been fixed at 10:30 GMT, the mailservers are now listed:

$ dig MX aol.com @dns-02.ns.aol.com. +short
0 mailin-04.mx.aol.com.
0 mailin-01.mx.aol.com.
0 mailin-02.mx.aol.com.
0 mailin-03.mx.aol.com.

Posted in Mail Relay | Leave a comment

Webcache IPv6 enabled

By now people are probably getting bored of hearing about the webcache so you’ll be glad to know that this should be the last post on the subject, the webcache having been successfully enabled for IPv6 this morning.

Note that the following is a “warts and all” description of the deployment. I was pressed for time and it could have been better but this is not a best practise guide, it’s a about what a similar system administrator might face in case the experience helps others.

Pre Deployment

It didn’t go entirely to plan, I worked through constructing and testing a formal configuration checklist for the service. Yesterday I made a announcement to our university IT support staff that there would be service downtime for the host (there would at least be a reboot to apply the IPv6 connection tracking enabled kernel as discussed in the previous post here) and then a little later (cue various interruptions and help tickets with exclamation marks and ‘ASAP’ in them) I discovered that I had no method of applying the preferred_lft 0 settings discussed here previously and required to make the IPv6 interfaces use the expected source addresses.

Under Debian it had been a simple cast to add a pre-up command that when a sub interface was brought up would apply the preferred_lft 0 setting, essentially telling it to prefer another interface for outgoing traffic, but otherwise use the interface as normal. Under Centos I couldn’t manually issue a command to alter it (as far as I can see -’ip add…’ rejected the preferred_lft option as junk and ‘ip change’ was not supported) and needed to update the iproute package. This was fairly painless (download latest source, butcher a copy of the previous packages .spec file and then rpmbuild the package on our development host) but is yet another custom package needed – I’ll be glad to redeploy with Centos 6 when it is released and so have a dedicated package maintainer rather than have extra work ourselves.

As I’d already announced the service would be going down I either had to stay late or lose a little face and do the work the next week (delaying the webcache IPv6 enabling yet another week). It’s important to have a work/life balance but this was one occasion I decided to stay late.

Aspects of the (re)deployment for IPv6 were

  • ip6tables configuration
  • Squid reconfiguration (ipv6 acls etc)
  • Apache reconfiguration (it serves a .pac file to some clients used to supply webcache information)
  • making tests to check the service configuration after change
  • install the new kernel, mcelog and iproute packages
  • interface configuration

Prior to this work the service didn’t have a formal checklist. I constructed one in our teams documentation and wrote a script to conduct the tests in succession (currently only 6 tests for the main functionality but this I’ll add more).

I was able to test the Squid and Apache configurations in advance on the hosts with static commands (e.g. squid -k parse -f /etc/squid/squid.conf) but (due to time) there is currently no identical webcache test host so there is room for improvement.

Deployment Day

The testing work paid off, the existing IPv4 service was down for about 2:30 minutes after shortly 7am with a minor outage of about the same duration a little later. The full IPv6 service was up before 8am.

There were a few hiccups

  • The workaround to apply preferred_lft 0 to IPv6 sub interfaces didn’t work, I’ve applied this manually for now and will make a ticket in our teams RT ticket system.
  • Sometimes really simple issues slip through: Due to oversight the IPv6 firewall wasn’t set to apply on boot, I applied it and fixed the boot commands.
  • One of my squid.conf IPv6 acls was valid syntax but wrong for service operation

The script for service testing was useful and speeded up testing greatly. I’ll aim to incorporate the tests into our service monitoring software.

End Result and Service Behaviour

This important results of this work are:

  • The service is now reachable via IPv6
  • It’s now possible to use wwwcache.ox.ac.uk from a University of Oxford host to visit an IPv6 only website even if your host is IPv4 only.
  • The opposite is also true. If for some reason a host is IPv6 only, the webcache can be used to visit IPv4 only websites.

Someone queried how the service would behave if the destination is available via IPv4 and IPv6 (or more accurately has both a A and AAAA DNS records), the answer is that IPv6 will be attempted first. The is typically the default behaviour for modern operating systems and while it’s possible to alter this we will be leaving it as expected.

Related to this, if you have a good memory I stated in a previous post that we wouldn’t add a AAAA for a service but would make the service use a slightly different name e.g. ntp6.oucs.ox.ac.uk instead of ntp.oucs.ox.ac.uk for the stratum 3 IPv6 provision. we also wanted to add IPv6 cautiously, enabling a service for IPv6 but making IPv4 the default where possible. For this service we’ve seemingly done a about face and added AAAA records for the same IPv4 service name and associated interfaces. My reasoning is subjective but based on the following:

  • If a user has an issue contacting wwwcache.ox.ac.uk we’re likely to get a support ticket complaining and so be aware of the issue quickly, compared to a users computer not accessing a NTP service correctly, in which case (in my experience) the machines clock silently drifts slowly out over the course of weeks or months and then when noticed it is wrongly assumed by the user that the entire university NTP service must be out by the same amount of time as their local clock.
  • I don’t want to have end users as my experiment subjects as such, however wwwcache.ox.ac.uk is a less used system than – for example – the main university mail relay. Hence it’s a more suitable place to use a AAAA for the first time on a main service.
  • We’re getting a bit more confident with the IPv6 deployment and as a result changing some previous opinions.

Remember that formal tests were run on the service, the end users are not being used as the test however I am aware of Google stating that 1 in 1000 IPv6 users had misconfigured connectivity, so I’m still keeping an eye out for odd reports that might be related to odd behaviour in a certain odd device or in a given network situation.

The service is lightly used and as such due to a funding decision (remembering the current UK academic budget cuts) the service is not fault tolerant. That is, the host is in warranty, has dual power supplies and RAID (and is powerful) but there is only one host.

Performance, Ethics and Privacy

In terms of performance the host has 8GB of RAM, before the work a quick check revealed 7.5GB was in use (caching via squid and the operating system itself) so the service is making good use of the hardware. The CPU is the low energy version which is powerful enough for the task and the disks are RAID1 (no, RAID5 would not be a good idea with squid). I believe I’ve covered what we would have purchased if there had been more budget in a previous post so I won’t dwell on it further.

In terms of ethics, the networks team and security team have access to the logs which are preserved for 90 days but (without getting into an entire post on the subject) within the same ethical and conduct rules as for the mail relay logs. Specifically if a person was to request logs for disciplinary procedures (or to see how hard coworker X is working) they are directed to the University Proctors office who would scrutinise the request. Most frivolous requesters including (on occasion) loud, angry and forceful managers demanding access or meta data about an account give up at this point. I can’t speak for the security team but in 3 years I’ve only had 2 queries from the proctors office relating to the mail logs and in both cases this was with regard to an external [ab]user sending unwanted mail into the university. This whole subject area might deserve a page on the main OUCS site, but in short I use the webcache myself and consider it private. We supply logs to the user for connectivity issues and have to be careful of fuzzy areas when troubleshooting with unit IT support staff on behalf of a user but personal dips into the logs are gross misconduct. We process the logs for tasks relating to the service, for example we might make summaries from the logs (“we had X users per day for this service in February”)  or process them for troubleshooting (“the summary shows one host has made 38 million queries in one day, all the other hosts are less than 10k queries, I suspect something is stuck in a loop.”).

There’s no pornography, censorship or similar filters on the webcache; people do research in areas that cover this as part of the university, and frankly there’s nothing to be gained from filtering it. If there is a social problem in a unit with an employee viewing pornography (and so generating a hostile working environment for other employees) then it is best dealt with via the local personnel/HR/management as a social/disciplinary issue, not a technical one – no block put in place on the webcache will cure the employee of inappropriate behaviour. On a distantly related note we haven’t been asked to implement the IWF filter list by JANET and I have strong opinions on the uselessness of the IWF filter list. I don’t think I’m giving away any security teams secrets if I reveal they have less than 10 regular expression blocks in place on the webcache which target specific virus executables and appear to have been added almost a decade ago – these wont interfere with normal browsing (unless you really need to run a visual basic script called “AnnaKournikova.jpg.vbs” – remember that?). There’s no other filters. Network access to the webcache is restricted to university IP address ranges.

This is a old service with some of the retained configuration referring to issues raised 10 years ago. I believe all the above is correct but if you think I’ve made an error please raise it with me (either by email to our group or in the comments here) and assume the error is the result of inheriting a service that’s over a decade old and not deliberate.

Posted in IPv6 | 1 Comment