AOL mail

Just a minor post about an issue some people might have seen (things are fairly quiet in the runup to Christmas).

If you had an issue delivering mail to or from an aol.com address today this post explains why. I don’t currently see anything on AOL’s postmaster blog with regards to the outage.

At approx 07:00 GMT today aol appear to have removed the MX record for aol.com

Here we lookup their nameservers – the servers that hold all the DNS records for their domains

$ dig NS aol.com +short
dns-02.ns.aol.com.
dns-01.ns.aol.com.
dns-06.ns.aol.com.
dns-07.ns.aol.com.

So (during the outage) lets ask one of those DNS servers where the mailserver for the domain aol.com is – we’re querying their nameserver directly:

$ dig MX aol.com @dns-02.ns.aol.com.

; <> DiG 9.7.2-P3-RedHat-9.7.2-1.P3.fc13 <> MX aol.com @dns-02.ns.aol.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48542
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;aol.com. IN MX

;; AUTHORITY SECTION:
aol.com. 300 IN SOA dns-02.ns.aol.com. hostmaster.aol.net. 304268691 43200 60 1209600 300

;; Query time: 115 msec
;; SERVER: 205.188.157.232#53(205.188.157.232)
;; WHEN: Tue Dec 21 10:07:02 2010
;; MSG SIZE rcvd: 89

So in short, there’s nothing – nowhere to deliver mail and so the domain will not handle mail. This means mail to @aol.com addresses was returned as unroutable and mail from that domain was rejected by basic sender verification (e.g. does the domain you’re claiming to be from actually exist as a mail domain?).

It appears to have been fixed at 10:30 GMT, the mailservers are now listed:

$ dig MX aol.com @dns-02.ns.aol.com. +short
0 mailin-04.mx.aol.com.
0 mailin-01.mx.aol.com.
0 mailin-02.mx.aol.com.
0 mailin-03.mx.aol.com.

Posted in Mail Relay | Leave a comment

Webcache IPv6 enabled

By now people are probably getting bored of hearing about the webcache so you’ll be glad to know that this should be the last post on the subject, the webcache having been successfully enabled for IPv6 this morning.

Note that the following is a “warts and all” description of the deployment. I was pressed for time and it could have been better but this is not a best practise guide, it’s a about what a similar system administrator might face in case the experience helps others.

Pre Deployment

It didn’t go entirely to plan, I worked through constructing and testing a formal configuration checklist for the service. Yesterday I made a announcement to our university IT support staff that there would be service downtime for the host (there would at least be a reboot to apply the IPv6 connection tracking enabled kernel as discussed in the previous post here) and then a little later (cue various interruptions and help tickets with exclamation marks and ‘ASAP’ in them) I discovered that I had no method of applying the preferred_lft 0 settings discussed here previously and required to make the IPv6 interfaces use the expected source addresses.

Under Debian it had been a simple cast to add a pre-up command that when a sub interface was brought up would apply the preferred_lft 0 setting, essentially telling it to prefer another interface for outgoing traffic, but otherwise use the interface as normal. Under Centos I couldn’t manually issue a command to alter it (as far as I can see -’ip add…’ rejected the preferred_lft option as junk and ‘ip change’ was not supported) and needed to update the iproute package. This was fairly painless (download latest source, butcher a copy of the previous packages .spec file and then rpmbuild the package on our development host) but is yet another custom package needed – I’ll be glad to redeploy with Centos 6 when it is released and so have a dedicated package maintainer rather than have extra work ourselves.

As I’d already announced the service would be going down I either had to stay late or lose a little face and do the work the next week (delaying the webcache IPv6 enabling yet another week). It’s important to have a work/life balance but this was one occasion I decided to stay late.

Aspects of the (re)deployment for IPv6 were

  • ip6tables configuration
  • Squid reconfiguration (ipv6 acls etc)
  • Apache reconfiguration (it serves a .pac file to some clients used to supply webcache information)
  • making tests to check the service configuration after change
  • install the new kernel, mcelog and iproute packages
  • interface configuration

Prior to this work the service didn’t have a formal checklist. I constructed one in our teams documentation and wrote a script to conduct the tests in succession (currently only 6 tests for the main functionality but this I’ll add more).

I was able to test the Squid and Apache configurations in advance on the hosts with static commands (e.g. squid -k parse -f /etc/squid/squid.conf) but (due to time) there is currently no identical webcache test host so there is room for improvement.

Deployment Day

The testing work paid off, the existing IPv4 service was down for about 2:30 minutes after shortly 7am with a minor outage of about the same duration a little later. The full IPv6 service was up before 8am.

There were a few hiccups

  • The workaround to apply preferred_lft 0 to IPv6 sub interfaces didn’t work, I’ve applied this manually for now and will make a ticket in our teams RT ticket system.
  • Sometimes really simple issues slip through: Due to oversight the IPv6 firewall wasn’t set to apply on boot, I applied it and fixed the boot commands.
  • One of my squid.conf IPv6 acls was valid syntax but wrong for service operation

The script for service testing was useful and speeded up testing greatly. I’ll aim to incorporate the tests into our service monitoring software.

End Result and Service Behaviour

This important results of this work are:

  • The service is now reachable via IPv6
  • It’s now possible to use wwwcache.ox.ac.uk from a University of Oxford host to visit an IPv6 only website even if your host is IPv4 only.
  • The opposite is also true. If for some reason a host is IPv6 only, the webcache can be used to visit IPv4 only websites.

Someone queried how the service would behave if the destination is available via IPv4 and IPv6 (or more accurately has both a A and AAAA DNS records), the answer is that IPv6 will be attempted first. The is typically the default behaviour for modern operating systems and while it’s possible to alter this we will be leaving it as expected.

Related to this, if you have a good memory I stated in a previous post that we wouldn’t add a AAAA for a service but would make the service use a slightly different name e.g. ntp6.oucs.ox.ac.uk instead of ntp.oucs.ox.ac.uk for the stratum 3 IPv6 provision. we also wanted to add IPv6 cautiously, enabling a service for IPv6 but making IPv4 the default where possible. For this service we’ve seemingly done a about face and added AAAA records for the same IPv4 service name and associated interfaces. My reasoning is subjective but based on the following:

  • If a user has an issue contacting wwwcache.ox.ac.uk we’re likely to get a support ticket complaining and so be aware of the issue quickly, compared to a users computer not accessing a NTP service correctly, in which case (in my experience) the machines clock silently drifts slowly out over the course of weeks or months and then when noticed it is wrongly assumed by the user that the entire university NTP service must be out by the same amount of time as their local clock.
  • I don’t want to have end users as my experiment subjects as such, however wwwcache.ox.ac.uk is a less used system than – for example – the main university mail relay. Hence it’s a more suitable place to use a AAAA for the first time on a main service.
  • We’re getting a bit more confident with the IPv6 deployment and as a result changing some previous opinions.

Remember that formal tests were run on the service, the end users are not being used as the test however I am aware of Google stating that 1 in 1000 IPv6 users had misconfigured connectivity, so I’m still keeping an eye out for odd reports that might be related to odd behaviour in a certain odd device or in a given network situation.

The service is lightly used and as such due to a funding decision (remembering the current UK academic budget cuts) the service is not fault tolerant. That is, the host is in warranty, has dual power supplies and RAID (and is powerful) but there is only one host.

Performance, Ethics and Privacy

In terms of performance the host has 8GB of RAM, before the work a quick check revealed 7.5GB was in use (caching via squid and the operating system itself) so the service is making good use of the hardware. The CPU is the low energy version which is powerful enough for the task and the disks are RAID1 (no, RAID5 would not be a good idea with squid). I believe I’ve covered what we would have purchased if there had been more budget in a previous post so I won’t dwell on it further.

In terms of ethics, the networks team and security team have access to the logs which are preserved for 90 days but (without getting into an entire post on the subject) within the same ethical and conduct rules as for the mail relay logs. Specifically if a person was to request logs for disciplinary procedures (or to see how hard coworker X is working) they are directed to the University Proctors office who would scrutinise the request. Most frivolous requesters including (on occasion) loud, angry and forceful managers demanding access or meta data about an account give up at this point. I can’t speak for the security team but in 3 years I’ve only had 2 queries from the proctors office relating to the mail logs and in both cases this was with regard to an external [ab]user sending unwanted mail into the university. This whole subject area might deserve a page on the main OUCS site, but in short I use the webcache myself and consider it private. We supply logs to the user for connectivity issues and have to be careful of fuzzy areas when troubleshooting with unit IT support staff on behalf of a user but personal dips into the logs are gross misconduct. We process the logs for tasks relating to the service, for example we might make summaries from the logs (“we had X users per day for this service in February”)  or process them for troubleshooting (“the summary shows one host has made 38 million queries in one day, all the other hosts are less than 10k queries, I suspect something is stuck in a loop.”).

There’s no pornography, censorship or similar filters on the webcache; people do research in areas that cover this as part of the university, and frankly there’s nothing to be gained from filtering it. If there is a social problem in a unit with an employee viewing pornography (and so generating a hostile working environment for other employees) then it is best dealt with via the local personnel/HR/management as a social/disciplinary issue, not a technical one – no block put in place on the webcache will cure the employee of inappropriate behaviour. On a distantly related note we haven’t been asked to implement the IWF filter list by JANET and I have strong opinions on the uselessness of the IWF filter list. I don’t think I’m giving away any security teams secrets if I reveal they have less than 10 regular expression blocks in place on the webcache which target specific virus executables and appear to have been added almost a decade ago – these wont interfere with normal browsing (unless you really need to run a visual basic script called “AnnaKournikova.jpg.vbs” – remember that?). There’s no other filters. Network access to the webcache is restricted to university IP address ranges.

This is a old service with some of the retained configuration referring to issues raised 10 years ago. I believe all the above is correct but if you think I’ve made an error please raise it with me (either by email to our group or in the comments here) and assume the error is the result of inheriting a service that’s over a decade old and not deliberate.

Posted in IPv6 | 1 Comment

Kernel for the webcache

Mathematical Institute

The initial switched /64 connection to the Mathematical Institute was switched to a routed firewall connection this week and as a result I’ve routed the entire /56 that was set aside to them. I think we’ll have to require units to have a routing firewall for the /56 and limit non routed units to a maximum of 3 /64′s until they have a firewall in place, otherwise we’ll have too much configuration to do for so many individual /64′s to be routed on the backbone. With the work complete my contact in the Institute has gone on holiday for a fortnight, which is perfect as I’ve some catching up to do in order to improve their service. The Institute is the first unit in the university to receive a /56 IPv6 connection.

OUCS offices

We deployed IPv6 without router advertisements, setup a single host that was used for normal purposes (generating network traffic) and then completed a security response to a pseudo report of the host being compromised. The test highlighted some behavioural differences between our response tools for IPv4 and those for IPv6 which caused minor confusion (the differences were already documented but we could improve the situation). We also completed our migration of Netdisco to the latest CVS source which means IPv6 tracking to a local switchport is working without issue, that is, we can track a compromised host to an office wallport inside our department building. Outside OUCS we can only track a host to the final connection to the college/department, then the tracking must be internal and is handled by the independent IT staff at the unit – Netdisco is a perfect solution for this if you’re not using it already.

I’m going to talk to the security team again and then should be able to give out some addresses to internal teams in order to encourage some interest in IPv6 enabling other services.

Webcache/Centos Kernel

In order to add IPv6 connection tracking support I’ve built a new kernel for our Centos hosts, using the latest stable version at www.kernel.org. I believe I mentioned in a previous post that IPv6 connection tracking wasn’t present on the stock 2.6.18 kernel (any kernel prior to 2.6.20 I believe) and I’m somewhat reluctant to use more complex rules or wait for RHEL/Centos 6 so a new kernel seems a reasonable solution for us as long as it’s only in place until Centos 6 is available.

The main changes I made over the stock kernel config are

  • General setup ->
    • “enable depreciate sysfs features to support old userspace tools” -> I understand that you must enable this in order for the new kernel to boot on the current Centos
  • Networking Support ->
    • Network Options
      • enable “the ipv6 protocol” and enable “source addressed based routing” in the submenu (I would make all the others in the submenu available at least as modules)
      • enable “Network packet filtering framework (netfilter)”
        • “IPv6 Netfilter configuration”
          • enable “IPv6 connection tracking support”
          • enable “IP6 tables support”
          • I specifically selected “packet filtering” “reject” and “log”, and made all others available as modules

There’s quite a few other changes made (if you’re going to the trouble of packaging your own kernel you might as well tailor it to your needs) but they aren’t relevant to the IPv6 connection tracking.

I didn’t quite adhere to the official centos kernel rebuilding guide as I found it rather awkward to follow. From a documentation and testing point of view I think it’s always a good idea to test your instructions on a clean machine and a person that knows little about the subject area (loved ones or a coworker from another section that happens to be passing through…) in case you’re making subconscious omissions of steps that are obvious to yourself. You may need to bribe the person for their time if you find your subject area is utterly dull to the majority of the population.

The other oddity is that the newer kernel uses more bits for the storage of MCE logs so the Centos mcelog package needs updating otherwise it will complain each hour. I used the spec file from the existing .src.rpm and the more recent source to build a new package. Notice that stock 64 bit hosts will have mcelog but not 32bit hosts. I believe the new kernel requires mcelog for both architectures.

I’ve tested it on our development host, I’ll be making formal plans to deploy on the university webcache on Tuesday which will make the service dual stacked. Testing the service in advance is problematic so I’ll need to formalise and scrutinise the deployment steps.

IPv6 University Firewall

We want to replace our current development IPv6 firewall with a production service. Of interest to us with regard to this is that Cisco released the 5585-X ASA’s this last week. These run the same software and commands as the ASA 5510 and upwards (the 5505 is different in some aspects from the rest of the range I believe) but outperform our FWSM modules for IPv6. So we could deploy a couple of ASA 5500 series devices, develop any in house scripts/software needed to work with them and then move to the 5585-X when the main university IPv4 firewall service is due to be replaced in 2 years time.

My concern that would be that we deploy a lower capacity ASA, enable a service such as the HFS backup service for IPv6, then watch as hordes of workstations without a native IPv6 connection use the JANET 6to4 service, sending IPv4 traffic out onto JANET resulting in high IPv6 traffic in through the new firewall, making an internet connection that was barely used swamped overnight, resulting in user complaints. I think as long as we’re careful and monitor the volumes of traffic we should be able to predict and respond without issue but it’s one reason why better performance monitoring tools for the firewall might be of interest.

I’ve had an initial discussion with Cisco about the above and have spent some time in the past week reading through the ASA manuals and listening to the Cisco product podcasts (whilst doing other tasks). I don’t use iTunes much but I believe from memory it was simple to select the podcasts in the store section and search for ‘Cisco’, subscribe to the channels shown and then ‘show previous episodes’ for each channel and ‘get all’.

Next week

Next week I’ll concentrate on the Webcache IPv6 deployment, which will be early Tuesday during the JANET at risk period (e.g. 7am).

You may be aware that we’ve this week lost a prized team member (Oliver Gorwits) to another employer, have also recently re-employed for the wireless position and are seeing the retirement of our senior manager, so I don’t want to make too many other predictions about next week – I enjoy working on the IPv6 deployment issues but we have many other projects and support calls.

If I’m lucky I might have a deployment plan for an ASA solution to the development firewall at the end of next week.

Posted in IPv6 | 1 Comment

IPv6 to client networks, the raw first attempt

All New Authoritative and Resolver DNS servers

The last DNS resolver and all of the DNS authoritative servers have now been migrated to new hardware, the resolver having being migrated approx 6:10am this morning. There’s other work I’d like to do on the DNS, such as internally documenting any differences between our own and the Team Cymru Secure Bind template , or adjusting our configuration to match. At first glance perhaps specifically with regards some of the more minor types of logging (lame servers etc), but this a minor secondary project that will have to wait (I keep a list of these for when a miracle occurs and I have spare time).

The question on how to create IPv6 DNS records for clients was a concern but interestingly a draft RFC was published in September that covers exactly this topic. There’s also work to be done with regards to IPv6 enabling the auth and resolvers however I’ll talk about something a bit more exciting this week, which is namely our first IPv6 user networks for testing

IPv6 on the OUCS Offices Network

The political process for enabling IPv6 on the OUCS Offices network is coming to an end. The ICTST team provide the unit level IT support for OUCS, and I’ve been querying approval for a limited IPv6 deployment on the OUCS offices network. My hope is that enabling this early on will encourage IPv6 testing among some of the core services teams – I’ve already had queries from NSMS and the Systems Development team. It sounds as if this will be approved so I’ve also been checking with the security team to ensure they’re happy with the deployment going ahead since on the local network we have to be able to track misuses down to a host/person just as any other unit would.

I’m not sure full blown IPv6 on the network would cause much issue in terms of application/service support since at least some of the Windows workstations appear to be already using the JANET 6to4 service to connect to IPv6 enabled services without either our assistance or the users knowledge. While there are firewalling considerations to be careful of, this connectivity is something I’m not going to complain about since it’s quite useful (in terms of testing/verifying) to have IPv6 based connections occurring from our users to IPv6 enabled services.

The plan is that if ICTST give approval we will first enable IPv6 for only one machine (the network will not have router advertisements on it initially, so it will be statically configured), we’ll then generate some standard harmless network traffic from it. At this point we stage a security response when the machine is said to have been compromised and the networks and security teams check our normal logging and security tools to ensure we can track the host down and take actions to block it. This should uncover any quirks or deficiencies in our logging or toolsets. To make the test valid the machine will be a single stack (IPv6 only) since it’s the IPv6 logging and tools we need to check.

In preparation we’ve already upgraded Netdisco to the latest CVS version which can handle IPv6. IPv6 name resolution doesn’t appear to work in the latest CVS but this doesn’t hinder us at all and looks fairly simple to fix. If I get a chance (another task for the list) I will investigate and may submit a minor patch for consideration.

IPv6 on the Mathematical Institute Network

I believe it’s now been two years since I approached the Mathematical Institute and asked if they’d be willing to take part in a IPv6 trial. They’ve been far too polite by not pointed out the passage of time (which I originally optimistically stated would be months) but today we’ve supplied them with an actual native IPv6 connection for testing.

This is quite early for us in terms of the services we provide being IPv6 capable so there’s a fair few limitations, as a result this is clearly not a production deployment. However I wanted to get at least one interested customer on the IPv6 service as soon as possible so that they can start bringing up issues that might be obvious to themselves that we haven’t thought of – for instance due to local contact with common applications that might have odd quirks.

So the provided connection in brief:

  • Initially it’s just a /64 out of the eventual /56, it’s suitable for familiarisation with IPv6 until a IPv6 unit firewall is prepared
  • We don’t have a clean solution for DNS changes at the moment, this is being worked on
  • As mentioned last week, the university IPv6 firewall needs replacing with a production system
  • stateless DHCP and Router advertisements are provided
  • Advertised via the stateless DHCP, a single IPv6 DNS resolver is provided (later this will be replaced with the production service)

In security terms and from our immediate teams perspective things are politically easier for us than if it were the OUCS offices network – if a compromised address can’t be tracked to a host/user by the Mathematical Institute we’ll have to cut the development IPv6 feed off. This rather simple situation for us isn’t especially helpful for the local ITSS however, so I’ll prepare a suggested toolset for IPv6 adoption to assist. For instance the already mentioned NetDisco (CVS version) will provide the information required. Not everyone has the time to setup new services so I’ll also check  if perhaps NSMS could offer a local unit controlled version of the toolset on their new ‘FiDo‘  device since it would have network presence in a unit as part of the wake on lan facility/green IT. It might be too heavy a load for that device but Netdisco can be split into parts (web interface, database, probe) so we’ll see.

Next months work

It’s probably time to redo my predicted IPv6 related work for the next month since for one reason or another it’s drifting off course from last months prediction.

  • Subject to ICTST approval, deploy IPv6 to teams that request it for desktop systems in OUCS
  • Assist the Mathematical institute with their IPv6 connectivity
  • Deploy IPv6 on the new Networks Technology Group Service Network in order to enable more services (this is complex and may take most of the month)

And the 101 more minor tasks that don’t seem IPv6 related at first but are part of the critical path and are large tasks

  • Finish migrating our network monitoring host (Nagios etc) to new hardware/network
  • Migrate our internal database server to new hardware
  • Come up with a solution for a production IPv6 capable university firewall for the interim period between now and the scheduled backbone upgrade in 2 years time
Posted in IPv6 | Leave a comment

DNS resolvers and (unrelated) IPv6 progress

I thought I’d cover our IPv6/server replacement progress this week but also describe some IPv6 issues we bumped into in case it assists other IT Officers in the University.

DNS Resolvers

Firstly we’re replaced the second of the three DNS resolvers this morning, it seems with less than 30 seconds downtime for the individual resolver being replaced. The process (which happens about once every five years) is now more mature – the deployment/migration instructions I created for the first migration were tested again with the second deployment with only two minor corrections. I’ve also created a formal test plan for pre and post migration which I’ve applied – the previous migration had an odd logging issue that worked in testing but for which the configuration was overwritten for production due to my own human error and had to be corrected. By formalising the testing process it should now be impossible for this to crop up again.

The load on the resolvers is quite low, compared to what the hardware can cope with. Prior to the replacement hardware arriving I ran a script to show the top 10 hosts in the University making DNS queries (no other information, simply the number of queries per host, cropped at a limit of say X million queries per day). The top 5 were guaranteed to be misconfigurations, for example the top host at 38.5 million queries a day was a host asking hundreds of times a second and endlessly for the same individual DNS record. I contacted the sysadmins for the five hosts involved which reduced the queries per day by roughly 20%, but even with these hosts the query load would be manageable on lesser hardware. We’ve already used the lowest power consumption cpus we can in the sever range as part of the University’s energy initiative. Perhaps the next hardware refresh will see virtualisation of the service however this years work is a simple warranty refresh and there’s many other services our team would virtualise first to ensure our chosen virtualisation environment was mature before the (high downtime impact) DNS service was migrated.

The low load means the speed of response in production can be considered simply a measure of what’s been cached. If the record being queried is in the cache then the response will be instant, the only delays coming from lookups to external dns servers, there’s no cpu load worth mentioning.

e.g. in this example, using dig, we get 12ms for the uncached query. In the examples that follow we then get 1 ms for the second (cached) query and 0ms for a host on the same network

$ dig www.bbc.co.uk @163.1.2.1
[...]
;; ANSWER SECTION:
www.bbc.co.uk.        161    IN    CNAME    www.bbc.net.uk.
www.bbc.net.uk.        161    IN    A    212.58.244.68
[...]
;; Query time: 12 msec

[...]

Now the query has been cached:

$ dig www.bbc.co.uk @163.1.2.1
[...]
;; Query time: 1 msec[...]

And even then, using linux.ox.ac.uk on the same network as the DNS server, it appears the 1ms delay might well be the network from my host to the server, or possibly my workstation, but for 1ms or less I’m not going to investigate too hard.

@raven:~$ dig www.bbc.co.uk @163.1.2.1
[...]
;; Query time: 0 msec
[...]

So how do our (caching resolver) nameservers compare to others? Well using namebench this morning (and the results can vary a little but I’ll explain)

Our servers are closer to hosts on our network than external severs so give the quickest responses (the ‘Sys-$address’ servers) for the first summary:

Fastest individual response (in milliseconds):
----------------------------------------------
SYS-129.67.1.1   ######### 1.86205
SYS-163.1.2.1    ######### 1.86491
SYS-129.67.1.180 ########## 1.94716
Hurricane Electr ################### 3.84903
Norton DNS US    #################### 4.03595
OpenDNS          ##################### 4.39906
Cable & Wireless ########################## 5.29504
DynGuide         ########################## 5.35607
BT-70 GB         ########################## 5.39613
Google Public DN ################################################## 10.44893
UltraDNS-2       ##################################################### 11.17086

In terms of our servers the above test is fairly typical/consistent – the University servers should always be the fastest from the list

For the second test however, there are external DNS services which I’d suggest are receiving more queries and hence having a larger cache of queries at any point in time and so have a faster average response:

Mean response (in milliseconds):
--------------------------------
BT-70 GB         ############### 28.51
Google Public DN ##################### 39.61
OpenDNS          ############################ 54.62
Cable & Wireless ############################## 58.24
SYS-129.67.1.1   ################################### 68.50
SYS-163.1.2.1    #################################### 70.55
SYS-129.67.1.180 ###################################### 73.40
Norton DNS US    ######################################### 81.05
Hurricane Electr ########################################## 82.70
DynGuide         ################################################### 99.75
UltraDNS-2       ##################################################### 104.77

So based on the above we should all use the BT,Google or OpenDNS servers not the University DNS, right? Well there’s a couple of reasons why that might end up being slower. Firstly, using the default testing methodology of namebench this later test is more variable. Running the test the next day/hour/minute might give quite different results, so don’t jump to conclusions. For example the above might suggest that the two new DNS servers we’ve deployed are somehow faster, whereas the (currently) older 129.67.1.180 is slower, but the next test suggests the opposite.

Mean response (in milliseconds):
--------------------------------
BT 41 GB         ############## 50.04
OpenDNS-2        ############## 50.21
SYS-129.67.1.180 ############### 50.70
OpenDNS          ################ 57.20
Google Public DN ################## 64.75
SYS-163.1.2.1    ################### 67.80
UltraDNS         #################### 70.04
Hurricane Electr ##################### 74.04
SYS-129.67.1.1   ##################### 75.29
DynGuide         ############################# 104.22
Fast GB          ##################################################### 191.12

Hence namebench is a handy test, but relax and don’t panic about the results (or if you’re dishonest, simply run the test a number of times until you get the result you want to show your boss). Secondly each of our resolvers also carries a local copy of the ox.ac.uk zone so lookups for this will be instant (even if this weren’t the case, the authoritative servers for ox.ac.uk are also on the immediate network so I’d expect to be faster than an external lookup to a host that then contacts our authoritative servers, but this isn’t important). e.g.

$ dig www.oucs.ox.ac.uk @163.1.2.1
[...]
;; Query time: 1 msec

The last resolver will be replaced next week, it’s already prepared so I’ll finish testing it today. The authoritative servers replacements will be quite painless and not as potentially exciting.

IPv6 Work

A few issues cropped up. I mention them here, not because they aren’t known but because if you’re an average everyday sysadmin (like I am – I’m no IPv6 expert I just happen to be tasked with implementing it on our sevices) you might not be aware of them.

Firstly for our IPv4 based servers we tend to have a management interface (that you might ssh to) separate to the hosts service addresses. We use virtual interfaces (eth0:1,eth0:2) to provide these in most cases. Under Ipv6, as you may know, you don’t use virtual interfaces,so your configuration might look something like:

# don't use this example, read the explanation
iface eth0 inet6 static
 address [% ipv6_management_interface %]
 gateway [% ipv6_gateway %]
 netmask 64
 mtu 1280
 post-up /sbin/ifconfig eth0 inet6 add [% ipv6_service_X  %]/64
 [...more service addresses..]

That’s fine, except the traffic from the host (e.g. making database connections) may well come from any of the service addresses, which caused an issue when the webserver for IT Support Staff was IPv6 enabled. There’s roughly 10 rules set out in an RFC to define how the source should be chosen, this article is already rather long so I’m only discussing the solution, there’s a better article on the Linux implementation but in brief here’s what I’ve done:

iface eth0 inet6 static
 address [% ipv6_management_interface %]
 gateway [% ipv6_gateway %]
 netmask 64
 mtu 1280

 pre-up ip -6 addr add [% ipv6_service_X  %]/64 dev eth0
 pre-up ip -6 addr change [% ipv6_service_X %]/64 dev eth0 preferred_lft 0

I found documentation on this general area and on preferred_lft to be a little sparse (but please correct me in the comments if you know of a link to an article with any real meat to it). Using the section of RFC2461 it’s the length of time the prefix is valid for the purpose of on-link determination. We’ve set it to zero so which results in the interface being marked as depreciated (the interface still works fine). We’ve also altered the interfaces defined order so the management interface is the last initialised.

Of unrelated interest is the mtu specified which is explained in detail by Geoff Huston so read his notes for this.

The final host for our NTP round robin is a somewhat quirky machine which has (for historical reasons that pre-date my joining the team) got an interface on both physical OUCS machine room networks, a practise we ask others not to do and don’t do on any other service we have. Under IPv4 a single gateway is defined and the host responds correctly to a ping or other traffic on either interface. Under IPv6 the host receives traffic on the secondary interface and replies out of the primary, causing the packets to be dropped by the networks border. Adding a second gateway to anywhere via the secondary connection fixes ICMPv6 so it behaves as expected, however ntpd replies out the opposite interface to the one that received the query. From what I can find this appears to be a known problem with ntpd, and since the host is about to be migrated to a single homed host I’ve simply removed the interface from the ntp6 round robin and will allow the hosts decommission to fix the issue – if I had more time I might investigate further but we are short on time compared to outstanding tasks. Sadly this host is also a component of our Nagios monitoring so we may have to postpoone the IPv6 service monitoring and perhaps speed up this base hosts migration to new hardware/software.

Lastly there was an issue last week on the (separate IPv6 only) University firewall for (if I recall correctly) roughly 40 minutes which was my own human error and embarrassing. Although the IPv6 deployment is considered currently a non production service, the distinction is weaker as we enable more production services on IPv6, accessible either externally or via tunnelled internal hosts. The issue was a configuration management and testing one hampered by there being only one firewall device for the IPv6 connection currently (and no test equivalent). We discussed in our team meeting yesterday contacting our switch/router hardware vendor to discuss a more mature (and upgradable) interim solution instead of waiting 2 years for the backbone upgrade project. We also need a solution for the firewall management itself – adding,removing webserver exemptions for example. We have an existing system which manages the main firewall and IPv4 exemptions but some work and research will be needed as the IPv6 exemptions are currently manually handled and so not scalable.

Progress

In short we’re about a week behind based on the original plan and I may insert an additional weeks breathing space into the schedule in order to address minor issues that have come up during the work. Specifically looking at last weeks targets:

  • I stated I’d be building a Centos5 custom kernel, which is required for the webcache to be IPv6 enabled. I didn’t have time for this last week but aim to revisit it this week.
  • The final host was added to the ntp6 stratum3 and had an issue as discussed above, it was removed and the present ntp6.oucs.ox.ac.uk service will be regarded as complete for now
  • This also affects the Nagios network monitoring, which is hence delayed
  • The expected DNS resolver deployment has gone fine, the next one will be Tuesday 5th October, when all 3 resolvers are replaced they can be IPv6 enabled.
  • I haven’t replaced any Authoritative DNS servers yet but hope to replace at least one this week

In addition

  • I’m looking at how we’ll handle DNS for the units that want to take part in early adoption of IPv6 prior to our team having a IPv6 capable DNS management interface available for IT officers – we may use wildcards in the initial period (not generate statements, which are different)
  • I’ll try and get a public update to see if the Network Security team are ready for a unit to have IPv6, (if interested note that our own team has basic local network sanity requirements for taking part in the early adoption testing)
  • As discussed we’ll be looking at making the IPv6 firewall a production quality service
Posted in IPv6 | Leave a comment

Surprise! You have IPv6 connectivity!

I bet you didn’t think you had IPv6 connectivity yet (certainly in any University department). After all we’re still working through our plan to light up IPv6 services in the core. Well, news flash: if you’re running Windows 7 in the University it’s likely you can already access IPv6 services.

How so? By means of what we call tunnelling mechanisms. The Internet standards designers realised that during the transition to full IPv6 access there would be islands of IPv4-only and IPv6-only systems. The idea was to create some simple transitioning mechanisms by which systems in these islands could still talk to the rest of the Internet, be it IPv4 or IPv6.

Windows 7 ships with a number of tunnelling mechanisms enabled by default, and they pretty much all work in the same way. Your client wraps up the IPv6 packet inside an IPv4 packet and fires it off to a Relay Server out on the Internet somewhere. The Relay Server has both IPv4 and IPv6 connectivity so extracts the IPv6 content and sends it natively to the target server. The reply is somewhat similar, and usually some fancy IP addressing rules are used to allow traffic to find its way back to your client.

Note that depending on local department and college firewall configurations, some of the tunnelling mechanisms may not work.

Teredo is a common tunnelling mechanism which is used when the client is on an RFC1918 IPv4 address (sometimes called private addressing), often what you get behind NAT. 6to4 is another mechanism, this time one which requires a publicly routable IPv4 address (so is common at our institution as most clients have that configuration).

There are a few issues with tunnelling mechanisms, however:

  1. They don’t promote setting up native IPv6 connectivity
  2. There are security concerns because your traffic (potentially local traffic between two University systems) goes via a Relay Server on the Internet
  3. Performance may be poor because of the latency introduced by relaying out to the Internet or because the Relay Server is congested

The second point is particularly interesting to the Network Development and Network Security teams in OUCS. We’d much rather traffic local to the University didn’t relay via some untrusted server potentially on the other side of the world, and the tunnelling also makes it difficult for us to monitor for and catch malware-infected clients like we can do for IPv4.

Confider for example an IPv4 workstation in a department connecting to the IRC service irc.ox.ac.uk, which has now been IPv6 enabled. The Windows 7 client is on IPv4 and the server is on IPv6. Due to the default configuration of Windows and Microsoft’s interpretation of the RFCs a tunnelled IPv6 connection will be preferred to a native IPv4 connection (the IRC service still runs on IPv4, too!).

  1. Client is on IPv4 and asks a DNS server for the IP(s) of irc.ox.ac.uk
  2. DNS resolver replies with both 129.67.1.25 (A record) and 2001:630:440:129::407 (AAAA record used for IPv6 addresses)
  3. Windows 7 spots it’s on a publicly routable IPv4 address so starts up a 6to4 tunnel
  4. A connection to 2001:630:440:129::407 is made over the 6to4 tunnel, via a Relay Server on the Internet

By the way, one mitigation technique is to use an SSL connection to the IRC server, which we support :-)

To protect collegiate University interests, OxCERT took the decision to place a block on Teredo traffic (udp/3544) at the University JANET-connection firewall. We’ve recently also been looking at the 6to4 mechanism — but then hit a stumbling block…

Ideally we’d like to run a local 6to4 relay so that University systems can still use this transition mechanism. I set one up on a development server at OUCS. This server needs, due to the way 6to4 works, to be able to send IPv6 traffic with source addresses in the IPv6 range 2002::/16. Fair enough, our backbone can deal with that.

However at our connection to JANET it turns out that the JANET-UK engineers implement some IP address filters (and quite rightly so). Only IPv6 addresses in Oxford’s 2001:630:440::/44 range are permitted onto JANET, and the 2002::/16 packets are dropped on the floor :-(

The net effect is that sadly I can’t run a local 6to4 relay for us in Oxford, at least not without persuading JANET-UK to change their configuration policy for connected institutions. I’d quite like to do that: it seems a little unfortunate to assume we won’t want to use these transition mechanisms (and various other IPv6 goodies which use non site-specific addresses). However as I know what they’re likely to be doing in the configuration (uRPF) and I understand its limitations, I agree it would be more work for them, although not impossible by any means, to implement smarter access control lists.

Curiously, JANET-UK does not filter what we receive, only what we send. For instance we receive lots of spoofed packets from the Internet which are dropped by our JANET-connection backbone router. I suppose it’s their implementation of “be liberal in what you receive, conservative in what you send.”

The upshot of all this is that from a Windows 7 box on a publicly routed IPv4 address you probably can already ping6 ipv6.google.com, or visit http://www.kame.net/ and see a dancing turtle. I’m glad for that – IPv6 isn’t scary, and is here and alive and working well. However I’d much prefer us to be able to provide an improved and safer experience when you are inadvertently using IPv6, as will become all the more common in the future.

If we accept that use of 6to4 is inevitable then we (JANET-connected institutions) should also be able to run a local 6to4 relay for:

  1. Network security – to monitor for and catch malware-infected clients
  2. Network performance – to avoid “tromboning” local traffic via a remote (and untrusted/congested/etc) server on the Internet

Do any other institutions feel the same way? Are we barking up the wrong tree? I’d appreciate feedback in the comments section below. Regardless, good luck with your IPv6 transition — exciting times!

Posted in Backbone Network, IPv6 | 2 Comments

New DNS servers

Completed

The main progress on the IPv6 and server deployments this week:

  • This morning we’ve deployed a new DNS resolver to replace our oldest in service host. It was due to be done last week but I spent a little longer on testing. This has made the deployment a lot smoother than it would otherwise have been if rushed through last week. The DNS resolver service itself is made up of 3 servers with only one server being migrated. Due to this and because the changeover period was going to be short and the time for the change early morning a general announcement to IT staff was not made (there are other social reasons – too many announcements has the effect of crying wolf and then staff stop reading them). One address would have been unreachable for roughly 90 seconds during the changeover at ~7:14am (I think we can make it faster for the two that follow).
  • I’ve IPv6 enabled another couple of minor servers and as a result added another host to our IPv6 stratum 3 NTP round robin DNS record
  • I’ve done the majority of work in preparation for IPv6 enabling our webserver that IT Support Staff use for our web based network management tools, there isn’t time to make this live in the at-risk slot today but it’s a service we can deploy on another early morning this week without causing issues.

Webcache

The main setback has been the webcache. It’s based on a Centos 5 host, which is using the standard 2.6.18 series Linux kernel. The issue is that IPv6 connection tracking is broken on kernels prior to 2.6.20. Some Linux distributions can have slightly misleading kernel version numbers since the distribution maintainers backport certain select newer fixes and features to the older kernel version they ship, so I tested the IPv6 connection tracking on our centos 5 development host in case. Sadly testing confirmed there were issues.This has further implications since oxmail.ox.ac.uk, our ntp stratum 2 and smtp.ox.ac.uk are among our Redhat/Centos based services. It’s quite a shame that I didn’t pick this up when researching/auditing our services, and in hindsight I believe a second mistake was that our IPv6 test network was Debian only hence I didn’t spot it in testing. The Debian hosts had a kernel new enough not to suffer the issue (On Debian this is “etch and a half” kernel onwards).

What’s the fuss about? Connection tracking means that you can make a statement in your firewall rules along the lines of “allow in traffic from anyone who’s replying to my attempt at contacting them”. To simplify things: if you have broken connection tracking then your firewall rules either don’t work or you have to make them more primitive and yet more complex to configure. The possible solutions include running a custom kernel, which if possible I’d like to avoid on a production system since we’ll have to track kernel security announcements and do all the actions that would normally be done for you by a distribution package manager. We could also reinstall the system, perhaps to Debian, but this is a time consuming and service affecting solution. Redhat6/Centos6 should solve the issue but might not be released until January. I took a little look on a development host at putting the redhat6 beta 2 kernel on to Centos 5 but met a dependency chain that suggested this was not the way forward. Building complex firewalls based on connectionless rules fells like a step backwards, I’d like to avoid this.

I suspect we’ll use a custom kernel for a few months (e.g. our own package from the latest stable version at kernel.org), then make the webcache one of the first hosts upgraded when Centos6 is released. Once this is done we might take stock and think about the other Centos based services.

DNS

It was also planned to replace one authoritative and one resolver DNS service this week. The resolver is completed, as mentioned earlier, but the auth service hasn’t been done due to time constraints. The auth service typically has a much lighter load so it was more important to replace the resolver. We might replace the (one of three making up the service) auth server outside of the JANET at risk period since queries tend to come from other DNS servers which have better caching and failover behaviour than end user clients which use the resolvers, hence 60 seconds of one auth DNS server being down out of the three shouldn’t have a noticeable effect, especially if the work is done in the early hours.

It’s taken a fair time to deploy one resolver, but this has been due to integrating the older DNS configuration management system with our newer system used for our other hosts (we use cfengine). It’s not possible to totally turn off the old configuration system at this point, which is responsible for pushing new DNS configurations across the dns servers but now that the configuration templates and integration are done (and tested) the remaining DNS servers should be easy and quick to configure and hence faster to deploy. As far as IPv6 goes, once all the resolvers or all the authoritative DNS servers are using the new configuration system it’s a simple matter to enable it – I was able to complete the configuration templates and testing for this last week.

To Follow

The rest of this week will involve:

  • Enabling the webserver that IT Support Staff use for our web based network management tools to support access via IPv6, possibly tomorrow morning
  • Adding the final host to the ntp stratum 3 IPv6 round robin
  • Prepare the two new hosts that will replace our other two older DNS resolvers next Tuesday
  • Building and packaging a working Centos5 kernel from the latest stable version at kernel.org and testing for stability, then considering deployment on the webcache
  • (if time allows) replace the DNS auth servers one at a time
  • (if time allows) setting up monitoring of the IPv6 based services
Posted in IPv6 | Leave a comment

IPv6 deployment – irc.ox.ac.uk and ntp6.oucs.ox.ac.uk complete

Our targets for the IPv6 deployment for this week have been mostly successful

  • Enabling IPv6 to first half and now the second half of the OUCS server room has been completed, we’ve had no complaints of disruption from other teams so this appears successful.
  • All our development hosts are now IPv6 enabled. These were the first hosts enabled which was useful for setting up our configuration management systems handling of IPv6.
  • Two of our four IPv4 stratum 3 NTP servers are now IPv6 enabled, providing a live service currently at ntp6.oucs.ox.ac.uk. In months to come I suspect we’ll add AAAA records for ntp.oucs.ox.ac.uk however for now we’re being cautious.
  • At IT support staff request the IRC service at irc.ox.ac.uk has been IPv6 enabled ahead of schedule. This was more rushed than I would have liked but we were able to test on a development host and the change was completed with the service live without downtime or disconnections. Sadly the jabber/xmpp service which is present on the same physical server received less testing and wasn’t successfully IPv6 enabled, AAAA records for this service have been removed until time can be set aside for proper testing and deployment. For this service (which is mainly used by IT Support Staff) we decided to add AAAA records for the normal service address (rather than use, say, irc6.ox.ac.uk) but haven’t seen any reported issues yet.
  • Our teams main database server has been IPv6 enabled, this is not a direct service to IT support staff but provides information to many of the tools and backend processes.

The following were originally planned for this morning but are behind schedule:

  • IPv6 enabling the webcache service – this simply hasn’t had time set to it due to other demands. It’s reasonably straightforward (a little Squid configuration, some ip6tables work, some testing) but I don’t believe the migration should be done live like the irc.ox.ac.uk service was and although little used nowadays it has a larger number of users than the irc service.
  • Outside the IPv6 project the migration of one of the university main DNS servers to new hardware will be delayed. It was ready on time this morning but I’d like to do more testing before deploying, I wasn’t comfortable that it had been tested enough.

So my work for the next four days is

  1. Testing the replacement DNS server with the aim to deploying in next weeks Tuesday slot instead (21st September)
  2. Planning/preparation for IPv6 enabling the webcache in the same slot
  3. IPv6 enabling the host that provides our team website IT support staff use for internal network facilities, this is also a IPv4 NTP stratum 3 host so we can add it to the ntp6.oucs.ox.ac.uk service once enabled.
  4. Ahead of schedule, adding the network monitoring system host itself to IPv6 as it’s also a IPv4 stratum 3 NTP host, but probably not doing any work on monitoring via IPv6 until the planned date due to time.
Posted in IPv6 | 1 Comment

IPv6 and related work over the next month

We’ve a timeline of work for the IPv6 deployment for the next month which might be of interest to some and at the same time we’ve work updating the hardware and operating systems for certain core services. There’s other services that will also need to support IPv6 but I don’t want to plan at an operational level for much more than a month ahead since unexpected additional tasks tend to crop up and skew the dates.

In brief we’re enabling IPv6 to our server room and starting the first core service upgrades, in some cases mixing the migration in with a standard hardware/software refresh (where combining the tasks makes the migration simpler rather than harder). The dates reflect the weekly JANET at risk period for maintenance, Tuesday 7:00am-9:00am.

Date Project Work
7th September IPv6
  • Enabling IPv6 to first half of the OUCS server room
  • Make all networks team development hosts IPv6 enabled
14th September IPv6/Servers
  • Enabling IPv6 to the last half of the OUCS server room
  • Make ntp.oucs.ox.ac.uk accessible via IPv6
  • Make the webcache service accessible via IPv6
  • Migrate one authoritative and one resolver to new hardware/software
21st September IPv6
  • Make the comms.oucs.ox.ac.uk utilities website for IT support staff accessible via IPv6
28th September IPv6/Servers
  • Make authoritative and resolver DNS available via IPv6
  • Migrate network monitoring host (monitors services) to new hardware/software and IPv6 enable
5th October IPv6/Servers
  • Migrate database host to new hardware/software and IPv6 enable
12th October Servers
  • Migrate DHCP service to new hardware/software
19th October Servers
  • Migrate DNS master to new hardware/software
20th October and relax…
  • update the blog about what our IPv6 plans are for the next month

You might think that we could enable core services faster however the above is one persons work from a four man team – we will have normal duties and other projects underway during this period (the team manages 22 public services and about 13 internal services (physical network monitoring, telecoms integration etc).

The enabling of the first part of the OUCS server room this morning (and the background research/testing as well as the notes this section is based on) was done by another team member however the essential features are that we’re providing IPv6 to the server room with router advertisements disabled, meaning dual stacked servers run by other teams won’t suddenly obtain an IPv6 address automatically. We can get on and IPv6 enable our teams services with a statically configured address without disrupting others. The configuration on a cisco device looks like

ipv6 nd ra suppress
ipv6 nd prefix default 2592000 604800 no-autoconfig

The Cisco IPv6 command reference for nd is available online. In the example above ignore the numbers which are just defaults for valid-lifetime and preferred-lifetime in seconds (normally for a specific network prefix, which we do not need in this case) via router advertisement. The important part is the ‘suppress’ and ‘no-autoconfig’.

The first line tells the router that for ipv6 neighbour discovery it should suppress periodic router advertisements. The second line instructs the router that if asked it should tell the client not to use autoconfiguration for the network. In technical terms, when it gets a request from a client it will respond but in the Prefix Option (see RFC2461) the A-bit is cleared, meaning that the client must not use autoconfiguration for that prefix.

For further information see this c-nsp thread on RA/RS messages and also this other thread.

Posted in IPv6 | Leave a comment

Implementing Spanning Tree

Some of the IT support staff have taken the recently published IPv6 trial conditions quite seriously and we’ve already had two queries with regards to the spanning tree requirement. These queries aren’t disputing it but rather asking about specific behaviour that will occur when implementing it.

This is a slightly tricky article to write as

  1. In our immediate team we tend to have spanning tree present on networks from the beginning
  2. We use predominantly Cisco products on the backbone however college or department IT staff may be using any vendor, so please excuse any Cisco specific terminology – your vendors implementation should be roughly similar.

But lets dive in

1. Look at your network

Specifically do you have mix of vendors? Know in advance that although they can work together it’s possible to encounter some unexpected issues (I believe we’ve some investigative work planned in this area to assist).

You should have a login to your managed switches and complete list/inventory of the switch devices on your network. If not then perhaps you’ve inherited a network as part of a new position and some detective work is required to audit the network before you proceed. Cisco CDP, the IEEE LLDP or your vendors equivalent may be of assistance.

Draw a network diagram of your core network to help you visualise the network and speed up troubleshooting.

2. Decide on the type of spanning tree

Spanning tree is a great help, but depending on the age of your switch hardware and software you may have the opportunity to deploy rapid spanning tree (RSTP – IEEE 802.1w), which has benefits discussed below. You may also have the opportunity to use one of the implementations of per VLAN rapid spanning tree which you may decide to use if able and your network uses VLANs.

Dipping into the CCNA ICND2 Exam Certification Guide by Wendell Odom, Cisco Press we can save a lot of time and steal the following table from p88 which illustrates the features offered by the three main per VLAN spanning tree options on Cisco devices.

Option Implemented via STP/802.1d Implmented via RSTP/802.1w Configuration Effort Only one Instance Required for Each Redundant Path
PVST+ Yes No Small No
PVRST No Yes Small No
MIST (MST) No Yes Medium Yes

I’ve altered the column title of the second (and third columns) in the above table from the original of ‘supports STP’ to make it clear that although the implementation may not use standard spanning tree it will co-exist with it. A network of rapid spanning tree switches can co exist with normal spanning tree and the switches will do the right thing in order to work together.

Don’t change the timing values of spanning tree when configuring it, leave these at the defaults. If you disagree with this then you’re probably excessively familiar with spanning tree and the advice in this article isn’t aimed at your level.

Mentions of spanning tree from this point onwards will tend to be generic unless otherwise stated and hence refer to whichever implementation you have chosen.

3. Plan – What will I see when I first implement it?

Firstly, standard spanning tree takes roughly 50 seconds to converge after a change, rapid spanning tree may be between 2-10 seconds. Convergence may be noticeable to your users as a delay, so perform the introduction out of standard working hours. If you have an at risk period, such as JANET’s Tuesday 7:00-9:00am period which has been widely adopted, then use this period.

In everyday use, if a link were to go down then with traditional spanning tree you might expect a reconvergence time to a redundant link of perhaps 30 to 50 seconds since any minor topology change might not result in a full network spanning tree reconvergence (e.g. it happens faster if you only have 3 switches instead of 30).  Those with a fascination for detailed scenario explanations should take a look at Cisco press “CCNP Switch 642-813 Official Certification Guide” by David Hucaby, p142 onwards.

I’d start by enabling spanning tree on your core switch e.g. the switch closest to the centre of your network (or one of them if you have more than one). It will initially assume it is the spanning tree root bridge until it decides otherwise from spanning tree traffic it receives. You can manually change the bridge priority to make the switch become the root bridge in any spanning tree election. Manually picking the root bridge will mean spanning tree topology should take reasonably expected paths in a more complex network. After this is done enable spanning tree on all your other switches.

If you’re unsure, the above paragraph seems confusing and you can count all your switch devices on a few fingers and toes then simply turn on spanning tree on all your switches. The switches will elect a root bridge without your involvement and should do the right thing.

With standard spanning tree as you enable it you might lose contact with switches for 20 seconds as the links start in spanning tree blocking mode then transition to forwarding (taking 20 seconds to do so), this is normal.

You might see some links go down and yet the network still functions – perhaps you didn’t know you had a redundant link (loop) on your network. Imagine you have the three switches – with a cable connecting them together in a triangle: with spanning tree two links would be up, the other would be disabled (connecting ports put into a blocking state), preventing a loop. If one of the live links was broken the disabled link would be made live automatically.

End users workstation ports connected to a switch running standard spanning tree will see a delay of roughly 30 seconds or more from connection to usability. This is noticeable to users and may result in odd complaints such as that “your DHCP server is too slow” or similar. You can configure “portfast” (I’m told on HP equipment this is the edge-port feature) on these ports to keep the port in a spanning tree forwarding state and make the delay vanish, which is good. The minor risk is that someone will plug in a switch that doesn’t do spanning tree to this port and another port, creating a bridging loop. Use BDPU guard (or your vendors equivalent) on ports you have portfast enabled on to protect against these ports being accidentally connected to another spanning tree enabled switch [this section may be expanded upon in the future].

4. Aftermath

Everything should be fine, but there would be a lot less IT positions if equipment always did what it should. Double check your network is functional (perhaps you have a Nagios,Zabbix or equivalent).

Look at what links (if any) spanning tree has disabled, and compare it to your network diagram. Did you know these redundant links were present?

You can now add intended redundant paths between your switches (e.g extra cables) for fault tolerance without impacting your network. Spanning Tree will automatically disable (connecting ports put into a blocking state) the redundant link when you plug it in and start using it when the usual link is broken.

Make yourself a cup of tea, smile to yourself and realise that you will have a reduced workload and less odd confusing issues with your network. When you hunger for more read more about STP, PortFast and BDPU Guard. Perhaps you might channel bond where you have dual links between switches and one is currently disabled by spanning tree. You might consider deploying Netdisco.

Finally don’t forget to drop our team an email to state you’re ready to take part in the IPv6 trials.

Posted in Best Practices, IPv6 | 1 Comment