UK Outage Friday March 20th - RFO (Reason For Outage) - Fixed/Completed Work

  • Wednesday, 25th March, 2020
  • 11:38am

RFO

Outage Date: Friday, 20 March 2020

DataCenter Location: Saxon House, Cheltenham
 
Description: We have experienced intermittent connectivity to our router in Telehouse North. During this time our support portal, and our phone system dropped offline, despite being connected via our out-ofband connection. There were two unrelated faults, SSE had fibre damaged near THN at 22:30 which caused intermittent connectivity over the circuit. VirginMedia had an unrelated fault at their site in Wootton which dropped a number of circuits.
 
Time Line:

  • 22:35 - Link down detected at Saxon House (SSE)
  • 22:40 - SSE informed, engineers aware
  • 23:12 - SSE confirm intermittent connection between SAX and THN
  • 23:15 - Port marked admin down at SAX to prevent flapping
  • 01:30 - SSE confirm engineers working on the issue, next update 03:00
  • 02:47 - Link down detected at Saxon House (VM)
  • 02:47 - MPLS path down detected on our router in Telehouse North.
  • 02:48 - VM informed, engineers aware
  • 03:00 - SSE report fibre damaged at Telehouse end along with other fibre pairs in a pit bundle
  • 03:00 - Engineer dispatched to THN
  • 04:05 - SSE have engineers in the pit splicing pairs, next update 06:00
  • 05:25 - Engineer onsite at THN, both x-connect ports down
  • 05:30 - Port marked admin up at SAX in anticipation of fibre splicing
  • 05:35 - Call with SSE and VM over diversity
  • 06:00 - SSE convinced route is diverse to VM, next update 08:00
  • 06:10 - VM also convinced route is diverse to SSE
  • 06:15 - Detailed circuit maps requested from both SSE and VM
  • 08:38 - MPLS path up detected on our router in Telehouse North.
  • 08:39 - Updates posted on social media
  • 08:50 - Systems administrators investigating why phone system and support portal are offline.
  • 09:02 - MPLS path down detected on our router in Telehouse North.
  • 09:12 - SSE report splicing complete, but circuit still down, next update 11:00
  • 09:15 - MPLS path up detected on our router in Telehouse North.
  • 09:18 - VM report engineers on site splicing fibre
  • 09:20 - Circuit maps investigated, confirmed diverse.
  • 09:32 - Call with VM confirms their issue is at Wootton and not related to the pit near THN
  • 09:22 - Phone system and support portal issue isolated to DNS server (ns1) not responding to queries.
  • 09:50 -  System administrators investigating issues with DNS server (ns1).
  • 10:11 - System administrators identified fault with DNS service on DNS server (ns1).
  • 10:22 - DNS service issue resolved on DNS server (ns1).
  • 10:36 - Testing resolution of DNS queries on DNS server (ns1).
  • 10:42 - Investigating DNS service on other DNS servers (ns0 & ns2).
  • 10:55 - DNS issues resolved and all DNS servers fully operational.
  • 11:14 - System administrators investigating phone system and support portal.
  • 11:24 - Phone system identified to be at fault due to failed resolution to VOIP PBX server. Resolved since fixing DNS service on DNS server (ns1).
  • 11:42 - Support portal connection issue identified.
  • 12:12 - MPLS path down detected on our router in Telehouse North.
  • 12:20 - MPLS path up detected on our router in Telehouse North.
  • 12:22 - System engineers prepare to add additional connection to support portal.
  • 12:32 - Additional DNS server begin provisioning.
  • 12:41 - MPLS path down detected on our router in Telehouse North.
  • 12:45 - SSE report engineers replacing line card, next update 14:00
  • 13:00 - VM trying to patch on to working circuit to expedite solution
  • 13:01 - MPLS path up detected on our router in Telehouse North.
  • 13:06 - MPLS path down detected on our router in Telehouse North.
  • 14:05 - Phone system confirmed working independent of connectivity at SAX
  • 14:15 - SSE confirm engineers working onsite, resolution ETA 16:30
  • 15:15 - MPLS path up detected on our router in Telehouse North.
  • 15:20 - VM confirm original circuit is back up fault free
  • 15:44 - Link Up detected at SAX (SSE)
  • 15:57 - Link Down detected at SAX (SSE)
  • 16:32 - Link Up detected at SAX (SSE)
  • 16:40 - SSE confirm circuit back up and fault free  

Future Plans:

We previously had 3 authoritative name servers, 2 hosted on our network, and 1 on our out-of-band connection. The 2 on our network were unreachable but were functional, with the out-of-band server was online but the DNS service had crashed. We have set up a fourth authoritative name server which is hosted off network with a third party, and we have added this to our DNS cluster to minimise any potential issues caused by DNS service errors in future. This issue with DNS is why our support portal and VOIP phone system became unreachable.

 

We are also arranging a 10G connection from SAX to Manchester for a path to global transit which is not dependant on THN.

« Back