management network too deep? move curviceps uplink, walnut outage 2022-05

changed due date to May 03, 2022

added help-wanted majorly-borked risk:high labels

walnut reboot expected: https://lists.ucc.gu.uwa.edu.au/pipermail/tech/2022-May/005525.html

Currently responding:

# nmap -sP 192.168.2.1-255
Starting Nmap 7.70 ( https://nmap.org ) at 2022-05-03 13:19 AWST
Nmap scan report for 192.168.2.1
Host is up.
Nmap scan report for 192.168.2.2
Host is up (0.00055s latency).
MAC Address: 00:1B:D5:9C:D0:3F (Cisco Systems)
Nmap scan report for clipons.sese.uwa.edu.au (192.168.2.4)
Host is up (0.00084s latency).
MAC Address: 28:94:0F:04:93:FF (Cisco Systems)
Nmap scan report for mooneye.mgmt.ucc.asn.au (192.168.2.9)
Host is up (0.038s latency).
MAC Address: 00:11:43:E1:AF:3C (Dell)
Nmap scan report for 192.168.2.20
Host is up (0.00013s latency).
MAC Address: 00:18:E7:C7:C8:6C (Cameo Communications)
Nmap scan report for 192.168.2.21
Host is up (-0.100s latency).
MAC Address: FC:EC:DA:4F:FF:12 (Ubiquiti Networks)
Nmap scan report for molmol.mgmt.ucc.asn.au (192.168.2.33)
Host is up (0.00042s latency).
MAC Address: 00:25:90:CE:37:89 (Super Micro Computer)
Nmap scan report for 192.168.2.34
Host is up (0.00015s latency).
MAC Address: 4C:5E:0C:69:AC:70 (Routerboard.com)
Nmap scan report for 192.168.2.38
Host is up (0.00016s latency).
MAC Address: 2E:78:D2:03:4F:75 (Unknown)
Nmap scan report for mudkip.mgmt.ucc.asn.au (192.168.2.46)
Host is up (-0.100s latency).
MAC Address: 80:C1:6E:77:2E:44 (Hewlett Packard)
Nmap scan report for 192.168.2.47
Host is up (-0.087s latency).
MAC Address: 80:C1:6E:74:EF:BA (Hewlett Packard)
Nmap done: 255 IP addresses (11 hosts up) scanned in 3.86 seconds

@nick That checks out with what I'm seeing.

Here's my current list of devices to investigate. (I've excluded everything that's down that I know shouldn't be down.)

192.168.2.2     lard.ucc.asn.au.            up
192.168.2.4     kerosene.ucc.asn.au.        up
192.168.2.9     mooneye.mgmt.ucc.asn.au.    up
192.168.2.10    curviceps.ucc.asn.au.       down
192.168.2.18    motsugo.mgmt.ucc.asn.au.    down
192.168.2.20    coromandel.ucc.asn.au.      up
192.168.2.21    smallwing.ucc.asn.au.       up
192.168.2.30    medico.mgmt.ucc.asn.au.     down
192.168.2.32    maltair.mgmt.ucc.asn.au.    down
192.168.2.33    molmol.mgmt.ucc.asn.au.     up
192.168.2.34    abe.ucc.asn.au.             up
192.168.2.35    murasoi.mgmt.ucc.asn.au.    down
192.168.2.37    walnut.mgmt.ucc.asn.au.     down
192.168.2.38    salmon.mgmt.ucc.asn.au.     up
192.168.2.46    mudkip.mgmt.ucc.asn.au.     up
192.168.2.47    magikarp.mgmt.ucc.asn.au.   up
192.168.2.52    dell-ph1.mgmt.ucc.asn.au.   down
192.168.2.55    machop.mgmt.ucc.asn.au.     down

I have manually plugged into Curviceps and it checks out as OK. The LACP link between it and Walnut seems to have failed due to one of the two links being down, which would explain the failure of things on the management network that are off of it.

I suspect the issue is on Walnut's end and somehow relates to its apparent control plane crash. I think the only way forward is a reboot of Walnut.

Rebooting Walnut fixed its control interface but did not resolve the issue generally.

Port 23 and 24 on Curviceps refuse to work, even for connectivity to my laptop. I was able to access its dashboard through port 18 though.

Swapping the LAG to use port 17 instead of 23 for the second port has fixed everything now.

Thanks! It does seem to have made a few more hosts responsive:

Nmap scan report for 192.168.2.10
Host is up (0.0039s latency).
MAC Address: 00:1B:3F:10:F1:20 (ProCurve Networking by HP)
Nmap scan report for 192.168.2.18
Host is up (-0.10s latency).
MAC Address: 00:25:90:1F:19:2F (Super Micro Computer)
Nmap scan report for 192.168.2.30
Host is up (0.00043s latency).
MAC Address: 00:25:90:A0:69:C4 (Super Micro Computer)
Nmap scan report for maltair.mgmt.ucc.asn.au (192.168.2.32)
Host is up (0.00033s latency).
MAC Address: 38:EA:A7:A9:41:5C (Hewlett Packard)
Nmap scan report for walnut.mgmt.ucc.asn.au (192.168.2.37)
Host is up (-0.097s latency).
MAC Address: F0:9F:C2:64:53:C0 (Ubiquiti Networks)
Nmap scan report for 192.168.2.53
Host is up (0.00086s latency).
MAC Address: D0:67:E5:EF:85:CB (Dell)
Nmap scan report for machop.mgmt.ucc.asn.au (192.168.2.55)
Host is up (-0.10s latency).
MAC Address: D0:50:99:F3:52:5E (ASRock Incorporation)

Still down: .35, possibly .52 should be .53, maybe a missing reverse-DNS entry or two:

35.2.168.192.in-addr.arpa domain name pointer murasoi.mgmt.ucc.asn.au.

closed

mentioned in issue #36

Moving discussion on management network cleanup to #36.

removed majorly-borked label

Question: Should we move the curviceps uplink to the core switch?

reopened

I think it makes sense to have walnut at the centre of the topology. Changing it would add kerosene as an additional SPoF for anything off curviceps, because most traffic has to first go through murasoi (off walnut) anyway to be routed.

@nick Are you happy to close this again after the latest look at the network?

I think it makes sense to have walnut at the centre of the topology.

You're right - your last wiki edit about walnut had helped and I've updated these a bit as well:

kerosene is still a big physical SPoF for anything coming through VLAN 5, 6, 13 or 42 (though clubroom ports/VLAN 3 are expected!), but walnut doesn't have really have spare ports for that sort of thing, unless we had to in a pinch. So - they're both currently critical even for remote access. (and curviceps is for some consoles)

closed

management network too deep? move curviceps uplink, walnut outage 2022-05

Child items

Activity

Admin message

management network too deep? move curviceps uplink, walnut outage 2022-05

Child items

Linked items

Activity