management network too deep? move curviceps uplink, walnut outage 2022-05
[MPT][NTU][333] tried to configure a network console port for dell-ph1.mgmt.ucc.gu.uwa.edu.au
on 2022-05-02 and discovered that a large chunk of the management network was not responding.
ACTION: move curviceps
uplink to kerosene
, the core switch?
Related? walnut
the Ubiquiti EdgeSwitch 16 XG in particular:
[333]: Anyone got any ideas for recovering walnut, short of resetting it? Its management is unres ponsive on both the network, and the serial console as well (it froze last night as soon as I tried to run
show running-config interface 0/11
). I did try^Q
in case a loose connection had inadvertently been interpreted as a^S
XOFF, but no bueno. The only remaining thing I can think of is seeing if any of the hosts hooked up to it have the management VLAN trunked, and if so see if I can get to it from there. Otherwise resetting it will probably make Ceph very sad
[TPG]: Hmmm... rancid thinks that walnut has been down since about 5AM yesterday
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Link issues together to show that they're related. Learn more.
Activity
- Nick Bannon changed due date to May 03, 2022
changed due date to May 03, 2022
- Nick Bannon added help-wanted majorly-borked risk:high labels
added help-wanted majorly-borked risk:high labels
- Author Owner
walnut
reboot expected: https://lists.ucc.gu.uwa.edu.au/pipermail/tech/2022-May/005525.html(thanks @jimbo @dylanh333 )
Edited by Nick Bannon - Author Owner
Currently responding:
# nmap -sP 192.168.2.1-255 Starting Nmap 7.70 ( https://nmap.org ) at 2022-05-03 13:19 AWST Nmap scan report for 192.168.2.1 Host is up. Nmap scan report for 192.168.2.2 Host is up (0.00055s latency). MAC Address: 00:1B:D5:9C:D0:3F (Cisco Systems) Nmap scan report for clipons.sese.uwa.edu.au (192.168.2.4) Host is up (0.00084s latency). MAC Address: 28:94:0F:04:93:FF (Cisco Systems) Nmap scan report for mooneye.mgmt.ucc.asn.au (192.168.2.9) Host is up (0.038s latency). MAC Address: 00:11:43:E1:AF:3C (Dell) Nmap scan report for 192.168.2.20 Host is up (0.00013s latency). MAC Address: 00:18:E7:C7:C8:6C (Cameo Communications) Nmap scan report for 192.168.2.21 Host is up (-0.100s latency). MAC Address: FC:EC:DA:4F:FF:12 (Ubiquiti Networks) Nmap scan report for molmol.mgmt.ucc.asn.au (192.168.2.33) Host is up (0.00042s latency). MAC Address: 00:25:90:CE:37:89 (Super Micro Computer) Nmap scan report for 192.168.2.34 Host is up (0.00015s latency). MAC Address: 4C:5E:0C:69:AC:70 (Routerboard.com) Nmap scan report for 192.168.2.38 Host is up (0.00016s latency). MAC Address: 2E:78:D2:03:4F:75 (Unknown) Nmap scan report for mudkip.mgmt.ucc.asn.au (192.168.2.46) Host is up (-0.100s latency). MAC Address: 80:C1:6E:77:2E:44 (Hewlett Packard) Nmap scan report for 192.168.2.47 Host is up (-0.087s latency). MAC Address: 80:C1:6E:74:EF:BA (Hewlett Packard) Nmap done: 255 IP addresses (11 hosts up) scanned in 3.86 seconds
- Owner
@nick That checks out with what I'm seeing.
Here's my current list of devices to investigate. (I've excluded everything that's down that I know shouldn't be down.)
192.168.2.2 lard.ucc.asn.au. up 192.168.2.4 kerosene.ucc.asn.au. up 192.168.2.9 mooneye.mgmt.ucc.asn.au. up 192.168.2.10 curviceps.ucc.asn.au. down 192.168.2.18 motsugo.mgmt.ucc.asn.au. down 192.168.2.20 coromandel.ucc.asn.au. up 192.168.2.21 smallwing.ucc.asn.au. up 192.168.2.30 medico.mgmt.ucc.asn.au. down 192.168.2.32 maltair.mgmt.ucc.asn.au. down 192.168.2.33 molmol.mgmt.ucc.asn.au. up 192.168.2.34 abe.ucc.asn.au. up 192.168.2.35 murasoi.mgmt.ucc.asn.au. down 192.168.2.37 walnut.mgmt.ucc.asn.au. down 192.168.2.38 salmon.mgmt.ucc.asn.au. up 192.168.2.46 mudkip.mgmt.ucc.asn.au. up 192.168.2.47 magikarp.mgmt.ucc.asn.au. up 192.168.2.52 dell-ph1.mgmt.ucc.asn.au. down 192.168.2.55 machop.mgmt.ucc.asn.au. down
- Owner
I have manually plugged into Curviceps and it checks out as OK. The LACP link between it and Walnut seems to have failed due to one of the two links being down, which would explain the failure of things on the management network that are off of it.
I suspect the issue is on Walnut's end and somehow relates to its apparent control plane crash. I think the only way forward is a reboot of Walnut.
- Owner
Rebooting Walnut fixed its control interface but did not resolve the issue generally.
- Owner
Port 23 and 24 on Curviceps refuse to work, even for connectivity to my laptop. I was able to access its dashboard through port 18 though.
Swapping the LAG to use port 17 instead of 23 for the second port has fixed everything now.
- Author Owner
Thanks! It does seem to have made a few more hosts responsive:
Nmap scan report for 192.168.2.10 Host is up (0.0039s latency). MAC Address: 00:1B:3F:10:F1:20 (ProCurve Networking by HP) Nmap scan report for 192.168.2.18 Host is up (-0.10s latency). MAC Address: 00:25:90:1F:19:2F (Super Micro Computer) Nmap scan report for 192.168.2.30 Host is up (0.00043s latency). MAC Address: 00:25:90:A0:69:C4 (Super Micro Computer) Nmap scan report for maltair.mgmt.ucc.asn.au (192.168.2.32) Host is up (0.00033s latency). MAC Address: 38:EA:A7:A9:41:5C (Hewlett Packard) Nmap scan report for walnut.mgmt.ucc.asn.au (192.168.2.37) Host is up (-0.097s latency). MAC Address: F0:9F:C2:64:53:C0 (Ubiquiti Networks) Nmap scan report for 192.168.2.53 Host is up (0.00086s latency). MAC Address: D0:67:E5:EF:85:CB (Dell) Nmap scan report for machop.mgmt.ucc.asn.au (192.168.2.55) Host is up (-0.10s latency). MAC Address: D0:50:99:F3:52:5E (ASRock Incorporation)
Still down: .35, possibly .52 should be .53, maybe a missing reverse-DNS entry or two:
35.2.168.192.in-addr.arpa domain name pointer murasoi.mgmt.ucc.asn.au.
- James Arcus closed
closed
- James Arcus mentioned in issue #36
mentioned in issue #36
- Owner
Moving discussion on management network cleanup to #36.
- Nick Bannon removed majorly-borked label
removed majorly-borked label
- Author Owner
Question: Should we move the
curviceps
uplink to the core switch? - Nick Bannon reopened
reopened
- Owner
I think it makes sense to have
walnut
at the centre of the topology. Changing it would addkerosene
as an additional SPoF for anything offcurviceps
, because most traffic has to first go throughmurasoi
(offwalnut
) anyway to be routed.Edited by James Arcus - Owner
@nick Are you happy to close this again after the latest look at the network?
- Author Owner
I think it makes sense to have walnut at the centre of the topology.
You're right - your last wiki edit about
walnut
had helped and I've updated these a bit as well:kerosene
is still a big physical SPoF for anything coming through VLAN 5, 6, 13 or 42 (though clubroom ports/VLAN 3 are expected!), butwalnut
doesn't have really have spare ports for that sort of thing, unless we had to in a pinch. So - they're both currently critical even for remote access. (andcurviceps
is for some consoles) - Nick Bannon closed
closed