Troubleshooting Neutron in Full HA setup
I was wondering if there’s an available resource in order to help us debug network issues in our OpenStack installation.
In particular we have installed OpenStack using http://
Internal networks seems to work fine, but we are struggling to get external network connectivity. In particular, when we’re tracing packets through iptables, they disappear in a black hole after nat table (PREROUTING) , and a “qr-12b0d136-7d: hw csum failure.” Message with full stack trace appears in dmesg.
I’m reaching out, because we seem to have hit a dead end – and would very much like a new pair of eyes on it.
Is there anyone that could be available for a hands on troubleshooting session?
Any help would be greatly appreciated.
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- Daneyon Hansen Edit question
- Solved by:
- Daneyon Hansen
- Solved:
- Last query:
- Last reply:
- Whiteboard:
- Provided initial feedback after reviewing the deployment. Trying to implement HA, but using Neutron L3 Agents that are not supported. ALso using bonded interfaces, which has not been tested.
Revision history for this message
|
#1 |
Jon,
Here is some initial feedback after an initial review of your environment:
1.You are using bonded interfaces, so you may be running into some unknown factors since bonding is not included in the HA reference architecture. You may want to get everything working using physical interfaces. When everything is working, then try moving the configuration to bonded interfaces. For example, I see that Galera has unspecified incoming addresses due to the bonded interfaces: | wsrep_incoming_
2.You are creating Quantum routers which is not supported in the HA deployment. The HA architecture uses Neutron's Provider Networking Extensions and relies on a physical upstream router for L3 and first-hop redundancy (I.e. HSRP). This is because Neutron supports multiple L3 agents for scalability, but is still a single-
Revision history for this message
|
#2 |
What does your bonding setup look like? Are you using the kernel bonding driver? or setting up the bond via OVS?
I saw some weirdness when I was bonding with kernel driver, and then attaching that bond as port to my br-ex. With latest COI, the newer OVS package allows bonding from within OVS itself. I haven't seen those issues when setting up bond with OVS. But I'm not certain if it was newer OVS package or doing the bond via OVS itself which resolved the issue. Also, in addition to wsrep_incoming_
All that to say my suggestion is to try bonding from within OVS. But like Daneyon said, get it working without bonding first. =)
Revision history for this message
|
#3 |
Thank you for your replies. I have now reconfigured it without the use of bonding (complete reinstall) - but I'm getting the exact same result in terms of networking for the virtual machines. The only real difference now is that Galera/wsrep is now able to detect incoming address ip.
* Blackhole behavior is still present
* Router interface on the external network has status: DOWN, while admin state is UP
* hw csum failure is still occuring
[ 5065.723256] tapa8858e62-8e: hw csum failure.
[ 5065.725424] Pid: 0, comm: swapper/24 Tainted: G O 3.2.0-60-generic #91-Ubuntu
[ 5065.725428] Call Trace:
[ 5065.725431] <IRQ> [<ffffffff81545
[ 5065.725455] [<ffffffff8153e
[ 5065.725461] [<ffffffff8153e
[ 5065.725471] [<ffffffff815c5
[ 5065.725488] [<ffffffffa0141
[ 5065.725497] [<ffffffffa00f4
[ 5065.725506] [<ffffffffa013c
[ 5065.725515] [<ffffffffa0106
[ 5065.725524] [<ffffffff81574
[ 5065.725532] [<ffffffff8157b
[ 5065.725538] [<ffffffff81574
[ 5065.725544] [<ffffffff8157b
[ 5065.725551] [<ffffffff8157c
[ 5065.725560] [<ffffffffa00fd
[ 5065.725567] [<ffffffff81547
[ 5065.725576] [<ffffffff8132b
[ 5065.725581] [<ffffffff81547
[ 5065.725586] [<ffffffff81547
[ 5065.725594] [<ffffffff81548
[ 5065.725603] [<ffffffff81370
[ 5065.725611] [<ffffffff8106f
[ 5065.725622] [<ffffffff81661
[ 5065.725631] [<ffffffff8166c
[ 5065.725640] [<ffffffff81016
[ 5065.725645] [<ffffffff81070
[ 5065.725650] [<ffffffff8166c
[ 5065.725657] [<ffffffff81662
[ 5065.725660] <EOI> [<ffffffff81370
[ 5065.725669] [<ffffffff81370
[ 5065.725679] [<ffffffff8150c
[ 5065.725685] [<ffffffff81013
[ 5065.725696] [<ffffffff8163f
[ 5132.087205] qr-fc5f0ab2-a0: hw csum failure.
[ 5132.089394] Pid: 0, comm: swapper/24 Tainted: G O 3.2.0-60-generic #91-Ubuntu
[ 5132.089398] Call Trace:
[ 5132.089401] <IRQ> [<ffffffff81545
[ 5132.089426] [<ffffffff8153e
[ 5132.089432] [<ffffffff8153e
[ 5132.089442] [<ffffffff815c5
[ 5132.089459] [<ffffffffa0141
[ 5132.089468] [<ffffffffa00f4
[ 5132.089477] [<ffffffffa013c
[ 5132.089486] [<ffffffffa0106
[ 5132.089495] [<ffffffff81574
[ 5132.089503] [<ffffffff8157b
[ 5132.089509] [<ffffffff81574
[ 5132.089515] [<ffffffff8157b
[ 5132.089522] [<ffffffff8157c
[ 5132.089531] [<ffffffffa00fd
[ 5132.089537] [<ffffffff81547
[ 5132.089547] [<ffffffff8132b
[ 5132.089552] [<ffffffff81547
[ 5132.089557] [<ffffffff81547
[ 5132.089564] [<ffffffff81548
[ 5132.089574] [<ffffffff81370
[ 5132.089582] [<ffffffff8106f
[ 5132.089592] [<ffffffff81661
[ 5132.089601] [<ffffffff8166c
[ 5132.089611] [<ffffffff81016
[ 5132.089615] [<ffffffff81070
[ 5132.089621] [<ffffffff8166c
[ 5132.089627] [<ffffffff81662
[ 5132.089630] <EOI> [<ffffffff8108f
[ 5132.089644] [<ffffffff81370
[ 5132.089649] [<ffffffff81370
[ 5132.089658] [<ffffffff8150c
[ 5132.089665] [<ffffffff81013
[ 5132.089675] [<ffffffff8163f
Revision history for this message
|
#4 |
Quite possibly not a complete solution, but I see you're on the 3.2.0 series of kernels. Can you update to linux-generic-
Revision history for this message
|
#5 |
I've attempted to upgrade the kernel to linux-generic-
The hw csum failure seems to be gone from logs, but the network is still not working as expected.
Setting up an internal + external network, and a router in between still yields:
* Router interface on the external network has status: DOWN, while admin state is UP
* Host on internal network can ping both gateway, and external net ip - but nothing outside
* Assigned floating ip replies from within the virtual machine - but not on the external network
* VMs put directly on the external network gets an ip which is working fine
* Neither internal or external network connected VM can reach metadata server 169.254.169.254
Revision history for this message
|
#6 |
You definitely need the external router interface to show up/up. This interface should be logically segmented to use the same VLANs that your Neutron OVS plugin.ini uses. The VLAN's needed to be trunked between the physical L3 gateway and the control/compute nodes.
VLANs defined on TOR L2 switch and L3 GW:
vlan 220
name pod1_mgt
vlan 221
name pod1_public_api
vlan 222
name pod1_swift_storage
vlan 223
name pod1-qtm--net1
vlan 224
name pod1-qtm--net2
vlan 225
name pod1-qtm--net3
Interfaces on TOR Switch and L3 GW (this should be on every interface within the path):
interface <NAME/NUMBER>
switchport mode trunk
switchport trunk native vlan 220
switchport trunk allowed vlan 220-230
spanning-tree port type edge
L3 GW (Note: no GW for Swift storage network):
interface Vlan220
ip address 192.168.220.1/24
no shutdown
interface Vlan223
ip address 192.168.223.1/24
description ### Daneyon- Neutron Provider VLAN Deployment ###
no shutdown
interface Vlan224
ip address 192.168.224.1/24
description ### Daneyon- Neutron Provider VLAN Deployment ###
no shutdown
interface Vlan225
ip address 192.168.225.1/24
description ### Daneyon- Neutron Provider VLAN Deployment ###
no shutdown
Revision history for this message
|
#7 |
As I mentioned in my initial feedback, their is no concept of internal/external networks and floating ip's with neutron provider networks. Until Neutron L3 agent HA becomes available (possibly early J release), you can not deploy the L3 agent within the HA architecture. Therefore you can not configure neutron routers and attach internal/external networks to the neutron routers. Instances get spawned, obtain an IP address over the provider network from the DHCP agent(s) running on the control node(s). The instances then communicate directly over the provider physical network.
Revision history for this message
|
#8 |
Jon,
If your problem is solved, can you please provide a quick synopsis and select Problem Solved? If not, please provide an update to the problem.
Revision history for this message
|
#9 |
The problem is not solved. As per your comments, we decided to set up a Virtual Device Context (VDC) on a Nexus 7000 to provide HA for L3. We are now attempting to get this running, but have so far been unsuccessful.
We're utilizing UCS blades connected to a Fabric Interconnect, further connected to Nexus 5000, then up to Nexus 7000. VLANs have been manually set up though trunk on a PortChannel link in between the network devices.
Plugin has been configured as per: https:/
#/etc/neutron/
[cisco_plugins]
nexus_plugin=
vswitch_
[NEXUS_
compute01=
compute02=
compute03=
compute04=
compute05=
username=XX
password=YY
However, we're struggling to get even basic functionality going. (E.G unable to list instances in Horizon when plugin enabled)
Revision history for this message
|
#10 |
Update:
Nexus integration seems to work, however:
* Adding a router interface to an internal network in horizon reports success, but it doesn't show in list
* Adding the same interface against throws error, saying router already have an interface
* Setting up external interface also reports success, but interface is DOWN
* The ip listed in horizon is not assigned to any interface in Nexus
* Looking at the VCS on Nexus 7000, no new vrf was added - however, neutron seems to have modified the existing management vrf adding a duplicate line for vlans
Revision history for this message
|
#11 |
Jon,
You can not add a router interface in Horizon because you can not use Neutron routers in the HA reference architecture. I suggest spending time familiarizing yourself with VLAN provider networks:
http://
I don't believe Horizon supports VLAN provider networking, especially with the Nexus plugin. I believe Abishek Subramanian (<email address hidden>) is involved in the Horizon work. I suggest contacting him to get the latest status on Horizon support. I will be closing this support request.
Revision history for this message
|
#12 |
Jon,
You can not add a router interface in Horizon because you can not use Neutron routers in the HA reference architecture. I suggest spending time familiarizing yourself with VLAN provider networks:
http://
I don't believe Horizon supports VLAN provider networking, especially with the Nexus plugin. I believe Abishek Subramanian (<email address hidden>) is involved in the Horizon work. I suggest contacting him to get the latest status on Horizon support. I will be closing this support request.