Peculiar Timing Issue With PxE In Nexus Environment
I have seen some peculiar behavior doing PxE builds. I wanted to pick the brain of some experienced network engineers as my research on the internet shows there is a ton of contradictory opinions / philosophies.
Here is the setup:
1. HP SL4540 Servers with a Broadcom Copper/RJ-45 10Gbe Cards with PxE enabled. 2. Connected to Nexus 2232 10G Copper/RJ-45 edge switches going back to a Nexus 5596 aggregation layer. 3. IP helpers services are enabled on the 2232 so that DHCP and PxE requests are forwarded to a specific IP address. 4. Both the Switch Port & NIC Port are set to auto-negotiate.
With the solution fully configured, the servers still fail to either find the DHCP / PxE server or make initial contact but never complete the handshake process. The peculiar thing is that we sometimes will see this late at night. We break for sleep and wake up in the morning and the servers are fine and ready to load an OS. There is some type timing phenomenon going on in the background and given enough time they eventually find this timing window and move forward.
I’ve read dozens of articles now and I have found some experts pointing to potential causes but then others refuting the same ones. Here are the preliminary ones I found.
1. With NIC and Switch Port set to auto-negotiate the negotiation process is taking longer than the PxE request process cycle which seems to be pretty short (10 to 15 seconds). I have read tons of conflicting info. Some of it (Citrix, Altiris, VMWare) saying setting to full duplex will circumvent the negotiation process if it is indeed taking longer than the PxE request process. Other docs say Gigabit Ethernet requires auto-negotiate and wont work without it. What are your thoughts on this ? Is Auto-Negotiate absolutely required for 1G/10G and if not can setting to full duplex potentially help the NIC sync with the switch quicker ? We do have tight control of server and switch so can ensure both are set to whatever we need. 2. I found a bunch of articles where PxE fails when certain services are active on the switch. The articles showed that these services can cause a long negotiation process to get link, or even allow link but hold packets for a bit as they bring in the new connections. The ones in particular they mentioned were: a. Spanning Tree Protocol b. Ether Channel c. Port Aggregation Protocol d. Disabled PortFast Service What are your thoughts on this ? Are there other services that could have a similar effect ? 3. Outside these theories any other reasons one could see why this would happen ? The fact that it eventually “fixes itself” bugs me so I want to find a definitive answer.
Also from the switch side what troubleshooting steps can we take to prove or disprove these theories ? We can pull the switch logs, is there any other things to look for or monitor ? What would we expect to see in the switch log if one of these theories were correct ? Would loading WireShark and examining packet dumps be of any use here ? Any insight would be greatly appreciated. Thanks Cisco community. :-)
[toc:faq]The ProblemOn traditional switches whenever we have a trunk
interface we use the VLAN tag to demultiplex the VLANs. The switch needs
to determine which MAC Address table to look in for a forwarding
decision. To do this we require the switch to do...
[toc:faq]Introduction:Netdr is a tool available on a RSP720, Sup720 or
Sup32 that allows one to capture packets on the RP or SP inband. The
netdr command can be used to capture both Tx and Rx packets in the
software switching path. This is not a substitut...
IntroductionOSPF, being a link-state protocol, allows for every router
in the network to know of every link and OSPF speaker in the entire
network. From this picture each router independently runs the Shortest
Path First (SPF) algorithm to determine the b...