Porting an io-net driver to io-pkt

This document is intended to help developers who wish to port an existing io-net network driver to be a “native” io-pkt driver, to obtain better throughput using fewer CPU resources. In order to do this, you need to have at least some familiarity with io-net network drivers.

This document includes:

Terminology

io-net
The framework and infrastructure for network drivers and protocols used with QNX Neutrino prior to 6.4.
io-pkt
The framework and infrastructure for network drivers and protocols developed for QNX Neutrino 6.4 to improve performance and make it easier to port drivers and protocols from current BSD source.

Shim and io-net drivers

It's worth first mentioning that you don't have to port an io-net driver to io-pkt! Any existing io-net driver binary should function “as-is” under io-pkt, using the “shim” driver (devnp-shim.so) that io-pkt automatically loads whenever you mount an io-net driver.

The shim driver performs a binary emulation of the io-net infrastructure, so the existing io-net driver isn't even aware that it's running under io-pkt. From io-pkt's perspective, the shim driver looks just like any other wired ethernet driver.

We did find and fix a few oddball bugs in some io-net drivers when we tested them under the shim — io-net tolerated them, but the shim emulation didn't.

So, the shim is responsible for translating the npkt packet buffers of io-net into the mbuf buffers of io-pkt, in both directions — transmit and receive. There actually isn't much overhead associated with this.

The big difference when using the shim is that a context switch is forced to occur during packet reception. This overhead doesn't occur in a native io-pkt driver, and is perhaps the primary motivator for porting a driver from io-net to io-pkt.

It's worth mentioning that with:

  1. a powerful CPU, and
  2. at slower data rates (e.g. 10/100 Mbit/sec), and
  3. with large packets,

you can be hard-pressed to measure the performance difference between an io-net driver with the shim, and a native io-pkt driver.

A performance test (see #3 above) was conducted at QSS on x86 boxes (see #1 above) with the io-net devn-pcnet.so 100 Mbit driver (see #2 above) and the shim, and a native PCNET driver (actually a ported BSD driver). Identical throughput of around 90 Mbits/sec was measured, and with the powerful x86 desktop PCs, it was difficult to measure any reduction in CPU consumption with the io-pkt driver, which didn't have a thread switch during receive. Do keep in mind that performance tests often utilize maximum-sized packets (see #3 above) which amortizes the cost of the thread switch.

The benefits of converting to io-pkt are going to be most evident when:

  1. on low-power processors, where minimal CPU consumption is absolutely critical to enable customer applications to execute in a timely manner, and
  2. maximum link speeds are used (e.g. gigabit ethernet), and
  3. minimal-sized packets.

Differences between an io-net driver and an io-pkt driver

The good news is that io-net and io-pkt drivers are actually pretty similar. There are a few fundamental differences, which we'll discuss below.

One major difference between io-net and io-pkt is that in io-net, an npkt is used as the fundamental packet buffer, while in io-pkt, an mbuf is used instead. For information about some handy mbuf macros that are defined in <sys/mbuf.h>, see the NetBSD documenation at http://netbsd.gw.com/cgi-bin/man-cgi?mbuf++NetBSD-current.

Any network driver can be divided up into the following functional areas:

It can be helpful to look at the source (see the Networking project on our Foundry 27 website, http://community.qnx.com) to io-net and io-pkt drivers for the same hardware, and contrast the differences. For example, for PCI, you could compare speedo or i82544, and for non-PCI, you could compare mpc85xx.

Initialization

If we look at trunk/sys/dev_qnx/speedo/speedo.c, we can see a function speedo_entry(), which io-pkt calls when the DLL devnp-speedo.so is first loaded. It in turn simply calls speedo_detect(), which is located in trunk/sys/dev_qnx/speedo/detect.c.

The speedo_detect() function does the PCI bus scanning, and for each matching VID/DID, it calls back up into io-pkt via the dev_attach() function. The io-pkt manager then calls back down into the driver's speedo_attach() function, which allocates driver resources.

From an io-pkt driver's perspective, initialization is quite different from that with io-net: it takes places in two phases: *_attach(), then *_init(). In the example above, speedo_attach() is immediately called, and then after an indeterminate period of time — e.g. when ifconfig assigns an IP address, or when Qnet is started — speedo_init() is called.

And here's the curve ball: speedo_init() may be called over and over and over again. For example, if you run the ifconfig utility twice, with two different IP addresses, speedo_init() will be called twice!


Note: When you're writing an io-pkt driver, you must be very aware that your driver's *_init() function will be called over and over again, so it should preferably not allocate any resources — if it does, it must do so carefully to make sure it doesn't do so improperly the second and third times through.

Back to speedo_attach() in detect.c. Note that it creates and fills in an io-net-style nic_config_t structure, which is used by the mii code and for making the nicinfo utility work nicely.

The speedo_attach() function also does the usual interfacing to the hardware, pretty much as the io-net driver does. However, it does make an io-pkt-specific call to interrupt_entry_init(), which is worth mentioning.

In io-pkt, interrupt handling and thread control is done by the io-pkt framework, not by the driver, for maximum performance.

To that end, the driver doesn't directly set up an interrupt handler; rather, it asks io-pkt to take the interrupt, which then calls into the driver's process_interrupt() function discussed below in the Receive section.

At the end of speedo_attach(), note the calls into io-pkt:

if_attach()
ether_ifattach()

which are required to register the io-pkt driver.

Now let's take a look at the second stage of initialization, which occurs in speedo_init() in detect.c. Remember that this function can be called over and over again at any time by io-pkt! The io-pkt manager actually learns about this call via the ifp structure that was passed to io-pkt in the if_attach() call above.

The first thing speedo_init() does is check to see if it's dying (see Shutdown,” below) and if so, it immediately returns. Shutdown is non-trivial.

The next thing speedo_init() does is call speedo_stop(), which also can be called directly by io-pkt. The speedo_stop() function, as the name hints at, halts the hardware but doesn't free resources allocated in speedo_attach().

Then, the NIC is reinitialized (e.g. descriptors, etc). Note that InterruptAttach() is called only if speedo->iid is invalid.

Finally, flags are set in the ifp structure to indicate that the driver is active. This is important.

Transmit

Now look at trunk/sys/dev_qnx/speedo/transmit.c. When io-pkt wants the driver to transmit an mbuf, it calls the driver's speedo_start() function, which is registered in speedo_init().


Note: Don't be confused by the *_start() name — it has nothing to do with the *_stop() function, which io-pkt may also call, to shut down.

The speedo_start() function checks the ifp flags (see above) and if the driver isn't running, it immediately returns.

Next, it loops, checking to see if there are any transmit descriptors free. If so, it calls the IFQ_DEQUEUE() macro to get the next queued mbuf for transmission. Each data fragment of the mbuf has its physical address loaded into a tx descriptor, and the cache flushed.

Note that as in io-net, the driver is free to defragment the packet, if it thinks it's too fragmented.

Next, the hardware registers are written to, to tell the hardware that there's a new packet ready for transmission.

Note that at the very end of the speedo_start() function, it calls the io-pkt NW_EX_LK() macro to unlock the mutex that io-pkt locked before it called the driver's *_start() function.

At the end of transmit.c is the speedo_transmit_complete() function, which is called from various places in the driver to harvest descriptors of completed transmissions. A very important difference from io-net is that when the driver has completed transmission of the packet, it simply calls m_free() and releases the mbuf; it doesn't attempt to return the packet to the protocol, as with io-net.

Receive

It's important to mention that unlike io-net, a native io-pkt driver doesn't create a high-priority receive thread to handle interrupts.

Packet reception in io-pkt starts with an interrupt from the hardware, which io-pkt initially handles, so let's look at that first.

If you look at speedo.c, you'll see two functions that can be called by io-pkt: speedo_isr() and speedo_isr_kermask().

A driver can either choose to use the simple kernel masking method of acknowledging interrupts, or it can choose to use the more elegant “real” ISR technique. The major advantage of using the real ISR is that when interrupts are shared between different devices — a bad idea, generally — far less interrupt latency will be experienced, if both devices' interrupt handlers do “the right thing” with real ISR handlers, and check for and mask their interrupt sources in hardware. If the interrupt isn't shared between different devices, the advantage of kermask versus real ISR isn't very noticeable.

A complete discussion of system interrupts in QNX Neutrino is outside the scope of this document. Suffice it to say that with io-pkt, a driver may choose one method or the other, possibly via a command-line option.

Either way, after an interrupt occurs, io-pkt calls the driver's speedo_process_interrupt() function, which loops, checking the hardware registers to see why the interrupt occurred. One of the functions it calls is speedo_receive(), which is located in receive.c. The speedo_receive() function scans through all filled rx descriptors, passing their mbufs up to io-pkt via the ifp->if_input() function. Note that it calls into io-pkt to get a new mbuf via the m_getcl_wtp() function — unlike some io-net drivers, it doesn't attempt to maintain its own internal cache of buffers.

Link State Change Handling

If you look at trunk/sys/dev_qnx/speedo/mii.c, you will notice very little difference between it and the io-net version.

An important point to emphasize is that with io-pkt, the native driver doesn't set up a timer that directly calls the housekeeping function to probe the link state every two seconds, as is done in io-net.

Instead, in io-pkt, a callout is created — we can see the callout_init() function call in speedo_init_phy(), and we can see the callout_msec() function call in speedo_init() to arm the internal io-pkt timer.

It's worthwhile noting that a callout timer is a one-shot thing — it's rearmed every time, at the end of the speedo_MDI_MonitorPhy() function call. If you don't do that, it will be called only once!

Control (e.g. devctl() and ioctl())

Very similar to io-net's speedo_devctl() function, if we look at trunk/sys/dev_qnx/speedo/devctl.c, we will find the speedo_ioctl() function. It's passed an unsigned cmd, which the driver can choose to either implement or not.

Looking at speedo_ioctl(), we can see it handles SIOCGDRVCOM, which nicinfo uses to get driver configuration information and counts. If the driver doesn't implement SIOCGDRVCOM, it will fall through at the end of the switch statement to the generic ether_ioctl() function in io-pkt, which actually does implement a very bare-bones version of the nicinfo stats. This was actually implemented for ported BSD drivers.

Another interesting couple of values for cmd are SIOCSIFMEDIA and SIOCGIFMEDIA. This functionality is new for io-pkt. In io-net, there was no way to set the link speed/duplex/etc. except via command-line options. However with io-pkt, we can use the MEDIA ioctl() commands (e.g. via the ifconfig utility) to change the link speed and duplex on the fly.

Look at trunk/sys/dev_qnx/speedo/bsd_media.c for details of how this is implemented — it's heavily commented, and is very similar for all native io-pkt drivers. Note that your driver doesn't need to implement the MEDIA ioctls, but if it chooses to do so, then this functionality is immediately available external to io-pkt.

Shutdown

In io-net, there was a pretty straightforward, two-stage shutdown. Shutting down in io-pkt is a bit more complicated than that.

First off, note the speedo_stop() function previously mentioned. It can be externally called by io-pkt, or internally by the driver itself. This is necessary, but not sufficient.

In addition, the driver needs to register another callback function via the shutdownhook_establish() io-pkt function, which it calls in the speedo_attach() function. The callback function it registers is speedo_shutdown(), which simply calls speedo_stop() with the correct parameters.

The shutdownhook mechanism is special — it's called from the SIGSEGV handler of io-pkt to quiesce hardware. If this isn't done, hardware may continue via DMA to write to memory that the system thinks is freed. This is very bad — DMA must be halted, and hardware interrupts should be masked when the driver's shutdownhook() function is called.

It's worth mentioning that the driver callback function that the driver registers with io-pkt via the shutdownhook_establish() io-pkt function should do as little as possible, and ideally just hit some hardware registers to primarily halt DMA and hopefully mask the hardware interrupt — especially if the interrupt is shared!

The problem is that if the driver does unnecessary work in the shutdown callback, it risks faulting again. If it does, then other hardware driver threads may not get a chance to quiesce their hardware, which can be bad.

Note that the speedo_stop() function, which is certainly generic, isn't particularly well-coded in this regard; to avoid memory leaks, it scans through transmissions which are in progress, and calls m_free() to release mbufs. But if the SIGSEGV occurred because of mbuf heap corruption, then a second SIGSEGV could easily occur during the subsequent m_free() calls. As you can see, shutdown isn't trivial to get correct for all circumstances.

Threading

It's worth repeating that threading in io-pkt is much more restrictive than with io-net. The io-pkt drivers shouldn't create threads or set up direct timer calls unless absolutely necessary, to allow io-pkt to control execution for optimal performance.

A complete discussion of io-pkt threads is outside the scope of this document (see Threading model in the Overview chapter of the Core Networking Stack User's Guide), but suffice to say that unless you really, really need to create threads — which you can — it's far easier not to.

Mutexing

Just as with io-net drivers, io-pkt drivers can choose to implement a variety of hardware and data exclusive-access mechanisms. We have mentioned that before the driver transmit *_start() function is called, an io-pkt mutex will be locked, and must be freed before the function returns.

However, we recommend that each io-pkt driver provide its own mutex protection to its hardware and data structures, to ensure that it functions correctly on SMP machines, and even if someone rearranges thread priorities, or even recodes io-pkt's threading model in the future.

It should be obvious that mutex protection is required for the transmit and receive calls. However please be careful to ensure that adequate mutexing is also implemented for the housekeeping periodic timer and also the ioctl() calls, which may access hardware and alter driver data structure values.