Parent Directory
|
Revision Log
| Links to HEAD: | (view) (download) (annotate) |
| Sticky Revision: |
net: Revert vnet/epair cleanup race mitigation Revert the mitigation code for the vnet/epair cleanup race (done in r365457). r368237 introduced a more reliable fix. MFC after: 2 weeks Sponsored by: Modirum MDPay
if: Fix panic when destroying vnet and epair simultaneously When destroying a vnet and an epair (with one end in the vnet) we often panicked. This was the result of the destruction of the epair, which destroys both ends simultaneously, happening while vnet_if_return() was moving the struct ifnet to its home vnet. This can result in a freed ifnet being re-added to the home vnet V_ifnet list. That in turn panics the next time the ifnet is used. Prevent this race by ensuring that vnet_if_return() cannot run at the same time as if_detach() or epair_clone_destroy(). PR: 238870, 234985, 244703, 250870 MFC after: 2 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27378
Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (*). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav (*) Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225
if: Fix non-VIMAGE build if_link_ifnet() and if_unlink_ifnet() are needed even when VIMAGE is not enabled. MFC after: 2 weeks Sponsored by: Modirum MDPay
if: Protect V_ifnet in vnet_if_return() When we terminate a vnet (i.e. jail) we move interfaces back to their home vnet. We need to protect our access to the V_ifnet CK_LIST. We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove()) waits for net epoch callback completion. That's not possible from NET_EPOCH. Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to move and, once we've released the lock, move them back to their home vnet. We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a LOR between ifnet_sx, in_multi_sx and iflib ctx lock. Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as we do the list manipulation, but do not hold it as we if_vmove(). Reviewed by: melifaro MFC after: 2 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27279
if: Remove ifnet_rwlock It no longer serves any purpose, as evidenced by the fact that we never take it without ifnet_sxlock. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D27278
Move all ifaddr route creation business logic to net/route/route_ifaddr.c Differential Revision: https://reviews.freebsd.org/D26318
add SIOCGIFDATA ioctl For interfaces that do not support SIOCGIFMEDIA (for which there are quite a few) the only fallback is to query the interface for if_data->ifi_link_state. While it's possible to get at if_data for an interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms. SIOCGIFDATA is a simple ioctl to retrieve this fast with very little resource use in comparison. This implementation mirrors that of other similar ioctls in FreeBSD. Submitted by: Roy Marples <roy@marples.name> Reviewed by: markj MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D26538
net: mitigate vnet / epair cleanup races There's a race where dying vnets move their interfaces back to their original vnet, and if_epair cleanup (where deleting one interface also deletes the other end of the epair). This is commonly triggered by the pf tests, but also by cleanup of vnet jails. As we've not yet been able to fix the root cause of the issue work around the panic by not dereferencing a NULL softc in epair_qflush() and by not re-attaching DYING interfaces. This isn't a full fix, but makes a very common panic far less likely. PR: 244703, 238870 Reviewed by: lutz_donnerhacke.de MFC after: 4 days Differential Revision: https://reviews.freebsd.org/D26324
net: clean up empty lines in .c and .h files
Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937
Transition from rtrequest1_fib() to rib_action(). Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib, in6_rtrequest, rtrequest_fib> and their uses and switch to to rib_action(). This is part of the new routing KPI. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25546
Temporarly revert r363319 to unbreak the build. Reported by: CI Pointy hat to: melifaro
Transition from rtrequest1_fib() to rib_action(). Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib, in6_rtrequest, rtrequest_fib> and their uses and switch to to rib_action(). This is part of the new routing KPI. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25546
Use epoch(9) for rtentries to simplify control plane operations. Currently the only reason of refcounting rtentries is the need to report the rtable operation details immediately after the execution. Delaying rtentry reclamation allows to stop refcounting and simplify the code. Additionally, this change allows to reimplement rib_lookup_info(), which is used by some of the customers to get the matching prefix along with nexthops, in more efficient way. The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to nhop_priv to be able to reliably set curvnet even during vnet teardown. Rest of the reference counting code will be removed in the D24867 . Differential Revision: https://reviews.freebsd.org/D24866
Expose ifr_buffer_get_(buffer|length) outside if.c. This is a preparatory commit for D23933. Reviewed by: jhb
Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
Although most of the NIC drivers are epoch ready, due to peer pressure switch over to opt-in instead of opt-out for epoch. Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch when processing its packets. Now this will create recursive entrance in epoch in >90% network drivers, but will guarantee safeness of the transition. Mark several tested drivers as IFF_KNOWSEPOCH. Reviewed by: hselasky, jeff, bz, gallatin Differential Revision: https://reviews.freebsd.org/D23674
Partially revert VNET change and expand VNET structure. Revert parts of r353274 replacing vnet_state with a shutdown flag. Not having the state flag for the current SI_SUB_* makes it harder to debug kernel or module panics related to VNET bringup or teardown. Not having the state also does not allow us to check for other dependency levels between components, e.g. for moving interfaces. Expand the VNET structure with the new boolean flag indicating that we are doing a shutdown of a given vnet and update the vnet magic cookie for the change. Update libkvm to compile with a bool in the kernel struct. Bump __FreeBSD_version for (external) module builds to more easily detect the change. Reviewed by: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23097
Temporarily force IFF_NEEDSEPOCH until drivers have been resolved. Recent network epoch changes have left some drivers unexpectedly broken and there is not yet a consensus on the correct fix. This is patch is a minor performance impact until we can agree on the correct path forward. Reviewed by: core, network, imp, glebius, hselasky Differential Revision: https://reviews.freebsd.org/D23515
ifa_maintain_loopback_route: adjust debugging output Correction after r333476: - write this as LOG_DEBUG again instead of LOG_INFO; - get back function name into the message; - error may be ESRCH if an address is removed in process (by carp f.e.), not only ENOENT; - expression complexity grows, so try making it more readable. MFC after: 1 week
Introduce NET_EPOCH_CALL() macro and use it everywhere where we free data based on the network epoch. The macro reverses the argument order of epoch_call(9) - first function, then its argument. NFC
Mechanically substitute assertion of in_epoch(net_epoch_preempt) to NET_EPOCH_ASSERT(). NFC
- Move global network epoch definition to epoch.h, as more different subsystems tend to need to know about it, and including if_var.h is huge header pollution for them. Polluting possible non-network users with single symbol seems much lesser evil. - Remove non-preemptible network epoch. Not used yet, and unlikely to get used in close future.
if_vmove: return proper error status if_vmove can fail if it lost a race and the vnet's already been moved. The callers (and their callers) can generally cope with this, but right now success is assumed. Plumb out the ENOENT from if_detach_internal if it happens so that the error's properly reported to userland. Reviewed by: bz, kp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22780
Plug loopback idaddr refcount leak. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22980
Deduplicate code between if_delgroup() and if_delgroups(). Fix some style in if_addgroup(). No functional change intended. Reviewed by: hselasky MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22892
Fix a memory leak in if_delgroups() introduced in r334118. PR: 242712 Submitted by: ghuckriede@blackberry.com MFC after: 3 days
Allow kernel to compile without BPF. r297816 added some bpf magic for VIMAGE unconditionally which no longer allows kernels to compile without bpf (but with other networking). Add the missing ifdef checks and allow a kernel to compile without bpf again. PR: 242136 Reported by: dave mischler.com MFC after: 2 weeks
Add explicit SI_SUB_EPOCH Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP (EARLY_AP_STARTUP). Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH. epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND, but likely should allocate before SI_SUB_SMP. Prior to this change, consumers (well, epoch itself, and net/if.c) just open-coded the SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile. Reviewed by: mmacy Differential Revision: https://reviews.freebsd.org/D22503
In if_siocaddmulti() enter VNET. Reported & tested by: garga
There is a long standing problem with multicast programming for NICs and IPv6. With IPv6 we may call if_addmulti() in context of processing of an incoming packet. Usually this is interrupt context. While most of the NIC drivers are able to reprogram multicast filters without sleeping, some of them can't. An example is e1000 family of drivers. With iflib conversion the problem was somewhat hidden. Iflib processes packets in private taskqueue, so going to sleep doesn't trigger an assertion. However, the sleep would block operation of the driver and following incoming packets would fill the ring and eventually would start being dropped. Enabling epoch for the full time of a packet processing again started to trigger assertions for e1000. Fix this problem once and for all using a general taskqueue to call if_ioctl() method in all cases when if_addmulti() is called in a non sleeping context. Note that nobody cares about returned value. Reviewed by: hselasky, kib Differential Revision: https://reviews.freebsd.org/D22154
Remove obsoleted KPIs that were used to access interface address lists.
Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421
do_link_state_change() is executed in taskqueue context and in general is allowed to sleep. Don't enter the epoch for the whole duration. If some event handlers need the epoch, they should handle that theirselves. Discussed with: hselasky
The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the same after the network stack was epochified. Merge the two into one function and cleanup all uses of ifnet_byindex_locked(). While at it: - Add branch prediction macros. - Make sure the ifnet pointer is only deferred once, also when code optimisation is disabled. Sponsored by: Mellanox Technologies
Exclude the network link eventhandler from epochification after r353292. This fixes the following assert when "options RATELIMIT" is used: panic() malloc() sysctl_add_oid() tcp_rl_ifnet_link() do_link_state_change() taskqueue_run_locked() Sponsored by: Mellanox Technologies
if_delmulti() is never called without ifp argument, assert this instead of doing a useless search through interfaces.
Add two extra functions that basically give count of addresses on interface. Such function could been implemented on top of the if_foreach_llm?addr(), but several drivers need counting, so avoid copy-n-paste inside the drivers.
Provide new KPI for network drivers to access lists of interface addresses. The KPI doesn't reveal neither how addresses are stored, how the access to them is synchronized, neither reveal struct ifaddr and struct ifmaddr. Reviewed by: gallatin, erj, hselasky, philip, stevek Differential Revision: https://reviews.freebsd.org/D21943
Remove epoch assertion from if_setlladdr(). Originally this function was protected by IF_ADDR_LOCK(), which was a mutex, so that two simultaneous if_setlladdr() can't execute. Later it was switched to IF_ADDR_RLOCK(), likely by a mistake. Later it was switched to NET_EPOCH_ENTER(). Then I incorrectly added NET_EPOCH_ASSERT() here. In reality ifp->if_addr never goes away and never changes its length. So, doing bcopy() in it is always "safe", meaning it won't dereference a wrong pointer or write into someone's else memory. Of course doing two bcopy() in parallel would result in a mess of two addresses, but net epoch doesn't protect against that, neither IF_ADDR_RLOCK() did. So for now, just remove the assertion and leave for later a proper fix. Reported by: markj
In DIAGNOSTIC block of if_delmulti_ifma_flags() enter the network epoch. This quickly plugs the regression from r353292. The locking of multicast definitely needs a broader review today... Reported by: pho, dhw
Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111
Factor out VNET shutdown check into an own vnet structure field. Remove the now obsolete vnet_state field. This greatly simplifies the detection of VNET shutdown and avoids code duplication. Discussed with: bz@ MFC after: 1 week Sponsored by: Mellanox Technologies
Add debugging facility EPOCH_TRACE that checks that epochs entered are properly nested and warns about recursive entrances. Unlike with locks, there is nothing fundamentally wrong with such use, the intent of tracer is to help to review complex epoch-protected code paths, and we mean the network stack here. Reviewed by: hselasky Sponsored by: Netflix Pull Request: https://reviews.freebsd.org/D21610
Add SIOCGIFDOWNREASON. The ioctl(2) is intended to provide more details about the cause of the down for the link. Eventually we might define a comprehensive list of codes for the situations. But interface also allows the driver to provide free-form null-terminated ASCII string to provide arbitrary non-formalized information. Sample implementation exists for mlx5(4), where the string is fetched from firmware controlling the port. Reviewed by: hselasky, rrs Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21527
SIOCSIFNAME: Do nothing if we're not actually changing Instead of throwing EEXIST, just succeed if the name isn't actually changing. We don't need to trigger departure or any of that because there's no change from consumers' perspective. PR: 240539 Reviewed by: brooks MFC after: 5 days Differential Revision: https://reviews.freebsd.org/D21618
Need to wait for epoch callbacks to complete before detaching a network interface. This particularly manifests itself when an INP has multicast options attached during a network interface detach. Then the IPv4 and IPv6 leave group call which results from freeing the multicast address, may access a freed ifnet structure. These are the steps to reproduce: service mdnsd onestart # installed from ports ifconfig epair create ifconfig epair0a 0/24 up ifconfig epair0a destroy Tested by: pho @ MFC after: 1 week Sponsored by: Mellanox Technologies
Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117
Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
Fix rt_ifa selection during loopback route insertion process. Currently such routes are added with a link-level IFA, which is plain wrong. Only after the insertion they get fixed by the special link_rtrequest() ifa handler. This behaviour complicates routing code and makes ifa selection more complex. Streamline this process by explicitly moving link_rtrequest() logic to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all this logic in the loopback route case by explicitly specifying proper rt_ifa inside the ifa_maintain_loopback_route().§ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20076
Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory This commit adds new if_alloc_domain() and if_alloc_dev() methods to allocate ifnets. When called with a domain on a NUMA machine, ifalloc_domain() will record the NUMA domain in the ifnet, and it will allocate the ifnet struct from memory which is local to that NUMA node. Similarly, if_alloc_dev() is a wrapper for if_alloc_domain which uses a driver supplied device_t to call ifalloc_domain() with the appropriate domain. Note that the new if_numa_domain field fits in an alignment pad in struct ifnet, and so does not alter the size of the structure. Reviewed by: glebius, kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19930
Rework CASE_IOC_IFGROUPREQ() to require a case before the macro. This is more compatible with formatting tools and looks more normal. Reported by: jhb (on a different review) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D18442
Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin
Remove part of comment that doesn't match reality.
Adapt the fix in r341008 to correctly work with EBR. IFNET_RLOCK_NOSLEEP() is epoch_enter_preempt() in FreeBSD 12+. Holding it in sysctl_rtsock() doesn't protect us from ifnet unlinking, because unlinking occurs with IFNET_WLOCK(), that is rw_wlock+sx_xlock, and it doesn check that concurrent code is running in epoch section. But while we are in epoch section, we should be able to do access to ifnet's fields, even it was unlinked. Thus do not change if_addr and if_hw_addr fields in ifnet_detach_internal() to NULL, since rtsock code can do access to these fields and this is allowed while it is running in epoch section. This should fix the race, when ifnet_detach_internal() unlinks ifnet after we checked it for IFF_DYING in sysctl_dumpentry. Move free(ifp->if_hw_addr) into ifnet_free_internal(). Also remove the NULL check for ifp->if_description, since free(9) can correctly handle NULL pointer. MFC after: 1 week
Fix possible panic during ifnet detach in rtsock. The panic can happen, when some application does dump of routing table using sysctl interface. To prevent this, set IFF_DYING flag in if_detach_internal() function, when ifnet under lock is removed from the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent ifnet detach during routes enumeration. In case, if some interface was detached in the time before we take the lock, add the check, that ifnet is not DYING. This prevents access to memory that could be freed after ifnet is unlinked. PR: 227720, 230498, 233306 Reviewed by: bz, eugen MFC after: 1 week Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D18338
For compatibility KPI functions like if_addr_rlock() that used to have mutexes but now are converted to epoch(9) use thread-private epoch_tracker. Embedding tracker into ifnet(9) or ifnet derived structures creates a non reentrable function, that will fail miserably if called simultaneously from two different contexts. A thread private tracker will provide a single tracker that would allow to call these functions safely. It doesn't allow nested call, but this is not expected from compatibility KPIs. Reviewed by: markj
Notify that the ifnet will go away, even on vnet shutdown pf subscribes to ifnet_departure_event events, so it can clean up the ifg_pf_kif and if_pf_kif pointers in the ifnet. During vnet shutdown interfaces could go away without sending the event, so pf ends up cleaning these up as part of its shutdown sequence, which happens after the ifnet has already been freed. Send the ifnet_departure_event during vnet shutdown, allowing pf to clean up correctly. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17500
Resolve deadlock between epoch(9) and various network interface SX-locks, during if_purgeaddrs(), by not allowing to hold the epoch read lock over typical network IOCTL code paths. This is a regression issue after r334305. Reviewed by: ae (network) Differential revision: https://reviews.freebsd.org/D17647 MFC after: 1 week Sponsored by: Mellanox Technologies
Add KPI that can be used by tunneling interfaces to handle IP addresses appearing and disappearing on the host system. Such handling is need, because tunneling interfaces must use addresses, that are configured on the host as ingress addresses for tunnels. Otherwise the system can send spoofed packets with source address, that belongs to foreign host. The KPI uses ifaddr_event_ext event to implement addresses tracking. Tunneling interfaces register event handlers and then they are notified by the kernel, when an address disappears or appears. ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event() in the ip_encap.c MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17134
Add ifaddr_event_ext event. It is similar to ifaddr_event, but the handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL, and the pointer to ifaddr. Also ifaddr_event now is implemented using ifaddr_event_ext handler. MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17100
For changing the MTU on tun/tap devices, it should not matter whether it is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command. Without this patch, for IPv6 the new MTU is not used when creating routes. Especially, when initiating TCP connections after increasing the MTU, the old MTU is still used to compute the MSS. Thanks to ae@ and bz@ for helping to improve the patch. Reviewed by: ae@, bz@ Approved by: re (kib@) MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D17180
fix copy/paste error when clearing ifma flag CID: 1395119 Reported by: vangyzen
Add the ability to look up the 3b PCP of a VLAN interface. Use it in toe_l2_resolve to fill up the complete vtag and not just the vid. Reviewed by: kib@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D16752
Fix in6_multi double free This is actually several different bugs: - The code is not designed to handle inpcb deletion after interface deletion - add reference for inpcb membership - The multicast address has to be removed from interface lists when the refcount goes to zero OR when the interface goes away - decouple list disconnect from refcount (v6 only for now) - ifmultiaddr can exist past being on interface lists - add flag for tracking whether or not it's enqueued - deferring freeing moptions makes the incpb cleanup code simpler but opens the door wider still to races - call inp_gcmoptions synchronously after dropping the the inpcb lock Fundamentally multicast needs a rewrite - but keep applying band-aids for now. Tested by: kp Reported by: novel, kp, lwhsu
Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147
Deduplicate the code. Add generic function if_tunnel_check_nesting() that does check for allowed nesting level for tunneling interfaces and also does loop detection. Use it in gif(4), gre(4) and me(4) interfaces. Differential Revision: https://reviews.freebsd.org/D16162
struct ifmediareq *ifmrp is only used in the COMPAT_FREEBSD32 parts of ifioctl(). Move it inside the proper #ifdef. This was throwing a valid "Assigned but unused" warning with gcc. Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16063
epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
if_setlladdr: don't call ioctl in epoch context PR: 228612 Reported by: markj
route: fix missed ref adds - ensure that we bump the ifa ref whenever we add a reference - defer freeing epoch protected references until after the if_purgaddrs loop
if_delgroups: add missed unlock introduced by r334118
UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409
net: fix uninitialized variable warning
ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366
epoch(9): allocate net epochs earlier in boot
epoch: move epoch variables to read mostly section
epoch(9): Make epochs non-preemptible by default There are risks associated with waiting on a preemptible epoch section. Change the name to make them not be the default and document the issue under CAVEATS. Reported by: markj
epoch: add non-preemptible "critical" variant adds: - epoch_enter_critical() - can be called inside a different epoch, starts a section that will acquire any MTX_DEF mutexes or do anything that might sleep. - epoch_exit_critical() - corresponding exit call - epoch_wait_critical() - wait variant that is guaranteed that any threads in a section are running. - epoch_global_critical - an epoch_wait_critical safe epoch instance Requested by: markj Approved by: sbruno
Allow different bridge types to coexist if_bridge has a lot of limitations that make it scale poorly to higher data rates. In my projects/VPC branch I leverage the bridge interface between layers for my high speed soft switch as well as for purposes of stacking in general. Reviewed by: sbruno@ Approved by: sbruno@ Differential Revision: https://reviews.freebsd.org/D15344
Slight cleanup of interface event logging. Make if_printf() use vlog() instead of vprintf(). This means it can no longer return the number of characters printed, as it used to, but every single call to if_printf() in the entire kernel ignores the return value anyway; just return 0 so we don't have to change the prototype. Consistently use if_printf() throughout sys/net/if.c, instead of a mixture of if_printf() and log(). In ifa_maintain_loopback_route(), don't needlessly log an error if we either failed to add a route because it already existed or failed to remove one because it did not. We still return an error code, though. MFC after: 1 week
Allocate epoch for networking at startup Additionally add CK to include paths for modules Approved by: sbruno@
r333175 introduced deferred deletion of multicast addresses in order to permit the driver ioctl to sleep on commands to the NIC when updating multicast filters. More generally this permitted driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a a multicast update would still be queued for deletion when ifconfig deleted the interface thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses on the interface. Synchronously remove all external references to a multicast address before enqueueing for delete. Reported by: lwhsu Approved by: sbruno
Import the netdump client code. This is a component of a system which lets the kernel dump core to a remote host after a panic, rather than to a local storage device. The server component is available in the ports tree. netdump is particularly useful on diskless systems. The netdump(4) man page contains some details describing the protocol. Support for configuring netdump will be added to dumpon(8) in a future commit. To use netdump, the kernel must have been compiled with the NETDUMP option. The initial revision of netdump was written by Darrell Anderson and was integrated into Sandvine's OS, from which this version was derived. Reviewed by: bdrewery, cem (earlier versions), julian, sbruno MFC after: 1 month X-MFC note: use a spare field in struct ifnet Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15253
Separate list manipulation locking from state change in multicast Multicast incorrectly calls in to drivers with a mutex held causing drivers to have to go through all manner of contortions to use a non sleepable lock. Serialize multicast updates instead. Submitted by: mmacy <mmacy@mattmacy.io> Reviewed by: shurd, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14969
Translate 32-bit ifmedia requests into native ones. We use transformation rather than accessors as virtually ever driver implements SIOCGIFMEDIA and all would have to be touched. Keep the code readable by always performing copies and (possiably no-op) transforms. Reviewed by: jhb, kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14996
Remove support for the Arcnet protocol. While Arcnet has some continued deployment in industrial controls, the lack of drivers for any of the PCI, USB, or PCIe NICs on the market suggests such users aren't running FreeBSD. Evidence in the PR database suggests that the cm(4) driver (our sole Arcnet NIC) was broken in 5.0 and has not worked since. PR: 182297 Reviewed by: jhibbits, vangyzen Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15057
Remove support for FDDI networks. Defines in net/if_media.h remain in case code copied from ifconfig is in use elsewere (supporting non-existant media type is harmless). Reviewed by: kib, jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15017
Remove the thread argument from ifr_buffer_*() accessors. They are always used in a context where curthread is the correct thread. This makes them more similar to the ifr_data_get_ptr() accessor.
ifconf(): correct handling of sockaddrs smaller than struct sockaddr. Portable programs that use SIOCGIFCONF (e.g. traceroute) assume that each pseudo ifreq is of length MAX(sizeof(struct ifreq), sizeof(ifr_name) + ifr_addr.sa_len). For short sockaddrs we copied too much from the source sockaddr resulting in a heap leak. I believe only one such sockaddr exists (struct sockaddr_sco which is 8 bytes) and it is unclear if such sockaddrs end up on interfaces in practice. If it did, the result would be an 8 byte heap leak on current architectures. admbugs: 869 Reviewed by: kib Obtained from: CheriBSD MFC after: 3 days Security: kernel heap leak Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14981
Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941
Add 32-bit compat for ioctls that take struct ifgroupreq. Use an accessor to access ifgr_group and ifgr_groups. Use an macro CASE_IOC_IFGROUPREQ(cmd) in place of case statements such as "case SIOCAIFGROUP:". This avoids poluting the switch statements with large numbers of #ifdefs. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14960
ifconf(): Always zero the whole struct ifreq. The previous split of zeroing ifr_name and ifr_addr seperately is safe on current architectures, but would be unsafe if pointers were larger than 8 bytes. Combining the zeroing adds no real cost (a few instructions) and makes the security property easier to verify. Reviewed by: kib, emaste Obtained from: CheriBSD MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14912
Document and enforce assumptions about struct (in6_)ifreq. - The two types must be type-punnable for shared members of ifr_ifru. This allows compatibility accessors to be shared. - There must be no padding gap between ifr_name and ifr_ifru. This is assumed in tcpdump's use of SIOCGIFFLAGS output which attempts to be broadly portable. This is true for all current architectures, but very large (256-bit) fat-pointers could violate this invariant. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14910
Use an accessor function to access ifr_data. This fixes 32-bit compat (no ioctl command defintions are required as struct ifreq is the same size). This is believed to be sufficent to fully support ifconfig on 32-bit systems. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14900
Remove infrastructure for token-ring networks. Reviewed by: cem, imp, jhb, jmallett Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14875
Fix a whitespace bug missed in refactoring prior to r331641. MFC with: r331641
Fix access to ifru_buffer on freebsd32. Make all kernel accesses to ifru_buffer go via access functions which take the process ABI into account and use an appropriate union to access members in the correct place in struct ifreq. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14846
Allow to specify PCP on packets not belonging to any VLAN. According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should be considered as untagged, and only PCP and DEI values from the VLAN tag are meaningful. See for instance https://www.cisco.com/c/en/us/td/docs/switches/connectedgrid/cg-switch-sw-master/software/configuration/guide/vlan0/b_vlan_0.html. Make it possible to specify PCP value for outgoing packets on an ethernet interface. When PCP is supplied, the tag is appended, VLAN id set to 0, and PCP is filled by the supplied value. The code to do VLAN tag encapsulation is refactored from the if_vlan.c and moved into if_ethersubr.c. Drivers might have issues with filtering VID 0 packets on receive. This bug should be fixed for each driver. Reviewed by: ae (previous version), hselasky, melifaro Sponsored by: Mellanox Technologies MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D14702
This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.
| ViewVC Help | |
| Powered by ViewVC 1.1.27 |