/[base]/head/sys/net/if.c
ViewVC logotype

Log of /head/sys/net/if.c

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (download) (annotate)
Sticky Revision:


Revision 368238 - (view) (download) (annotate) - [select for diffs]
Modified Tue Dec 1 16:34:43 2020 UTC (3 years, 7 months ago) by kp
File length: 108422 byte(s)
Diff to previous 368237
net: Revert vnet/epair cleanup race mitigation

Revert the mitigation code for the vnet/epair cleanup race (done in r365457).
r368237 introduced a more reliable fix.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay


Revision 368237 - (view) (download) (annotate) - [select for diffs]
Modified Tue Dec 1 16:23:59 2020 UTC (3 years, 7 months ago) by kp
File length: 108511 byte(s)
Diff to previous 368124
if: Fix panic when destroying vnet and epair simultaneously

When destroying a vnet and an epair (with one end in the vnet) we often
panicked. This was the result of the destruction of the epair, which destroys
both ends simultaneously, happening while vnet_if_return() was moving the
struct ifnet to its home vnet. This can result in a freed ifnet being re-added
to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
used.

Prevent this race by ensuring that vnet_if_return() cannot run at the same time
as if_detach() or epair_clone_destroy().

PR:		238870, 234985, 244703, 250870
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27378


Revision 368124 - (view) (download) (annotate) - [select for diffs]
Modified Sat Nov 28 12:12:51 2020 UTC (3 years, 7 months ago) by kib
File length: 108188 byte(s)
Diff to previous 368025
Make MAXPHYS tunable.  Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225


Revision 368025 - (view) (download) (annotate) - [select for diffs]
Modified Wed Nov 25 17:15:24 2020 UTC (3 years, 7 months ago) by kp
File length: 108188 byte(s)
Diff to previous 368020
if: Fix non-VIMAGE build

if_link_ifnet() and if_unlink_ifnet() are needed even when VIMAGE is not
enabled.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay


Revision 368020 - (view) (download) (annotate) - [select for diffs]
Modified Wed Nov 25 15:07:22 2020 UTC (3 years, 7 months ago) by kp
File length: 108167 byte(s)
Diff to previous 368015
if: Protect V_ifnet in vnet_if_return()

When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.

We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.

We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.

Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().

Reviewed by:	melifaro
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27279


Revision 368015 - (view) (download) (annotate) - [select for diffs]
Modified Wed Nov 25 10:56:38 2020 UTC (3 years, 7 months ago) by kp
File length: 107261 byte(s)
Diff to previous 367484
if: Remove ifnet_rwlock

It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.

Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27278


Revision 367484 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 8 11:12:00 2020 UTC (3 years, 7 months ago) by melifaro
File length: 107616 byte(s)
Diff to previous 366230
Move all ifaddr route creation business logic to net/route/route_ifaddr.c

Differential Revision:	https://reviews.freebsd.org/D26318


Revision 366230 - (view) (download) (annotate) - [select for diffs]
Modified Mon Sep 28 16:54:39 2020 UTC (3 years, 9 months ago) by emaste
File length: 109307 byte(s)
Diff to previous 365457
add SIOCGIFDATA ioctl

For interfaces that do not support SIOCGIFMEDIA (for which there are
quite a few) the only fallback is to query the interface for
if_data->ifi_link_state.  While it's possible to get at if_data for an
interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms.

SIOCGIFDATA is a simple ioctl to retrieve this fast with very little
resource use in comparison.  This implementation mirrors that of other
similar ioctls in FreeBSD.

Submitted by:	Roy Marples <roy@marples.name>
Reviewed by:	markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D26538


Revision 365457 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 8 14:54:10 2020 UTC (3 years, 9 months ago) by kp
File length: 109076 byte(s)
Diff to previous 365071
net: mitigate vnet / epair cleanup races

There's a race where dying vnets move their interfaces back to their original
vnet, and if_epair cleanup (where deleting one interface also deletes the other
end of the epair). This is commonly triggered by the pf tests, but also by
cleanup of vnet jails.

As we've not yet been able to fix the root cause of the issue work around the
panic by not dereferencing a NULL softc in epair_qflush() and by not
re-attaching DYING interfaces.

This isn't a full fix, but makes a very common panic far less likely.

PR:		244703, 238870
Reviewed by:	lutz_donnerhacke.de
MFC after:	4 days
Differential Revision:	https://reviews.freebsd.org/D26324


Revision 365071 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 1 21:19:14 2020 UTC (3 years, 10 months ago) by mjg
File length: 108987 byte(s)
Diff to previous 363834
net: clean up empty lines in .c and .h files


Revision 363834 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 4 13:58:36 2020 UTC (3 years, 10 months ago) by markj
File length: 108999 byte(s)
Diff to previous 363403
Remove free_domain() and uma_zfree_domain().

These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937


Revision 363403 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jul 21 19:56:13 2020 UTC (3 years, 11 months ago) by melifaro
File length: 109073 byte(s)
Diff to previous 363320
Transition from rtrequest1_fib() to rib_action().

Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
 to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546


Revision 363320 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jul 19 10:53:15 2020 UTC (3 years, 11 months ago) by melifaro
File length: 109020 byte(s)
Diff to previous 363319
Temporarly revert r363319 to unbreak the build.

Reported by:	CI
Pointy hat to: melifaro


Revision 363319 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jul 19 09:29:27 2020 UTC (3 years, 11 months ago) by melifaro
File length: 109073 byte(s)
Diff to previous 361409
Transition from rtrequest1_fib() to rib_action().

Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25546


Revision 361409 - (view) (download) (annotate) - [select for diffs]
Modified Sat May 23 10:21:02 2020 UTC (4 years, 1 month ago) by melifaro
File length: 109020 byte(s)
Diff to previous 358592
Use epoch(9) for rtentries to simplify control plane operations.

Currently the only reason of refcounting rtentries is the need to report
 the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
 is used by some of the customers to get the matching prefix along
 with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
 nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision:	https://reviews.freebsd.org/D24866


Revision 358592 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 3 18:05:11 2020 UTC (4 years, 4 months ago) by brooks
File length: 109024 byte(s)
Diff to previous 358333
Expose ifr_buffer_get_(buffer|length) outside if.c.

This is a preparatory commit for D23933.

Reviewed by:	jhb


Revision 358333 - (view) (download) (annotate) - [select for diffs]
Modified Wed Feb 26 14:26:36 2020 UTC (4 years, 4 months ago) by kaktus
File length: 109038 byte(s)
Diff to previous 358301
Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718


Revision 358301 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 24 21:07:30 2020 UTC (4 years, 4 months ago) by glebius
File length: 108996 byte(s)
Diff to previous 358020
Although most of the NIC drivers are epoch ready, due to peer pressure
switch over to opt-in instead of opt-out for epoch.

Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.

Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.

Mark several tested drivers as IFF_KNOWSEPOCH.

Reviewed by:		hselasky, jeff, bz, gallatin
Differential Revision:	https://reviews.freebsd.org/D23674


Revision 358020 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 17 11:08:50 2020 UTC (4 years, 4 months ago) by bz
File length: 109096 byte(s)
Diff to previous 357640
Partially revert VNET change and expand VNET structure.

Revert parts of r353274 replacing vnet_state with a shutdown flag.

Not having the state flag for the current SI_SUB_* makes it harder to debug
kernel or module panics related to VNET bringup or teardown.
Not having the state also does not allow us to check for other dependency
levels between components, e.g. for moving interfaces.

Expand the VNET structure with the new boolean flag indicating that we are
doing a shutdown of a given vnet and update the vnet magic cookie for the
change.

Update libkvm to compile with a bool in the kernel struct.

Bump __FreeBSD_version for (external) module builds to more easily detect
the change.

Reviewed by:	hselasky
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23097


Revision 357640 - (view) (download) (annotate) - [select for diffs]
Modified Thu Feb 6 20:47:50 2020 UTC (4 years, 4 months ago) by jeff
File length: 108792 byte(s)
Diff to previous 356863
Temporarily force IFF_NEEDSEPOCH until drivers have been resolved.

Recent network epoch changes have left some drivers unexpectedly broken
and there is not yet a consensus on the correct fix.  This is patch is
a minor performance impact until we can agree on the correct path
forward.

Reviewed by:	core, network, imp, glebius, hselasky
Differential Revision:	https://reviews.freebsd.org/D23515


Revision 356863 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jan 18 04:48:05 2020 UTC (4 years, 5 months ago) by eugen
File length: 108709 byte(s)
Diff to previous 356755
ifa_maintain_loopback_route: adjust debugging output

Correction after r333476:

- write this as LOG_DEBUG again instead of LOG_INFO;
- get back function name into the message;
- error may be ESRCH if an address is removed in process (by carp f.e.),
not only ENOENT;
- expression complexity grows, so try making it more readable.

MFC after:	1 week


Revision 356755 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 15 06:05:20 2020 UTC (4 years, 5 months ago) by glebius
File length: 108626 byte(s)
Diff to previous 356753
Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch.   The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC


Revision 356753 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 15 05:45:27 2020 UTC (4 years, 5 months ago) by glebius
File length: 108686 byte(s)
Diff to previous 356748
Mechanically substitute assertion of in_epoch(net_epoch_preempt) to
NET_EPOCH_ASSERT(). NFC


Revision 356748 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 15 03:34:21 2020 UTC (4 years, 5 months ago) by glebius
File length: 108750 byte(s)
Diff to previous 356536
- Move global network epoch definition to epoch.h, as more different
  subsystems tend to need to know about it, and including if_var.h is
  huge header pollution for them.  Polluting possible non-network
  users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch.  Not used yet, and unlikely
  to get used in close future.


Revision 356536 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 9 03:52:50 2020 UTC (4 years, 5 months ago) by kevans
File length: 108819 byte(s)
Diff to previous 356268
if_vmove: return proper error status

if_vmove can fail if it lost a race and the vnet's already been moved. The
callers (and their callers) can generally cope with this, but right now
success is assumed. Plumb out the ENOENT from if_detach_internal if it
happens so that the error's properly reported to userland.

Reviewed by:	bz, kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22780


Revision 356268 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 2 09:08:45 2020 UTC (4 years, 6 months ago) by melifaro
File length: 108697 byte(s)
Diff to previous 355942
Plug loopback idaddr refcount leak.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22980


Revision 355942 - (view) (download) (annotate) - [select for diffs]
Modified Fri Dec 20 20:15:34 2019 UTC (4 years, 6 months ago) by markj
File length: 108609 byte(s)
Diff to previous 355938
Deduplicate code between if_delgroup() and if_delgroups().

Fix some style in if_addgroup().  No functional change intended.

Reviewed by:	hselasky
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22892


Revision 355938 - (view) (download) (annotate) - [select for diffs]
Modified Fri Dec 20 17:21:57 2019 UTC (4 years, 6 months ago) by markj
File length: 109162 byte(s)
Diff to previous 355070
Fix a memory leak in if_delgroups() introduced in r334118.

PR:		242712
Submitted by:	ghuckriede@blackberry.com
MFC after:	3 days


Revision 355070 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 24 23:21:47 2019 UTC (4 years, 7 months ago) by bz
File length: 109140 byte(s)
Diff to previous 355024
Allow kernel to compile without BPF.

r297816 added some bpf magic for VIMAGE unconditionally which no longer
allows kernels to compile without bpf (but with other networking).
Add the missing ifdef checks and allow a kernel to compile without bpf
again.

PR:		242136
Reported by:	dave mischler.com
MFC after:	2 weeks


Revision 355024 - (view) (download) (annotate) - [select for diffs]
Modified Fri Nov 22 23:23:40 2019 UTC (4 years, 7 months ago) by cem
File length: 109053 byte(s)
Diff to previous 354685
Add explicit SI_SUB_EPOCH

Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP).  Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.

epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP.  Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D22503


Revision 354685 - (view) (download) (annotate) - [select for diffs]
Modified Wed Nov 13 16:28:53 2019 UTC (4 years, 7 months ago) by glebius
File length: 109061 byte(s)
Diff to previous 354149
In if_siocaddmulti() enter VNET.

Reported & tested by:	garga


Revision 354149 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 29 17:36:06 2019 UTC (4 years, 8 months ago) by glebius
File length: 109013 byte(s)
Diff to previous 353868
There is a long standing problem with multicast programming for NICs
and IPv6.  With IPv6 we may call if_addmulti() in context of processing
of an incoming packet.  Usually this is interrupt context.  While most
of the NIC drivers are able to reprogram multicast filters without
sleeping, some of them can't.  An example is e1000 family of drivers.
With iflib conversion the problem was somewhat hidden.  Iflib processes
packets in private taskqueue, so going to sleep doesn't trigger an
assertion.  However, the sleep would block operation of the driver and
following incoming packets would fill the ring and eventually would
start being dropped.  Enabling epoch for the full time of a packet
processing again started to trigger assertions for e1000.

Fix this problem once and for all using a general taskqueue to call
if_ioctl() method in all cases when if_addmulti() is called in a
non sleeping context.  Note that nobody cares about returned value.

Reviewed by:	hselasky, kib
Differential Revision:	  https://reviews.freebsd.org/D22154


Revision 353868 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 21 18:17:03 2019 UTC (4 years, 8 months ago) by glebius
File length: 108522 byte(s)
Diff to previous 353685
Remove obsoleted KPIs that were used to access interface address lists.


Revision 353685 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 17 16:23:03 2019 UTC (4 years, 8 months ago) by cem
File length: 110525 byte(s)
Diff to previous 353654
Split out a more generic debugnet(4) from netdump(4)

Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421


Revision 353654 - (view) (download) (annotate) - [select for diffs]
Modified Wed Oct 16 16:32:58 2019 UTC (4 years, 8 months ago) by glebius
File length: 110531 byte(s)
Diff to previous 353550
do_link_state_change() is executed in taskqueue context and in
general is allowed to sleep.  Don't enter the epoch for the
whole duration.  If some event handlers need the epoch, they
should handle that theirselves.

Discussed with:	hselasky


Revision 353550 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 15 12:08:09 2019 UTC (4 years, 8 months ago) by hselasky
File length: 110600 byte(s)
Diff to previous 353549
The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the
same after the network stack was epochified. Merge the two into one function
and cleanup all uses of ifnet_byindex_locked().

While at it:
- Add branch prediction macros.
- Make sure the ifnet pointer is only deferred once,
  also when code optimisation is disabled.

Sponsored by:	Mellanox Technologies


Revision 353549 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 15 11:20:16 2019 UTC (4 years, 8 months ago) by hselasky
File length: 110672 byte(s)
Diff to previous 353522
Exclude the network link eventhandler from epochification after r353292.

This fixes the following assert when "options RATELIMIT" is used:
panic()
malloc()
sysctl_add_oid()
tcp_rl_ifnet_link()
do_link_state_change()
taskqueue_run_locked()

Sponsored by:	Mellanox Technologies


Revision 353522 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 14 21:18:37 2019 UTC (4 years, 8 months ago) by glebius
File length: 110672 byte(s)
Diff to previous 353420
if_delmulti() is never called without ifp argument, assert this instead
of doing a useless search through interfaces.


Revision 353420 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 10 23:44:56 2019 UTC (4 years, 8 months ago) by glebius
File length: 110898 byte(s)
Diff to previous 353419
Add two extra functions that basically give count of addresses
on interface.  Such function could been implemented on top of
the if_foreach_llm?addr(), but several drivers need counting,
so avoid copy-n-paste inside the drivers.


Revision 353419 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 10 23:42:55 2019 UTC (4 years, 8 months ago) by glebius
File length: 110327 byte(s)
Diff to previous 353314
Provide new KPI for network drivers to access lists of interface
addresses.  The KPI doesn't reveal neither how addresses are stored,
how the access to them is synchronized, neither reveal struct ifaddr
and struct ifmaddr.

Reviewed by:	gallatin, erj, hselasky, philip, stevek
Differential Revision:	https://reviews.freebsd.org/D21943


Revision 353314 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 8 17:55:45 2019 UTC (4 years, 8 months ago) by glebius
File length: 109325 byte(s)
Diff to previous 353312
Remove epoch assertion from if_setlladdr().  Originally this function was
protected by IF_ADDR_LOCK(), which was a mutex, so that two simultaneous
if_setlladdr() can't execute. Later it was switched to IF_ADDR_RLOCK(),
likely by a mistake. Later it was switched to NET_EPOCH_ENTER(). Then I
incorrectly added NET_EPOCH_ASSERT() here.

In reality ifp->if_addr never goes away and never changes its length. So,
doing bcopy() in it is always "safe", meaning it won't dereference a wrong
pointer or write into someone's else memory. Of course doing two bcopy() in
parallel would result in a mess of two addresses, but net epoch doesn't
protect against that, neither IF_ADDR_RLOCK() did.

So for now, just remove the assertion and leave for later a proper fix.

Reported by:	markj


Revision 353312 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 8 16:45:56 2019 UTC (4 years, 8 months ago) by glebius
File length: 109428 byte(s)
Diff to previous 353292
In DIAGNOSTIC block of if_delmulti_ifma_flags() enter the network epoch.
This quickly plugs the regression from r353292. The locking of multicast
definitely needs a broader review today...

Reported by:	pho, dhw


Revision 353292 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 7 22:40:05 2019 UTC (4 years, 8 months ago) by glebius
File length: 109379 byte(s)
Diff to previous 353274
Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111


Revision 353274 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 7 14:15:41 2019 UTC (4 years, 8 months ago) by hselasky
File length: 109778 byte(s)
Diff to previous 352707
Factor out VNET shutdown check into an own vnet structure field.
Remove the now obsolete vnet_state field. This greatly simplifies the
detection of VNET shutdown and avoids code duplication.

Discussed with:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies


Revision 352707 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 25 18:26:31 2019 UTC (4 years, 9 months ago) by glebius
File length: 110193 byte(s)
Diff to previous 352458
Add debugging facility EPOCH_TRACE that checks that epochs entered are
properly nested and warns about recursive entrances.  Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.

Reviewed by:	hselasky
Sponsored by:	Netflix
Pull Request:	https://reviews.freebsd.org/D21610


Revision 352458 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 17 18:49:13 2019 UTC (4 years, 9 months ago) by kib
File length: 110167 byte(s)
Diff to previous 352246
Add SIOCGIFDOWNREASON.

The ioctl(2) is intended to provide more details about the cause of
the down for the link.

Eventually we might define a comprehensive list of codes for the
situations.  But interface also allows the driver to provide free-form
null-terminated ASCII string to provide arbitrary non-formalized
information.  Sample implementation exists for mlx5(4), where the
string is fetched from firmware controlling the port.

Reviewed by:	hselasky, rrs
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21527


Revision 352246 - (view) (download) (annotate) - [select for diffs]
Modified Thu Sep 12 15:36:48 2019 UTC (4 years, 9 months ago) by kevans
File length: 110142 byte(s)
Diff to previous 349507
SIOCSIFNAME: Do nothing if we're not actually changing

Instead of throwing EEXIST, just succeed if the name isn't actually
changing. We don't need to trigger departure or any of that because there's
no change from consumers' perspective.

PR:		240539
Reviewed by:	brooks
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D21618


Revision 349507 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jun 28 10:49:04 2019 UTC (5 years ago) by hselasky
File length: 110088 byte(s)
Diff to previous 348254
Need to wait for epoch callbacks to complete before detaching a
network interface.

This particularly manifests itself when an INP has multicast options
attached during a network interface detach. Then the IPv4 and IPv6
leave group call which results from freeing the multicast address, may
access a freed ifnet structure. These are the steps to reproduce:

service mdnsd onestart # installed from ports

ifconfig epair create
ifconfig epair0a 0/24 up
ifconfig epair0a destroy

Tested by:	pho @
MFC after:	1 week
Sponsored by:	Mellanox Technologies


Revision 348254 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 24 22:30:40 2019 UTC (5 years, 1 month ago) by jhb
File length: 109813 byte(s)
Diff to previous 347984
Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117


Revision 347984 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 20 00:38:23 2019 UTC (5 years, 1 month ago) by cem
File length: 109758 byte(s)
Diff to previous 347982
Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.


Revision 347982 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 19 21:49:56 2019 UTC (5 years, 1 month ago) by melifaro
File length: 109751 byte(s)
Diff to previous 346579
Fix rt_ifa selection during loopback route insertion process.
  Currently such routes are added with a link-level IFA, which is
  plain wrong. Only after the insertion they get fixed by the special
  link_rtrequest() ifa handler. This behaviour complicates routing code
  and makes ifa selection more complex.
Streamline this process by explicitly moving link_rtrequest() logic
  to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all
  this logic in the loopback route case by explicitly specifying
  proper rt_ifa inside the ifa_maintain_loopback_route().§

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20076


Revision 346579 - (view) (download) (annotate) - [select for diffs]
Modified Mon Apr 22 19:24:21 2019 UTC (5 years, 2 months ago) by gallatin
File length: 110391 byte(s)
Diff to previous 343305
Track device's NUMA domain in ifnet & alloc ifnet from NUMA local memory

This commit adds new if_alloc_domain() and if_alloc_dev() methods to
allocate ifnets.  When called with a domain on a NUMA machine,
ifalloc_domain() will record the NUMA domain in the ifnet, and it will
allocate the ifnet struct from memory which is local to that NUMA
node.  Similarly, if_alloc_dev() is a wrapper for if_alloc_domain
which uses a driver supplied device_t to call ifalloc_domain() with
the appropriate domain.

Note that the new if_numa_domain field fits in an alignment pad in
struct ifnet, and so does not alter the size of the structure.

Reviewed by:	glebius, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19930


Revision 343305 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jan 22 17:39:26 2019 UTC (5 years, 5 months ago) by brooks
File length: 109698 byte(s)
Diff to previous 342872
Rework CASE_IOC_IFGROUPREQ() to require a case before the macro.

This is more compatible with formatting tools and looks more normal.

Reported by:	jhb (on a different review)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18442


Revision 342872 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 9 01:11:19 2019 UTC (5 years, 5 months ago) by glebius
File length: 109683 byte(s)
Diff to previous 342870
Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin


Revision 342870 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 9 00:38:16 2019 UTC (5 years, 5 months ago) by glebius
File length: 109375 byte(s)
Diff to previous 341334
Remove part of comment that doesn't match reality.


Revision 341334 - (view) (download) (annotate) - [select for diffs]
Modified Fri Nov 30 10:36:14 2018 UTC (5 years, 7 months ago) by ae
File length: 109528 byte(s)
Diff to previous 341008
Adapt the fix in r341008 to correctly work with EBR.

IFNET_RLOCK_NOSLEEP() is epoch_enter_preempt() in FreeBSD 12+. Holding
it in sysctl_rtsock() doesn't protect us from ifnet unlinking, because
unlinking occurs with IFNET_WLOCK(), that is rw_wlock+sx_xlock, and it
doesn check that concurrent code is running in epoch section. But while
we are in epoch section, we should be able to do access to ifnet's
fields, even it was unlinked. Thus do not change if_addr and if_hw_addr
fields in ifnet_detach_internal() to NULL, since rtsock code can do
access to these fields and this is allowed while it is running in epoch
section.

This should fix the race, when ifnet_detach_internal() unlinks ifnet
after we checked it for IFF_DYING in sysctl_dumpentry.

Move free(ifp->if_hw_addr) into ifnet_free_internal(). Also remove the
NULL check for ifp->if_description, since free(9) can correctly handle
NULL pointer.

MFC after:	1 week


Revision 341008 - (view) (download) (annotate) - [select for diffs]
Modified Tue Nov 27 09:04:06 2018 UTC (5 years, 7 months ago) by ae
File length: 109713 byte(s)
Diff to previous 340413
Fix possible panic during ifnet detach in rtsock.

The panic can happen, when some application does dump of routing table
using sysctl interface. To prevent this, set IFF_DYING flag in
if_detach_internal() function, when ifnet under lock is removed from
the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent
ifnet detach during routes enumeration. In case, if some interface was
detached in the time before we take the lock, add the check, that ifnet
is not DYING. This prevents access to memory that could be freed after
ifnet is unlinked.

PR:		227720, 230498, 233306
Reviewed by:	bz, eugen
MFC after:	1 week
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D18338


Revision 340413 - (view) (download) (annotate) - [select for diffs]
Modified Tue Nov 13 22:58:38 2018 UTC (5 years, 7 months ago) by glebius
File length: 109666 byte(s)
Diff to previous 340066
For compatibility KPI functions like if_addr_rlock() that used to have
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.

Reviewed by:	markj


Revision 340066 - (view) (download) (annotate) - [select for diffs]
Modified Fri Nov 2 16:50:17 2018 UTC (5 years, 8 months ago) by kp
File length: 109918 byte(s)
Diff to previous 339588
Notify that the ifnet will go away, even on vnet shutdown

pf subscribes to ifnet_departure_event events, so it can clean up the
ifg_pf_kif and if_pf_kif pointers in the ifnet.
During vnet shutdown interfaces could go away without sending the event,
so pf ends up cleaning these up as part of its shutdown sequence, which
happens after the ifnet has already been freed.

Send the ifnet_departure_event during vnet shutdown, allowing pf to
clean up correctly.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17500


Revision 339588 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 22 13:25:26 2018 UTC (5 years, 8 months ago) by hselasky
File length: 109813 byte(s)
Diff to previous 339550
Resolve deadlock between epoch(9) and various network interface
SX-locks, during if_purgeaddrs(), by not allowing to hold the epoch
read lock over typical network IOCTL code paths. This is a regression
issue after r334305.

Reviewed by:		ae (network)
Differential revision:	https://reviews.freebsd.org/D17647
MFC after:		1 week
Sponsored by:		Mellanox Technologies


Revision 339550 - (view) (download) (annotate) - [select for diffs]
Modified Sun Oct 21 17:55:26 2018 UTC (5 years, 8 months ago) by ae
File length: 109782 byte(s)
Diff to previous 339537
Add KPI that can be used by tunneling interfaces to handle IP addresses
appearing and disappearing on the host system.

Such handling is need, because tunneling interfaces must use addresses,
that are configured on the host as ingress addresses for tunnels.
Otherwise the system can send spoofed packets with source address, that
belongs to foreign host.

The KPI uses ifaddr_event_ext event to implement addresses tracking.
Tunneling interfaces register event handlers and then they are
notified by the kernel, when an address disappears or appears.

ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event()
in the ip_encap.c

MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17134


Revision 339537 - (view) (download) (annotate) - [select for diffs]
Modified Sun Oct 21 15:02:06 2018 UTC (5 years, 8 months ago) by ae
File length: 110062 byte(s)
Diff to previous 339012
Add ifaddr_event_ext event. It is similar to ifaddr_event, but the
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.

MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17100


Revision 339012 - (view) (download) (annotate) - [select for diffs]
Modified Sat Sep 29 13:01:23 2018 UTC (5 years, 9 months ago) by tuexen
File length: 109782 byte(s)
Diff to previous 338162
For changing the MTU on tun/tap devices, it should not matter whether it
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.

Reviewed by:		ae@, bz@
Approved by:		re (kib@)
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D17180


Revision 338162 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 21 22:59:22 2018 UTC (5 years, 10 months ago) by mmacy
File length: 109861 byte(s)
Diff to previous 337932
fix copy/paste error when clearing ifma flag

CID: 1395119
Reported by:	vangyzen


Revision 337932 - (view) (download) (annotate) - [select for diffs]
Modified Thu Aug 16 23:46:38 2018 UTC (5 years, 10 months ago) by np
File length: 109858 byte(s)
Diff to previous 337866
Add the ability to look up the 3b PCP of a VLAN interface.  Use it in
toe_l2_resolve to fill up the complete vtag and not just the vid.

Reviewed by:	kib@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D16752


Revision 337866 - (view) (download) (annotate) - [select for diffs]
Modified Wed Aug 15 20:23:08 2018 UTC (5 years, 10 months ago) by mmacy
File length: 109811 byte(s)
Diff to previous 336676
Fix in6_multi double free

This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
  - add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
  goes to zero OR when the interface goes away
  - decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
  - add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
  door wider still to races
  - call inp_gcmoptions synchronously after dropping the the inpcb lock

Fundamentally multicast needs a rewrite - but keep applying band-aids for now.

Tested by: kp
Reported by: novel, kp, lwhsu


Revision 336676 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jul 24 16:35:52 2018 UTC (5 years, 11 months ago) by andrew
File length: 109549 byte(s)
Diff to previous 336131
Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147


Revision 336131 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jul 9 11:03:28 2018 UTC (5 years, 11 months ago) by ae
File length: 109549 byte(s)
Diff to previous 336062
Deduplicate the code.

Add generic function if_tunnel_check_nesting() that does check for
allowed nesting level for tunneling interfaces and also does loop
detection. Use it in gif(4), gre(4) and me(4) interfaces.

Differential Revision:	https://reviews.freebsd.org/D16162


Revision 336062 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jul 7 13:35:06 2018 UTC (5 years, 11 months ago) by sbruno
File length: 108548 byte(s)
Diff to previous 335924
struct ifmediareq *ifmrp is only used in the COMPAT_FREEBSD32 parts of
ifioctl().  Move it inside the proper #ifdef.  This was throwing a valid
"Assigned but unused" warning with gcc.

Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16063


Revision 335924 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 4 02:47:16 2018 UTC (6 years ago) by mmacy
File length: 108548 byte(s)
Diff to previous 334402
epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


Revision 334402 - (view) (download) (annotate) - [select for diffs]
Modified Wed May 30 21:46:10 2018 UTC (6 years, 1 month ago) by mmacy
File length: 108123 byte(s)
Diff to previous 334305
if_setlladdr: don't call ioctl in epoch context

PR: 228612
Reported by: markj


Revision 334305 - (view) (download) (annotate) - [select for diffs]
Modified Tue May 29 00:53:53 2018 UTC (6 years, 1 month ago) by mmacy
File length: 108091 byte(s)
Diff to previous 334172
route: fix missed ref adds
 - ensure that we bump the ifa ref whenever we add a reference
 - defer freeing epoch protected references until after the if_purgaddrs
   loop


Revision 334172 - (view) (download) (annotate) - [select for diffs]
Modified Thu May 24 17:54:08 2018 UTC (6 years, 1 month ago) by mmacy
File length: 108113 byte(s)
Diff to previous 334118
if_delgroups: add missed unlock introduced by r334118


Revision 334118 - (view) (download) (annotate) - [select for diffs]
Modified Wed May 23 21:02:14 2018 UTC (6 years, 1 month ago) by mmacy
File length: 108094 byte(s)
Diff to previous 333886
UDP: further performance improvements on tx

Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409


Revision 333886 - (view) (download) (annotate) - [select for diffs]
Modified Sat May 19 19:00:04 2018 UTC (6 years, 1 month ago) by mmacy
File length: 108078 byte(s)
Diff to previous 333813
net: fix uninitialized variable warning


Revision 333813 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 18 20:13:34 2018 UTC (6 years, 1 month ago) by mmacy
File length: 108071 byte(s)
Diff to previous 333809
ifnet: Replace if_addr_lock rwlock with epoch + mutex

Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366


Revision 333809 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 18 18:48:00 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107272 byte(s)
Diff to previous 333807
epoch(9): allocate net epochs earlier in boot


Revision 333807 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 18 17:58:15 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107138 byte(s)
Diff to previous 333802
epoch: move epoch variables to read mostly section


Revision 333802 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 18 17:29:43 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107110 byte(s)
Diff to previous 333775
epoch(9): Make epochs non-preemptible by default

There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by:	markj


Revision 333775 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 18 01:52:51 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107034 byte(s)
Diff to previous 333481
epoch: add non-preemptible "critical" variant

adds:
- epoch_enter_critical() - can be called inside a different epoch,
  starts a section that will acquire any MTX_DEF mutexes or do
  anything that might sleep.
- epoch_exit_critical() - corresponding exit call
- epoch_wait_critical() - wait variant that is guaranteed that any
  threads in a section are running.
- epoch_global_critical - an epoch_wait_critical safe epoch instance

Requested by:   markj
Approved by:	sbruno


Revision 333481 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 11 05:00:40 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107033 byte(s)
Diff to previous 333476
Allow different bridge types to coexist

if_bridge has a lot of limitations that make it scale poorly to higher data
rates. In my projects/VPC branch I leverage the bridge interface between
layers for my high speed soft switch as well as for purposes of stacking
in general.

Reviewed by:	sbruno@
Approved by:	sbruno@
Differential Revision:	https://reviews.freebsd.org/D15344


Revision 333476 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 11 00:19:49 2018 UTC (6 years, 1 month ago) by des
File length: 107077 byte(s)
Diff to previous 333469
Slight cleanup of interface event logging.

Make if_printf() use vlog() instead of vprintf().  This means it can no
longer return the number of characters printed, as it used to, but every
single call to if_printf() in the entire kernel ignores the return value
anyway; just return 0 so we don't have to change the prototype.

Consistently use if_printf() throughout sys/net/if.c, instead of a
mixture of if_printf() and log().

In ifa_maintain_loopback_route(), don't needlessly log an error if we
either failed to add a route because it already existed or failed to
remove one because it did not.  We still return an error code, though.

MFC after:	1 week


Revision 333469 - (view) (download) (annotate) - [select for diffs]
Modified Thu May 10 19:13:00 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107133 byte(s)
Diff to previous 333309
Allocate epoch for networking at startup

Additionally add CK to include paths for modules

Approved by:	sbruno@


Revision 333309 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 6 20:34:13 2018 UTC (6 years, 1 month ago) by mmacy
File length: 107086 byte(s)
Diff to previous 333283
r333175 introduced deferred deletion of multicast addresses in order to permit the driver ioctl
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.

Synchronously remove all external references to a multicast address before enqueueing for delete.

Reported by:	lwhsu
Approved by:	sbruno


Revision 333283 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 6 00:38:29 2018 UTC (6 years, 1 month ago) by markj
File length: 106587 byte(s)
Diff to previous 333175
Import the netdump client code.

This is a component of a system which lets the kernel dump core to
a remote host after a panic, rather than to a local storage device.
The server component is available in the ports tree. netdump is
particularly useful on diskless systems.

The netdump(4) man page contains some details describing the protocol.
Support for configuring netdump will be added to dumpon(8) in a future
commit. To use netdump, the kernel must have been compiled with the
NETDUMP option.

The initial revision of netdump was written by Darrell Anderson and
was integrated into Sandvine's OS, from which this version was derived.

Reviewed by:	bdrewery, cem (earlier versions), julian, sbruno
MFC after:	1 month
X-MFC note:	use a spare field in struct ifnet
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15253


Revision 333175 - (view) (download) (annotate) - [select for diffs]
Modified Wed May 2 19:36:29 2018 UTC (6 years, 2 months ago) by shurd
File length: 106507 byte(s)
Diff to previous 332997
Separate list manipulation locking from state change in multicast

Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969


Revision 332997 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 25 15:30:42 2018 UTC (6 years, 2 months ago) by brooks
File length: 106281 byte(s)
Diff to previous 332490
Translate 32-bit ifmedia requests into native ones.

We use transformation rather than accessors as virtually ever driver
implements SIOCGIFMEDIA and all would have to be touched.

Keep the code readable by always performing copies and (possiably no-op)
transforms.

Reviewed by:	jhb, kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14996


Revision 332490 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 13 21:18:04 2018 UTC (6 years, 2 months ago) by brooks
File length: 104689 byte(s)
Diff to previous 332412
Remove support for the Arcnet protocol.

While Arcnet has some continued deployment in industrial controls, the
lack of drivers for any of the PCI, USB, or PCIe NICs on the market
suggests such users aren't running FreeBSD.

Evidence in the PR database suggests that the cm(4) driver (our sole
Arcnet NIC) was broken in 5.0 and has not worked since.

PR:		182297
Reviewed by:	jhibbits, vangyzen
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15057


Revision 332412 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 11 17:28:24 2018 UTC (6 years, 2 months ago) by brooks
File length: 104707 byte(s)
Diff to previous 332158
Remove support for FDDI networks.

Defines in net/if_media.h remain in case code copied from ifconfig is in
use elsewere (supporting non-existant media type is harmless).

Reviewed by:	kib, jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15017


Revision 332158 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 6 23:25:54 2018 UTC (6 years, 2 months ago) by brooks
File length: 104723 byte(s)
Diff to previous 332151
Remove the thread argument from ifr_buffer_*() accessors.

They are always used in a context where curthread is the correct thread.
This makes them more similar to the ifr_data_get_ptr() accessor.


Revision 332151 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 6 20:26:56 2018 UTC (6 years, 2 months ago) by brooks
File length: 104875 byte(s)
Diff to previous 332122
ifconf(): correct handling of sockaddrs smaller than struct sockaddr.

Portable programs that use SIOCGIFCONF (e.g. traceroute) assume
that each pseudo ifreq is of length MAX(sizeof(struct ifreq),
sizeof(ifr_name) + ifr_addr.sa_len).  For short sockaddrs we copied
too much from the source sockaddr resulting in a heap leak.

I believe only one such sockaddr exists (struct sockaddr_sco which
is 8 bytes) and it is unclear if such sockaddrs end up on interfaces
in practice.  If it did, the result would be an 8 byte heap leak on
current architectures.

admbugs:	869
Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	3 days
Security:	kernel heap leak
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14981


Revision 332122 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 6 17:35:35 2018 UTC (6 years, 2 months ago) by brooks
File length: 104672 byte(s)
Diff to previous 332088
Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941


Revision 332088 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 5 22:14:55 2018 UTC (6 years, 2 months ago) by brooks
File length: 104696 byte(s)
Diff to previous 332087
Add 32-bit compat for ioctls that take struct ifgroupreq.

Use an accessor to access ifgr_group and ifgr_groups.

Use an macro CASE_IOC_IFGROUPREQ(cmd) in place of case statements such
as "case SIOCAIFGROUP:". This avoids poluting the switch statements
with large numbers of #ifdefs.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14960


Revision 332087 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 5 21:58:28 2018 UTC (6 years, 2 months ago) by brooks
File length: 103669 byte(s)
Diff to previous 331831
ifconf(): Always zero the whole struct ifreq.

The previous split of zeroing ifr_name and ifr_addr seperately is safe
on current architectures, but would be unsafe if pointers were larger
than 8 bytes. Combining the zeroing adds no real cost (a few
instructions) and makes the security property easier to verify.

Reviewed by:	kib, emaste
Obtained from:	CheriBSD
MFC after:	3 days
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14912


Revision 331831 - (view) (download) (annotate) - [select for diffs]
Modified Fri Mar 30 21:38:53 2018 UTC (6 years, 3 months ago) by brooks
File length: 103754 byte(s)
Diff to previous 331797
Document and enforce assumptions about struct (in6_)ifreq.

- The two types must be type-punnable for shared members of ifr_ifru.
  This allows compatibility accessors to be shared.

- There must be no padding gap between ifr_name and ifr_ifru.  This is
  assumed in tcpdump's use of SIOCGIFFLAGS output which attempts to be
  broadly portable.  This is true for all current architectures, but very
  large (256-bit) fat-pointers could violate this invariant.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14910


Revision 331797 - (view) (download) (annotate) - [select for diffs]
Modified Fri Mar 30 18:50:13 2018 UTC (6 years, 3 months ago) by brooks
File length: 103490 byte(s)
Diff to previous 331714
Use an accessor function to access ifr_data.

This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size).  This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14900


Revision 331714 - (view) (download) (annotate) - [select for diffs]
Modified Wed Mar 28 23:33:26 2018 UTC (6 years, 3 months ago) by brooks
File length: 103209 byte(s)
Diff to previous 331644
Remove infrastructure for token-ring networks.

Reviewed by:	cem, imp, jhb, jmallett
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14875


Revision 331644 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 27 18:55:39 2018 UTC (6 years, 3 months ago) by brooks
File length: 103229 byte(s)
Diff to previous 331641
Fix a whitespace bug missed in refactoring prior to r331641.

MFC with:	r331641


Revision 331641 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 27 18:26:50 2018 UTC (6 years, 3 months ago) by brooks
File length: 103231 byte(s)
Diff to previous 331622
Fix access to ifru_buffer on freebsd32.

Make all kernel accesses to ifru_buffer go via access functions
which take the process ABI into account and use an appropriate union
to access members in the correct place in struct ifreq.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14846


Revision 331622 - (view) (download) (annotate) - [select for diffs]
Added Tue Mar 27 15:29:32 2018 UTC (6 years, 3 months ago) by kib
File length: 100945 byte(s)
Diff to previous 326023
Allow to specify PCP on packets not belonging to any VLAN.

According to 802.1Q-2014, VLAN tagged packets with VLAN id 0 should be
considered as untagged, and only PCP and DEI values from the VLAN tag
are meaningful.  See for instance
https://www.cisco.com/c/en/us/td/docs/switches/connectedgrid/cg-switch-sw-master/software/configuration/guide/vlan0/b_vlan_0.html.

Make it possible to specify PCP value for outgoing packets on an
ethernet interface.  When PCP is supplied, the tag is appended, VLAN
id set to 0, and PCP is filled by the supplied value.  The code to do
VLAN tag encapsulation is refactored from the if_vlan.c and moved into
if_ethersubr.c.

Drivers might have issues with filtering VID 0 packets on
receive.  This bug should be fixed for each driver.

Reviewed by:	ae (previous version), hselasky, melifaro
Sponsored by:	Mellanox Technologies
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D14702



This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

  ViewVC Help
Powered by ViewVC 1.1.27