/[base]/head/sys/netinet6/ip6_output.c
ViewVC logotype

Log of /head/sys/netinet6/ip6_output.c

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (download) (annotate)
Sticky Revision:


Revision 366813 - (view) (download) (annotate) - [select for diffs]
Modified Sun Oct 18 17:15:47 2020 UTC (3 years, 8 months ago) by melifaro
File length: 87765 byte(s)
Diff to previous 366569
Implement flowid calculation for outbound connections to balance
 connections over multiple paths.

Multipath routing relies on mbuf flowid data for both transit
 and outbound traffic. Current code fills mbuf flowid from inp_flowid
 for connection-oriented sockets. However, inp_flowid is currently
 not calculated for outbound connections.

This change creates simple hashing functions and starts calculating hashes
 for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the
 system.

Reviewed by:	glebius (previous version),ae
Differential Revision:	https://reviews.freebsd.org/D26523


Revision 366569 - (view) (download) (annotate) - [select for diffs]
Modified Fri Oct 9 12:06:43 2020 UTC (3 years, 8 months ago) by rscheff
File length: 87804 byte(s)
Diff to previous 365870
Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.

This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409


Revision 365870 - (view) (download) (annotate) - [select for diffs]
Modified Fri Sep 18 02:37:57 2020 UTC (3 years, 9 months ago) by np
File length: 86699 byte(s)
Diff to previous 365071
if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS.

This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx
and rx), TSO, and RSS if the NIC is capable of performing these operations on
inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface if one is
specified or of the interface that hosts the vxlanlocal address. If other
interfaces will carry traffic for that VXLAN then they must have the same
hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the required
capabilities and then translates the CSUM_ flags to their inner equivalents.
This tells the hardware ifnet that it needs to operate on the inner frame and
not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware drivers to
configure their devices to expect VXLAN traffic on the specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner frame.
if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is
not very clear why it didn't do this already.

Future work:
Rx: it should be possible to avoid the first trip up the protocol stack to get
the frame to if_vxlan just so it can decapsulate and requeue for a second trip
up the stack. The hardware NIC driver could directly call an if_vxlan receive
routine for VXLAN traffic instead.

Rx: LRO. depends on what happens with the previous item. There will have to to
be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.

Reviewed by:	kib@
Relnotes:	Yes
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873


Revision 365071 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 1 21:19:14 2020 UTC (3 years, 10 months ago) by mjg
File length: 86674 byte(s)
Diff to previous 363710
net: clean up empty lines in .c and .h files


Revision 363710 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 30 17:43:23 2020 UTC (3 years, 11 months ago) by markj
File length: 86682 byte(s)
Diff to previous 362338
ip6_output(): Check the return value of in6_getlinkifnet().

If the destination address has an embedded scope ID, make sure that it
corresponds to a valid ifnet before proceeding.  Otherwise a sendto()
with a bogus link-local address can trigger a NULL pointer dereference.

Reported by:	syzkaller
Reviewed by:	ae
Fixes:		r358572
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25887


Revision 362338 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jun 18 19:32:34 2020 UTC (4 years ago) by markj
File length: 86611 byte(s)
Diff to previous 361573
Add the SCTP_SUPPORT kernel option.

This is in preparation for enabling a loadable SCTP stack.  Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.

Discussed with:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation


Revision 361573 - (view) (download) (annotate) - [select for diffs]
Modified Thu May 28 07:29:44 2020 UTC (4 years, 1 month ago) by melifaro
File length: 86456 byte(s)
Diff to previous 360961
Replace ip6_ouput fib6_lookup_nh_<ext|basic> calls with fib6_lookup().

fib6_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.

Conversion is straight-forwarded, as the only 2 differences are
requirement of running in network epoch and the need to handle
RTF_GATEWAY case in the caller code.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24973


Revision 360961 - (view) (download) (annotate) - [select for diffs]
Modified Tue May 12 14:01:12 2020 UTC (4 years, 1 month ago) by gallatin
File length: 86455 byte(s)
Diff to previous 360930
IPv6: sync IP_NO_SND_TAG_RL support from IPv4

The IP_NO_SND_TAG_RL flag to ip{,6}_output() means that the packets
being sent should bypass hardware rate limiting. This is typically used
by modern TCP stacks for rexmits.

This support was added to IPv4 in r352657, but never added to IPv6, even
though rack and bbr call ip6_output() with this flag.

Reviewed by:	rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24822


Revision 360930 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 11 21:23:22 2020 UTC (4 years, 1 month ago) by gallatin
File length: 86283 byte(s)
Diff to previous 360914
Fix the build

Back out the IPv6 portion of r360903, as the stamp_tag param
is apparently not supported in upstream FreeBSD.

Sponsored by:	Netflix
Pointy hat to: gallatin


Revision 360914 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 11 19:17:33 2020 UTC (4 years, 1 month ago) by gallatin
File length: 86360 byte(s)
Diff to previous 360579
Ktls: never skip stamping tags for NIC TLS

The newer RACK and BBR TCP stacks have added a mechanism
to disable hardware packet pacing for TCP retransmits.
This mechanism works by skipping the send-tag stamp
on rate-limited connections when the TCP stack calls
ip_output() with the IP_NO_SND_TAG_RL flag set.

When doing NIC TLS, we must ignore this flag, as
NIC TLS packets must always be stamped.  Failure
to stamp a NIC TLS packet will result in crypto
issues.

Reviewed by:	hselasky, rrs
Sponsored by:	Netflix, Mellanox


Revision 360579 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 3 00:12:56 2020 UTC (4 years, 2 months ago) by glebius
File length: 86283 byte(s)
Diff to previous 360292
Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
        within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by:	gallatin
Differential Revision:	https://reviews.freebsd.org/D24598


Revision 360292 - (view) (download) (annotate) - [select for diffs]
Modified Sat Apr 25 09:06:11 2020 UTC (4 years, 2 months ago) by melifaro
File length: 86287 byte(s)
Diff to previous 359919
Convert route caching to nexthop caching.

This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340


Revision 359919 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 14 14:46:06 2020 UTC (4 years, 2 months ago) by gallatin
File length: 86400 byte(s)
Diff to previous 359154
KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213


Revision 359154 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 19 21:38:52 2020 UTC (4 years, 3 months ago) by markj
File length: 86405 byte(s)
Diff to previous 358575
Fix synchronization in the IPV6_2292PKTOPTIONS set handler.

The inpcb needs to be locked when we update output packet options.
Otherwise it is possible for the IPV6_2292PKTOPTIONS handler to free
packet option structures while another thread is reading or updating
them.

Note that the option handler is still kind of broken.  For instance it
frees all options before performing privilege checks for individual
options.  However, this can be fixed separately.

Reported by:	syzbot+52eb0fd4ddc119787f9d@syzkaller.appspotmail.com
Reviewed by:	bz, tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D24125


Revision 358575 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 3 13:48:12 2020 UTC (4 years, 4 months ago) by bz
File length: 86321 byte(s)
Diff to previous 358572
ip6: retire in6_selectroute_fib() as promised 8 years ago

In r231852 I added in6_selectroute_fib() as a compat function with the
fibnum as an extra argument compared to in6_selectroute() to keep the
KPI stable.
Way too late retire this function again and add the fib to in6_selectroute()
which also only has a single consumer now and was an orphan function before.


Revision 358572 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 3 11:32:47 2020 UTC (4 years, 4 months ago) by bz
File length: 86325 byte(s)
Diff to previous 358311
ip6_output: use new routing KPI when not passed a cached route

Implement the equivalent of r347375 (IPv4) for the IPv6 output path.
In IPv6 we get passed a cached route (and inp) by udp6_output()
depending on whether we acquired a write lock on the INP.
In case we neither bind nor connect a first UDP packet would come in
with a cached route (wlocked) and all further packets would not.
In case we bind and do not connect we never write-lock the inp.

When we do not pass in a cached route, rather than providing the
storage for a route locally and pass it over the old lookup code
and down the stack, use the new route lookup KPI and acquire all
details we need to send the packet.

Compared to the IPv4 code the IPv6 code has a couple of possible
complications: given an option with a routing hdr/caching route there,
and path mtu (ro_pmtu) case which now equally has to deal with the
possibility of having a route which is NULL passed in, and the
fwd_tag in case a firewall changes the next hop (something to
factor out in the future).

Sponsored by:	Netflix
Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D23886


Revision 358311 - (view) (download) (annotate) - [select for diffs]
Modified Tue Feb 25 15:03:41 2020 UTC (4 years, 4 months ago) by bz
File length: 84181 byte(s)
Diff to previous 358297
ip6_output: fix regression introduced in r358167 for ipv6 fragmentation

When moving the calculations for the optlen into the if (opt) block
which deals with possible extension headers I failed to initialise
unfragpartlen to the ipv6 header length if there were no extension
headers present.  Correct that mistake to make IPv6 fragment length
calculcations work again.

Reported by:	hselasky, kp
OKed by:	hselasky, kp
MFC after:	3 days
X-MFC with:	r358167
PR:		244393


Revision 358297 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 24 19:12:20 2020 UTC (4 years, 4 months ago) by bz
File length: 84184 byte(s)
Diff to previous 358167
Fix IPv6 checksums when exthdrs are present.

In two places in ip6_output we are doing (delayed) checksum calculations.
The initial logic came from SCTP in r205075,205104 and later I copied
and adjusted it for the TCP|UDP case in r235958.
The problem was that the original SCTP offsets were already wrong for any
case with extension headers present given IPv6 extension headers are not
part of the pseudo checksum calculations.
The later changes do not help in case there is checksum offloading as for
extension headers (incl. fragments) we do currrently never offload as we
have no infrastructure to know whether the NIC can handle these cases.

Correct the offsets for delayed checksum calculations and properly handle
mbuf flags.  In addition harmonize the almost identical duplicate code.

While here eliminate the now unneeded variable hlen and add an always
missing mtod() call in the 1-b and 3 cases after the introduction of
the mb_unmapped_to_ext() calls.

Reported by:	Francis Dupont (fdupont isc.org)
PR:		243675
MFC after:	6 days
Reviewed by:	markj (earlier version), gallatin
Differential Revision:	https://reviews.freebsd.org/D23760


Revision 358167 - (view) (download) (annotate) - [select for diffs]
Modified Thu Feb 20 10:56:12 2020 UTC (4 years, 4 months ago) by bz
File length: 84151 byte(s)
Diff to previous 358071
ip6_output: improve extension header handling

Move IPv6 source address checks from after extension header heandling
to the top of the function. If we do not pass these checks there is
no reason to do a lot of work upfront.

Fold extension header preparations and length calculations together into
a single branch and macro rather than doing them sequentially.
Likewise move extension header concatination into a single branch block
only doing it if we recorded any extension header length length.

Reviewed by:	melifaro (earlier version), markj, gallatin
Sponsored by:	Netflix (partially, originally)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23740


Revision 358071 - (view) (download) (annotate) - [select for diffs]
Modified Tue Feb 18 11:28:00 2020 UTC (4 years, 4 months ago) by bz
File length: 83803 byte(s)
Diff to previous 356974
ip6_output: update comments.

Clear up some comments and improve to panic messages.

No functional changes.

MFC after:	3 days


Revision 356974 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 05:51:22 2020 UTC (4 years, 5 months ago) by glebius
File length: 83630 byte(s)
Diff to previous 356253
Make ip6_output() and ip_output() require network epoch.

All callers that before may called into these functions
without network epoch now must enter it.


Revision 356253 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 1 17:32:20 2020 UTC (4 years, 6 months ago) by glebius
File length: 83678 byte(s)
Diff to previous 353356
In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM.  This change was not intentional,
so fix that.  Return EACCESS if a firewall forbids sending.

Noticed by:	ae


Revision 353356 - (view) (download) (annotate) - [select for diffs]
Modified Wed Oct 9 17:02:28 2019 UTC (4 years, 8 months ago) by glebius
File length: 83677 byte(s)
Diff to previous 353295
ip6_output() has a complex set of gotos, and some can jump out of
the epoch section towards return statement. Since entering epoch
is cheap, it is easier to cover the whole function with epoch,
rather than try to properly maintain its state.


Revision 353295 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 7 23:35:23 2019 UTC (4 years, 8 months ago) by markj
File length: 83676 byte(s)
Diff to previous 353292
Improve locking in the IPV6_V6ONLY socket option handler.

Acquire the inp lock before checking whether the socket is already bound,
and around updates to the inp_vflag field.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21867


Revision 353292 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 7 22:40:05 2019 UTC (4 years, 8 months ago) by glebius
File length: 83623 byte(s)
Diff to previous 351522
Widen NET_EPOCH coverage.

When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111


Revision 351522 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 27 00:01:56 2019 UTC (4 years, 10 months ago) by jhb
File length: 83554 byte(s)
Diff to previous 350531
Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277


Revision 350531 - (view) (download) (annotate) - [select for diffs]
Modified Fri Aug 2 07:41:36 2019 UTC (4 years, 11 months ago) by bz
File length: 82723 byte(s)
Diff to previous 349529
IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names.  We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
  - in6pcb which really is an inpcb and nothing separate
  - sotoin6pcb which is sotoinpcb (as per above)
  - in6p_sp which is inp_sp
  - in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone  also rename the in6p variables to inp as
  that is what we call it in most of the network stack including
  parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with:		tuexen (SCTP changes)
MFC after:		3 months
Sponsored by:		Netflix


Revision 349529 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jun 29 00:48:33 2019 UTC (5 years ago) by jhb
File length: 82826 byte(s)
Diff to previous 348966
Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616


Revision 348966 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 11 22:07:39 2019 UTC (5 years ago) by jhb
File length: 82156 byte(s)
Diff to previous 348254
Sort opt_foo.h #includes and add a missing blank line in ip_output().


Revision 348254 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 24 22:30:40 2019 UTC (5 years, 1 month ago) by jhb
File length: 82156 byte(s)
Diff to previous 347465
Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117


Revision 347465 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 10 20:15:40 2019 UTC (5 years, 1 month ago) by jhb
File length: 81863 byte(s)
Diff to previous 346677
Apply r280991 to ip6_fragment.

This uses m_dup_pkthdr() to copy all of the metadata about a packet to
each of its fragments including VLAN tags, mbuf tags, etc. instead of
hand-copying a few fields.

Reviewed by:	bz
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117


Revision 346677 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 25 15:37:28 2019 UTC (5 years, 2 months ago) by gallatin
File length: 81593 byte(s)
Diff to previous 346400
Track TCP connection's NUMA domain in the inpcb

Drivers can now pass up numa domain information via the
mbuf numa domain field.  This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by:	markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20028


Revision 346400 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 19 17:17:41 2019 UTC (5 years, 2 months ago) by tuexen
File length: 81524 byte(s)
Diff to previous 343631
Improve input validation for the socket option IPV6_CHECKSUM.

When using the IPPROTO_IPV6 level socket option IPV6_CHECKSUM on a raw
IPv6 socket, ensure that the value is either -1 or a non-negative even
number.

Reviewed by:		bz@, thj@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19966


Revision 343631 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 31 23:01:03 2019 UTC (5 years, 5 months ago) by glebius
File length: 81451 byte(s)
Diff to previous 342884
New pfil(9) KPI together with newborn pfil API and control utility.

The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented.  The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision:	https://reviews.freebsd.org/D18951


Revision 342884 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 9 14:28:08 2019 UTC (5 years, 5 months ago) by hselasky
File length: 81394 byte(s)
Diff to previous 341827
Fix loopback traffic when using non-lo0 link local IPv6 addresses.

The loopback interface can only receive packets with a single scope ID,
namely the scope ID of the loopback interface itself. To mitigate this
packets which use the scope ID are appearing as received by the real
network interface, see "origifp" in the patch. The current code would
drop packets which are designated for loopback which use a link-local
scope ID in the destination address or source address, because they
won't match the lo0's scope ID. To fix this restore the network
interface pointer from the scope ID in the destination address for
the problematic cases. See comments added in patch for a more detailed
description.

This issue was introduced with route caching (ae@).

Reviewed by:		bz (network)
Differential Revision:	https://reviews.freebsd.org/D18769
MFC after:		1 week
Sponsored by:		Mellanox Technologies


Revision 341827 - (view) (download) (annotate) - [select for diffs]
Modified Tue Dec 11 19:32:16 2018 UTC (5 years, 6 months ago) by mjg
File length: 80306 byte(s)
Diff to previous 338450
Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation


Revision 338450 - (view) (download) (annotate) - [select for diffs]
Modified Mon Sep 3 22:27:27 2018 UTC (5 years, 10 months ago) by bz
File length: 80336 byte(s)
Diff to previous 336301
Replicate r328271 from legacy IP to IPv6 using a single macro
to clear L2 and L3 route caches.
Also mark one function argument as __unused.

Reviewed by:	karels, ae
Approved by:	re (rgrimes)
Differential Revision:	https://reviews.freebsd.org/D17007


Revision 336301 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jul 15 00:47:06 2018 UTC (5 years, 11 months ago) by mmacy
File length: 80485 byte(s)
Diff to previous 334719
acquire inp lock around ip6_pcbopt to fix IPV6_TCLASS panic

Simple fix to address panics relating to setting IPV6_TCLASS
with setsockopt(). The premise of this change is that it is
ok to call malloc with M_NOWAIT while holding a lock on the
in6p.

If it later turns out that it is not ok, then major surgery
will be required, as ip6_setpktopt() will have to be fixed
(as it also calls malloc with M_NOWAIT) which pulls in the
ip6_pcbopts(), ip6_setpktopts(), ip6_setpktopt() call chain.

Submitted by:	Jason Eggnet
Reviewed by:	rrs, transport, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D16201


Revision 334719 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 6 15:45:57 2018 UTC (6 years ago) by sbruno
File length: 79955 byte(s)
Diff to previous 334707
Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Obtained from:	DragonflyBSD
Relnotes:	Yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003


Revision 334707 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 6 10:46:24 2018 UTC (6 years ago) by ae
File length: 79715 byte(s)
Diff to previous 332967
Use m_copyback() function to write delayed checksum when it isn't located
in the first mbuf of the chain.

MFC after:	1 week


Revision 332967 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 24 19:55:12 2018 UTC (6 years, 2 months ago) by sbruno
File length: 80005 byte(s)
Diff to previous 332894
Revert r332894 at the request of the submitter.

Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks


Revision 332894 - (view) (download) (annotate) - [select for diffs]
Modified Mon Apr 23 19:51:00 2018 UTC (6 years, 2 months ago) by sbruno
File length: 80245 byte(s)
Diff to previous 331484
Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003


Revision 331484 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 24 12:43:34 2018 UTC (6 years, 3 months ago) by jtl
File length: 80005 byte(s)
Diff to previous 331454
Remove some unneccessary variable sets in IPv6 code, as detected by
clang's static analyzer.

Reviewed by:	bz
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D10940


Revision 331454 - (view) (download) (annotate) - [select for diffs]
Modified Fri Mar 23 18:34:38 2018 UTC (6 years, 3 months ago) by sbruno
File length: 80011 byte(s)
Diff to previous 331436
Revert r331379 as the "simple" lock changes have revealed a deeper problem
and need for a rethink.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks


Revision 331436 - (view) (download) (annotate) - [select for diffs]
Modified Fri Mar 23 16:56:44 2018 UTC (6 years, 3 months ago) by kp
File length: 80393 byte(s)
Diff to previous 331380
netpfil: Introduce PFIL_FWD flag

Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.

Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.

Reviewed by:	ae, kevans
Differential Revision:	https://reviews.freebsd.org/D13715


Revision 331380 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 22 23:34:48 2018 UTC (6 years, 3 months ago) by sbruno
File length: 80390 byte(s)
Diff to previous 331379
Refactor ip6_getpcbopt() for better locking and memory management

Created GET_PKTOPT_EXT_HDR() and GET_PKTOPT_SOCKADDR() macros to
handle safely fetching options from in6p_outputopts, including
properly dealing with in6p locking and preparing memory for
sooptcopyout().

Changed the function signature of ip6_getpcbopt() to allow the
function to acquire and release locks on in6p as needed.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14619


Revision 331379 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 22 22:29:32 2018 UTC (6 years, 3 months ago) by sbruno
File length: 80026 byte(s)
Diff to previous 331376
Simple locking fixes in ip_ctloutput, ip6_ctloutput, rip_ctloutput.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14624


Revision 331376 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 22 21:18:34 2018 UTC (6 years, 3 months ago) by sbruno
File length: 79644 byte(s)
Diff to previous 331373
Handle locking and memory safety for IPV6_PATHMTU in ip6_ctloutput().

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	ae
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14622


Revision 331373 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 22 20:21:05 2018 UTC (6 years, 3 months ago) by sbruno
File length: 79448 byte(s)
Diff to previous 326023
Improve write locking in ip6_ctloutput() with macros.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14620


Revision 326023 - (view) (download) (annotate) - [select for diffs]
Modified Mon Nov 20 19:43:44 2017 UTC (6 years, 7 months ago) by pfg
File length: 79629 byte(s)
Diff to previous 321618
sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


Revision 321618 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 27 13:03:36 2017 UTC (6 years, 11 months ago) by bz
File length: 79585 byte(s)
Diff to previous 319216
After inpcb route caching was put back in place there is no need for
flowtable anymore (as flowtable was never considered to be useful in
the forwarding path).

Reviewed by:		np
Differential Revision:	https://reviews.freebsd.org/D11448


Revision 319216 - (view) (download) (annotate) - [select for diffs]
Modified Tue May 30 14:50:28 2017 UTC (7 years, 1 month ago) by jtl
File length: 79745 byte(s)
Diff to previous 318124
Fix an unnecessary/incorrect check in the PKTOPT_EXTHDRCPY macro.

This macro allocates memory and, if malloc does not return NULL, copies
data into the new memory. However, it doesn't just check whether malloc
returns NULL. It also checks whether we called malloc with M_NOWAIT. That
is not necessary.

While it may be that malloc() will only return NULL when the M_NOWAIT flag
is set, we don't need to check for this when checking malloc's return
value. Further, in this case, the check was not completely accurate,
because it checked for flags == M_NOWAIT, rather than treating it as a bit
field and checking for (flags & M_NOWAIT).

Reviewed by:	ae
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D10942


Revision 318124 - (view) (download) (annotate) - [select for diffs]
Modified Wed May 10 00:14:55 2017 UTC (7 years, 1 month ago) by np
File length: 79768 byte(s)
Diff to previous 317282
ip6_output runs with the inp lock held, just like ip_output.


Revision 317282 - (view) (download) (annotate) - [select for diffs]
Modified Sat Apr 22 13:04:36 2017 UTC (7 years, 2 months ago) by kp
File length: 79744 byte(s)
Diff to previous 317186
Rename variable for clarity

Rename the mtu variable in ip6_fragment(), because mtu is misleading. The
variable actually holds the fragment length.
No functional change.

Suggested by: ae


Revision 317186 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 20 09:05:53 2017 UTC (7 years, 2 months ago) by kp
File length: 79712 byte(s)
Diff to previous 315956
pf: Fix possible incorrect IPv6 fragmentation

When forwarding pf tracks the size of the largest fragment in a fragmented
packet, and refragments based on this size.
It failed to ensure that this size was a multiple of 8 (as is required for all
but the last fragment), so it could end up generating incorrect fragments.

For example, if we received an 8 byte and 12 byte fragment pf would emit a first
fragment with 12 bytes of payload and the final fragment would claim to be at
offset 8 (not 12).

We now assert that the fragment size is a multiple of 8 in ip6_fragment(), so
other users won't make the same mistake.

Reported by:	Antonios Atlasis <aatlasis at secfu net>
MFC after:	3 days


Revision 315956 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 25 15:06:28 2017 UTC (7 years, 3 months ago) by karels
File length: 79639 byte(s)
Diff to previous 314722
Fix reference count leak with L2 caching.

ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.

Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level

Reviewed by:    ae gnn
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D10059
and those below, will be ignored--
> Description of fields to fill in above:                     76 columns --|
> PR:                       If and which Problem Report is related.
> Submitted by:             If someone else sent in the change.
> Reported by:              If someone else reported the issue.
> Reviewed by:              If someone else reviewed your modification.
> Approved by:              If you needed approval for this commit.
> Obtained from:            If the change is from a third party.
> MFC after:                N [day[s]|week[s]|month[s]].  Request a reminder email.
> MFH:                      Ports tree branch name.  Request approval for merge.
> Relnotes:                 Set to 'yes' for mention in release notes.
> Security:                 Vulnerability reference (one per line) or description.
> Sponsored by:             If the change was sponsored by an organization.
> Differential Revision:    https://reviews.freebsd.org/D### (*full* phabric URL needed).
> Empty fields above will be automatically removed.

M    netinet/in_pcb.c
M    netinet/ip_output.c
M    netinet6/ip6_output.c


Revision 314722 - (view) (download) (annotate) - [select for diffs]
Modified Mon Mar 6 04:01:58 2017 UTC (7 years, 3 months ago) by eri
File length: 79676 byte(s)
Diff to previous 314436
The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by:	adrian, aw
Approved by:	ae (mentor)
Sponsored by:	rsync.net
Differential Revision:	D9235


Revision 314436 - (view) (download) (annotate) - [select for diffs]
Modified Tue Feb 28 23:42:47 2017 UTC (7 years, 4 months ago) by imp
File length: 79492 byte(s)
Diff to previous 313675
Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96


Revision 313675 - (view) (download) (annotate) - [select for diffs]
Modified Sun Feb 12 06:56:33 2017 UTC (7 years, 4 months ago) by eri
File length: 79492 byte(s)
Diff to previous 313524
Committed without approval from mentor.

Reported by:	gnn


Revision 313524 - (view) (download) (annotate) - [select for diffs]
Modified Fri Feb 10 05:16:14 2017 UTC (7 years, 4 months ago) by eri
File length: 79676 byte(s)
Diff to previous 313330
The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian


Revision 313330 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 6 08:49:57 2017 UTC (7 years, 4 months ago) by ae
File length: 79492 byte(s)
Diff to previous 312379
Merge projects/ipsec into head/.

 Small summary
 -------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
  option IPSEC_SUPPORT added. It enables support for loading
  and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
  default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
  support was removed. Added TCP/UDP checksum handling for
  inbound packets that were decapsulated by transport mode SAs.
  setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
  build as part of ipsec.ko module (or with IPSEC kernel).
  It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
  methods. The only one header file <netipsec/ipsec_support.h>
  should be included to declare all the needed things to work
  with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
  Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
  - now all security associations stored in the single SPI namespace,
    and all SAs MUST have unique SPI.
  - several hash tables added to speed up lookups in SADB.
  - SADB now uses rmlock to protect access, and concurrent threads
    can do SA lookups in the same time.
  - many PF_KEY message handlers were reworked to reflect changes
    in SADB.
  - SADB_UPDATE message was extended to support new PF_KEY headers:
    SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
    can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
  avoid locking protection for ipsecrequest. Now we support
  only limited number (4) of bundled SAs, but they are supported
  for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
  used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
  check for full history of applied IPsec transforms.
o References counting rules for security policies and security
  associations were changed. The proper SA locking added into xform
  code.
o xform code was also changed. Now it is possible to unregister xforms.
  tdb_xxx structures were changed and renamed to reflect changes in
  SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by:	gnn, wblock
Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D9352


Revision 312379 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 18 13:31:17 2017 UTC (7 years, 5 months ago) by hselasky
File length: 80382 byte(s)
Diff to previous 307541
Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months


Revision 307541 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 17 23:25:31 2016 UTC (7 years, 8 months ago) by gnn
File length: 79422 byte(s)
Diff to previous 307234
Limit the number of mbufs that can be allocated for IPV6_2292PKTOPTIONS
(and IPV6_PKTOPTIONS).

PR:		100219
Submitted by:	Joseph Kong
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D5157


Revision 307234 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 13 20:15:47 2016 UTC (7 years, 8 months ago) by glebius
File length: 78914 byte(s)
Diff to previous 305824
- Revert r300854, r303657 which tried to fix regression from r297225.
- Fix the regression proper way using RO_RTFREE().

Submitted by:	ae


Revision 305824 - (view) (download) (annotate) - [select for diffs]
Modified Thu Sep 15 07:41:48 2016 UTC (7 years, 9 months ago) by kevlo
File length: 79083 byte(s)
Diff to previous 304713
Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.

Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D7878


Revision 304713 - (view) (download) (annotate) - [select for diffs]
Modified Wed Aug 24 00:52:30 2016 UTC (7 years, 10 months ago) by karels
File length: 79061 byte(s)
Diff to previous 303657
Fix L2 caching for UDP over IPv6

ip6_output() was missing cache invalidation code analougous to
ip_output.c. r304545 disabled L2 caching for UDP/IPv6 as a workaround.
This change adds the missing cache invalidation code and reverts
r304545.

Reviewed by:	gnn
Approved by:	gnn (mentor)
Tested by:	peter@, Mike Andrews
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D7591


Revision 303657 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 2 12:18:06 2016 UTC (7 years, 11 months ago) by ae
File length: 78782 byte(s)
Diff to previous 303626
Fix NULL pointer dereference.
ro pointer can be NULL when IPSec consumes mbuf.

PR:		211486
MFC after:	3 days


Revision 303626 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 1 17:02:21 2016 UTC (7 years, 11 months ago) by gallatin
File length: 78761 byte(s)
Diff to previous 302784
Rework IPV6 TCP path MTU discovery to match IPv4

- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
  to send TCP packets without looking at the tcp host cache for every
  single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
  PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held.  Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by:	bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D7272


Revision 302784 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 13 19:41:19 2016 UTC (7 years, 11 months ago) by dim
File length: 78610 byte(s)
Diff to previous 301717
Fix a page fault in ip6_setpktopt(), occurring when the pflog module is
loaded, and syncthing is started, which uses setsockopt(IPV6_PKGINFO).

This is because pflog interfaces do not normally have an IPv6 address,
causing the ND_IFINFO() macro to dereference a NULL pointer.

Reviewed by:	ae
PR:		210943
MFC after:	3 days


Revision 301717 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jun 9 05:48:34 2016 UTC (8 years ago) by ae
File length: 78568 byte(s)
Diff to previous 301217
Cleanup unneded include "opt_ipfw.h".

It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.


Revision 301217 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jun 2 17:51:29 2016 UTC (8 years, 1 month ago) by gnn
File length: 78590 byte(s)
Diff to previous 300854
This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262


Revision 300854 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 27 17:31:02 2016 UTC (8 years, 1 month ago) by glebius
File length: 78553 byte(s)
Diff to previous 300298
Plug route reference underleak that happens with FLOWTABLE after r297225.

Submitted by:	Mike Karels <mike karels.net>


Revision 300298 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 20 12:17:40 2016 UTC (8 years, 1 month ago) by ae
File length: 78405 byte(s)
Diff to previous 300297
Remove ip6 adjusting from the place where pointer couldn't be changed.
And add comment after calling PFIL hooks, where it could be changed.


Revision 300297 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 20 12:09:10 2016 UTC (8 years, 1 month ago) by ae
File length: 78441 byte(s)
Diff to previous 300262
Remove ip6 pointer initialization and strange check from the beginning
of ip6_output(). It isn't used until the first time adjusted.
Remove the comment about adjusting where it is actually initialized.


Revision 300262 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 20 04:45:08 2016 UTC (8 years, 1 month ago) by markj
File length: 78559 byte(s)
Diff to previous 300202
Move IPv6 malloc tag definitions into the IPv6 code.


Revision 300202 - (view) (download) (annotate) - [select for diffs]
Modified Thu May 19 12:45:20 2016 UTC (8 years, 1 month ago) by ae
File length: 78500 byte(s)
Diff to previous 300054
Since PFIL can change destination address, use its always actual value
from mbuf when calculating path mtu. Remove now unused finaldst variable.
Also constify dst argument in ip6_getpmtu() and ip6_getpmtu_ctl().

Reviewed by:	melifaro
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC


Revision 300054 - (view) (download) (annotate) - [select for diffs]
Modified Tue May 17 14:06:55 2016 UTC (8 years, 1 month ago) by ae
File length: 78508 byte(s)
Diff to previous 298075
Call RO_RTFREE() when we have detected the change of destination
address, otherwise the old route will be used with new destination.

MFC after:	1 week


Revision 298075 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 15 17:30:33 2016 UTC (8 years, 2 months ago) by pfg
File length: 78484 byte(s)
Diff to previous 297225
sys/net* : for pointers replace 0 with NULL.

Mostly cosmetical, no functional change.

Found with devel/coccinelle.


Revision 297225 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 24 07:54:56 2016 UTC (8 years, 3 months ago) by gnn
File length: 78469 byte(s)
Diff to previous 296242
FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by:	Mike Karels
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D4306


Revision 296242 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 1 00:17:14 2016 UTC (8 years, 4 months ago) by glebius
File length: 78007 byte(s)
Diff to previous 293169
New way to manage reference counting of mbuf external storage.

The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix


Revision 293169 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jan 4 18:32:24 2016 UTC (8 years, 6 months ago) by melifaro
File length: 78013 byte(s)
Diff to previous 293098
Finish r293098: make ip6_getpmtu() and ip6_getpmtu_ctl() use new routing API


Revision 293098 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jan 3 09:54:03 2016 UTC (8 years, 6 months ago) by melifaro
File length: 77978 byte(s)
Diff to previous 292956
Handle IPV6_PATHMTU option by spliting ip6_getpmtu_ctl() from ip6_getpmtu().
Add ro_mtu field to 'struct route' to be able to pass lookup MTU back to
  the caller.

Currently, ip6_getpmtu() has 2 totally different use cases:
1) control plane (IPV6_PATHMTU req), where we just need to calculate MTU
  and return it, w/o any reusability.
2) Actual ip6_output() data path where we (nearly) always use the provided
  route lookup data. If this data is not 'valid' we need to perform another
  lookup and save the result (which cannot be re-used by ip6_output()).

Given that, handle 1) by calling separate function doing rte lookup itself.
  Resulting MTU is calculated by (newly-added) ip6_calcmtu() used by both
  ip6_getpmtu_ctl() and ip6_getpmtu().
For 2) instead of storing ref'ed rte, store mtu (the only needed data
  from the lookup result) inside newly-added ro_mtu field.
  'struct route' was shrinked by 8(or 4 bytes) in r292978. Grow it again
  by 4 bytes. New ro_mtu field will be used in other places like
  ip/tcp_output (EMSGSIZE handling from output routines).

Reviewed by:	ae


Revision 292956 - (view) (download) (annotate) - [select for diffs]
Modified Wed Dec 30 18:08:05 2015 UTC (8 years, 6 months ago) by jtl
File length: 76525 byte(s)
Diff to previous 290867
Add the appropriate case statement for IPV6_BINDMULTI so the option can be
retrieved with getsockopt().

CID:	1229928
Differential Revision:	https://reviews.freebsd.org/D4737
Reviewed by:	adrian
Sponsored by:	Juniper Networks


Revision 290867 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 15 16:02:22 2015 UTC (8 years, 7 months ago) by melifaro
File length: 76501 byte(s)
Diff to previous 287861
Bring back the ability of passing cached route via nd6_output_ifp().


Revision 287861 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 16 14:26:28 2015 UTC (8 years, 9 months ago) by melifaro
File length: 76489 byte(s)
Diff to previous 287525
Simplify the way of attaching IPv6 link-layer header.

Problem description:
How do we currently perform layer 2 resolution and header imposition:

For IPv4 we have the following chain:
  ip_output() -> (ether|atm|whatever)_output() -> arpresolve()

Lookup is done in proper place (link-layer output routine) and it is possible
  to provide cached lle data.

For IPv6 situation is more complex:
  ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
    nd6_storelladdr()

We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
  * checks if lle exists, creates it if needed (similar to arpresolve())
  * performes lle state transitions (similar to arpresolve())
  * calls nd6_output_ifp() which pushes packets to link output routine along
    with running SeND/MAC hooks regardless of lle state
    (e.g. works as run-hooks placeholder).

After that, iface output routine like ether_output() calls nd6_storelladdr()
  which performs lle lookup once again.

As a result, we perform lookup twice for each outgoing packet for most types
  of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
  interfaces (see nd6_need_cache()).

Fix this behavior by eliminating first ND lookup. To be more specific:
  * make all nd6_output() consumers use nd6_output_ifp() instead
  * rename nd6_output[_slow]() to nd6_resolve_[slow]()
  * convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
    e.g. copy L2 address to buffer instead of pushing packet towards lower
    layers
  * Make all nd6_storelladdr() users use nd6_resolve()
  * eliminate nd6_storelladdr()

The resulting callchain is the following:
  ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()

Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
  -> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
  nd6_resolve() which will return EWOULDBLOCK, and that result
  will be converted to 0.

(And EWOULDBLOCK is actually used by IB/TOE code).

Sponsored by:		Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D1469


Revision 287525 - (view) (download) (annotate) - [select for diffs]
Modified Sun Sep 6 20:57:57 2015 UTC (8 years, 9 months ago) by adrian
File length: 76503 byte(s)
Diff to previous 286452
Add support for receiving flowtype, flowid and RSS bucket information as part of recvmsg().

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3562


Revision 286452 - (view) (download) (annotate) - [select for diffs]
Modified Sat Aug 8 15:58:35 2015 UTC (8 years, 10 months ago) by melifaro
File length: 76027 byte(s)
Diff to previous 285107
Simplify ip[6] simploop:
Do not pass 'dst' sockaddr to ip[6]_mloopback:
  - We have explicit check for AF_INET in ip_output()
  - We assume ip header inside passed mbuf in ip_mloopback
  - We assume ip6 header inside passed mbuf in ip6_mloopback


Revision 285107 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jul 3 19:01:38 2015 UTC (9 years ago) by ae
File length: 76066 byte(s)
Diff to previous 282578
Keep IPv6 address specified by IPV6_PKTINFO socket option in kernel
internal form to be able handle link-local IPv6 addresses.

Reported by:	kp
Tested by:	kp


Revision 282578 - (view) (download) (annotate) - [select for diffs]
Modified Thu May 7 14:17:43 2015 UTC (9 years, 1 month ago) by ae
File length: 75937 byte(s)
Diff to previous 280955
Mark data checksum as valid for multicast packets, that we send back
to myself via simloop.
Also remove duplicate check under #ifdef DIAGNOSTIC.

PR:		180065
MFC after:	1 week


Revision 280955 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 1 12:15:01 2015 UTC (9 years, 3 months ago) by kp
File length: 75855 byte(s)
Diff to previous 279588
Preserve IPv6 fragment IDs accross reassembly and refragmentation

When forwarding fragmented IPv6 packets and filtering with PF we
reassemble and refragment. That means we generate new fragment headers
and a new fragment ID.

We already save the fragment IDs so we can do the reassembly so it's
straightforward to apply the incoming fragment ID on the refragmented
packets.

Differential Revision:	https://reviews.freebsd.org/D2188
Approved by:		gnn (mentor)


Revision 279588 - (view) (download) (annotate) - [select for diffs]
Modified Wed Mar 4 11:20:01 2015 UTC (9 years, 4 months ago) by ae
File length: 75832 byte(s)
Diff to previous 278842
Fix deadlock in IPv6 PCB code.

When several threads are trying to send datagram to the same destination,
but fragmentation is disabled and datagram size exceeds link MTU,
ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
sockets wanted to know MTU to this destination. And since all threads
hold PCB lock while sending, taking the lock for each PCB in the
in6_pcbnotify() leads to deadlock.

RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
But it doesn't require this, when we don't receive ICMPv6 message.

Change ip6_notify_pmtu() function to be able use it directly from
ip6_output() to notify only one socket, and to notify all sockets
when ICMPv6 packet too big message received.

PR:		197059
Differential Revision:	https://reviews.freebsd.org/D1949
Reviewed by:	no objection from #network
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC


Revision 278842 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 16 06:30:27 2015 UTC (9 years, 4 months ago) by glebius
File length: 75965 byte(s)
Diff to previous 278832
Factor out ip6_fragment() function, to be used in IPv6 stack and pf(4).

Submitted by:		Kristof Provost
Differential Revision:	D1766


Revision 278832 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 16 05:58:32 2015 UTC (9 years, 4 months ago) by glebius
File length: 75740 byte(s)
Diff to previous 278828
Move ip6_deletefraghdr() to frag6.c.

Suggested by:	bz


Revision 278828 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 16 01:12:20 2015 UTC (9 years, 4 months ago) by glebius
File length: 76368 byte(s)
Diff to previous 277331
Factor out ip6_deletefraghdr() function, to be shared between IPv6
stack and pf(4).

Submitted by:	Kristof Provost
Reviewed by:	ae
Differential Revision:	D1764


Revision 277331 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jan 18 18:06:40 2015 UTC (9 years, 5 months ago) by adrian
File length: 75740 byte(s)
Diff to previous 277072
Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
  configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
  that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision:	D1383
Reviewed by:	gnn, jfv (intel driver bits)


Revision 277072 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jan 12 14:52:43 2015 UTC (9 years, 5 months ago) by glebius
File length: 75710 byte(s)
Diff to previous 276692
Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.

Sponsored by:	Nginx, Inc.


Revision 276692 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jan 5 09:58:32 2015 UTC (9 years, 5 months ago) by rwatson
File length: 76042 byte(s)
Diff to previous 275710
To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division


Revision 275710 - (view) (download) (annotate) - [select for diffs]
Modified Thu Dec 11 18:35:34 2014 UTC (9 years, 6 months ago) by ae
File length: 76043 byte(s)
Diff to previous 275358
Remove flag/flags argument from the following functions:
 ipsec_getpolicybyaddr()
 ipsec4_checkpolicy()
 ip_ipsec_output()
 ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC


Revision 275358 - (view) (download) (annotate) - [select for diffs]
Modified Mon Dec 1 11:45:24 2014 UTC (9 years, 7 months ago) by hselasky
File length: 76027 byte(s)
Diff to previous 274611
Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies


Revision 274611 - (view) (download) (annotate) - [select for diffs]
Added Mon Nov 17 01:05:29 2014 UTC (9 years, 7 months ago) by melifaro
File length: 76033 byte(s)
Diff to previous 274331
Finish r274175: do control plane MTU tracking.

Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR:		194238
MFC after:	1 month
CR:		D1125



This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

  ViewVC Help
Powered by ViewVC 1.1.27