/[base]/head/sys/netinet/tcp_usrreq.c
ViewVC logotype

Log of /head/sys/netinet/tcp_usrreq.c

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (download) (annotate)
Sticky Revision:


Revision 368819 - (view) (download) (annotate) - [select for diffs]
Modified Sat Dec 19 22:04:46 2020 UTC (3 years, 6 months ago) by gallatin
File length: 73060 byte(s)
Diff to previous 365071
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain

In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636


Revision 365071 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 1 21:19:14 2020 UTC (3 years, 10 months ago) by mjg
File length: 72813 byte(s)
Diff to previous 363256
net: clean up empty lines in .c and .h files


Revision 363256 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 16 16:46:24 2020 UTC (3 years, 11 months ago) by tuexen
File length: 72814 byte(s)
Diff to previous 361926
(Re)-allow 0.0.0.0 to be used as an address in connect() for TCP
In r361752 an error handling was introduced for using 0.0.0.0 or
255.255.255.255 as the address in connect() for TCP, since both
addresses can't be used. However, the stack maps 0.0.0.0 implicitly
to a local address and at least two regressions were reported.
Therefore, re-allow the usage of 0.0.0.0.
While there, change the error indicated when using 255.255.255.255
from EAFNOSUPPORT to EACCES as mentioned in the man-page of connect().

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25401


Revision 361926 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jun 8 11:48:07 2020 UTC (4 years ago) by rrs
File length: 72977 byte(s)
Diff to previous 361752
An important statistic in determining if a server process (or client) is being delayed
is to know the time to first byte in and time to first byte out. Currently we
have no way to know these all we have is t_starttime. That (t_starttime) tells us
what time the 3 way handshake completed. We don't know when the first
request came in or how quickly we responded. Nor from a client perspective
do we know how long from when we sent out the first byte before the
server responded.

This small change adds the ability to track the TTFB's. This will show up in
BB logging which then can be pulled for later analysis. Note that currently
the tracking is via the ticks variable of all three variables. This provides
a very rough estimate (hz=1000 its 1ms). A follow-on set of work will be
to change all three of these values into something with a much finer resolution
(either microseconds or nanoseconds), though we may want to make the resolution
configurable so that on lower powered machines we could still use the much
cheaper ticks variable.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24902


Revision 361752 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 3 14:16:40 2020 UTC (4 years ago) by rrs
File length: 72647 byte(s)
Diff to previous 361750
We should never allow either the broadcast or IN_ADDR_ANY to be
connected to or sent to. This was fond when working with Michael
Tuexen and Skyzaller. Skyzaller seems to want to use either of
these two addresses to connect to at times. And it really is
an error to do so, so lets not allow that behavior.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D24852


Revision 361750 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 3 13:51:53 2020 UTC (4 years ago) by tuexen
File length: 72152 byte(s)
Diff to previous 361228
Restrict enabling TCP-FASTOPEN to end-points in CLOSED or LISTEN state

Enabling TCP-FASTOPEN on an end-point which is in a state other than
CLOSED or LISTEN, is a bug in the application. So it should not work.
Also the TCP code does not (and needs not to) handle this.
While there, also simplify the setting of the TF_FASTOPEN flag.

This issue was found by running syzkaller.

Reviewed by:		rrs
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D25115


Revision 361228 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 18 22:53:12 2020 UTC (4 years, 1 month ago) by karels
File length: 72057 byte(s)
Diff to previous 361081
Allow TCP to reuse local port with different destinations

Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781


Revision 361081 - (view) (download) (annotate) - [select for diffs]
Modified Fri May 15 14:06:37 2020 UTC (4 years, 1 month ago) by tuexen
File length: 71317 byte(s)
Diff to previous 360879
Allow only IPv4 addresses in sendto() for TCP on AF_INET sockets.

This problem was found by looking at syzkaller reproducers for some other
problems.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24831


Revision 360879 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 10 17:43:42 2020 UTC (4 years, 1 month ago) by tuexen
File length: 71188 byte(s)
Diff to previous 360638
Remove trailing whitespace.


Revision 360638 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 4 20:19:57 2020 UTC (4 years, 1 month ago) by rrs
File length: 71189 byte(s)
Diff to previous 360408
Adjust the fb to have a way to ask the underlying stack
if it can support the PRUS option (OOB). And then have
the new function call that to validate and give the
correct error response if needed to the user (rack
and bbr do not support obsoleted OOB data).

Sponsoered by: Netflix Inc.
Differential Revision:	 https://reviews.freebsd.org/D24574


Revision 360408 - (view) (download) (annotate) - [select for diffs]
Modified Mon Apr 27 23:17:19 2020 UTC (4 years, 2 months ago) by jhb
File length: 70420 byte(s)
Diff to previous 360402
Initial support for kernel offload of TLS receive.

- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
  authentication algorithms and keys as well as the initial sequence
  number.

- When reading from a socket using KTLS receive, applications must use
  recvmsg().  Each successful call to recvmsg() will return a single
  TLS record.  A new TCP control message, TLS_GET_RECORD, will contain
  the TLS record header of the decrypted record.  The regular message
  buffer passed to recvmsg() will receive the decrypted payload.  This
  is similar to the interface used by Linux's KTLS RX except that
  Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
  or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
  soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
  1.2, though I believe we will be able to reuse the same interface
  and structures for 1.3.


Revision 360402 - (view) (download) (annotate) - [select for diffs]
Modified Mon Apr 27 22:31:42 2020 UTC (4 years, 2 months ago) by jhb
File length: 70089 byte(s)
Diff to previous 357858
Add the initial sequence number to the TLS enable socket option.

This will be needed for KTLS RX.

Reviewed by:	gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D24451


Revision 357858 - (view) (download) (annotate) - [select for diffs]
Modified Thu Feb 13 15:14:46 2020 UTC (4 years, 4 months ago) by tuexen
File length: 69282 byte(s)
Diff to previous 357818
sack_newdata and snd_recover hold the same value. Therefore, use only
a single instance: use snd_recover also where sack_newdata was used.

Submitted by:		Richard Scheffenegger
Differential Revision:	https://reviews.freebsd.org/D18811


Revision 357818 - (view) (download) (annotate) - [select for diffs]
Modified Wed Feb 12 13:31:36 2020 UTC (4 years, 4 months ago) by rrs
File length: 69326 byte(s)
Diff to previous 357276
White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.


Revision 357276 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 29 22:48:18 2020 UTC (4 years, 5 months ago) by glebius
File length: 69335 byte(s)
Diff to previous 356986
Fix missing NET_EPOCH_ENTER() when compiled with TCP_OFFLOAD.

Reported by:	Coverity
CID:		1413162


Revision 356986 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 15:06:59 2020 UTC (4 years, 5 months ago) by bz
File length: 69335 byte(s)
Diff to previous 356983
Fix NOINET kernels after r356983.

All gotos to the label are within the #ifdef INET section, which leaves
us with an unused label.  Cover the label under #ifdef INET as well to
avoid the warning and compile time error.


Revision 356983 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 06:10:41 2020 UTC (4 years, 5 months ago) by glebius
File length: 69316 byte(s)
Diff to previous 356982
Make in_pcbladdr() require network epoch entered by its callers.  Together
with this widen network epoch coverage up to tcp_connect() and udp_connect().

Revisions from r356974 and up to this revision cover D23187.

Differential Revision:	https://reviews.freebsd.org/D23187


Revision 356982 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 06:07:27 2020 UTC (4 years, 5 months ago) by glebius
File length: 69244 byte(s)
Diff to previous 356981
Remove extraneous NET_EPOCH_ASSERT - the full function is covered.


Revision 356981 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 06:06:27 2020 UTC (4 years, 5 months ago) by glebius
File length: 69267 byte(s)
Diff to previous 356980
Re-absorb tcp_detach() back into tcp_usr_detach() as the comment suggests.
Not a functional change.


Revision 356980 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 06:04:56 2020 UTC (4 years, 5 months ago) by glebius
File length: 69830 byte(s)
Diff to previous 356978
Don't enter network epoch in tcp_usr_detach. A PCB removal doesn't
require that.


Revision 356978 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 06:01:26 2020 UTC (4 years, 5 months ago) by glebius
File length: 70018 byte(s)
Diff to previous 356976
tcp_usr_attach() doesn't need network epoch.  in_pcbfree() and
in_pcbdetach() perform all necessary synchronization themselves.


Revision 356976 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 05:54:58 2020 UTC (4 years, 5 months ago) by glebius
File length: 70136 byte(s)
Diff to previous 356975
Inline tcp_attach() into tcp_usr_attach().  Not a functional change.


Revision 356975 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 22 05:53:16 2020 UTC (4 years, 5 months ago) by glebius
File length: 70554 byte(s)
Diff to previous 355304
Make tcp_output() require network epoch.

Enter the epoch before calling into tcp_output() from those
functions, that didn't do that before.

This eliminates a bunch of epoch recursions in TCP.


Revision 355304 - (view) (download) (annotate) - [select for diffs]
Modified Mon Dec 2 20:58:04 2019 UTC (4 years, 7 months ago) by trasz
File length: 70186 byte(s)
Diff to previous 355273
Make use of the stats(3) framework in the TCP stack.

This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time.  stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by:	thj
Tested by:	thj
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20655


Revision 355273 - (view) (download) (annotate) - [select for diffs]
Modified Sun Dec 1 21:01:33 2019 UTC (4 years, 7 months ago) by tuexen
File length: 68108 byte(s)
Diff to previous 354422
Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22497


Revision 354422 - (view) (download) (annotate) - [select for diffs]
Modified Thu Nov 7 00:10:14 2019 UTC (4 years, 7 months ago) by glebius
File length: 67902 byte(s)
Diff to previous 354044
Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in
TCP functions that are executed in syscall context.  No
functional change here.


Revision 354044 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 24 20:05:10 2019 UTC (4 years, 8 months ago) by tuexen
File length: 68234 byte(s)
Diff to previous 353328
Ensure that the flags indicating IPv4/IPv6 are not changed by failing
bind() calls. This would lead to inconsistent state resulting in a panic.
A fix for stable/11 was committed in
https://svnweb.freebsd.org/base?view=revision&revision=338986
An accelerated MFC is planned as discussed with emaste@.

Reported by:		syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com
Obtained from:		jtl@
MFC after:		1 day
Sponsored by:		Netflix, Inc.


Revision 353328 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 8 21:34:06 2019 UTC (4 years, 8 months ago) by jhb
File length: 66909 byte(s)
Diff to previous 351522
Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.

This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891


Revision 351522 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 27 00:01:56 2019 UTC (4 years, 10 months ago) by jhb
File length: 66988 byte(s)
Diff to previous 350531
Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277


Revision 350531 - (view) (download) (annotate) - [select for diffs]
Modified Fri Aug 2 07:41:36 2019 UTC (4 years, 11 months ago) by bz
File length: 66203 byte(s)
Diff to previous 349529
IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names.  We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
  - in6pcb which really is an inpcb and nothing separate
  - sotoin6pcb which is sotoinpcb (as per above)
  - in6p_sp which is inp_sp
  - in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone  also rename the in6p variables to inp as
  that is what we call it in most of the network stack including
  parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with:		tuexen (SCTP changes)
MFC after:		3 months
Sponsored by:		Netflix


Revision 349529 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jun 29 00:48:33 2019 UTC (5 years ago) by jhb
File length: 66240 byte(s)
Diff to previous 346360
Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616


Revision 346360 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 18 23:21:26 2019 UTC (5 years, 2 months ago) by jhb
File length: 66262 byte(s)
Diff to previous 341335
Push down INP_WLOCK slightly in tcp_ctloutput.

The inp lock is not needed for testing the V6 flag as that flag is set
once when the inp is created and never changes.  For non-TCP socket
options the lock is immediately dropped after checking that flag.
This just pushes the lock down to only be acquired for TCP socket
options.

This isn't a hot-path, more a cosmetic cleanup I noticed while reading
the code.

Reviewed by:	bz
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19740


Revision 341335 - (view) (download) (annotate) - [select for diffs]
Modified Fri Nov 30 10:50:07 2018 UTC (5 years, 7 months ago) by tuexen
File length: 66304 byte(s)
Diff to previous 338291
Limit option_len for the TCP_CCALGOOPT.

Limiting the length to 2048 bytes seems to be acceptable, since
the values used right now are using 8 bytes.

Reviewed by:		glebius, bz, rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18366


Revision 338291 - (view) (download) (annotate) - [select for diffs]
Modified Fri Aug 24 10:50:19 2018 UTC (5 years, 10 months ago) by tuexen
File length: 66239 byte(s)
Diff to previous 338138
Fix a shadowed variable warning.
Thanks to Peter Lei for reporting the issue.

Approved by:		re(kib@)
MFH:			1 month
Sponsored by:		Netflix, Inc.


Revision 338138 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 21 14:12:30 2018 UTC (5 years, 10 months ago) by tuexen
File length: 66235 byte(s)
Diff to previous 338102
Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR:			173444
Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16796


Revision 338102 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 20 12:43:18 2018 UTC (5 years, 10 months ago) by rrs
File length: 65154 byte(s)
Diff to previous 338053
This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16626


Revision 338053 - (view) (download) (annotate) - [select for diffs]
Modified Sun Aug 19 14:56:10 2018 UTC (5 years, 10 months ago) by tuexen
File length: 65153 byte(s)
Diff to previous 336962
Don't expose the uptime via the TCP timestamps.

The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by:		rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16636


Revision 336962 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jul 31 06:27:05 2018 UTC (5 years, 11 months ago) by tuexen
File length: 64961 byte(s)
Diff to previous 336940
Fix INET only builds.

r336940 introduced an "unused variable" warning on platforms which
support INET, but not INET6, like MALTA and MALTA64 as reported
by Mark Millard. Improve the #ifdefs to address this issue.

Sponsored by:		Netflix, Inc.


Revision 336940 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jul 30 21:27:26 2018 UTC (5 years, 11 months ago) by tuexen
File length: 64921 byte(s)
Diff to previous 336934
Allow implicit TCP connection setup for TCP/IPv6.

TCP/IPv4 allows an implicit connection setup using sendto(), which
is used for TTCP and TCP fast open. This patch adds support for
TCP/IPv6.
While there, improve some tests for detecting multicast addresses,
which are mapped.

Reviewed by:		bz@, kbowling@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16458


Revision 336934 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jul 30 20:35:50 2018 UTC (5 years, 11 months ago) by tuexen
File length: 62366 byte(s)
Diff to previous 336596
Fix some TCP fast open issues.

The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
  recv(), send(), and close() before the TCP-ACK segment has been received,
  the TCP connection is just dropped and the reception of the TCP-ACK
  segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
  send(), send(), and close() before the TCP-ACK segment has been received,
  the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
  by close() the TCP connection is just dropped.

Reviewed by:		jtl@, kbowling@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16485


Revision 336596 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jul 22 05:37:58 2018 UTC (5 years, 11 months ago) by mmacy
File length: 62301 byte(s)
Diff to previous 335924
NULL out cc_data in pluggable TCP {cc}_cb_destroy

When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.

 -    NULL out cc_data in the framework after calling {cc}_cb_destroy
 -    free(9) checks for NULL so there is no need to perform not NULL checks
     before calling free.
 -    Improve a comment about NewReno in tcp_ccalgounload

This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.

Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282


Revision 335924 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 4 02:47:16 2018 UTC (6 years ago) by mmacy
File length: 62278 byte(s)
Diff to previous 332774
epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


Revision 332774 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 19 15:03:48 2018 UTC (6 years, 2 months ago) by rrs
File length: 61932 byte(s)
Diff to previous 332770
These two modules need the tcp_hpts.h file for
when the option is enabled (not sure how LINT/build-universe
missed this) opps.

Sponsored by:	Netflix Inc


Revision 332770 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 19 13:37:59 2018 UTC (6 years, 2 months ago) by rrs
File length: 61902 byte(s)
Diff to previous 332120
This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after:	Never
Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D15020


Revision 332120 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 6 17:20:37 2018 UTC (6 years, 2 months ago) by jtl
File length: 61487 byte(s)
Diff to previous 332045
If a user closes the socket before we call tcp_usr_abort(), then
tcp_drop() may unlock the INP.  Currently, tcp_usr_abort() does not
check for this case, which results in a panic while trying to unlock
the already-unlocked INP (not to mention, a use-after-free violation).

Make tcp_usr_abort() check the return value of tcp_drop(). In the case
where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps
to abort the connection and simply unlock the INP_INFO lock prior to
returning.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.


Revision 332045 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 4 21:12:35 2018 UTC (6 years, 2 months ago) by emaste
File length: 61438 byte(s)
Diff to previous 331901
Fix kernel memory disclosure in tcp_ctloutput

strcpy was used to copy a string into a buffer copied to userland, which
left uninitialized data after the terminating 0-byte.  Use the same
approach as in tcp_subr.c: strncpy and explicit '\0'.

admbugs:	765, 822
MFC after:	1 day
Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reported by:	Vlad Tsyrklevich
Security:	Kernel memory disclosure
Sponsored by:	The FreeBSD Foundation


Revision 331901 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 3 01:08:54 2018 UTC (6 years, 3 months ago) by np
File length: 61341 byte(s)
Diff to previous 331485
Add a hook to allow the toedev handling an offloaded connection to
provide accurate TCP_INFO.

Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D14816


Revision 331485 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 24 12:48:10 2018 UTC (6 years, 3 months ago) by jtl
File length: 61278 byte(s)
Diff to previous 331347
Make the TCP blackbox code committed in r331347 be an optional feature
controlled by the TCP_BLACKBOX option.

Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.

Sponsored by:	Netflix, Inc.


Revision 331347 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 22 09:40:08 2018 UTC (6 years, 3 months ago) by jtl
File length: 61224 byte(s)
Diff to previous 331322
Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by:	gnn (previous version)
Obtained from:	Netflix, Inc.
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D11085


Revision 331322 - (view) (download) (annotate) - [select for diffs]
Modified Wed Mar 21 20:59:30 2018 UTC (6 years, 3 months ago) by glebius
File length: 59250 byte(s)
Diff to previous 330002
The net.inet.tcp.nolocaltimewait=1 optimization prevents local TCP connections
from entering the TIME_WAIT state. However, it omits sending the ACK for the
FIN, which results in RST. This becomes a bigger deal if the sysctl
net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of
the connection (also local) keeps retransmitting FINs.

To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead
we will allocate a tcptw on stack and proceed to the end of the function all
the way to tcp_twrespond(), to generate the correct ACK, then we will drop the
last PCB reference.

While here, make a few tiny improvements:
- use bools for boolean variable
- staticize nolocaltimewait
- remove pointless acquisiton of socket lock

Reported by:	jtl
Reviewed by:	jtl
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D14697


Revision 330002 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 26 03:03:41 2018 UTC (6 years, 4 months ago) by pkelsey
File length: 59110 byte(s)
Diff to previous 330001
Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.

The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by:	hiren
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14048


Revision 330001 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 26 02:53:22 2018 UTC (6 years, 4 months ago) by pkelsey
File length: 59319 byte(s)
Diff to previous 326023
This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14047


Revision 326023 - (view) (download) (annotate) - [select for diffs]
Modified Mon Nov 20 19:43:44 2017 UTC (6 years, 7 months ago) by pfg
File length: 58412 byte(s)
Diff to previous 322648
sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


Revision 322648 - (view) (download) (annotate) - [select for diffs]
Modified Fri Aug 18 07:27:15 2017 UTC (6 years, 10 months ago) by tuexen
File length: 58368 byte(s)
Diff to previous 318649
Ensure inp_vflag is consistently set for TCP endpoints.

Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set
for inpcbs used for TCP sockets, no matter if the setting is derived
from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option.
For UDP this was already done right.

PR:		221385
MFC after:	1 week


Revision 318649 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 22 15:29:10 2017 UTC (7 years, 1 month ago) by tuexen
File length: 58289 byte(s)
Diff to previous 314436
The connect() system call should return -1 and set errno to EAFNOSUPPORT
if it is called on a TCP socket
 * with an IPv6 address and the socket is bound to an
    IPv4-mapped IPv6 address.
 * with an IPv4-mapped IPv6 address and the socket is bound to an
   IPv6 address.
Thanks to Jonathan T. Leighton for reporting this issue.

Reviewed by:		bz gnn
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D9163


Revision 314436 - (view) (download) (annotate) - [select for diffs]
Modified Tue Feb 28 23:42:47 2017 UTC (7 years, 4 months ago) by imp
File length: 58111 byte(s)
Diff to previous 313528
Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96


Revision 313528 - (view) (download) (annotate) - [select for diffs]
Modified Fri Feb 10 05:58:16 2017 UTC (7 years, 4 months ago) by eri
File length: 58111 byte(s)
Diff to previous 313527
Revert r313527

Heh svn is not git


Revision 313527 - (view) (download) (annotate) - [select for diffs]
Modified Fri Feb 10 05:51:39 2017 UTC (7 years, 4 months ago) by eri
File length: 58229 byte(s)
Diff to previous 313330
Correct missed variable name.

Reported-by: ohartmann@walstatt.org


Revision 313330 - (view) (download) (annotate) - [select for diffs]
Modified Mon Feb 6 08:49:57 2017 UTC (7 years, 4 months ago) by ae
File length: 58111 byte(s)
Diff to previous 307551
Merge projects/ipsec into head/.

 Small summary
 -------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
  option IPSEC_SUPPORT added. It enables support for loading
  and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
  default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
  support was removed. Added TCP/UDP checksum handling for
  inbound packets that were decapsulated by transport mode SAs.
  setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
  build as part of ipsec.ko module (or with IPSEC kernel).
  It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
  methods. The only one header file <netipsec/ipsec_support.h>
  should be included to declare all the needed things to work
  with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
  Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
  - now all security associations stored in the single SPI namespace,
    and all SAs MUST have unique SPI.
  - several hash tables added to speed up lookups in SADB.
  - SADB now uses rmlock to protect access, and concurrent threads
    can do SA lookups in the same time.
  - many PF_KEY message handlers were reworked to reflect changes
    in SADB.
  - SADB_UPDATE message was extended to support new PF_KEY headers:
    SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
    can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
  avoid locking protection for ipsecrequest. Now we support
  only limited number (4) of bundled SAs, but they are supported
  for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
  used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
  check for full history of applied IPsec transforms.
o References counting rules for security policies and security
  associations were changed. The proper SA locking added into xform
  code.
o xform code was also changed. Now it is possible to unregister xforms.
  tdb_xxx structures were changed and renamed to reflect changes in
  SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by:	gnn, wblock
Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D9352


Revision 307551 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 18 07:16:49 2016 UTC (7 years, 8 months ago) by jch
File length: 58105 byte(s)
Diff to previous 307153
Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR:			203175
Reported by:		Palle Girgensohn, Slawa Olhovchenkov
Tested by:		Slawa Olhovchenkov
Reviewed by:		Slawa Olhovchenkov
Approved by:		gnn, Slawa Olhovchenkov
Differential Revision:	https://reviews.freebsd.org/D8211
MFC after:		1 week
Sponsored by:		Verisign, inc


Revision 307153 - (view) (download) (annotate) - [select for diffs]
Modified Wed Oct 12 19:06:50 2016 UTC (7 years, 8 months ago) by jtl
File length: 57651 byte(s)
Diff to previous 306769
The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by:	hiren
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8219


Revision 306769 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 6 16:28:34 2016 UTC (7 years, 8 months ago) by jtl
File length: 57656 byte(s)
Diff to previous 305810
Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by:	gnn, lstewart (partial)
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	(multiple)
Tested by:	Limelight, Netflix


Revision 305810 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 14 14:48:00 2016 UTC (7 years, 9 months ago) by tuexen
File length: 57664 byte(s)
Diff to previous 304223
Ensure that the IPPROTO_TCP level socket options
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.

Reviewed by:	rrs, jtl, hiren
MFC after:	1 month
Differential Revision:	7833


Revision 304223 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 16 15:11:46 2016 UTC (7 years, 10 months ago) by rrs
File length: 57590 byte(s)
Diff to previous 298673
Here we update the  modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by:	Netflix Inc.
Differential Revision:	D6790


Revision 298673 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 26 23:02:18 2016 UTC (8 years, 2 months ago) by cem
File length: 57183 byte(s)
Diff to previous 296881
tcp_usrreq: Free allocated buffer in relock case

The disgusting macro INP_WLOCK_RECHECK may early-return.  In
tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking
this macro, which may leak memory.

Add a _CLEANUP variant that takes a code argument to perform variable cleanup
in the early return path.  Use it to free the 'pbuf' allocated in
tcp_default_ctloutput().

I am not especially happy with this macro, but I reckon it's not any worse than
INP_WLOCK_RECHECK already was.

Reported by:	Coverity
CID:		1350286
Sponsored by:	EMC / Isilon Storage Division


Revision 296881 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 15 00:15:10 2016 UTC (8 years, 3 months ago) by glebius
File length: 57013 byte(s)
Diff to previous 296352
Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.


Revision 296352 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 3 17:46:38 2016 UTC (8 years, 4 months ago) by gnn
File length: 57024 byte(s)
Diff to previous 294931
Fix dtrace probes (introduced in 287759): debug__input was used
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath

Submitted by:	Hannes Mehnert
Sponsored by:	REMS, EPSRC
Differential Revision:	https://reviews.freebsd.org/D5525


Revision 294931 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 27 17:59:39 2016 UTC (8 years, 5 months ago) by glebius
File length: 56981 byte(s)
Diff to previous 294902
Rename netinet/tcp_cc.h to netinet/cc/cc.h.

Discussed with:	lstewart


Revision 294902 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 27 07:34:00 2016 UTC (8 years, 5 months ago) by glebius
File length: 56982 byte(s)
Diff to previous 294869
Fix issues with TCP_CONGESTION handling after r294540:
o Return back the buf[TCP_CA_NAME_MAX] for TCP_CONGESTION,
  for TCP_CCALGOOPT use dynamically allocated *pbuf.
o For SOPT_SET TCP_CONGESTION do NULL terminating of string
  taking from userland.
o For SOPT_SET TCP_CONGESTION do the search for the algorithm
  keeping the inpcb lock.
o For SOPT_GET TCP_CONGESTION first strlcpy() the name
  holding the inpcb lock into temporary buffer, then copyout.

Together with:	lstewart


Revision 294869 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 27 00:45:46 2016 UTC (8 years, 5 months ago) by glebius
File length: 56960 byte(s)
Diff to previous 294540
Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state.  Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by:	Netflix


Revision 294540 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jan 22 02:07:48 2016 UTC (8 years, 5 months ago) by glebius
File length: 56920 byte(s)
Diff to previous 294536
Provide new socket option TCP_CCALGOOPT, which stands for TCP congestion
control algorithm options.  The argument is variable length and is opaque
to TCP, forwarded directly to the algorithm's ctl_output method.

Provide new includes directory netinet/cc, where algorithm specific
headers can be installed.

The new API doesn't yet have any in tree consumers.

The original code written by lstewart.
Reviewed by:	rrs, emax
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D711


Revision 294536 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 21 22:53:12 2016 UTC (8 years, 5 months ago) by glebius
File length: 56234 byte(s)
Diff to previous 294535
Refactor TCP_CONGESTION setsockopt handling:
- Use M_TEMP instead of stack variable.
- Unroll error handling, removing several levels of indentation.


Revision 294535 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 21 22:34:51 2016 UTC (8 years, 5 months ago) by glebius
File length: 56430 byte(s)
Diff to previous 293284
- Rename cc.h to more meaningful tcp_cc.h.
- Declare it a kernel only include, which it already is.
- Don't include tcp.h implicitly from tcp_cc.h


Revision 293284 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jan 7 00:14:42 2016 UTC (8 years, 5 months ago) by glebius
File length: 56401 byte(s)
Diff to previous 292706
Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
  options length. The functions gives a better estimate, since it takes
  into account SACK state as well.

Reviewed by:	jtl
Differential Revision:	https://reviews.freebsd.org/D3593


Revision 292706 - (view) (download) (annotate) - [select for diffs]
Modified Thu Dec 24 19:09:48 2015 UTC (8 years, 6 months ago) by pkelsey
File length: 56456 byte(s)
Diff to previous 292309
Implementation of server-side TCP Fast Open (TFO) [RFC7413].

TFO is disabled by default in the kernel build.  See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by:	gnn, jch, stas
MFC after:	3 days
Sponsored by:	Verisign, Inc.
Differential Revision:	https://reviews.freebsd.org/D4350


Revision 292309 - (view) (download) (annotate) - [select for diffs]
Modified Wed Dec 16 00:56:45 2015 UTC (8 years, 6 months ago) by rrs
File length: 54979 byte(s)
Diff to previous 289276
First cut of the modularization of our TCP stack. Still
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by:	jtl, hiren, transports
Sponsored by:	Netflix Inc.
Differential Revision:	D4055


Revision 289276 - (view) (download) (annotate) - [select for diffs]
Modified Wed Oct 14 00:35:37 2015 UTC (8 years, 8 months ago) by hiren
File length: 52850 byte(s)
Diff to previous 287830
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision:	D3100
Submitted by:		Jonathan Looney <jlooney at juniper dot net>
Reviewed by:		gnn, hiren


Revision 287830 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 15 20:04:30 2015 UTC (8 years, 9 months ago) by hiren
File length: 52166 byte(s)
Diff to previous 287759
Remove unnecessary tcp state transition call.

Differential Revision:	D3451
Reviewed by:		markj
MFC after:		2 weeks
Sponsored by:		Limelight Networks


Revision 287759 - (view) (download) (annotate) - [select for diffs]
Modified Sun Sep 13 15:50:55 2015 UTC (8 years, 9 months ago) by gnn
File length: 52166 byte(s)
Diff to previous 286443
dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by:	rwatson
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	D3530


Revision 286443 - (view) (download) (annotate) - [select for diffs]
Modified Sat Aug 8 08:40:36 2015 UTC (8 years, 10 months ago) by jch
File length: 51427 byte(s)
Diff to previous 286227
Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by:	Larry Rosenman <ler@lerctr.org>
Submitted by:	markj, jch


Revision 286227 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 3 12:13:54 2015 UTC (8 years, 11 months ago) by jch
File length: 51344 byte(s)
Diff to previous 286027
Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
  stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
  and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR:			183659
Differential Revision:	https://reviews.freebsd.org/D2599
Reviewed by:		jhb, adrian
Tested by:		adrian, nitroboost-gmail.com
Sponsored by:		Verisign, Inc.


Revision 286027 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 29 17:59:13 2015 UTC (8 years, 11 months ago) by pkelsey
File length: 51320 byte(s)
Diff to previous 279821
Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by:	glebius
Approved by:	so
Approved by:	jmallett (mentor)
Security:	FreeBSD-SA-15:15.tcp
Sponsored by:	Norse Corp, Inc.


Revision 279821 - (view) (download) (annotate) - [select for diffs]
Modified Mon Mar 9 20:29:16 2015 UTC (9 years, 3 months ago) by jch
File length: 51307 byte(s)
Diff to previous 275333
In TCP, connect() can return incorrect error code EINVAL
instead of EADDRINUSE or ECONNREFUSED

PR:			https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=196035
Differential Revision:	https://reviews.freebsd.org/D1982
Reported by:		Mark Nunberg <mnunberg@haskalah.org>
Submitted by:		Harrison Grundy <harrison.grundy@astrodoggroup.com>
Reviewed by:		adrian, jch, glebius, gnn
Approved by:		jhb
MFC after:		2 weeks


Revision 275333 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 30 13:43:52 2014 UTC (9 years, 7 months ago) by glebius
File length: 51179 byte(s)
Diff to previous 275329
Merge from projects/sendfile:

- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
  into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix


Revision 275329 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 30 13:24:21 2014 UTC (9 years, 7 months ago) by glebius
File length: 50393 byte(s)
Diff to previous 275320
Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix


Revision 275320 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 30 12:11:01 2014 UTC (9 years, 7 months ago) by glebius
File length: 50379 byte(s)
Diff to previous 273850
Missed in r274421: use sbavail() instead of bare access to sb_cc.


Revision 273850 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 30 08:53:56 2014 UTC (9 years, 8 months ago) by jch
File length: 50375 byte(s)
Diff to previous 273014
Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan().  This race condition drives unplanned timewait
timeout cancellation.  Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision:	https://reviews.freebsd.org/D826
Submitted by:		Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by:		jch
Reviewed By:		jhb (mentor), adrian, rwatson
Sponsored by:		Verisign, Inc.
MFC after:		2 weeks
X-MFC-With:		r264321


Revision 273014 - (view) (download) (annotate) - [select for diffs]
Modified Sun Oct 12 23:01:25 2014 UTC (9 years, 8 months ago) by jch
File length: 49734 byte(s)
Diff to previous 271391
A connection in TIME_WAIT state before calling close() actually did not
received any RST packet.  Do not set error to ECONNRESET in this case.

Differential Revision:	https://reviews.freebsd.org/D879
Reviewed by:		rpaulo, adrian
Approved by:		jhb (mentor)
Sponsored by:		Verisign, Inc.


Revision 271391 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 10 13:17:35 2014 UTC (9 years, 9 months ago) by ae
File length: 49703 byte(s)
Diff to previous 265338
Make in6_pcblookup_hash_locked and in6_pcbladdr static.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC


Revision 265338 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 4 23:25:32 2014 UTC (10 years, 2 months ago) by glebius
File length: 50820 byte(s)
Diff to previous 261242
The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.


Revision 261242 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jan 28 20:28:32 2014 UTC (10 years, 5 months ago) by gnn
File length: 50833 byte(s)
Diff to previous 257846
Decrease lock contention within the TCP accept case by removing
the INP_INFO lock from tcp_usr_accept.  As the PR/patch states
this was following the advice already in the code.
See the PR below for a full disucssion of this change and its
measured effects.

PR:		183659
Submitted by:	Julian Charbon
Reviewed by:	jhb


Revision 257846 - (view) (download) (annotate) - [select for diffs]
Modified Fri Nov 8 13:04:14 2013 UTC (10 years, 7 months ago) by glebius
File length: 51308 byte(s)
Diff to previous 257176
Make TCP_KEEP* socket options readable. At least PostgreSQL wants
to read the values.

Reported by:	sobomax


Revision 257176 - (view) (download) (annotate) - [select for diffs]
Modified Sat Oct 26 17:58:36 2013 UTC (10 years, 8 months ago) by glebius
File length: 50863 byte(s)
Diff to previous 254889
The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.


Revision 254889 - (view) (download) (annotate) - [select for diffs]
Modified Sun Aug 25 21:54:41 2013 UTC (10 years, 10 months ago) by markj
File length: 50839 byte(s)
Diff to previous 245934
Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month


Revision 245934 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jan 26 01:41:42 2013 UTC (11 years, 5 months ago) by np
File length: 50783 byte(s)
Diff to previous 245921
Add checks for SO_NO_OFFLOAD in a couple of places that I missed earlier
in r245915.


Revision 245921 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jan 25 22:50:52 2013 UTC (11 years, 5 months ago) by np
File length: 50690 byte(s)
Diff to previous 245915
There is no need to call into the TOE driver twice in pru_rcvd (tod_rcvd
and then tod_output right after that).

Reviewed by:	bz@


Revision 245915 - (view) (download) (annotate) - [select for diffs]
Modified Fri Jan 25 20:23:33 2013 UTC (11 years, 5 months ago) by np
File length: 50684 byte(s)
Diff to previous 240985
Heed SO_NO_OFFLOAD.

MFC after:	1 week


Revision 240985 - (view) (download) (annotate) - [select for diffs]
Modified Thu Sep 27 07:13:21 2012 UTC (11 years, 9 months ago) by glebius
File length: 50546 byte(s)
Diff to previous 237263
Fix bug in TCP_KEEPCNT setting, which slipped in in the last round
of reviewing of r231025.

Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.

Reported & tested by:	amdmi3


Revision 237263 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 19 07:34:13 2012 UTC (12 years ago) by np
File length: 50415 byte(s)
Diff to previous 231025
- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)


Revision 231025 - (view) (download) (annotate) - [select for diffs]
Added Sun Feb 5 16:53:02 2012 UTC (12 years, 4 months ago) by glebius
File length: 49796 byte(s)
Diff to previous 229714
Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by:	andre, bz, lstewart



This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

  ViewVC Help
Powered by ViewVC 1.1.27