/[base]/head/sys/netinet/in_pcb.h
ViewVC logotype

Log of /head/sys/netinet/in_pcb.h

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (download) (annotate)
Sticky Revision:


Revision 368819 - (view) (download) (annotate) - [select for diffs]
Modified Sat Dec 19 22:04:46 2020 UTC (3 years, 6 months ago) by gallatin
File length: 35285 byte(s)
Diff to previous 367148
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain

In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636


Revision 367148 - (view) (download) (annotate) - [select for diffs]
Modified Thu Oct 29 22:18:56 2020 UTC (3 years, 8 months ago) by jhb
File length: 35226 byte(s)
Diff to previous 366569
Call m_snd_tag_rele() to free send tags.

Send tags are refcounted and if_snd_tag_free() is called by
m_snd_tag_rele() when the last reference is dropped on a send tag.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26995


Revision 366569 - (view) (download) (annotate) - [select for diffs]
Modified Fri Oct 9 12:06:43 2020 UTC (3 years, 8 months ago) by rscheff
File length: 35248 byte(s)
Diff to previous 361228
Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.

This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409


Revision 361228 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 18 22:53:12 2020 UTC (4 years, 1 month ago) by karels
File length: 34823 byte(s)
Diff to previous 357818
Allow TCP to reuse local port with different destinations

Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by:	tuexen (rscheff, rrs previous version)
MFC after:	1 month
Sponsored by:	Forcepoint LLC
Differential Revision:	https://reviews.freebsd.org/D24781


Revision 357818 - (view) (download) (annotate) - [select for diffs]
Modified Wed Feb 12 13:31:36 2020 UTC (4 years, 4 months ago) by rrs
File length: 34658 byte(s)
Diff to previous 356752
White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by:	Netflix Inc.


Revision 356752 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 15 03:41:15 2020 UTC (4 years, 5 months ago) by glebius
File length: 34671 byte(s)
Diff to previous 356663
Stop header pollution and don't include if_var.h via in_pcb.h.


Revision 356663 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jan 12 17:52:32 2020 UTC (4 years, 5 months ago) by tuexen
File length: 34715 byte(s)
Diff to previous 354490
Fix race when accepting TCP connections.

When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
   table.
2) The remote address was filled in and the inp was relocated in the
   hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.

Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D22971


Revision 354490 - (view) (download) (annotate) - [select for diffs]
Modified Thu Nov 7 22:26:54 2019 UTC (4 years, 7 months ago) by glebius
File length: 34700 byte(s)
Diff to previous 354480
Remove now unused INP_INFO_RLOCK macros.


Revision 354480 - (view) (download) (annotate) - [select for diffs]
Modified Thu Nov 7 21:03:15 2019 UTC (4 years, 7 months ago) by glebius
File length: 35065 byte(s)
Diff to previous 354478
Remove now unused INP_HASH_RLOCK() macros.


Revision 354478 - (view) (download) (annotate) - [select for diffs]
Modified Thu Nov 7 20:57:51 2019 UTC (4 years, 7 months ago) by glebius
File length: 35380 byte(s)
Diff to previous 350531
Add INP_UNLOCK() which will do whatever R/W unlock is required.


Revision 350531 - (view) (download) (annotate) - [select for diffs]
Modified Fri Aug 2 07:41:36 2019 UTC (4 years, 11 months ago) by bz
File length: 35327 byte(s)
Diff to previous 350501
IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names.  We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
  - in6pcb which really is an inpcb and nothing separate
  - sotoin6pcb which is sotoinpcb (as per above)
  - in6p_sp which is inp_sp
  - in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone  also rename the in6p variables to inp as
  that is what we call it in most of the network stack including
  parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with:		tuexen (SCTP changes)
MFC after:		3 months
Sponsored by:		Netflix


Revision 350501 - (view) (download) (annotate) - [select for diffs]
Modified Thu Aug 1 14:17:31 2019 UTC (4 years, 11 months ago) by rrs
File length: 35550 byte(s)
Diff to previous 349893
This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D20953


Revision 349893 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 10 20:40:39 2019 UTC (4 years, 11 months ago) by rrs
File length: 35361 byte(s)
Diff to previous 346677
This commit updates rack to what is basically being used at NF as
well as sets in some of the groundwork for committing BBR. The
hpts system is updated as well as some other needed utilities
for the entrance of BBR. This is actually part 1 of 3 more
needed commits which will finally complete with BBRv1 being
added as a new tcp stack.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D20834


Revision 346677 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 25 15:37:28 2019 UTC (5 years, 2 months ago) by gallatin
File length: 35110 byte(s)
Diff to previous 342872
Track TCP connection's NUMA domain in the inpcb

Drivers can now pass up numa domain information via the
mbuf numa domain field.  This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by:	markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20028


Revision 342872 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 9 01:11:19 2019 UTC (5 years, 5 months ago) by glebius
File length: 35112 byte(s)
Diff to previous 341595
Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin


Revision 341595 - (view) (download) (annotate) - [select for diffs]
Modified Wed Dec 5 17:06:00 2018 UTC (5 years, 6 months ago) by markj
File length: 35128 byte(s)
Diff to previous 339039
Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.

Memory beyond that limit was previously unused, wasting roughly 1MB per
8GB of RAM.  Also retire INP_PCBLBGROUP_PORTHASH, which was identical to
INP_PCBPORTHASH.

Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D17803


Revision 339039 - (view) (download) (annotate) - [select for diffs]
Modified Mon Oct 1 10:46:00 2018 UTC (5 years, 9 months ago) by ae
File length: 35202 byte(s)
Diff to previous 338571
Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.

PR:		231428
Reviewed by:	mmacy, tuexen
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17335


Revision 338571 - (view) (download) (annotate) - [select for diffs]
Modified Mon Sep 10 19:00:29 2018 UTC (5 years, 9 months ago) by markj
File length: 35120 byte(s)
Diff to previous 338509
Fix synchronization of LB group access.

Lookups are protected by an epoch section, so the LB group linkage must
be a CK_LIST rather than a plain LIST.  Furthermore, we were not
deferring LB group frees, so in_pcbremlbgrouphash() could race with
readers and cause a use-after-free.

Reviewed by:	sbruno, Johannes Lundberg <johalun0@gmail.com>
Tested by:	gallatin
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17031


Revision 338509 - (view) (download) (annotate) - [select for diffs]
Modified Thu Sep 6 19:55:40 2018 UTC (5 years, 9 months ago) by bz
File length: 35078 byte(s)
Diff to previous 338138
The inp_lle field to struct inpcb, along with two "valid" flags
for the rt and lle cache were added in r191129 (2009).
To my best knowledge they have never been used and route caching
has converted the inp_rt field from that commit to inp_route
rendering this field and these flags obsolete.

Convert the pointer into a spare pointer to not change the size of
the structure anymore (and to have a spare pointer) and mark the
two fields as unused.

Reviewed by:	markj, karels
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17062


Revision 338138 - (view) (download) (annotate) - [select for diffs]
Modified Tue Aug 21 14:12:30 2018 UTC (5 years, 10 months ago) by tuexen
File length: 35150 byte(s)
Diff to previous 338115
Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR:			173444
Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16796


Revision 338115 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 20 20:06:36 2018 UTC (5 years, 10 months ago) by bz
File length: 35122 byte(s)
Diff to previous 337279
GC inc_isipv6; it was added for "temp" compatibility in 2001, r86764
and does not seem to be used.


Revision 337279 - (view) (download) (annotate) - [select for diffs]
Modified Sat Aug 4 00:03:21 2018 UTC (5 years, 11 months ago) by glebius
File length: 35176 byte(s)
Diff to previous 335979
Now that after r335979 the kernel addresses in API structures are
fixed size, there is no reason left for the unions.

Discussed with:	brooks


Revision 335979 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 5 13:13:48 2018 UTC (5 years, 11 months ago) by brooks
File length: 35209 byte(s)
Diff to previous 335924
Make struct xinpcb and friends word-size independent.

Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662.  This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR:		228301 (exp-run by antoine)
Reviewed by:	jtl, kib, rwatson (various versions)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15386


Revision 335924 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jul 4 02:47:16 2018 UTC (6 years ago) by mmacy
File length: 35153 byte(s)
Diff to previous 335356
epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066


Revision 335356 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 19 01:54:00 2018 UTC (6 years ago) by mmacy
File length: 34751 byte(s)
Diff to previous 335093
convert inpcbinfo hash and info rwlocks to epoch + mutex

- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
  INP_INFO_RLOCK which can no longer fail

When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.

Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.

On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%

Tested by LLNW and pho@

Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686


Revision 335093 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 13 23:19:54 2018 UTC (6 years ago) by mmacy
File length: 34729 byte(s)
Diff to previous 335017
Fix PCBGROUPS build post CK conversion of pcbinfo


Revision 335017 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 12 22:18:27 2018 UTC (6 years ago) by mmacy
File length: 34726 byte(s)
Diff to previous 335016
Defer inpcbport free until after a grace period has elapsed

This is a dependency for inpcbinfo rlock conversion to epoch


Revision 335016 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 12 22:18:20 2018 UTC (6 years ago) by mmacy
File length: 34689 byte(s)
Diff to previous 335015
mechanical CK macro conversion of inpcbinfo lists

This is a dependency for converting the inpcbinfo hash and info rlocks
to epoch.


Revision 335015 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 12 22:18:15 2018 UTC (6 years ago) by mmacy
File length: 34648 byte(s)
Diff to previous 334719
Defer inpcb deletion until after a grace period has elapsed

Deferring the actual free of the inpcb until after a grace
period has elapsed will allow us to convert the inpcbinfo
info and hash read locks to epoch.

Reviewed by: gallatin, jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15510


Revision 334719 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jun 6 15:45:57 2018 UTC (6 years ago) by sbruno
File length: 34611 byte(s)
Diff to previous 333984
Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Obtained from:	DragonflyBSD
Relnotes:	Yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003


Revision 333984 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 21 16:13:43 2018 UTC (6 years, 1 month ago) by mmacy
File length: 33414 byte(s)
Diff to previous 333915
inpcb: revert deferred inpcb free pending further review


Revision 333915 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 20 04:38:04 2018 UTC (6 years, 1 month ago) by mmacy
File length: 33451 byte(s)
Diff to previous 333910
inpcb: defer destruction of inpcb until after a grace period has elapsed

in_pcbfree will remove the incpb from the list and release the rtentry
while the vnet is set, but the actual destruction will be deferred
until any threads in a (not yet used) epoch section, no longer potentially
have references.


Revision 333910 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 20 02:17:30 2018 UTC (6 years, 1 month ago) by mmacy
File length: 33414 byte(s)
Diff to previous 332967
in_pcb: add helper for deferring inpcb rele calls from list functions


Revision 332967 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 24 19:55:12 2018 UTC (6 years, 2 months ago) by sbruno
File length: 33203 byte(s)
Diff to previous 332894
Revert r332894 at the request of the submitter.

Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks


Revision 332894 - (view) (download) (annotate) - [select for diffs]
Modified Mon Apr 23 19:51:00 2018 UTC (6 years, 2 months ago) by sbruno
File length: 34106 byte(s)
Diff to previous 332770
Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003


Revision 332770 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 19 13:37:59 2018 UTC (6 years, 2 months ago) by rrs
File length: 33203 byte(s)
Diff to previous 326023
This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after:	Never
Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D15020


Revision 326023 - (view) (download) (annotate) - [select for diffs]
Modified Mon Nov 20 19:43:44 2017 UTC (6 years, 7 months ago) by pfg
File length: 29563 byte(s)
Diff to previous 323219
sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


Revision 323219 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 6 13:56:18 2017 UTC (6 years, 9 months ago) by hselasky
File length: 29519 byte(s)
Diff to previous 318793
Add support for generic backpressure indicator for ratelimited
transmit queues aswell as non-ratelimited ones.

Add the required structure bits in order to support a backpressure
indication with ratelimited connections aswell as non-ratelimited
ones. The backpressure indicator is a value between zero and 65535
inclusivly, indicating if the destination transmit queue is empty or
full respectivly. Applications can use this value as a decision point
for when to stop transmitting data to avoid endless ENOBUFS error
codes upon transmitting an mbuf. This indicator is also useful to
reduce the latency for ratelimited queues.

Reviewed by:		gallatin, kib, gnn
Differential Revision:	https://reviews.freebsd.org/D11518
Sponsored by:		Mellanox Technologies


Revision 318793 - (view) (download) (annotate) - [select for diffs]
Modified Wed May 24 17:47:16 2017 UTC (7 years, 1 month ago) by glebius
File length: 29465 byte(s)
Diff to previous 318321
o Rearrange struct inpcb fields to optimize the TCP output code path
  considering cache line hits and misses.  Put the lock and hash list
  glue into the first cache line, put inp_refcount inp_flags inp_socket
  into the second cache line.
o On allocation zero out entire structure except the lock and list entries,
  including inp_route inp_lle inp_gencnt.  When inp_route and inp_lle were
  introduced, they were added below inp_zero_size, resulting on not being
  cleared after free/alloc.  This definitely was a source of bugs with route
  caching.  Could be that r315956 has just fixed one of them.
  The inp_gencnt is reinitialized on every alloc, so it is safe to clear it.

This has been proved to improve TCP performance at Netflix.

Obtained from:		rrs
Differential Revision:	D10686


Revision 318321 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 15 21:58:36 2017 UTC (7 years, 1 month ago) by glebius
File length: 29419 byte(s)
Diff to previous 315684
Reduce in_pcbinfo_init() by two params.  No users supply any flags to this
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away.  Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak.  Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.


Revision 315684 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 21 16:23:44 2017 UTC (7 years, 3 months ago) by glebius
File length: 29439 byte(s)
Diff to previous 315662
Force same alignment on struct xinpgen as we have on struct xinpcb.  This
fixes 32-bit builds.


Revision 315662 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 21 06:39:49 2017 UTC (7 years, 3 months ago) by glebius
File length: 29426 byte(s)
Diff to previous 314930
Hide struct inpcb, struct tcpcb from the userland.

This is a painful change, but it is needed.  On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD.  We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.

Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
  kernel structures inpcb and tcpcb inside.  Export into these structures
  the fields from inpcb and tcpcb that are known to be used, and put there
  a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.

Reviewed by:	rrs, gnn
Differential Revision:	D10018


Revision 314930 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 9 00:55:19 2017 UTC (7 years, 3 months ago) by glebius
File length: 28871 byte(s)
Diff to previous 314722
Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS.
This will make INVARIANT-enabled modules, that use this function to load
successfully on a kernel that has INVARIANT_SUPPORT only.


Revision 314722 - (view) (download) (annotate) - [select for diffs]
Modified Mon Mar 6 04:01:58 2017 UTC (7 years, 3 months ago) by eri
File length: 28914 byte(s)
Diff to previous 314436
The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by:	adrian, aw
Approved by:	ae (mentor)
Sponsored by:	rsync.net
Differential Revision:	D9235


Revision 314436 - (view) (download) (annotate) - [select for diffs]
Modified Tue Feb 28 23:42:47 2017 UTC (7 years, 4 months ago) by imp
File length: 28844 byte(s)
Diff to previous 313675
Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96


Revision 313675 - (view) (download) (annotate) - [select for diffs]
Modified Sun Feb 12 06:56:33 2017 UTC (7 years, 4 months ago) by eri
File length: 28844 byte(s)
Diff to previous 313528
Committed without approval from mentor.

Reported by:	gnn


Revision 313528 - (view) (download) (annotate) - [select for diffs]
Modified Fri Feb 10 05:58:16 2017 UTC (7 years, 4 months ago) by eri
File length: 28914 byte(s)
Diff to previous 313527
Revert r313527

Heh svn is not git


Revision 313527 - (view) (download) (annotate) - [select for diffs]
Modified Fri Feb 10 05:51:39 2017 UTC (7 years, 4 months ago) by eri
File length: 28933 byte(s)
Diff to previous 313524
Correct missed variable name.

Reported-by: ohartmann@walstatt.org


Revision 313524 - (view) (download) (annotate) - [select for diffs]
Modified Fri Feb 10 05:16:14 2017 UTC (7 years, 4 months ago) by eri
File length: 28914 byte(s)
Diff to previous 312379
The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian


Revision 312379 - (view) (download) (annotate) - [select for diffs]
Modified Wed Jan 18 13:31:17 2017 UTC (7 years, 5 months ago) by hselasky
File length: 28844 byte(s)
Diff to previous 302153
Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months


Revision 302153 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jun 23 21:07:15 2016 UTC (8 years ago) by np
File length: 28349 byte(s)
Diff to previous 298995
Add spares to struct ifnet and socket for packet pacing and/or general
use.  Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by:	gnn@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications


Revision 298995 - (view) (download) (annotate) - [select for diffs]
Modified Tue May 3 18:05:43 2016 UTC (8 years, 2 months ago) by pfg
File length: 28349 byte(s)
Diff to previous 297225
sys/net*: minor spelling fixes.

No functional change.


Revision 297225 - (view) (download) (annotate) - [select for diffs]
Modified Thu Mar 24 07:54:56 2016 UTC (8 years, 3 months ago) by gnn
File length: 28349 byte(s)
Diff to previous 287481
FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by:	Mike Karels
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D4306


Revision 287481 - (view) (download) (annotate) - [select for diffs]
Modified Sat Sep 5 10:15:19 2015 UTC (8 years, 9 months ago) by glebius
File length: 28102 byte(s)
Diff to previous 286443
Use Jenkins hash for TCP syncache.

o Unlike xor, in Jenkins hash every bit of input affects virtually
  every bit of output, thus salting the hash actually works. With
  xor salting only provides a false sense of security, since if
  hash(x) collides with hash(y), then of course, hash(x) ^ salt
  would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
  ideal.

TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.

Noticed by:	Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by:	jch
Reviewed by:	jch, pkelsey, delphij
Security:	strengthens protection against hash collision DoS
Sponsored by:	Nginx, Inc.


Revision 286443 - (view) (download) (annotate) - [select for diffs]
Modified Sat Aug 8 08:40:36 2015 UTC (8 years, 10 months ago) by jch
File length: 27957 byte(s)
Diff to previous 286227
Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by:	Larry Rosenman <ler@lerctr.org>
Submitted by:	markj, jch


Revision 286227 - (view) (download) (annotate) - [select for diffs]
Modified Mon Aug 3 12:13:54 2015 UTC (8 years, 11 months ago) by jch
File length: 27899 byte(s)
Diff to previous 275358
Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
  stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
  and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR:			183659
Differential Revision:	https://reviews.freebsd.org/D2599
Reviewed by:		jhb, adrian
Tested by:		adrian, nitroboost-gmail.com
Sponsored by:		Verisign, Inc.


Revision 275358 - (view) (download) (annotate) - [select for diffs]
Modified Mon Dec 1 11:45:24 2014 UTC (9 years, 7 months ago) by hselasky
File length: 25578 byte(s)
Diff to previous 274331
Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies


Revision 274331 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 9 21:33:01 2014 UTC (9 years, 7 months ago) by melifaro
File length: 25602 byte(s)
Diff to previous 271400
Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@


Revision 271400 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 10 16:26:18 2014 UTC (9 years, 9 months ago) by ae
File length: 25618 byte(s)
Diff to previous 271386
Add scope zone id to the in_endpoints and hc_metrics structures.

A non-global IPv6 address can be used in more than one zone of the same
scope. This zone index is used to identify to which zone a non-global
address belongs.

Also we can have many foreign hosts with equal non-global addresses,
but from different zones. So, they can have different metrics in the
host cache.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC


Revision 271386 - (view) (download) (annotate) - [select for diffs]
Modified Wed Sep 10 12:35:42 2014 UTC (9 years, 9 months ago) by ae
File length: 25496 byte(s)
Diff to previous 271293
Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of
IPv6 address as hash key in all places.

Obtained from:	Yandex LLC


Revision 271293 - (view) (download) (annotate) - [select for diffs]
Modified Tue Sep 9 01:45:39 2014 UTC (9 years, 9 months ago) by adrian
File length: 25441 byte(s)
Diff to previous 268557
Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan


Revision 268557 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jul 12 05:40:13 2014 UTC (9 years, 11 months ago) by adrian
File length: 25279 byte(s)
Diff to previous 268479
Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes
can be made to use it.


Revision 268479 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 10 03:10:56 2014 UTC (9 years, 11 months ago) by adrian
File length: 25193 byte(s)
Diff to previous 266418
Implement the first stage of multi-bind listen sockets and RSS socket
awareness.

* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
  sockets on the same bind details.

  Although the PCB code has been taught about this (see below) this patch
  doesn't introduce the rest of the PCB changes necessary to distribute
  lookups among multiple PCB entries in the global wildcard table.

* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
  given RSS bucket (and thus a single PCBGROUP hash.)

* Modify the PCB add path to be aware of IP_BINDMULTI:
  + Only allow further PCB entries to be added if the owner credentials
    and IP_BINDMULTI has been specified.  Ie, only allow further
    IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.

* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
  Instead of using the wildcard logic and hashing, these sockets are
  simply placed into the PCBGROUP and _not_ in the wildcard hash.

* When doing a PCBGROUP lookup, also do a wildcard match as well.
  This allows for an RSS bucket PCB entry to appear in a PCBGROUP
  rather than having to exist in the wildcard list.

Tested:

* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)

TODO:

* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
  logic.  This could be refactored into a single function.

* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
  yet know about this); nor does it yet fully work for UDP.


Revision 266418 - (view) (download) (annotate) - [select for diffs]
Modified Sun May 18 22:30:12 2014 UTC (10 years, 1 month ago) by adrian
File length: 24982 byte(s)
Diff to previous 264879
Add the flowtype to the inpcb.

The flowid isn't enough to use as part of any RSS related CPU affinity
lookups - the RSS code would like to know what kind of hash it is.


Revision 264879 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 24 12:52:31 2014 UTC (10 years, 2 months ago) by smh
File length: 24930 byte(s)
Diff to previous 252710
Fix jailed raw sockets not setting the correct source address by
calling in_pcbladdr instead of prison_get_ip4

MFC after:	1 month


Revision 252710 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 4 18:38:00 2013 UTC (11 years ago) by trociny
File length: 24840 byte(s)
Diff to previous 250300
In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option.  It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR:		179901
Submitted by:	Michael Gmelin freebsd grem.de (initial version)
Reviewed by:	rwatson
MFC after:	1 week


Revision 250300 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 6 16:42:18 2013 UTC (11 years, 1 month ago) by andre
File length: 24726 byte(s)
Diff to previous 249318
Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.


Revision 249318 - (view) (download) (annotate) - [select for diffs]
Modified Tue Apr 9 21:02:20 2013 UTC (11 years, 2 months ago) by andre
File length: 24734 byte(s)
Diff to previous 241129
Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.


Revision 241129 - (view) (download) (annotate) - [select for diffs]
Modified Tue Oct 2 12:03:02 2012 UTC (11 years, 9 months ago) by glebius
File length: 24726 byte(s)
Diff to previous 236959
 There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.

  This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:

  - 2 threads locate pcb, and do in_pcbref() on it.
  - These 2 threads drop the inp hash lock.
  - Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
    does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
    doesn't free the pcb due to two references on it. Then it unlocks the pcb.
  - 2 aforementioned threads acquire reader lock on the pcb and run
    in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
    second gets 0 and considers pcb freed, returns.
  - The thread that got 1 continutes working with detached pcb, which later
    leads to panic in the underlying protocol level.

  To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.

Discussed with:		rwatson, jhb
Reported by:		Vladimir Medvedkin <medved rambler-co.ru>


Revision 236959 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 12 14:02:38 2012 UTC (12 years ago) by tuexen
File length: 24666 byte(s)
Diff to previous 233096
Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days


Revision 233096 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 17 21:51:39 2012 UTC (12 years, 3 months ago) by rmh
File length: 24592 byte(s)
Diff to previous 227207
Hide a few declarations from userland (including `struct inpcbgroup'). This
removes the dependency on <machine/param.h> which was introduced with SVN
rev 222748 (due to CACHE_LINE_SIZE).

Reviewed by:	bde
MFC after:	10 days


Revision 227207 - (view) (download) (annotate) - [select for diffs]
Modified Sun Nov 6 10:47:20 2011 UTC (12 years, 7 months ago) by trociny
File length: 24592 byte(s)
Diff to previous 224151
Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by:	dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by:	rwatson
Glanced by:	rwatson
Tested by:	dave jones <s.dave.jones@gmail.com>


Revision 224151 - (view) (download) (annotate) - [select for diffs]
Modified Sun Jul 17 21:15:20 2011 UTC (12 years, 11 months ago) by bz
File length: 24525 byte(s)
Diff to previous 222787
Add spares to the network stack for FreeBSD-9:
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)

Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.

Discussed with:	rwatson (slightly earlier version)


Revision 222787 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jun 6 21:45:32 2011 UTC (13 years ago) by bz
File length: 24473 byte(s)
Diff to previous 222748
Unbreak kernels with non-default PCBGROUP included but no WITNESS.
Rather than including lock.h in in_pcbgroup.c in right order, fix it
for all consumers of in_pcb.h by further header file pollution under
#ifdef KERNEL.

Reported by:	Pan Tsu (inyaoo gmail.com)


Revision 222748 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jun 6 12:55:02 2011 UTC (13 years, 1 month ago) by rwatson
File length: 24451 byte(s)
Diff to previous 222691
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup.  pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups.  During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock.  By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details).  This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems".  However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies.  Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect.  In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.


Revision 222691 - (view) (download) (annotate) - [select for diffs]
Modified Sat Jun 4 16:33:06 2011 UTC (13 years, 1 month ago) by rwatson
File length: 21722 byte(s)
Diff to previous 222488
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc.  For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI.  However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.


Revision 222488 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 30 09:43:55 2011 UTC (13 years, 1 month ago) by rwatson
File length: 21430 byte(s)
Diff to previous 222217
Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.


Revision 222217 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 23 19:32:02 2011 UTC (13 years, 1 month ago) by rwatson
File length: 20352 byte(s)
Diff to previous 222213
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:

(1) Convert inpcb reference counting from manually manipulated integers to
    the refcount(9) KPI.  This allows the refcount to be managed atomically
    with an inpcb read lock rather than write lock, or even with no inpcb
    lock at all.  As a result, in_pcbref() also no longer requires an inpcb
    lock, so can be performed solely using the lock used to look up an
    inpcb.

(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
    in_pcbfree_internal) to the explicit in_pcbfree() context.  This means
    that the inpcb refcount is increasingly used only to maintain memory
    stability, not actually defer the clean up of inpcb protocol parts.
    This is desirable as many of those protocol parts required the pcbinfo
    lock, which we'd like not to acquire in in_pcbrele() contexts.  Document
    this in comments better.

(3) Introduce new read-locked and write-locked in_pcbrele() variations,
    in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
    be properly unlocked as needed.  in_pcbrele() is a wrapper around the
    latter, and should probably go away at some point.  This makes it
    easier to use this weak reference model when holding only a read lock,
    as will happen in the future.

This may well be safe to MFC, but some more KBI analysis is required.

Reviewed by:    bz
MFC after:      3 weeks
Sponsored by:   Juniper Networks, Inc.


Revision 222213 - (view) (download) (annotate) - [select for diffs]
Modified Mon May 23 13:51:57 2011 UTC (13 years, 1 month ago) by rwatson
File length: 20272 byte(s)
Diff to previous 220879
A number of quite incremental refinements to struct inpcbinfo's definition:

(1) Add a locking guide for inpcbinfo.
(2) Annotate inpcbinfo fields with synchronisation information; not all
    annotations are 100% satisfactory.
(3) Reorder inpcbinfo fields so that the lock is at the head of the
    structure, and close to fields it protects.
(4) Sort fields that will eventually be hashlock/pcbgroup-related together
    even though they remain locked by ipi_lock for now.

Reviewed by:	bz
Sponsored by:	Juniper Networks
X-MFC after:	KBI analysis required


Revision 220879 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 20 08:00:29 2011 UTC (13 years, 2 months ago) by bz
File length: 19467 byte(s)
Diff to previous 219579
MFp4 CH=191470:

Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c.  The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days


Revision 219579 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 12 21:46:37 2011 UTC (13 years, 3 months ago) by bz
File length: 19540 byte(s)
Diff to previous 205157
Merge the two identical implementations for local port selections from
in_pcbbind_setup() and in6_pcbsetport() in a single in_pcb_lport().

MFC after:	2 weeks


Revision 205157 - (view) (download) (annotate) - [select for diffs]
Modified Sun Mar 14 18:59:11 2010 UTC (14 years, 3 months ago) by rwatson
File length: 19451 byte(s)
Diff to previous 204806
Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot.  As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.

MFC after:      1 month
Reviewed by:    bz
Sponsored by:   Juniper Networks


Revision 204806 - (view) (download) (annotate) - [select for diffs]
Modified Sat Mar 6 21:24:11 2010 UTC (14 years, 4 months ago) by rwatson
File length: 19256 byte(s)
Diff to previous 196041
Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after:	1 week


Revision 196041 - (view) (download) (annotate) - [select for diffs]
Modified Sun Aug 2 22:47:08 2009 UTC (14 years, 11 months ago) by rwatson
File length: 19189 byte(s)
Diff to previous 195727
Add padding to struct inpcb, missed during our padding sweep earlier in
the release cycle.

Approved by:	re (kensmith)


Revision 195727 - (view) (download) (annotate) - [select for diffs]
Modified Thu Jul 16 21:13:04 2009 UTC (14 years, 11 months ago) by rwatson
File length: 19148 byte(s)
Diff to previous 195699
Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)


Revision 195699 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jul 14 22:48:30 2009 UTC (14 years, 11 months ago) by rwatson
File length: 19200 byte(s)
Diff to previous 194739
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)


Revision 194739 - (view) (download) (annotate) - [select for diffs]
Modified Tue Jun 23 17:03:45 2009 UTC (15 years ago) by bz
File length: 18369 byte(s)
Diff to previous 193217
After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.


Revision 193217 - (view) (download) (annotate) - [select for diffs]
Modified Mon Jun 1 10:30:00 2009 UTC (15 years, 1 month ago) by pjd
File length: 18393 byte(s)
Diff to previous 192116
- Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
  with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
  as there is no more space for more SOL_SOCKET options, but this option also
  fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
  functionality (currently only unjail root can use it).

Discussed with:	julian, adrian, jhb, rwatson, kmacy


Revision 192116 - (view) (download) (annotate) - [select for diffs]
Modified Thu May 14 20:59:36 2009 UTC (15 years, 1 month ago) by rwatson
File length: 18448 byte(s)
Diff to previous 191688
Staticize two functions not used outside of in_pcb.c: in_pcbremlists() and
db_print_inpcb().

MFC after:	1 month


Revision 191688 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 30 13:36:26 2009 UTC (15 years, 2 months ago) by zec
File length: 18622 byte(s)
Diff to previous 191160
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)


Revision 191160 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 16 23:02:56 2009 UTC (15 years, 2 months ago) by kmacy
File length: 18522 byte(s)
Diff to previous 191158
s/void/void */


Revision 191158 - (view) (download) (annotate) - [select for diffs]
Modified Thu Apr 16 22:47:43 2009 UTC (15 years, 2 months ago) by kmacy
File length: 18521 byte(s)
Diff to previous 191129
restore spare pointers for MFCing


Revision 191129 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 15 22:22:00 2009 UTC (15 years, 2 months ago) by kmacy
File length: 18467 byte(s)
Diff to previous 191126
- convert pspare pointers in inpcb to an llentry and rtentry cache
- add flags to indicate their validity


Revision 191126 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 15 22:09:42 2009 UTC (15 years, 2 months ago) by kmacy
File length: 18291 byte(s)
Diff to previous 191125
- add second flags field to to inpcb
- update comments in vflag


Revision 191125 - (view) (download) (annotate) - [select for diffs]
Modified Wed Apr 15 21:39:56 2009 UTC (15 years, 2 months ago) by kmacy
File length: 18371 byte(s)
Diff to previous 190880
provide additional convenience macros for inpcb locking (upgrade, downgrade, exclusive)


Revision 190880 - (view) (download) (annotate) - [select for diffs]
Modified Fri Apr 10 06:16:14 2009 UTC (15 years, 2 months ago) by kmacy
File length: 18198 byte(s)
Diff to previous 189848
Import "flowid" support for serializing flows across transmit queues

Reviewed by:	rwatson and jeli


Revision 189848 - (view) (download) (annotate) - [select for diffs]
Modified Sun Mar 15 09:58:31 2009 UTC (15 years, 3 months ago) by rwatson
File length: 18053 byte(s)
Diff to previous 189657
Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz


Revision 189657 - (view) (download) (annotate) - [select for diffs]
Modified Wed Mar 11 00:29:22 2009 UTC (15 years, 3 months ago) by rwatson
File length: 17942 byte(s)
Diff to previous 189637
Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whether
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this.  This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.

MFC after:	2 weeks


Revision 189637 - (view) (download) (annotate) - [select for diffs]
Modified Tue Mar 10 17:57:41 2009 UTC (15 years, 3 months ago) by rwatson
File length: 17873 byte(s)
Diff to previous 189615
Remove unused v6 macro aliases for inpcb fields:

        in6p_ip6_nxt
        in6p_vflag
        in6p_flags
        in6p_socket
        in6p_lport
        in6p_fport
        in6p_ppcb

Remove unused v6 macro aliases for inpcb flags:

        IN6P_HIGHPORT
        IN6P_LOWPORT
        IN6P_ANONPORT
        IN6P_RECVIF
        IN6P_MTUDISC
        IN6P_FAITH
        IN6P_CONTROLOPTS

References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.

MFC after:	3 days
Reviewed by:	bz


Revision 189615 - (view) (download) (annotate) - [select for diffs]
Added Tue Mar 10 11:04:19 2009 UTC (15 years, 3 months ago) by rwatson
File length: 18632 byte(s)
Diff to previous 186955
Remove now-unused INP_UNMAPPABLEOPTS.

MFC after:	3 days
Discussed with:	bz



This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

  ViewVC Help
Powered by ViewVC 1.1.27