Log of /releng/12.0/sys/cddl/contrib/opensolaris/uts/common/fs
Directory Listing
Revision
341085 -
Directory Listing
Modified
Tue Nov 27 17:58:25 2018 UTC
(5 years, 11 months ago)
by
markj
MFstable/12 r340970:
Ensure that directory entry padding bytes are zeroed.
Approved by: re (gjb)
Revision
340470 -
Directory Listing
Modified
Fri Nov 16 00:00:59 2018 UTC
(5 years, 11 months ago)
by
gjb
- Copy stable/12@r340462 to releng/12.0 as part of the 12.0-RELEASE
cycle.
- Prune svn:mergeinfo from the new branch.
- Update from BETA4 to RC1.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
Revision
339372 -
Directory Listing
Modified
Mon Oct 15 21:59:24 2018 UTC
(6 years ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Skip VDEV_IO_DONE stage only for ZIO_TYPE_FREE.
Device removal code uses zio_vdev_child_io() with ZIO_TYPE_NULL parent,
that never happened before. It confused FreeBSD-specific TRIM code,
which does not use VDEV_IO_DONE for logical ZIO_TYPE_FREE ZIOs. As
result of that stage being skipped device removal ZIOs leaked references
and memory that supposed to be freed by VDEV_IO_DONE, making it stuck.
It is a quick patch rather then a nice fix, but hopefully we'll be able
to drop it all together when alternative TRIM implementation finally get
landed.
PR: 228750, 229007
Discussed with: allanjude, avg, smh
Approved by: re (delphij)
MFC after: 5 days
Sponsored by: iXsystems, Inc.
Revision
339355 -
Directory Listing
Modified
Sun Oct 14 16:14:01 2018 UTC
(6 years ago)
by
mjg
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
zfs: fix a panic after failed mount
r338927("zfs: depessimize zfs_root with rmlocks") failed to error check
the mount before caching root vnode.
Results in crashes in rrw_enter_read_impl tracing back to zfs_mount.
Reported by: Mike Tancsa
Tested by: allanjude
Approved by: re (kib)
Revision
339335 -
Directory Listing
Modified
Fri Oct 12 16:55:28 2018 UTC
(6 years ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
This is a part of ZoL port of device removal patch:
commit a1d477c24c7badc89c60955995fd84d311938486
Author: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Tim Chase <tim@chase2k.com>
Approved by: re (kib)
MFC after: 1 week
Revision
339329 -
Directory Listing
Modified
Fri Oct 12 15:14:22 2018 UTC
(6 years ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Add ZIO_TYPE_FREE support for indirect vdevs.
Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE
requests to removed (indirect) vdevs, while on FreeBSD there is also
ZIO_TYPE_FREE (TRIM). ZIO_TYPE_FREE requests do not have the data
buffers, so don't need the pointer adjustment.
PR: 228750, 229007
Reviewed by: allanjude, sef
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17523
Revision
339299 -
Directory Listing
Modified
Wed Oct 10 22:59:15 2018 UTC
(6 years ago)
by
allanjude
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Pull in a follow-on commit to resolve a deadlock in ZFS sequential
resilver (r334844)
MFV/ZoL: Fix deadlock in IO pipeline
commit a76f3d0437e5e974f0f748f8735af3539443b388
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Fri Mar 16 16:46:06 2018 -0700
Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock. This can result in a deadlock
as shown in the stack traces below.
Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed. This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.
--- THREAD 1 ---
arc_read()
zio_nowait()
zio_vdev_io_start()
vdev_queue_io() <--- mutex_enter(vq->vq_lock)
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
zio_vdev_io_assess()
zio_wait_for_children() <- mutex_enter(zio->io_lock)
--- THREAD 2 --- (inverse order)
arc_read()
zio_change_priority() <- mutex_enter(zio->zio_lock)
vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported by: ZFS Leadership Meeting
Reviewed by: mav
Approved by: re (kib)
Obtained from: ZFS-on-Linux
MFC after: 2 weeks
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17495
Revision
339289 -
Directory Listing
Modified
Wed Oct 10 19:39:47 2018 UTC
(6 years ago)
by
allanjude
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Resolve a hang in ZFS during vnode reclaimation
This is caused by a deadlock between zil_commit() and zfs_zget()
Add a way for zfs_zget() to break out of the retry loop in the common case
PR: 229614
Reported by: grembo, Andreas Sommer, many others
Tested by: Andreas Sommer, Vicki Pfau
Reviewed by: avg (no objection)
Approved by: re (gjb)
MFC after: 2 months
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17460
Revision
339009 -
Directory Listing
Modified
Sat Sep 29 01:26:07 2018 UTC
(6 years ago)
by
allanjude
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Avoid panic when adjusting priority of a read in the face of an IO error
PR: 231516
Reported by: sbruno
Approved by: re (rgrimes)
Obtained from: ZFS-on-Linux
X-MFC-with: 334844
Sponsored by: Klara Systems
MFV/ZoL: Fix zio->io_priority failed (7 < 6) assert
commit c26cf0966d131b722c32f8ccecfe5791a789d975
Author: Tony Hutter <hutter2@llnl.gov>
Date: Tue May 29 18:13:48 2018 -0700
Fix zio->io_priority failed (7 < 6) assert
This fixes an assert in vdev_queue_change_io_priority():
VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Revision
338927 -
Directory Listing
Modified
Tue Sep 25 17:58:06 2018 UTC
(6 years, 1 month ago)
by
mjg
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
zfs: depessimize zfs_root with rmlocks
Currently vfs calls the root method on each absolute lookup and when
crossing mount points.
zfs_root ends up looking up the inode internally as if it was not
instantianted which results in significant lock contention on systems
like EPYC.
Store the vnode in the mount point and protect the access with rmlocks.
This is a temporary hack for 12.0.
Sample result:
before:
make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total
after:
make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total
Tested by: pho (zfs mount/unmount tests)
Reviewed by: kib, mav, sef (different parts)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17233
Revision
338869 -
Directory Listing
Modified
Fri Sep 21 21:56:00 2018 UTC
(6 years, 1 month ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r338866: 9700 ZFS resilvered mirror does not balance reads
illumos/illumos-gate@82f63c3c2bf5e4378706e8dcfccf717d67371be9
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: re (delphij)
Revision
338724 -
Directory Listing
Modified
Mon Sep 17 16:16:57 2018 UTC
(6 years, 1 month ago)
by
markj
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Fix an nvpair leak in vdev_geom_read_config().
Also change the behaviour slightly: instead of freeing "config" if the
last nvlist doesn't pass the tests, return the last config that did pass
those tests. This matches the comment at the beginning of the function.
PR: 230704
Diagnosed by: avg
Reviewed by: asomers, avg
Tested by: Mark Martinec <Mark.Martinec@ijs.si>
Approved by: re (gjb)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17202
Revision
338656 -
Directory Listing
Modified
Thu Sep 13 17:56:48 2018 UTC
(6 years, 1 month ago)
by
vangyzen
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Set zfs_arc_meta_strategy to metadata only
The previous default of "balanced" appears to have caused pathological
behavior, including very poor performance and 100% CPU load in the
arc_reclaim_thread.
The symptoms appeared when the daily periodic run started.
With this change, the system--and the ARC in particular--behaved
normally during a manual daily periodic run.
From Mark Johnston: The port of the balanced strategy is incomplete,
since arc_prune_async() is a no-op on FreeBSD. (This also seems
to imply that r337653 is a no-op.) After 12 is branched we can
port the remaining bits and consider changing the default back.
Submitted by: markj (essentially)
Reviewed by: markj
Approved by: re (gjb)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17156
Revision
338416 -
Directory Listing
Modified
Fri Aug 31 21:45:05 2018 UTC
(6 years, 1 month ago)
by
markj
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Re-compute the ARC size before computing the MFU target.
This fixes an upstream regression introduced in r331404, causing overly
aggressive reclamation of the ARC when under pressure.
Diagnosed by: Paul <devgs@ukr.net>
Approved by: re (gjb)
MFC after: 3 days
Revision
338370 -
Directory Listing
Modified
Wed Aug 29 12:24:19 2018 UTC
(6 years, 2 months ago)
by
kib
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.
Based on the submission by: royger
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Approved by: re (marius)
Differential revision: https://reviews.freebsd.org/D16881
Revision
338288 -
Directory Listing
Modified
Fri Aug 24 01:59:25 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Unblock speculative prefetcher also on pool creation.
Fix at r331950 appeared to be incomplete, fixing only case of pool
import, but not pool creation, leaving prefetcher still blocked for
newly created pools.
Approved by: re (gjb)
MFC after: 1 week
Revision
338206 -
Directory Listing
Modified
Wed Aug 22 16:32:53 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Add dmu_tx_assign() error handling in zfs_unlinked_drain().
The error handling got lost during r334810, while according to the report
error there may happen in case of dataset being over quota. In such case
just leave the node in the unlinked list to be freed sometimes later.
PR: 229887
Sponsored by: iXsystems, Inc.
Revision
338205 -
Directory Listing
Modified
Wed Aug 22 16:27:24 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Create separate taskqueue to call zfs_unlinked_drain().
r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every
deletion of a file with extended attributes. Using system_taskq for that
with its multiple threads in case of multiple files deletion caused all
available CPU threads to uselessly spin on busy locks, completely blocking
the system.
Use of single dedicated taskqueue is the only easy solution I've found,
while in would be great if we could specify that some task should be
executed only once at a time, but never in parallel, while many tasks
could use different threads same time.
Sponsored by: iXsystems, Inc.
Revision
338142 -
Directory Listing
Modified
Tue Aug 21 16:37:37 2018 UTC
(6 years, 2 months ago)
by
markj
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Set arc_kmem_cache_reap_retry_ms to 0 and make it configurable.
r329759 introduced this parameter, which controls the rate at which ZFS
UMA zones are drained when the ARC reclaim thread is shrinking the ARC.
The reclamation target is derived from the global free page count, and
arc_shrink() only frees buffers back to UMA, so the free page count is
not updated until the zones are drained. Thus, back-to-back calls to
arc_shrink() within the arc_kmem_cache_reap_retry_ms interval do not
provide immediate feedback to the arc_reclaim control loop, so we may
free more of the ARC than needed to address a transient page shortage.
As we do not implement the asynchronous zone draining added in r329759,
disable the retry interval, restoring pre-r329759 behaviour. That is,
we will drain the ZFS UMA zones before each attempt to shrink the ARC.
Reviewed by: mav
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Revision
337972 -
Directory Listing
Modified
Fri Aug 17 15:17:09 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
9751 Allocation throttling misplacing ditto blocks
Relax allocation throttling for ditto blocks. Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment. Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.
Sponsored by: iXsystems, Inc.
Revision
337970 -
Directory Listing
Modified
Fri Aug 17 15:00:41 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
9738 Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Revision
337923 -
Directory Listing
Modified
Thu Aug 16 18:44:50 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Make vfs.zfs.zio.dva_throttle_enabled sysctl writable.
Not sure what I thought originally, but as I see now runtime changes are
working fine, and the code seems like even designed for this.
Revision
337922 -
Directory Listing
Modified
Thu Aug 16 18:40:16 2018 UTC
(6 years, 2 months ago)
by
jamie
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.
Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.
Differential Revision: D14791
Revision
337870 -
Directory Listing
Modified
Wed Aug 15 21:01:57 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Fix mismerge in r337196.
ZoL did the same mistake, and fixed it with separate commit 863522b1f9:
dsl_scan_scrub_cb: don't double-account non-embedded blocks
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.
The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.
This was apparently a regression introduced in commit 00c405b4b5e8,
which was an incorrect port of this OpenZFS commit:
https://github.com/openzfs/openzfs/commit/d8a447a7
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720
Closes #7738
Reported by: sef
Revision
337677 -
Directory Listing
Modified
Sun Aug 12 03:15:30 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV/ZoL: Add dbuf hash and dbuf cache kstats
TODO: KSTAT_TYPE_NAMED support
commit 5e021f56d3437d3523904652fe3cc23ea1f4cb70
Author: Giuseppe Di Natale <dinatale2@users.noreply.github.com>
Date: Mon Jan 29 10:24:52 2018 -0800
Add dbuf hash and dbuf cache kstats
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.
Correct format of dbuf kstat file.
Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.
Introduce field filtering in the dbufstat python script.
Introduce a no header option to the dbufstat python script.
Introduce a test case to test basic mru->mfu list movement
in the ARC.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6906
Revision
337676 -
Directory Listing
Modified
Sun Aug 12 02:24:18 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV/ZoL: Fix stack dbuf_hold_impl()
commit fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Thu Aug 26 10:52:00 2010 -0700
Fix stack dbuf_hold_impl()
This commit preserves the recursive function dbuf_hold_impl() but moves
the local variables and function arguments to the heap to minimize
the stack frame size. Enough space is initially allocated on the
stack for 20 levels of recursion. This technique was based on commit
34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
traverse_visitbp().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revision
337672 -
Directory Listing
Modified
Sun Aug 12 01:29:30 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV/ZoL: Fix stack noinline
commit 60948de1ef976aabaa3630707bcc8b5867508507
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Thu Aug 26 10:58:36 2010 -0700
Fix stack noinline
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively. This patch flags all
such functions I've found as 'noinline' to prevent gcc from making
the optimization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revision
337671 -
Directory Listing
Modified
Sun Aug 12 01:17:32 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV/ZoL: Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
commit 81edd3e83409218879e7af293daa86b0c40eb015
Author: Peng <peng.hse@xtaotech.com>
Date: Wed Jun 8 15:22:07 2016 +0800
Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.
Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
db_blkptr = NULL and it's dirtied.
Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
xattr fits into the bonus buffer, so it's removed. The dbuf is
undirtied in this txg, but it's still referenced and cannot be
destroyed.
Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
(not done yet).
* A new change makes the spill buffer necessary again.
sa_build_layouts() ends up calling dbuf_find() to locate the
dbuf. It finds the old dbuf because it has not been destroyed yet
(it will be destroyed when the previous write is done and there
are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
referenced, so it's not destroyed.
Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
directly copied into the dnode, overwriting the blkptr area because,
in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
gets corrupted.
Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3937
Revision
337670 -
Directory Listing
Modified
Sun Aug 12 01:10:18 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV/ZoL: add dbuf stats
NB: disabled pending the addition of KSTAT_TYPE_RAW support to the
SPL
commit e0b0ca983d6897bcddf05af2c0e5d01ff66f90db
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Wed Oct 2 17:11:19 2013 -0700
Add visibility in to cached dbufs
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Revision
337669 -
Directory Listing
Modified
Sun Aug 12 00:45:53 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV/ZoL: Implement large_dnode pool feature
commit 50c957f702ea6d08a634e42f73e8a49931dd8055
Author: Ned Bass <bass6@llnl.gov>
Date: Wed Mar 16 18:25:34 2016 -0700
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
Revision
337660 -
Directory Listing
Modified
Sat Aug 11 22:01:52 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Enable balanced arc pruning
Taken from:
ommit f6046738365571bd647f804958dfdff8a32fbde4
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Sat May 30 09:57:53 2015 -0500
Make arc_prune() asynchronous
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock. This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous. Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.
To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:
* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq. This makes it
suitable to use in the context of the arc_adapt_thread().
* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete. This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.
This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured. It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revision
337653 -
Directory Listing
Modified
Sat Aug 11 19:45:04 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Limit the amount of dnode metadata in the ARC
In addition import most recent arc_prune_async implementation as dependency
commit 25458cbef9e59ef9ee6a7e729ab2522ed308f88f
Author: Tim Chase <tim@chase2k.com>
Date: Wed Jul 13 07:42:40 2016 -0500
Limit the amount of dnode metadata in the ARC
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
Revision
337594 -
Directory Listing
Modified
Fri Aug 10 23:42:11 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
ZFS/MFV: Use cached feature info in spa_add_feature_stats()
commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date: Thu Feb 26 12:24:11 2015 -0800
Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082
Revision
337567 -
Directory Listing
Modified
Fri Aug 10 06:42:08 2018 UTC
(6 years, 2 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Performance optimization of AVL tree comparator functions
MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date: Sat Aug 27 20:12:53 2016 +0200
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033
Revision
337229 -
Directory Listing
Modified
Fri Aug 3 02:16:45 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).
On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching. We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().
This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
zfsonlinux/zfs@62840030a7dceaee013ddbcc1eebcfc7922edf7c
Revision
337213 -
Directory Listing
Modified
Fri Aug 3 00:14:36 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337212:
9465 ARC check for 'anon_size > arc_c/2' can stall the system
illumos/illumos-gate@abe1fd01ce5a83718c5a840daeab4abdaec1c104
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Don Brady <don.brady@delphix.com>
Revision
337211 -
Directory Listing
Modified
Fri Aug 3 00:01:48 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337210: 9577 remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:
1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)
illumos/illumos-gate@c2919acbea007fa95c709b60d073db9a24526e01
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337209 -
Directory Listing
Modified
Thu Aug 2 23:56:07 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337208: 9591 ms_shift can be incorrectly changed in MOS config for
indirect vdevs that have been historically expanded
illumos/illumos-gate@11f6a9680e013a7c9c57dc0b64d3e91e2eee1a6b
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Revision
337207 -
Directory Listing
Modified
Thu Aug 2 23:50:03 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337206: 9338 moved dnode has incorrect dn_next_type
illumos/illumos-gate@c7fbe46df966ea665df63b6e6071808987e839d1
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337205 -
Directory Listing
Modified
Thu Aug 2 23:46:30 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337204: 9439 ZFS double-free due to failure to dirty indirect block
illumos/illumos-gate@99a19144e82244f3426f055cc73af8a937c0135c
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337202 -
Directory Listing
Modified
Thu Aug 2 23:43:01 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337200:
9438 Holes can lose birth time info if a block has a mix of birth times
Ultimately, the problem here is that when you truncate and write a file in
the same transaction group, the dbuf for the indirect block will be zeroed
out to deal with the truncation, and then written for the write. During
this process, we will lose hole birth time information for any holes in the
range. In the case where a dnode is being freed, we need to determine
whether the block should be converted to a higher-level hole in the zio
pipeline, and if so do it when the dnode is being synced out.
illumos/illumos-gate@738e2a3ce3b2579222d6855e7fe75b5bcfcddf8d
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
Revision
337198 -
Directory Listing
Modified
Thu Aug 2 23:25:49 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337197: 9456 ztest failure in zil_commit_waiter_timeout
illumos/illumos-gate@b6031810da58df96413bf76e068638fcab1f228a
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Prakash Surya <prakash.surya@delphix.com>
Revision
337196 -
Directory Listing
Modified
Thu Aug 2 23:23:10 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337195: 9454 ::zfs_blkstats should count embedded blocks
illumos/illumos-gate@dec267e7ea9828898b1c64462daa6636c4ef5e29
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337194 -
Directory Listing
Modified
Thu Aug 2 23:15:10 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337193:
9424 ztest failure: "unprotected error in call to Lua API (Invalid value type 'f
unction' for key 'error')"
illumos/illumos-gate@fe3ba4d1227d8746116ece7240682b13595c3142
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337191 -
Directory Listing
Modified
Thu Aug 2 21:59:46 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337190: 9486 reduce memory used by device removal on fragmented pools
In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.
illumos/illumos-gate@cfd63e1b1bcf7ba4bf72f55ddbd87ce008d2986d
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337183 -
Directory Listing
Modified
Thu Aug 2 21:19:35 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337182: 9330 stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.
illumos/illumos-gate@5ac95da7d61660aa299c287a39277cb0372be959
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Revision
337181 -
Directory Listing
Modified
Thu Aug 2 21:07:04 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
9539 Make zvol operations use _by_dnode routines
Continues what was started in 7801 add more by-dnode routines by fully
converting zvols to avoid unnecessary dnode_hold() calls. This saves a
small amount of CPU time and slightly improves latencies of operations
on zvols.
illumos/illumos-gate@8dfe5547fbf0979fc1065a8b6fddc1e940a7cf4f
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Richard Yao <richard.yao@prophetstor.com>
Revision
337177 -
Directory Listing
Modified
Thu Aug 2 20:33:13 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337175: 9487 Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to
exist after receiving the stream. We should free them accordingly, as if
a freeobjects record for them had been included in the stream.
zfsonlinux/zfs@48fbb9ddbf2281911560dfbc2821aa8b74127315
illumos/illumos-gate@7864b8192b8d30471fa2240466d516292e5765b8
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
Revision
337172 -
Directory Listing
Modified
Thu Aug 2 20:18:49 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337171:
9464 txg_kick() fails to see that we are quiescing, forcing transactions
to their next stages without leaving them accumulate changes
Ideally we would like txg_kick() to get triggered only when we are sure
that we are not syncing AND not quiescing any txg. This way we can kick
an open TXG to the quiescing state when we are sure that there is nothing
going on and we would benefit from the different states running
concurrently.
illumos/illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Revision
337169 -
Directory Listing
Modified
Thu Aug 2 20:06:46 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337167: 9442 decrease indirect block size of spacemaps
Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.
illumos/illumos-gate@221813c13b43ef48330b03725e00edee85108cf1
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337030 -
Directory Listing
Modified
Wed Aug 1 03:21:17 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337029:
9426 metaslab size can exceed offset addressable by spacemap
metaslab size can exceed offset addressable by spacemap. The vdev can
address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to
2^47 * 2^vdev_ashift. Therefore we may need to increase the number of
metaslabs so that the maximum metaslab size is capped at the amount that
can be addressed by the spacemap. This should happen in
vdev_metaslab_set_size().
illumos/illumos-gate@b4bf0cf0458759c67920a031021a9d96cd683cfe
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Don Brady <don.brady@delphix.com>
Revision
337028 -
Directory Listing
Modified
Wed Aug 1 03:07:33 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337027:
9328 zap code can take advantage of c99
9329 panic in zap_leaf_lookup() due to concurrent zapification
illumos/illumos-gate@bf26014c5541b6119f34e0d95294b7f2eb105ac2
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337025 -
Directory Listing
Modified
Wed Aug 1 02:39:44 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337022:
9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error
This assertion (VERIFY) failure was reported when reading a block. Turns out
the problem is that if we get an i/o error (ECKSUM in this case), and there
are multiple concurrent ARC reads of the same block (from different clones),
then the ARC will put multiple buf's on the same ANON hdr, which isn't
supposed to happen, and then causes a panic when we try to arc_buf_destroy()
the buf.
illumos/illumos-gate@fa98e487a9619b7902f218663be219e787a57dad
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337021 -
Directory Listing
Modified
Tue Jul 31 23:00:58 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337020:9443 panic when scrub a v10 pool
illumos/illumos-gate@bb1f424574ac8e08069d0ba993c2a41ffe796794
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
337017 -
Directory Listing
Modified
Tue Jul 31 22:50:50 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r337014:
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue
illumos/illumos-gate@20b5dafb425396adaebd0267d29e1026fc4dc413
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
Revision
337007 -
Directory Listing
Modified
Tue Jul 31 21:06:04 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336991, r337001:
9102 zfs should be able to initialize storage devices
The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.
illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
Revision
336961 -
Directory Listing
Modified
Tue Jul 31 01:02:22 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336960: 9256 zfs send space estimation off by > 10% on some datasets
illumos/illummos-gate@df477c0afa111b5205c872dab36dbfde391656de
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
Revision
336959 -
Directory Listing
Modified
Tue Jul 31 00:58:21 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336958: 9337 zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).
illumos/illumos-gate@adb52d9262f45a04318fc6e188fe2b7f59d989a5
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
336956 -
Directory Listing
Modified
Tue Jul 31 00:47:27 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336955: 9236 nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.
illumos/illumos-gate@21f7c81cc1156e9202ce3412d3ecaa697c3b2222
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
336954 -
Directory Listing
Modified
Tue Jul 31 00:37:45 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync
Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.
illumos/illumos-gate@a3b5583021b7b45676bf1f0cc68adf7a97900b56
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
336951 -
Directory Listing
Modified
Tue Jul 31 00:25:39 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.
illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
Revision
336949 -
Directory Listing
Modified
Tue Jul 31 00:02:42 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336948: 9112 Improve allocation performance on high-end systems
On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.
illumos/illumos-gate@f78cdc34af236a6199dd9e21376f4a46348c0d56
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>
Revision
336947 -
Directory Listing
Modified
Mon Jul 30 23:47:38 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336946: 9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.
illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Revision
336943 -
Directory Listing
Modified
Mon Jul 30 22:03:29 2018 UTC
(6 years, 2 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r336942: 9189 Add debug to vdev_label_read_config when txg check fails
illumos/illumos-gate@b6bf6e1540f30bd97b8d6e2c21d95e17841e0f23
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Revision
336180 -
Directory Listing
Modified
Tue Jul 10 20:11:32 2018 UTC
(6 years, 3 months ago)
by
sef
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges. One of the missing changes
included a notification on scrub cancellation.
Approved by: mav
Sponsored by: iXsystems Inc
Revision
336017 -
Directory Listing
Modified
Thu Jul 5 22:56:13 2018 UTC
(6 years, 3 months ago)
by
sef
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
This exposes ZFS user and group quotas via the normal
quatactl(2) mechanism. (Read-only at this point, however.)
In particular, this is to allow rpc.rquotad query quotas
for NFS mounts, allowing users to see their quotas on the
hosts using the datasets.
The changes specifically:
* Add new RPC entry points for querying quotas.
* Changes the library routines to allow non-UFS quotas.
* Changes rquotad to check for quotas on mounted filesystems,
rather than being limited to entries in /etc/fstab
* Lastly, adds a VFS entry-point for ZFS to query quotas.
Note that this makes one unavoidable behavioural change: if quotas
are enabled, then they can be queried, as opposed to the current
method of checking for quotas being specified in fstab. (With
ZFS, if there are user or group quotas, they're used, always.)
Reviewed by: delphij, mav
Approved by: mav
Sponsored by: iXsystems Inc
Differential Revision: https://reviews.freebsd.org/D15886
Revision
334810 -
Directory Listing
Modified
Thu Jun 7 18:59:32 2018 UTC
(6 years, 4 months ago)
by
benno
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Break recursion involving getnewvnode and zfs_rmnode.
When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.
In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.
Reviewed by: avg, mav
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D15342
Revision
334203 -
Directory Listing
Modified
Fri May 25 07:29:52 2018 UTC
(6 years, 5 months ago)
by
avg
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
fix zfs_getpages crash when called from sendfile, followup to r329363
It turns out that sendfile_swapin() has an optimization where it may
insert pointers to bogus_page into the page array that it passes to
VOP_GETPAGES. That happens to work with buffer cache, because it
extensively uses bogus_page internally, so it has the necessary checks.
However, ZFS did not expect bogus_page as VOP_GETPAGES(9) does not
document such a (ab)use of bogus_page.
So, this commit adds checks and handling of bogus_page.
I expect that use of bogus_page with VOP_GETPAGES will get documented
sooner rather than later.
Reported by: Andrew Reilly <areilly@bigpond.net.au>, delphij
Tested by: Andrew Reilly <areilly@bigpond.net.au>
Requested by: many
MFC after: 1 week
Revision
333630 -
Directory Listing
Modified
Tue May 15 13:27:29 2018 UTC
(6 years, 5 months ago)
by
avg
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Fix 'zpool create -t <tempname>'
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.
Fix this by specifying the correct pool name when setting the dataset
properties.
Author: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Obtained from: ZFS on Linux, zfsonlinux/zfs@4ceb8dd6fdfdde
MFC after: 1 week
Revision
333425 -
Directory Listing
Modified
Wed May 9 18:47:24 2018 UTC
(6 years, 5 months ago)
by
mmacy
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
Revision
333263 -
Directory Listing
Modified
Fri May 4 20:54:27 2018 UTC
(6 years, 5 months ago)
by
jamie
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
Revision
333234 -
Directory Listing
Modified
Fri May 4 00:56:41 2018 UTC
(6 years, 5 months ago)
by
emaste
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
zfs_ioctl: avoid out-of-bound read
admbugs: 796
Submitted by: Domagoj Stolfa <ds815@cam.ac.uk>
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: avg
MFC after: 1 day
Revision
332523 -
Directory Listing
Modified
Mon Apr 16 00:54:58 2018 UTC
(6 years, 6 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
9433 Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.
This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.
This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revision
332426 -
Directory Listing
Modified
Thu Apr 12 10:37:26 2018 UTC
(6 years, 6 months ago)
by
avg
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
allow ZFS pool to have temporary name for duration of current import
The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name. This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name. This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.
The changes come from ZoL with some small tweaks.
The porting has been done by julian.
The change is being submitted to OpenZFS:
https://github.com/openzfs/openzfs/pull/600
Submitted by: julian
Reviewed by: smh
MFC after: 2 weeks
Sponsored by: Panzura (porting)
Differential Revision: https://reviews.freebsd.org/D14972
Revision
332365 -
Directory Listing
Modified
Tue Apr 10 13:56:06 2018 UTC
(6 years, 6 months ago)
by
markj
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
Set zfs_arc_free_target to v_free_target.
Page daemon output is now regulated by a PID controller with a setpoint
of v_free_target. Moreover, the page daemon now wakes up regularly
rather than waiting for a wakeup from another thread. This means that
the free page count is unlikely to drop below the old
zfs_arc_free_target value, and as a result the ARC was not readily
freeing pages under memory pressure. Address the immediate problem by
updating zfs_arc_free_target to match the page daemon's new behaviour.
Reported and tested by: truckman
Discussed with: jeff
X-MFC with: r329882
Differential Revision: https://reviews.freebsd.org/D14994
Revision
331950 -
Directory Listing
Modified
Tue Apr 3 21:16:41 2018 UTC
(6 years, 6 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
9434 Speculative prefetch is blocked by device removal code.
Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal. At least one visual consequence
of it is completely blocked speculative prefetcher. This patch sets
the variable in such situations.
Revision
331713 -
Directory Listing
Modified
Wed Mar 28 23:17:29 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r331712:
9280 Assertion failure while running removal_with_ganging test with 4K devices
illumos/illumos-gate@243952c7eeef020886e3e2e3df99a513df40584a
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>
Revision
331711 -
Directory Listing
Modified
Wed Mar 28 23:05:48 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV 331710:
9188 increase size of dbuf cache to reduce indirect block decompression
illumos/illumos-gate@268bbb2a2fa79c36d4695d13a595ba50a7754b76
With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.
If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)
In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
Revision
331709 -
Directory Listing
Modified
Wed Mar 28 22:50:05 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r331708:
9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value
illumos/illumos-gate@9be12bd737714550277bd02b0c693db560976990
arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.
Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>
Revision
331707 -
Directory Listing
Modified
Wed Mar 28 22:29:06 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r331706:
9235 rename zpool_rewind_policy_t to zpool_load_policy_t
illumos/illumos-gate@5dafeea3ebd2dd77affc802bcb90f63faf01589f
We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.
For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Revision
331705 -
Directory Listing
Modified
Wed Mar 28 22:10:06 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV 331704:
9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices
illumos/illumos-gate@ccef24b493bcbd146fcd6d8946666cae081470b6
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Revision
331703 -
Directory Listing
Modified
Wed Mar 28 22:07:31 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV 331702:
9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate
illumos/illumos-gate@d1de72cfa29ab77ff80e2bb0e668a6afa5bccaf0
ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.
Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.
Solution: do not check txg in vdev_validate unless config lock is held.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
Revision
331701 -
Directory Listing
Added
Wed Mar 28 22:01:27 2018 UTC
(6 years, 7 months ago)
by
mav
Original Path:
head/sys/cddl/contrib/opensolaris/uts/common/fs
MFV r331695, 331700: 9166 zfs storage pool checkpoint
illumos/illumos-gate@8671400134a11c848244896ca51a7db4d0f69da4
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>