From Fedora Project Wiki

Enable systemd-oomd by default for all variants

Summary

Provide a better experience for Fedora users in out-of-memory (OOM) situations by enabling systemd-oomd by default. Actions taken by systemd-oomd operate on a per-cgroup level, aligning well with the life cycle of systemd units. systemd-oomd primarily uses Linux pressure stall information (PSI) to make decisions based on wasted productivity due to resource shortages; in addition to that, it also supports swap based actions.

Owners

Current status

  • Targeted release: Fedora 34
  • Last updated: 2021-03-30
  • FESCo issue: #2535
  • Tracker bug: #1913794
  • Release notes tracker: #627

Detailed description

The primary mechanism used by systemd-oomd for detecting when the system is out of memory is memory pressure. Memory pressure measures the percentage of time a cgroup has “wasted” due to lack of memory. This includes time spent reclaiming free memory, faulting in recently resident pages, and loading in anonymous pages from swap. When a monitored cgroup’s memory pressure exceeds the specified thresholds, systemd-oomd will perform action(s) on the targeted cgroup’s descendants, starting from the cgroups with the most reclaim scans. Reclaim activity is used here, rather than the largest consumer, as it reflects values set in the cgroup memory controller for memory protection (such as memory.low).

For memory pressure configuration, this will be ManagedOOMMemoryPressure=kill and ManagedOOMMemoryPressureLimit=50% on user@.service to have systemd-oomd send SIGKILLs to all processes under a selected cgroup when total memory pressure on all tasks exceeds 50% for 20 seconds.

For swap based actions, systemd-oomd will monitor the system-wide swap space and act when available swap falls below the configured threshold, starting with the cgroups with the highest swap usage to the least. Keeping some amount of swap (if enabled) available will prevent the kernel OOM killer from killing processes unpredictably and spending an unbounded amount of time afterwards.

For swap configuration, this will be SwapUsedLimitPercent=90% in oomd.conf and ManagedOOMSwap=kill on -.slice (root cgroup slice) to have systemd-oomd send SIGKILLs to all processes under a cgroup when swap used exceeds 90%.

Feedback

Can we integrate this with GIO's GMemoryMonitor API?

Likely yes, though it is not planned by the maintainers for the near term.

Can we exclude certain units from being killed?

Setting ManagedOOMPreference=avoid or ManagedOOMPreference=omit on systemd units that are leaf cgroups nodes or cgroups with memory.oom.group set to 1 can prevent them from being targeted systemd-oomd. avoid de-prioritizes while omit is equivalent to systemd-oomd ignoring. Since they are meant to be used sparingly (e.g. for critical services), its usage is limited to root owned cgroups.

How will this work if everything is in the same cgroup?

It will not work as systemd-oomd acts on a per-cgroup level. Applications will need to spawn processes into separate cgroups (e.g. with systemd-run) or use a desktop environment (e.g. GNOME, KDE) that does this for them.

Should spins that don't put processes in separate cgroups be excluded from this change?

That will be left up to the maintainers of those spins. Based on feedback, the current plan is to enable systemd-oomd with the specified configuration by default to minimize fragmentation on the Fedora install base (the Upgrade/Compatibility section as been updated to reflect this). A separate subpackage, "systemd-oomd-defaults", controls the policy for systemd-oomd and excluding it or removing it (and performing a systemctl daemon-reload) will prevent systemd-oomd from killing anything; without a policy systemd-oomd doesn't act.

Benefit to Fedora

  • Addressing the issue of improving user feedback in https://pagure.io/fedora-workstation/issue/202, systemd-oomd currently logs to the journal if pressure or swap action is about to occur. There are also debug logs, for each process that is sent a SIGKILL, that can be bumped up in priority. Further notification mechanisms (i.e. over dbus) can also be implemented depending on feedback.
  • While systemd-oomd is simpler in configuration to the oomd used at Facebook, the algorithm is largely the same. As such, the following case study can be used as an example of how PSI and cgroup killing can release memory not normally resolved with process killing and lead to better utilization: https://facebookincubator.github.io/oomd/docs/oomd-casestudy.html
  • OOM killing in userspace, before the kernel OOM killer kicks in, has been shown to be effective at keeping a system functional. An OOM kill in the kernel is slow, possibly leading to an unbounded amount of time swapping in and out pages and evicting the page cache.
  • PSI based actions, versus looking at raw memory consumption numbers, better reflect memory protection policies set for cgroup resource control limits (e.g. memory.low).

Scope

  • Proposal owners:
    • Implement and land additional refinements to systemd-oomd
      • Remove swap as a hard requirement to running systemd-oomd
      • Configurable memory pressure time window knob
      • Per-unit knob to exclude units from being killing
    • Enable oomd by default with sensible configuration that can be easily opted out
    • Test days
    • Aid with documentation
  • Other developers:
    • systemd: review PRs as needed
  • Release engineering: https://pagure.io/releng/issue/9913
  • Policies and guidelines: N/A
  • Trademark approval: N/A

Upgrade/compatibility impact

Systemd-oomd will be enabled by default, including on upgrade and new installs. Systems that were previously running earlyoom will be transitioned in a process similar to running these commands:

sudo systemctl disable --now earlyoom
sudo systemctl enable --now systemd-oomd

How to test

systemd 247 build for Fedora includes all the artifacts for systemd-oomd. It is disabled by default but can be started with:

sudo systemctl enable --now systemd-oomd

At this point you can decide which units to set properties on. For example, to enable swap-based killing on all units below the root slice:

sudo systemctl edit --force -- -.slice
[Slice]
ManagedOOMSwap=kill
# save and exit

Note that the following memory pressure example requires the changes listed in “Scope” to work as expected, as systemd-oomd shipped with systemd v247 does not support changing the time window for memory pressure. This example was run on a system with swap:

systemctl edit user@.service
[Service]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=10%
# save and exit

systemd-run --user tail /dev/zero # will lead to a lot of reclaim and then OOM if not killed

User experience

This should be a fully transparent change for users.

Dependencies

None. If changes to oomd are required to address feedback to this proposal, they will need to be merged in systemd.

Contingency plan

  • Contingency mechanism: For workstation, owner will revert all changes and we’ll go back to using earlyoom instead
  • Contingency deadline: Final freeze
  • Blocks release? No
  • Blocks product? No

Documentation

https://www.freedesktop.org/software/systemd/man/systemd-oomd.html
https://www.freedesktop.org/software/systemd/man/oomctl.html
https://www.freedesktop.org/software/systemd/man/oomd.conf.html

Release Notes

systemd-oomd is enabled by default. Depending on which systemd units have ManagedOOMSwap=kill or ManagedOOMMemoryPressure=kill, systemd-oomd will SIGKILL all the processes under the appropriate descendant cgroups when the configured limits are exceeded.

To revert back to earlyoom, run:

sudo systemctl disable --now systemd-oomd
sudo systemctl enable --now earlyoom

See man oomd.conf for configuration options.