From Fedora Project Wiki
(→‎Scope: add link to latest pr)
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Shorter Shutdown Timer =
= Shorter Shutdown Timer =


{{Change_Proposal_Banner}}


== Summary ==
== Summary ==
<!-- A sentence or two summarizing what this change is and what it will do. This information is used for the overall changeset summary page for each release. Note that motivation for the change should be in the Benefit to Fedora section below, and this part should answer the question "What?" rather than "Why?". -->
<!-- A sentence or two summarizing what this change is and what it will do. This information is used for the overall changeset summary page for each release. Note that motivation for the change should be in the Benefit to Fedora section below, and this part should answer the question "What?" rather than "Why?". -->


A downstream configuration change to reduce the systemd unit timeout from 2 minutes to 15 seconds.
A downstream configuration change to reduce the systemd unit timeout from 2 minutes to 45 seconds and send SIGABRT to generate a core dump before SIGKILL.


== Owner ==
== Owner ==
<!--
* Name: [[User:catanzaro| Michael Catanzaro]], [[User:aday| Allan Day]], [[User:zbyszek| Zbigniew Jędrzejewski-Szmek]]
For change proposals to qualify as self-contained, owners of all affected packages need to be included here. Alternatively, a SIG can be listed as an owner if it owns all affected packages.
* Email: mcatanzaro at redhat dot com, aday at redhat dot com, zbyszek at in dot waw dot pl
This should link to your home wiki page so we know who you are.
-->
* Name: [[User:FASAcountName| catanzaro]]
<!-- Include you email address that you can be reached should people want to contact you about helping with your change, status is requested, or technical issues need to be resolved. If the change proposal is owned by a SIG, please also add a primary contact person. -->
* Email: mcatanzaro at redhat dot com
<!--- UNCOMMENT only for Changes with assigned Shepherd (by FESCo)
* FESCo shepherd: [[User:FASAccountName| Shehperd name]] <email address>
-->




== Current status ==
== Current status ==
[[Category:ChangePageIncomplete]]
[[Category:ChangeAcceptedF38]]
<!-- When your change proposal page is completed and ready for review and announcement -->
<!-- When your change proposal page is completed and ready for review and announcement -->
<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->
<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->
Line 40: Line 31:
ON_QA -> change is fully code complete
ON_QA -> change is fully code complete
-->
-->
* FESCo issue: <will be assigned by the Wrangler>
* [https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/N6TW6EBPG5Q6D6SEV4FFGQH3E6HUTVE2/ devel thread]
* Tracker bug: <will be assigned by the Wrangler>
* FESCo issue: [https://pagure.io/fesco/issue/2928 #2928]
* Release notes tracker: <will be assigned by the Wrangler>
* Tracker bug: [https://bugzilla.redhat.com/show_bug.cgi?id=2161753 #2161753]
* Release notes tracker: [https://pagure.io/fedora-docs/release-notes/issue/954 #954]


== Detailed Description ==
== Detailed Description ==
<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->
<!-- Expand on the summary, if appropriate.  A couple sentences suffices to explain the goal, but the more details you can provide the better. -->


Currently, services which have stalled or are misbehaving can prevent shutdown for up to 2 minutes. This causes extreme frustrating for our users - someone goes to shutdown or reboot and then unexpectedly has to wait a long time before they can do anything else.
Currently, a service that fails to stop at shutdown time can block shutdown for up to 2 minutes. This is extremely frustrating for our users - someone goes to shutdown or reboot their system, and then unexpectedly has to wait for a long time before they can do anything else.
 
[https://pagure.io/fesco/issue/2853#comment-811075 The most common service to cause this issue is PackageKit], but there are others.


The most common service to cause this issue is PackageKit, but there are others.
When a service fails to shutdown when it is instructed to do so, it is not behaving properly, and it is preventing the system from behaving in an orderly and predictable manner. Desktop APIs exist for cases when services or apps legitimately need to prevent shutdown, and these allow the shutdown inhibit to be communicated to admins and users, so they understand what is happening. When the user decides to shut down anyway, services must terminate in a timely manner. The Workstation Working Group feels that 15 seconds is the maximum appropriate time for both system and user services, and that Fedora should be robust to buggy and misbehaving services that do not shut down in an appropriate manner. However, FESCo has requested that we start with a 45 second timeout instead of dropping immediately to 15 seconds.


The Workstation Working Group has been attempting to eliminate this bug for a number of years. Investigations have revealed that it's not possible to find and fix every misbehaving service: in some cases the misbehaviour comes from design flaws that cannot be reasonably fixed.
To facilitate debugging when a service fails to stop cleanly, we will use TimeoutStopFailureMode=abort to crash services that fail to stop in the time allotted. This will cause the service to crash with SIGABRT so that a core dump will be generated.


An attempt has also been [https://github.com/systemd/systemd/pull/18386 made to have the unit timeout changed in upstream systemd]. That attempt did not go anywhere, despite various efforts to move it along.
=== History ===
 
The Workstation Working Group has been [https://pagure.io/fedora-workstation/issue/163 working on this issue for several years]. Investigations have revealed that it's not possible to fix every misbehaving service: in some cases the misbehaviour comes from design flaws that are difficult to resolve.
 
An attempt has also been [https://github.com/systemd/systemd/pull/18386 made to have the unit timeout changed in upstream systemd]. That attempt did not go anywhere, despite various efforts to move it along. We are no longer comfortable waiting for upstream changes to land.
 
To our knowledge, there are no issues that will result from forcing services to stop after 45 seconds on typical systems. However, system administrators may need to configure a higher timeout if waiting longer for a particular service, which may be true for database services or virtual machine managers, for example. Sensitive services may disable the timeout altogether; Postgres and virt-manager already do this.


== Feedback ==
== Feedback ==
<!-- Summarize the feedback from the community and address why you chose not to accept proposed alternatives. This section is optional for all change proposals but is strongly suggested. Incorporating feedback here as it is raised gives FESCo a clearer view of your proposal and leaves a good record for the future. If you get no feedback, that is useful to note in this section as well. For innovative or possibly controversial ideas, consider collecting feedback before you file the change proposal. -->
<!-- Summarize the feedback from the community and address why you chose not to accept proposed alternatives. This section is optional for all change proposals but is strongly suggested. Incorporating feedback here as it is raised gives FESCo a clearer view of your proposal and leaves a good record for the future. If you get no feedback, that is useful to note in this section as well. For innovative or possibly controversial ideas, consider collecting feedback before you file the change proposal. -->


The Workstation Working Group has [https://pagure.io/fedora-workstation/issue/163 a ticket where they have been tracking and discussing this issue]. This change [https://pagure.io/fesco/issue/2853 was also previously proposed to FESCo], where there was some discussion.
* Fedora Server wishes to be cautious and use a longer shutdown timer, but this change proposal is implemented in a way that would affect all Fedora editions. We should find a way to allow different Fedora editions to have different defaults, perhaps by altering config files.
* The short shutdown timer [https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/M56AM4ZUNHCHIDGTF6HUBWQ7INQO74AZ/ might not be long enough for libvirt to shut down VMs]. Databases and virtual machines really must not be killed forcibly. Service files may already request longer timeouts, but would need to be modified to do so.
* The short shutdown timer could [https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/RR4XZFEMAKSEI3ES6D7FA2OMV7IRCK2J/ brick Pinephone modems]. This seems like a hardware bug rather than something that should affect Fedora's default behavior.
* This change proposal now incorporates use of TimeoutStopFailureMode=abort due to discussion feedback, to facilitate debugging of services that do not stop properly.


== Benefit to Fedora ==
== Benefit to Fedora ==
Line 89: Line 92:
-->
-->


The primary benefit of the change will be to eliminate a very annoying and - frankly - embarrassing bug. Our users shouldn't have to randomly sit waiting for their machine to shutdown.
The primary benefit of the change will be to mitigate a very annoying and - frankly - embarrassing bug. Our users shouldn't have to randomly sit waiting for their machine to shutdown. It will also encourage the correct use of shutdown inhibit APIs.


It will also encourage the correct use of shutdown inhibit APIs by services.
Although this change will "paper over" bugs in services without fixing them, we emphasize that reducing the timeout is not merely a workaround for buggy services, but also the desired permanent design. Of course it is desirable to fix the underlying bugs as well, but it doesn't make sense to require this before fixing the service timeout to match our needs.


== Scope ==
== Scope ==
* Proposal owners:
* Proposal owners:
<!-- What work do the feature owners have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
<!-- What work do the feature owners have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
** Merge [https://src.fedoraproject.org/rpms/systemd/pull-request/85 the downstream systemd PR] to reduce the unit timout
** Merge [https://src.fedoraproject.org/rpms/systemd/pull-request/85 pull request to shorten timeout to 45 s] to {{package|systemd}}.
** Merge [https://src.fedoraproject.org/rpms/systemd/pull-request/102 pull request to set TimeoutStopFailureMode=abort] to {{package|systemd}}.
* Other developers: <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Other developers: <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- What work do other developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
<!-- What work do other developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
 
** Test their packages with the new behavior and report issues as necessary.
* Release engineering: [https://pagure.io/releng/issues #Releng issue number] <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Release engineering: [https://pagure.io/releng/issue/11193 #11193] <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- Does this feature require coordination with release engineering (e.g. changes to installer image generation or update package delivery)?  Is a mass rebuild required?  include a link to the releng issue.  
<!-- Does this feature require coordination with release engineering (e.g. changes to installer image generation or update package delivery)?  Is a mass rebuild required?  include a link to the releng issue.  
The issue is required to be filed prior to feature submission, to ensure that someone is on board to do any process development work and testing and that all changes make it into the pipeline; a bullet point in a change is not sufficient communication -->
The issue is required to be filed prior to feature submission, to ensure that someone is on board to do any process development work and testing and that all changes make it into the pipeline; a bullet point in a change is not sufficient communication -->


* Policies and guidelines: N/A (not needed for this Change) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Policies and guidelines: No policy or guideline changes required <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- Do the packaging guidelines or other documents need to be updated for this feature?  If so, does it need to happen before or after the implementation is done?  If a FPC ticket exists, add a link here. Please submit a pull request with the proposed changes before submitting your Change proposal. -->
<!-- Do the packaging guidelines or other documents need to be updated for this feature?  If so, does it need to happen before or after the implementation is done?  If a FPC ticket exists, add a link here. Please submit a pull request with the proposed changes before submitting your Change proposal. -->


Line 110: Line 114:
<!-- If your Change may require trademark approval (for example, if it is a new Spin), file a ticket ( https://pagure.io/Fedora-Council/tickets/issues ) requesting trademark approval from the Fedora Council. This approval will be done via the Council's consensus-based process. -->
<!-- If your Change may require trademark approval (for example, if it is a new Spin), file a ticket ( https://pagure.io/Fedora-Council/tickets/issues ) requesting trademark approval from the Fedora Council. This approval will be done via the Council's consensus-based process. -->


* Alignment with Objectives:  
* Alignment with Objectives: N/A (not needed for this Change)
<!-- Does your proposal align with the current Fedora Objectives: https://docs.fedoraproject.org/en-US/project/objectives/ ? It's okay if it doesn't, but it's something to consider -->
<!-- Does your proposal align with the current Fedora Objectives: https://docs.fedoraproject.org/en-US/project/objectives/ ? It's okay if it doesn't, but it's something to consider -->


Line 118: Line 122:
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->


System and user services will be killed with SIGABRT 45 seconds after receiving SIGTERM. Previously they would receive SIGKILL 1 minute 30 seconds for most system and user services, or 2 minutes for user manager system services (the system service that runs all user services for a user). Services will have less time to shut down gracefully by default, and a core dump will be generated to facilitate bug reporting. These defaults are configurable and system administrators who require longer timeouts would need to adjust them before or after upgrade. You may edit the DefaultTimeoutStopSec= setting in /etc/systemd/user.conf and /etc/systemd/system.conf. You may also create a drop-in to change the TimeoutStopSec= setting for user@service.


== How To Test ==
== How To Test ==
Line 150: Line 155:
-->
-->


This change will make the Fedora user experience less annoying.
This change will make the Fedora user experience less annoying. It will also encourage the use of the existing inhibit APIs, which provide better feedback for users when system shutdown does need to be delayed.


== Dependencies ==
== Dependencies ==
Line 164: Line 169:
* Contingency mechanism: the change owners will revert the change in systemd.
* Contingency mechanism: the change owners will revert the change in systemd.
<!-- When is the last time the contingency mechanism can be put in place?  This will typically be the beta freeze. -->
<!-- When is the last time the contingency mechanism can be put in place?  This will typically be the beta freeze. -->
* Contingency deadline: if we back out the change it would be best to do it before beta freeze.
* Contingency deadline: if we back out the change it would be best to do it before beta freeze, but this can happen at any point.
<!-- Does finishing this feature block the release, or can we ship with the feature in incomplete state? -->
<!-- Does finishing this feature block the release, or can we ship with the feature in incomplete state? -->
* Blocks release? No.
* Blocks release? No.

Latest revision as of 23:09, 22 February 2023

Shorter Shutdown Timer

Summary

A downstream configuration change to reduce the systemd unit timeout from 2 minutes to 45 seconds and send SIGABRT to generate a core dump before SIGKILL.

Owner


Current status

Detailed Description

Currently, a service that fails to stop at shutdown time can block shutdown for up to 2 minutes. This is extremely frustrating for our users - someone goes to shutdown or reboot their system, and then unexpectedly has to wait for a long time before they can do anything else.

The most common service to cause this issue is PackageKit, but there are others.

When a service fails to shutdown when it is instructed to do so, it is not behaving properly, and it is preventing the system from behaving in an orderly and predictable manner. Desktop APIs exist for cases when services or apps legitimately need to prevent shutdown, and these allow the shutdown inhibit to be communicated to admins and users, so they understand what is happening. When the user decides to shut down anyway, services must terminate in a timely manner. The Workstation Working Group feels that 15 seconds is the maximum appropriate time for both system and user services, and that Fedora should be robust to buggy and misbehaving services that do not shut down in an appropriate manner. However, FESCo has requested that we start with a 45 second timeout instead of dropping immediately to 15 seconds.

To facilitate debugging when a service fails to stop cleanly, we will use TimeoutStopFailureMode=abort to crash services that fail to stop in the time allotted. This will cause the service to crash with SIGABRT so that a core dump will be generated.

History

The Workstation Working Group has been working on this issue for several years. Investigations have revealed that it's not possible to fix every misbehaving service: in some cases the misbehaviour comes from design flaws that are difficult to resolve.

An attempt has also been made to have the unit timeout changed in upstream systemd. That attempt did not go anywhere, despite various efforts to move it along. We are no longer comfortable waiting for upstream changes to land.

To our knowledge, there are no issues that will result from forcing services to stop after 45 seconds on typical systems. However, system administrators may need to configure a higher timeout if waiting longer for a particular service, which may be true for database services or virtual machine managers, for example. Sensitive services may disable the timeout altogether; Postgres and virt-manager already do this.

Feedback

  • Fedora Server wishes to be cautious and use a longer shutdown timer, but this change proposal is implemented in a way that would affect all Fedora editions. We should find a way to allow different Fedora editions to have different defaults, perhaps by altering config files.
  • The short shutdown timer might not be long enough for libvirt to shut down VMs. Databases and virtual machines really must not be killed forcibly. Service files may already request longer timeouts, but would need to be modified to do so.
  • The short shutdown timer could brick Pinephone modems. This seems like a hardware bug rather than something that should affect Fedora's default behavior.
  • This change proposal now incorporates use of TimeoutStopFailureMode=abort due to discussion feedback, to facilitate debugging of services that do not stop properly.

Benefit to Fedora

The primary benefit of the change will be to mitigate a very annoying and - frankly - embarrassing bug. Our users shouldn't have to randomly sit waiting for their machine to shutdown. It will also encourage the correct use of shutdown inhibit APIs.

Although this change will "paper over" bugs in services without fixing them, we emphasize that reducing the timeout is not merely a workaround for buggy services, but also the desired permanent design. Of course it is desirable to fix the underlying bugs as well, but it doesn't make sense to require this before fixing the service timeout to match our needs.

Scope

  • Policies and guidelines: No policy or guideline changes required
  • Trademark approval: N/A (not needed for this Change)
  • Alignment with Objectives: N/A (not needed for this Change)

Upgrade/compatibility impact

System and user services will be killed with SIGABRT 45 seconds after receiving SIGTERM. Previously they would receive SIGKILL 1 minute 30 seconds for most system and user services, or 2 minutes for user manager system services (the system service that runs all user services for a user). Services will have less time to shut down gracefully by default, and a core dump will be generated to facilitate bug reporting. These defaults are configurable and system administrators who require longer timeouts would need to adjust them before or after upgrade. You may edit the DefaultTimeoutStopSec= setting in /etc/systemd/user.conf and /etc/systemd/system.conf. You may also create a drop-in to change the TimeoutStopSec= setting for user@service.

How To Test

Given the intermittent and unpredictable nature of the bug that is being targeted, the best way to test is by using the upcoming Fedora release. Are shutdown delays eliminated as intended? Do system services experience issues as a result of the change?

User Experience

This change will make the Fedora user experience less annoying. It will also encourage the use of the existing inhibit APIs, which provide better feedback for users when system shutdown does need to be delayed.

Dependencies

No specific changes are required in other packages. However, service developers may want to take this opportunity to examine the shutdown behavior of their components.

Contingency Plan

  • Contingency mechanism: the change owners will revert the change in systemd.
  • Contingency deadline: if we back out the change it would be best to do it before beta freeze, but this can happen at any point.
  • Blocks release? No.

Documentation

Documentation isn't required for this minor configuration change. Services that legitimately need to prevent system shutdown should use systemd inhibit. Desktop applications can use the XDG inhibit portal.

Release Notes