From Fedora Project Wiki
(→‎Benefit to Fedora: mention Ubuntu too)
Line 77: Line 77:
The developers who are interested in this kind of optimization work can perform it within Fedora, without having to build separate repositories. The users who have the appropriate hardware will gain performance benefits. Faster code is also more energy-efficient. The change will be automatic and transparent to users.
The developers who are interested in this kind of optimization work can perform it within Fedora, without having to build separate repositories. The users who have the appropriate hardware will gain performance benefits. Faster code is also more energy-efficient. The change will be automatic and transparent to users.


Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 will use x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux). We implement the same change in Fedora in a way that is scoped more narrowly, but should provide the same performance and energy benefits.
Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 will use x86-64-v3, and other distros provide optimized variants ([https://en.opensuse.org/openSUSE:X86-64-Architecture-Levels OpenSUSE], Arch Linux, [https://ubuntu.com/blog/optimising-ubuntu-performance-on-amd64-architecture Ubuntu]). We implement the same change in Fedora in a way that is scoped more narrowly, but should provide the same performance and energy benefits.


== Scope ==
== Scope ==

Revision as of 21:17, 28 December 2023

Optimized Binaries for the AMD64 / x86_64 Architecture

Important.png
This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes process, proposals are publicly announced in order to receive community feedback. This proposal will only be implemented if approved by the Fedora Engineering Steering Committee.

Summary

Additional paths will be inserted into the search path used for executables on systems which have a compatible CPU. Those additional paths will mirror the AMD64 / x86_64 "microarchitecture levels" supported by the glibc-hwcaps mechanism: x86-64-v2, x86-64-v3, x86_64-v4. Systemd will be modified to insert the additional directories into the $PATH environment variable (affecting all programs on the system) and the equivalent internal mechanism in systemd (affecting what executables are used by services). Individual packages can provide optimized libraries via the glibc-hwcaps mechanism and optimized executables via the extended search path. This optimized code will be used if the CPU supports it. Which packages provide the optimized code and at which level will be made by individual package maintainers based on benchmark results.

Owner

NOTE: I'm writing and filling this proposal on the last day allowed for system-wide proposals. It is too large for one person. If you are interested, please let me know or even add yourself to the list of Owners. I would love to have more people working on this.

Current status

  • Targeted release: Fedora Linux 40
  • Last updated: 2023-12-28
  • Announced
  • Discussion thread
  • FESCo issue: <will be assigned by the Wrangler>
  • Tracker bug: <will be assigned by the Wrangler>
  • Release notes tracker: <will be assigned by the Wrangler>

Detailed Description

Fedora binaries for the AMD64 / x86_64 architecture are compiled with code-generation flags that support almost all CPU variants. But newer generations of processors gained additional instructions that may be used to generate faster code. A vendor-independent x86-64 psABI supplement defines four "microachitecture levels": x86-64-v1 (the baseline, our code targets this), x86-64-v2 (+SSE3, CentoOS targets this), x86-64-v3 (+AVX), x86-64-v4 (+AVX512) [1]. When code is compiled for a higher microarchitecture level it will crash (with SIGILL, "illegal instruction") on CPUs which do not support it. Benchmark results show small differences in performance: usually in the range from -5% to 10%, with no discernible difference for most code, but some applications benefit, with gains of 120% in some benchmarks [e.g. 2, 4].

Over the years, various people have expressed interest in raising the required microarchitecture levels. But we have been very conservative in making changes, because support is missing in many older CPUs that are still in use, and in fact, even in some CPUs produced and sold today. By raising the required level we would make Fedora completely unusable on many machines. It also seems that recompiling all packages with the changed options would largely be a waste of resources, because for most code it makes no difference. But for some of the numerical or cryptographic code there are noticeable gains and it seems to be worth the effort to provide optimized code. This also makes Fedora more attractive to people interested in optimization.

The dynamic linker already has the glibc-hwcaps mechanism to load optimized implementations of shared objects [3]. This means that packages can provide optimized libraries and they linker will be automatically load them from separate directories if appropriate. (For AMD64, this is /usr/lib64/glibc-hwcaps/x86-64-v{2,3,4}/.)

To extend the glibc-hwcaps mechanism to executables, systemd will be modified to extend the search path with appropriate directories. When started, it will check the CPU capabilities and modify the executable search path it has internally and which is also used to set $PATH for services. (For AMD64, /usr/bin/glibc-hwcaps/x86-64-v{2,3,4}/.)

Note: the ELF format provides the IFUNC mechanism to dynamically select a variant of a function (symbol) when an executable is loaded [5]. This is in particular used to load code using specific CPU instructions when those are supported. This mechanism is both more general (because it allows arbitrary selection criteria), more fine-grained (because there can be other variants than just a few fixed microarchitecture levels), and more efficient (becuase only the parts of the code that benefit from this need to be provided in multiple variants). In particular, glibc already makes extensive use of this to provide optimized code, which is then widely used by other libraries and programs. This means that even though we compile code in a way where the lowest baseline is supported, modern CPU instructions are already widely used. This is one of the reasons why compiling for a higher baseline often doesn't make any difference in benchmarks. The IFUNC mechanism or an equivalent mechanism should generally be preferred. Nevertheless, that needs to be implented in the program or library itself, which is not trivial. The two mechanisms in this Proposal are intended for the packages which do not support IFUNCs or some other equivalent mechanism.

[1] https://hackweek.opensuse.org/all/projects/support-glibc-hwcaps-and-micro-architecture-package-generation
[2] https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0002-march.rst
[3] https://sourceware.org/pipermail/libc-alpha/2021-February/122207.html
[4] https://blog.centos.org/2023/08/centos-isa-sig-performance-investigation/
[5] https://jasoncc.github.io/gnu_gcc_glibc/gnu-ifunc.html

Glibc-hwcaps together with the new feature in systemd provide a generic mechanism. It will be up to individual packages to actually provide code which makes use of it. Individual package maintainers are encouraged to benchmark their packages after recompilation, and provide the optimized variants if useful. (I.e. the code in question is measureably faster and the program is ran often enough for this to make a difference.)

The Change Owners will implement the packaging changes for a few packages while developing the general mechanism and will submit those as pull requests. Other maintainers are asked to do the same for their packages.

Optimized variants of programs and libraries MAY be packaged in a separate subpackage. The general packaging rules should be applied, i.e. a separate package or packages SHOULD be created if it is files are large enough.

Available benchmark results [2,4] are narrow and not very convincing. We should plan an evaluation of results after one release. If it turns out that the real gains are too small, we can scrap the effort. On the other hand, we should also consider other architectures. For example, microarchitecture levels z{14,15} for s390x or power{9,10} for ppc64le. Other architectures are not included in this Change Proposal to reduce its scope.

Feedback

Benefit to Fedora

The developers who are interested in this kind of optimization work can perform it within Fedora, without having to build separate repositories. The users who have the appropriate hardware will gain performance benefits. Faster code is also more energy-efficient. The change will be automatic and transparent to users.

Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 will use x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu). We implement the same change in Fedora in a way that is scoped more narrowly, but should provide the same performance and energy benefits.

Scope

  • Proposal owners:
    • Extend systemd to set the executable search path using the same criteria as the dynamic linker.
    • Implement packaging changes for at least one package with a library and at least one package with executables and submit this as pull requests.
    • Provide a pull request for the Packaging Guidelines to describe the changes listed in Description above.
  • Other developers:
    • Do benchmarking and implement packaging changes for other packages if beneficial.
  • Policies and guidelines: TBD.
  • Trademark approval: N/A (not needed for this Change)
  • Alignment with Community Initiatives:

Upgrade/compatibility impact

No impact.

How To Test

  • Use /usr/bin/ld.so --help to check which hwcaps are supported by the system.
  • Install one or more packages which provide optimized code.
  • Restart the system or re-login to reinitialize $PATH.
  • Check that appropriate directories are present in $PATH.
  • Run some benchmarks and check that the optimized code is indeed faster.


User Experience

There should be no impact for users. If they optimized code is available and installed for their hardware, various tasks may finish faster and use less energy.


Dependencies

Contingency Plan

  • Contingency mechanism: Undo the changes in packages which introduced them and recompile.
  • Contingency deadline: Any time.
  • Blocks release? No.

Documentation

Release Notes

Packages which benefit from being compiled for higher AMD64 microarchitecture levels (x86-64-v2, x86-64-v3, x86_64-v4) are now provided with optimized variants which will be used automatically on appropriate CPUs. This includes: TBD1, TBD2, TBD3.