From Fedora Project Wiki
(Copy of Changes/PythonStaticSpeedup)
 
No edit summary
 
(29 intermediate revisions by 5 users not shown)
Line 3: Line 3:
= Build Python with -fno-semantic-interposition for better performance =
= Build Python with -fno-semantic-interposition for better performance =


 
{{admon/important|Simplified version of another change proposal|This change was originally proposed for [[Releases/32|Fedora 32]] as [[Changes/PythonStaticSpeedup]], however based on [https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/NWPVQSKVWDKA75PDEIJNJIFL5C5SJXB2/ community feedback], it has been significantly reduced.}}


== Summary ==
== Summary ==
Python 3 traditionally in Fedora was built with a shared library libpython3.?.so and the final binary was dynamically linked against that shared library. This change is about creating the static library and linking the final python3 binary against it, as it provides significant performance improvement, up to 27% depending on the workload. The static library will not be shipped. The shared library will continue to exist in a separate subpackage. In essence, python3 will no longer depend on libpython.
We add the <code>-fno-semantic-interposition</code> compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.


== Owner ==
== Owner ==
Line 18: Line 18:
* Responsible WG:
* Responsible WG:
-->
-->
* Shout-out: [[User:Jankratochvil|Jan Kratochvíl]] for first suggesting this instead of the original proposal, followed by [[User:Kkofler|Kevin Kofler]]. [[User:Fweimer|Florian Weimer]] for providing answers to our questions. David Gray for originally suggesting to link Python statically to gain performance.


== Current status ==
== Current status ==
Line 29: Line 30:
CLOSED as NEXTRELEASE -> change is completed and verified and will be delivered in next release under development
CLOSED as NEXTRELEASE -> change is completed and verified and will be delivered in next release under development
-->
-->
* Tracker bug: <will be assigned by the Wrangler>
* Tracker bug: [https://bugzilla.redhat.com/show_bug.cgi?id=1779341 #1779341]
* Release notes tracker: <will be assigned by the Wrangler>
* Release notes tracker: [https://pagure.io/fedora-docs/release-notes/issue/421 #421]


== Detailed Description ==
== Detailed Description ==


When we compile the python3 package on Fedora (prior to this change), we create the libpython3.?.so shared library and the final python3 binary (<code>/usr/bin/python3</code>) is dynamically linked against it. However by building the libpython3.?.a static library and statically linking the final binary against it, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is linked statically.
When we build the Python interpreter with the <code>-fno-semantic-interposition</code> compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.
 
For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect.
 
For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations.
 
If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup.


Since Python 3.8, [https://docs.python.org/3.8/whatsnew/3.8.html#debug-build-uses-the-same-abi-as-release-build C extensions must no longer be linked to libpython by default]. Applications embedding Python now need to utilize the --embed flag for python3-config to be linked to libpython. During the [[Changes/Python3.8|Python 3.8 upgrade and rebuilds]] we've uncovered various cases of packages linking to libpython implicitly through various hacks within their buildsystems and fixed as many as possible. However, there are legitimate reasons to link an application to libpython and for those cases libpython should be provided so applications that embed Python can continue to do so.
Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance.


This mirrors the Debian/Ubuntu way of building Python, where they offer a statically linked binary and an additional libpython subpackage. The libpython subpackage will be created and python3-devel will depend on it, so packages that embed Python will keep working.
Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.


The change was first done in Debian and Ubuntu years ago, followed by Python 3.8. manylinux1 and manylinux2010 ABI don't link C extensions to libpython either (to support Debian/Ubuntu).
If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommended way is to build a custom Python without <code>-fno-semantic-interposition</code>.


By applying this change, libpython's namespace will be separated from Python's, so '''C extension which are still linked to libpython''' might experience side effects or break.
It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).


There is one exception for C extensions. If an application is linked to libpython in order to embed Python, C extensions used only within this application can continue to be linked to libpython.
=== Affected Pythons ===


Currently there is no upstream option to build the static library, as well as the shared one and statically link the final binary to it, so we have to rely on a downstream patch to achieve it. We plan to work with upstream to incorporate the changes there as well.
Primarily, we will change the interpreter in the {{package|python3}} package, that is Python 3.8 in Fedora 32 and any later version of Python in future Fedora releases.


Before the change, python3.8 is dynamically linked to libpython3.8:
Impact on other Python packages (and generally software using Python) is not anticipated (other than the possible speedup).


<pre>
We will also change the [https://developer.fedoraproject.org/tech/languages/python/multiple-pythons.html alternate Python interpreters] where possible and useful, primarily the upstream supported versions of CPython, such as {{package|python39}} (if already packaged), {{package|python37}} and {{package|python36}}.
+-------------------+
|                  |
|                  |        +--------------------+
|  libpython3.8.so  <---------+ /usr/bin/python3.8 |
|                   |         +--------------------+
|                  |
+-------------------+
</pre>


After the change, python3.8 is statically linked to libpython3.8:
=== Affected Fedora releases ===


<pre>
This is a Fedora 32 change and it will be implemented in Rawhide (Fedora 32) only. Any future versions of Fedora will inherit the change until it is reverted for some reason.
                              +-----------------------+
                              |                      |
                              |  /usr/bin/python3.8  |
                              |                      |
+-------------------+        | +-------------------+ |
|                  |        | |                  | |
|                  |        | |                  | |
|  libpython3.8.so  |        | |  libpython3.8.a  | |
|                  |        | |                  | |
|                  |        | |                  | |
+-------------------+        | +-------------------+ |
                              +-----------------------+
</pre>


As a negative side effect, when both libpython3.8.so and /usr/bin/python3.8 are installed, the filesystem footprint will be slightly increased (libpython3.8.so on Python 3.8.0, x86_64 is ~3.4M). OTOH only a very small amount of packages will depend on libpython3.8.so.
If it turns out that there are absolutely no issues, we might consider backporting the speedup to already released Fedora versions (for example Fedora 31). Such action would be separately coordinated with [https://docs.fedoraproject.org/en-US/fesco/ FESCo].


== Benefit to Fedora ==
== Benefit to Fedora ==
Line 114: Line 100:


<pre>
<pre>
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| Benchmark              | python38-3.8.0-1 | python38-3.8.0-666          |
| Benchmark              | python38-3.8.0-1 (original)  | python38-3.8.0-2 (changed)  |
+=========================+==================+==============================+
+=========================+==============================+==============================+
| nbody                  | 238 ms           | 174 ms: 1.36x faster (-27%)  |
| scimark_lu              | 294 ms                       | 213 ms: 1.38x faster (-27%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| raytrace                | 919 ms           | 686 ms: 1.34x faster (-25%) |
| scimark_sparse_mat_mult | 8.61 ms                     | 6.39 ms: 1.35x faster (-26%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| scimark_lu              | 285 ms           | 215 ms: 1.33x faster (-25%)  |
| nbody                  | 236 ms                       | 179 ms: 1.32x faster (-24%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| scimark_sparse_mat_mult | 8.20 ms         | 6.20 ms: 1.32x faster (-24%) |
| django_template        | 203 ms                       | 158 ms: 1.29x faster (-22%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| django_template        | 204 ms           | 156 ms: 1.31x faster (-24%)  |
| raytrace                | 910 ms                       | 709 ms: 1.28x faster (-22%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| chaos                  | 203 ms          | 156 ms: 1.30x faster (-23%) |
| logging_format          | 17.7 us                      | 13.8 us: 1.28x faster (-22%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| logging_simple          | 15.6 us          | 12.2 us: 1.28x faster (-22%) |
| richards                | 124 ms                      | 97.2 ms: 1.27x faster (-21%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| richards               | 124 ms          | 97.0 ms: 1.28x faster (-22%) |
| unpickle               | 23.9 us                      | 18.8 us: 1.27x faster (-21%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| scimark_fft            | 652 ms           | 511 ms: 1.27x faster (-22%)  |
| chaos                  | 200 ms                       | 158 ms: 1.26x faster (-21%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| hexiom                  | 17.4 ms         | 13.8 ms: 1.27x faster (-21%) |
| hexiom                  | 17.6 ms                     | 14.0 ms: 1.26x faster (-21%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| logging_format         | 17.1 us         | 13.5 us: 1.27x faster (-21%) |
| logging_simple         | 15.8 us                     | 12.5 us: 1.26x faster (-21%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| nqueens                | 174 ms           | 137 ms: 1.26x faster (-21%)  |
| nqueens                | 179 ms                       | 142 ms: 1.26x faster (-20%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| crypto_pyaes            | 201 ms          | 160 ms: 1.26x faster (-20%)  |
| logging_silent          | 340 ns                      | 273 ns: 1.25x faster (-20%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| deltablue              | 12.6 ms         | 10.0 ms: 1.25x faster (-20%) |
| crypto_pyaes            | 201 ms                       | 162 ms: 1.24x faster (-19%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| unpickle_pure_python    | 576 us          | 463 us: 1.24x faster (-20%)  |
| scimark_fft            | 653 ms                      | 527 ms: 1.24x faster (-19%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| pickle_pure_python      | 799 us          | 644 us: 1.24x faster (-19%)  |
| scimark_monte_carlo    | 190 ms                      | 154 ms: 1.24x faster (-19%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| go                      | 449 ms          | 362 ms: 1.24x faster (-19%)  |
| pickle_pure_python      | 795 us                      | 646 us: 1.23x faster (-19%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| spectral_norm          | 247 ms           | 200 ms: 1.24x faster (-19%)  |
| go                      | 443 ms                       | 361 ms: 1.23x faster (-18%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| scimark_monte_carlo    | 185 ms           | 151 ms: 1.23x faster (-19%) |
| deltablue              | 12.6 ms                     | 10.4 ms: 1.22x faster (-18%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| logging_silent          | 340 ns          | 276 ns: 1.23x faster (-19%)  |
| spectral_norm          | 245 ms                      | 201 ms: 1.22x faster (-18%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| unpickle                | 23.3 us          | 19.1 us: 1.22x faster (-18%) |
| float                  | 203 ms                      | 167 ms: 1.21x faster (-18%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| float                  | 200 ms           | 166 ms: 1.21x faster (-17%) |
| mako                    | 27.0 ms                     | 22.2 ms: 1.21x faster (-18%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| mako                    | 26.6 ms         | 22.0 ms: 1.21x faster (-17%) |
| scimark_sor            | 347 ms                       | 286 ms: 1.21x faster (-17%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| xml_etree_generate      | 159 ms          | 133 ms: 1.20x faster (-17%)  |
| unpickle_pure_python    | 575 us                      | 475 us: 1.21x faster (-17%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| xml_etree_process      | 128 ms           | 107 ms: 1.20x faster (-16%)  |
| fannkuch                | 803 ms                       | 667 ms: 1.20x faster (-17%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| fannkuch                | 795 ms           | 670 ms: 1.19x faster (-16%) |
| pathlib                | 35.3 ms                     | 29.5 ms: 1.20x faster (-17%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| chameleon              | 15.7 ms          | 13.3 ms: 1.18x faster (-15%) |
| pyflate                | 1.15 sec                    | 959 ms: 1.19x faster (-16%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| scimark_sor            | 347 ms           | 294 ms: 1.18x faster (-15%)  |
| sympy_expand            | 707 ms                       | 600 ms: 1.18x faster (-15%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| pathlib                | 35.7 ms         | 30.2 ms: 1.18x faster (-15%) |
| regex_compile          | 303 ms                       | 258 ms: 1.18x faster (-15%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| regex_compile          | 301 ms           | 255 ms: 1.18x faster (-15%) |
| chameleon              | 15.7 ms                     | 13.3 ms: 1.18x faster (-15%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| genshi_text            | 48.3 ms         | 41.2 ms: 1.17x faster (-15%) |
| sympy_str              | 461 ms                       | 394 ms: 1.17x faster (-15%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| sympy_str              | 459 ms           | 394 ms: 1.17x faster (-14%) |
| genshi_xml              | 104 ms                       | 88.4 ms: 1.17x faster (-15%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| genshi_xml              | 102 ms           | 87.6 ms: 1.16x faster (-14%) |
| dulwich_log            | 116 ms                       | 100 ms: 1.16x faster (-14%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| 2to3                    | 540 ms           | 465 ms: 1.16x faster (-14%) |
| sympy_integrate        | 34.4 ms                     | 29.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| sqlite_synth            | 4.89 us          | 4.25 us: 1.15x faster (-13%) |
| genshi_text            | 49.1 ms                      | 42.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| sympy_expand            | 704 ms           | 613 ms: 1.15x faster (-13%)  |
| 2to3                    | 535 ms                       | 471 ms: 1.14x faster (-12%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| html5lib                | 162 ms           | 141 ms: 1.15x faster (-13%) |
| json_dumps              | 20.4 ms                     | 18.0 ms: 1.13x faster (-12%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| sympy_integrate        | 34.2 ms         | 30.0 ms: 1.14x faster (-12%) |
| sympy_sum              | 285 ms                       | 252 ms: 1.13x faster (-12%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| dulwich_log            | 121 ms           | 107 ms: 1.13x faster (-11%)  |
| xml_etree_process      | 128 ms                       | 114 ms: 1.12x faster (-11%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| sympy_sum              | 286 ms          | 253 ms: 1.13x faster (-11%)  |
| sqlite_synth            | 4.75 us                      | 4.24 us: 1.12x faster (-11%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| xml_etree_iterparse    | 170 ms           | 152 ms: 1.12x faster (-11%) |
| telco                  | 10.1 ms                     | 8.98 ms: 1.12x faster (-11%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| telco                  | 10.2 ms         | 9.14 ms: 1.11x faster (-10%) |
| meteor_contest          | 168 ms                       | 150 ms: 1.12x faster (-11%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| meteor_contest          | 171 ms           | 154 ms: 1.11x faster (-10%) |
| sqlalchemy_imperative  | 53.3 ms                     | 47.7 ms: 1.12x faster (-11%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| json_dumps              | 20.0 ms         | 18.0 ms: 1.11x faster (-10%) |
| tornado_http            | 425 ms                       | 382 ms: 1.11x faster (-10%) |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| tornado_http            | 425 ms           | 384 ms: 1.11x faster (-10%) |
| xml_etree_generate      | 159 ms                       | 144 ms: 1.10x faster (-9%)   |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| xml_etree_parse        | 249 ms           | 226 ms: 1.10x faster (-9%)  |
| sqlalchemy_declarative  | 271 ms                       | 251 ms: 1.08x faster (-7%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_imperative  | 53.4 ms          | 49.6 ms: 1.08x faster (-7%)  |
| json_loads              | 43.5 us                      | 40.4 us: 1.08x faster (-7%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| python_startup         | 13.7 ms         | 12.7 ms: 1.07x faster (-7%)  |
| python_startup         | 13.9 ms                     | 13.1 ms: 1.06x faster (-6%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| json_loads              | 43.3 us         | 40.7 us: 1.06x faster (-6%)  |
| unpickle_list          | 6.68 us                     | 6.29 us: 1.06x faster (-6%)  |
+-------------------------+------------------+------------------------------+
+-------------------------+------------------------------+------------------------------+
| python_startup_no_site  | 9.29 ms          | 8.75 ms: 1.06x faster (-6%)  |
+-------------------------+------------------+------------------------------+
| pickle_dict            | 33.8 us          | 32.0 us: 1.06x faster (-5%)  |
+-------------------------+------------------+------------------------------+
| sqlalchemy_declarative  | 272 ms          | 258 ms: 1.05x faster (-5%)  |
+-------------------------+------------------+------------------------------+
</pre>
</pre>


Line 226: Line 206:
* Proposal owners:
* Proposal owners:
<!-- What work do the feature owners have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
<!-- What work do the feature owners have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
** Review and merge the [https://src.fedoraproject.org/rpms/python3/pull-request/133 pull request with the implementation].
** Review and merge the [https://src.fedoraproject.org/rpms/python3/pull-request/151 pull request with the implementation].
** Go through the Python C extension packages that are linked to libpython and test if things work correctly. A copr repository will be provided for testing.
** Monitor Koschei for significant problems.
** Backport the change to alternate Python versions.
** Attempt to upstream the change: https://bugs.python.org/issue38980


* Other developers: Other developers are encouraged to test the new statically linked python3 and check if their package works as expected <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Other developers are encouraged to check if their package works as expected <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- What work do other developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->
<!-- What work do other developers have to accomplish to complete the feature in time for release?  Is it a large change affecting many parts of the distribution or is it a very isolated change? What are those changes?-->


* Release engineering: [https://pagure.io/releng/issue/8953 #8953] This change does not require a mass rebuild, however a rebuild of the affected packages will be required. The affected packages will be rebuilt in copr first.
* Release engineering: N/A (not needed for this Change) -- this change does not require a mass rebuild nor any other special releng work


* Policies and guidelines: The packaging guidelines will need to be updated to explicitly mention that C extensions should not be linked to libpython, and that the python3 binary is statically linked. <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
* Policies and guidelines: N/A (not needed for this Change) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- Do the packaging guidelines or other documents need to be updated for this feature?  If so, does it need to happen before or after the implementation is done?  If a FPC ticket exists, add a link here. -->
<!-- Do the packaging guidelines or other documents need to be updated for this feature?  If so, does it need to happen before or after the implementation is done?  If a FPC ticket exists, add a link here. -->


Line 244: Line 226:


<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
Affected package maintainers should verify that their packages work as expected and the only impact the end users should see is a performance increase for workloads relying on Python.
Python package maintainers should verify that their packages work as expected and the only impact the end users should see is a performance increase for workloads relying on Python.


== How To Test ==
== How To Test ==
Line 263: Line 245:
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->


Copr repo with instructions: https://copr.fedorainfracloud.org/coprs/g/python/Python3_statically_linked/
Test that everything Python related in Fedora works as usual.
 
=== Package changes test ===
The change will bring the new <code>libpython3</code> subpackage as a dependency of <code>python3-devel</code>.
 
Test that it's installed:
<pre>
$ rpm -q libpython3
</pre>
 
Test that it's uninstalled if <code>python3-devel</code> is removed:
<pre>
$ dnf remove python3-devel
</pre>


Test that <code>python3-libs</code> no longer includes the libpython shared library.
=== Was the flag applied test ===
<pre>
$ rpm -ql python3-libs | grep libpython3
</pre>


=== Dynamic linker test ===
You can test whether the <code>-fno-semantic-interposition</code> flag was applied for your Python build:
 
To check that the python3.8 program is not linked to libpython, ldd can be used. For example, Python 3.7 will still be linked to libpython:


<pre>
<pre>
$ ldd /usr/bin/python3.7|grep libpython
>>> import sysconfig
libpython3.7m.so.1.0 => /lib64/libpython3.7m.so.1.0 (0x00007fbb57333000)
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST'))
True
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST'))
True
</pre>
</pre>


But python3.8 will no longer be linked to libpython:
Before the change, you would see <code>False</code>, <code>False</code>.
 
<pre>
$ ldd /usr/bin/python3.8|grep libpython
</pre>


=== Performance test ===
=== Performance test ===


The performance speedup can be measured using the official Python benchmark suite [https://pyperformance.readthedocs.io/ pyperformance]: see [https://pyperformance.readthedocs.io/usage.html#run-benchmarks Run benchmarks].
The performance speedup can be measured using the official Python benchmark suite [https://pyperformance.readthedocs.io/ pyperformance]: see [https://pyperformance.readthedocs.io/usage.html#run-benchmarks Run benchmarks].
=== Namespace test ===
The following script can be used to verify that the change is in effect:
<pre>
import ctypes
import sys
EMPTY_TUPLE_SINGLETON = ()
def get_empty_tuple(lib):
    # Call PyTuple_New(0)
    func = lib.PyTuple_New
    func.argtypes = (ctypes.c_ssize_t,)
    func.restype = ctypes.py_object
    return func(0)
def test_lib(libname, lib):
    obj = get_empty_tuple(lib)
    if obj is EMPTY_TUPLE_SINGLETON:
        print("%s: SAME namespace" % libname)
    else:
        print("%s: DIFFERENT namespace" % libname)
def test():
    program = ctypes.pythonapi
    if hasattr(sys, 'abiflags'):
        abiflags = sys.abiflags
    else:
        # Python 2
        abiflags = ''
    ver = sys.version_info
    filename = ('libpython%s.%s%s.so.1.0'
                % (ver.major, ver.minor, abiflags))
    libpython = ctypes.cdll.LoadLibrary(filename)
    test_lib('program', program)
    test_lib('libpython', libpython)
test()
</pre>
Output before the change:
<pre>
program: SAME namespace
libpython: SAME namespace
</pre>
Output after the change:
<pre>
program: SAME namespace
libpython: DIFFERENT namespace
</pre>


== User Experience ==
== User Experience ==
Line 376: Line 283:


<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
While this specific change is not dependent on anything else, we would like to ensure that all the packages that link to libpython continue to work as expected.
This change is not dependent on anything else.
 
Currently (30/10/2019) 118 packages on rawhide depend on libpython.
 
Result of the "repoquery --repo=rawhide --source --whatrequires 'libpython3.8.so.1.0()(64bit)' " command on Fedora Rawhide, x86_64:
 
*COPASI
*Io-language
*OpenImageIO
*YafaRay
*antimony
*blender
*boost
*calamares
*calibre
*cantor
*ceph
*clingo
*condor
*createrepo_c
*csound
*cvc4
*dionaea
*dmlite
*domoticz
*fontforge
*freecad
*gdb
*gdcm
*gdl
*getdp
*glade
*globus-net-manager
*glom
*gnucash
*gpaw
*hamlib
*hokuyoaist
*hugin
*insight
*kdevelop-python
*kicad
*kitty
*krita
*lammps
*ldns
*libCombine
*libarcus https://src.fedoraproject.org/rpms/libarcus/pull-request/8
*libarcus-lulzbot
*libbatch
*libcec
*'''libcomps'''
*'''libdnf'''
*libftdi
*libkml
*libkolabxml
*libldb
*libnuml
*libpeas
*libplist
*libreoffice
*librepo
*libsavitar
*libsbml
*libsedml
*libtalloc
*libyang
*libyui-bindings
*link-grammar
*lldb
*mathgl
*med
*mod_wsgi
*nautilus-python
*nbdkit
*nest
*netgen-mesher
*neuron
*nextpnr
*nordugrid-arc
*nwchem
*openbabel
*openscap
*opentrep
*openvdb
*pam_wrapper
*paraview
*perl-Inline-Python
*pidgin
*pitivi
*plplot
*postgresql
*pynac
*pyotherside
*pythia8
*python-gstreamer1
*python-jep
*python-qt5
*<del>python3</del>
*qgis
*qpid-dispatch
*qpid-proton
*rdkit
*renderdoc
*rmol
*root
*samba
*scidavis
*sigil
*swift-lang
*texworks
*thunarx-python
*trademgen
*trellis
*unbound
*uwsgi
*vdr-epg-daemon
*vigra
*'''vim'''
*vrpn
*vtk
*weechat
*znc
 
Packages in '''bold''' are the ones present in the default docker/podman "fedora:rawhide" image.


== Contingency Plan ==
== Contingency Plan ==


<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "Revert the shipped configuration".  Or it might not (e.g. rebuilding a number of dependent packages).  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "Revert the shipped configuration".  Or it might not (e.g. rebuilding a number of dependent packages).  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
* Contingency mechanism: If issues appear that cannot be fixed in a timely manner the change can be easily reverted and will be considered again for the next fedora release. Also a proper upgrade path mechanism will be provided in case of reversion, since libpython.3.?.so will be a separate package with this change.
* Contingency mechanism: If issues appear that cannot be fixed in a timely manner the change can be easily reverted and will be considered again for the next fedora release.
<!-- When is the last time the contingency mechanism can be put in place?  This will typically be the beta freeze. -->
<!-- When is the last time the contingency mechanism can be put in place?  This will typically be the beta freeze. -->
* Contingency deadline: Before the beta freeze of Fedora 32 (2020-02-25)
* Contingency deadline: Before the beta freeze of Fedora 32 (2020-02-25)
Line 516: Line 299:


<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
The documentation will be reflected in the changes for the python packaging guidelines.
This change proposal has all the documentation.
 
See the [[Changes/PythonStaticSpeedup|previous change proposal]] and the [https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/NWPVQSKVWDKA75PDEIJNJIFL5C5SJXB2/ thread about it on the devel mailing list] for more relevant information about what we are not doing


== Release Notes ==
== Release Notes ==
Line 525: Line 310:
-->
-->


[[Category:ChangePageIncomplete]]
TBD. Be sure to mention PEP [https://www.python.org/dev/peps/pep-0445/ 445] and [https://www.python.org/dev/peps/pep-0454/ 454].
 
[[Category:ChangeAcceptedF32]]
<!-- When your change proposal page is completed and ready for review and announcement -->
<!-- When your change proposal page is completed and ready for review and announcement -->
<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->
<!-- remove Category:ChangePageIncomplete and change it to Category:ChangeReadyForWrangler -->
Line 532: Line 319:


<!-- Select proper category, default is Self Contained Change -->
<!-- Select proper category, default is Self Contained Change -->
<!-- [[Category:SelfContainedChange]] -->
[[Category:SelfContainedChange]]
[[Category:SystemWideChange]]
<!-- [[Category:SystemWideChange]] -->

Latest revision as of 13:38, 9 June 2021

Build Python with -fno-semantic-interposition for better performance

Simplified version of another change proposal
This change was originally proposed for Fedora 32 as Changes/PythonStaticSpeedup, however based on community feedback, it has been significantly reduced.

Summary

We add the -fno-semantic-interposition compiler flag when building Python interpreters, as it provides significant performance improvement, up to 27% depending on the workload. Users will no longer be able to use LD_PRELOAD to override a symbol from libpython, which we consider a good trade off for the speedup.

Owner

Current status

Detailed Description

When we build the Python interpreter with the -fno-semantic-interposition compiler/linker flag, we can achieve a performance gain of 5% to 27% depending on the workload. Link time optimizations and profile guided optimizations also have a greater impact when python3 is built this way.

For a vague-linkage function definition, a call site in the same translation unit may inline the callee. Whether -fno-semantic-interposition is enabled has no effect.

For a non-vague-linkage function definition, by default (-fsemantic-interposition) the -fpic mode does not allow a call site in the same translation unit to inline the callee or perform other interprocedural optimizations. -fno-semantic-interposition re-enables interprocedural optimizations.

If a caller inlines a callee, using LD_PRELOAD to interpose the callee will not affect the caller. But many other LD_PRELOAD usage still work. We consider the small LD_PRELOAD limitation a good trade off for the speedup.

Interposition is enabled by default in compilers like GCC: function calls to a library goes through a "Procedure Linkage Table" (PLT). This indirection is required to allow a library loaded by LD_PRELOAD environment variable to override a function. The indirection puts more pressure on the CPU level 1 cache (instruction cache). In terms of performance, the main drawback is that function calls from a library to the same library cannot be inlined, to respect the interposition semantics. Inlining is usually a big win in terms of performance.

Disabling interposition for libpython removes the overhead on function calls by avoiding the PLT indirection, and allows to inline more function calls. We're describing function calls from libpython to libpython, something which is very common in Python: almost all function calls are calls from libpython to libpython.

If Fedora users need to use LD_PRELOAD to override symbols in libpython, the recommended way is to build a custom Python without -fno-semantic-interposition.

It is still possible to use LD_PRELOAD to override symbols in other libraries (for example in glibc).

Affected Pythons

Primarily, we will change the interpreter in the python3 package, that is Python 3.8 in Fedora 32 and any later version of Python in future Fedora releases.

Impact on other Python packages (and generally software using Python) is not anticipated (other than the possible speedup).

We will also change the alternate Python interpreters where possible and useful, primarily the upstream supported versions of CPython, such as python39 (if already packaged), python37 and python36.

Affected Fedora releases

This is a Fedora 32 change and it will be implemented in Rawhide (Fedora 32) only. Any future versions of Fedora will inherit the change until it is reverted for some reason.

If it turns out that there are absolutely no issues, we might consider backporting the speedup to already released Fedora versions (for example Fedora 31). Such action would be separately coordinated with FESCo.

Benefit to Fedora

Python's performance will increase significantly depending on the workload. Since many core components of the OS also depend on Python this could lead to an increase in their performance as well, however individual benchmarks will need to be conducted to verify the performance gain for those components.

pyperformance results, ignoring differences smaller than 5%:

+-------------------------+------------------------------+------------------------------+
| Benchmark               | python38-3.8.0-1 (original)  | python38-3.8.0-2 (changed)   |
+=========================+==============================+==============================+
| scimark_lu              | 294 ms                       | 213 ms: 1.38x faster (-27%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_sparse_mat_mult | 8.61 ms                      | 6.39 ms: 1.35x faster (-26%) |
+-------------------------+------------------------------+------------------------------+
| nbody                   | 236 ms                       | 179 ms: 1.32x faster (-24%)  |
+-------------------------+------------------------------+------------------------------+
| django_template         | 203 ms                       | 158 ms: 1.29x faster (-22%)  |
+-------------------------+------------------------------+------------------------------+
| raytrace                | 910 ms                       | 709 ms: 1.28x faster (-22%)  |
+-------------------------+------------------------------+------------------------------+
| logging_format          | 17.7 us                      | 13.8 us: 1.28x faster (-22%) |
+-------------------------+------------------------------+------------------------------+
| richards                | 124 ms                       | 97.2 ms: 1.27x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| unpickle                | 23.9 us                      | 18.8 us: 1.27x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| chaos                   | 200 ms                       | 158 ms: 1.26x faster (-21%)  |
+-------------------------+------------------------------+------------------------------+
| hexiom                  | 17.6 ms                      | 14.0 ms: 1.26x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| logging_simple          | 15.8 us                      | 12.5 us: 1.26x faster (-21%) |
+-------------------------+------------------------------+------------------------------+
| nqueens                 | 179 ms                       | 142 ms: 1.26x faster (-20%)  |
+-------------------------+------------------------------+------------------------------+
| logging_silent          | 340 ns                       | 273 ns: 1.25x faster (-20%)  |
+-------------------------+------------------------------+------------------------------+
| crypto_pyaes            | 201 ms                       | 162 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_fft             | 653 ms                       | 527 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| scimark_monte_carlo     | 190 ms                       | 154 ms: 1.24x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| pickle_pure_python      | 795 us                       | 646 us: 1.23x faster (-19%)  |
+-------------------------+------------------------------+------------------------------+
| go                      | 443 ms                       | 361 ms: 1.23x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| deltablue               | 12.6 ms                      | 10.4 ms: 1.22x faster (-18%) |
+-------------------------+------------------------------+------------------------------+
| spectral_norm           | 245 ms                       | 201 ms: 1.22x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| float                   | 203 ms                       | 167 ms: 1.21x faster (-18%)  |
+-------------------------+------------------------------+------------------------------+
| mako                    | 27.0 ms                      | 22.2 ms: 1.21x faster (-18%) |
+-------------------------+------------------------------+------------------------------+
| scimark_sor             | 347 ms                       | 286 ms: 1.21x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| unpickle_pure_python    | 575 us                       | 475 us: 1.21x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| fannkuch                | 803 ms                       | 667 ms: 1.20x faster (-17%)  |
+-------------------------+------------------------------+------------------------------+
| pathlib                 | 35.3 ms                      | 29.5 ms: 1.20x faster (-17%) |
+-------------------------+------------------------------+------------------------------+
| pyflate                 | 1.15 sec                     | 959 ms: 1.19x faster (-16%)  |
+-------------------------+------------------------------+------------------------------+
| sympy_expand            | 707 ms                       | 600 ms: 1.18x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| regex_compile           | 303 ms                       | 258 ms: 1.18x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| chameleon               | 15.7 ms                      | 13.3 ms: 1.18x faster (-15%) |
+-------------------------+------------------------------+------------------------------+
| sympy_str               | 461 ms                       | 394 ms: 1.17x faster (-15%)  |
+-------------------------+------------------------------+------------------------------+
| genshi_xml              | 104 ms                       | 88.4 ms: 1.17x faster (-15%) |
+-------------------------+------------------------------+------------------------------+
| dulwich_log             | 116 ms                       | 100 ms: 1.16x faster (-14%)  |
+-------------------------+------------------------------+------------------------------+
| sympy_integrate         | 34.4 ms                      | 29.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------------------+------------------------------+
| genshi_text             | 49.1 ms                      | 42.9 ms: 1.15x faster (-13%) |
+-------------------------+------------------------------+------------------------------+
| 2to3                    | 535 ms                       | 471 ms: 1.14x faster (-12%)  |
+-------------------------+------------------------------+------------------------------+
| json_dumps              | 20.4 ms                      | 18.0 ms: 1.13x faster (-12%) |
+-------------------------+------------------------------+------------------------------+
| sympy_sum               | 285 ms                       | 252 ms: 1.13x faster (-12%)  |
+-------------------------+------------------------------+------------------------------+
| xml_etree_process       | 128 ms                       | 114 ms: 1.12x faster (-11%)  |
+-------------------------+------------------------------+------------------------------+
| sqlite_synth            | 4.75 us                      | 4.24 us: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| telco                   | 10.1 ms                      | 8.98 ms: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| meteor_contest          | 168 ms                       | 150 ms: 1.12x faster (-11%)  |
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_imperative   | 53.3 ms                      | 47.7 ms: 1.12x faster (-11%) |
+-------------------------+------------------------------+------------------------------+
| tornado_http            | 425 ms                       | 382 ms: 1.11x faster (-10%)  |
+-------------------------+------------------------------+------------------------------+
| xml_etree_generate      | 159 ms                       | 144 ms: 1.10x faster (-9%)   |
+-------------------------+------------------------------+------------------------------+
| sqlalchemy_declarative  | 271 ms                       | 251 ms: 1.08x faster (-7%)   |
+-------------------------+------------------------------+------------------------------+
| json_loads              | 43.5 us                      | 40.4 us: 1.08x faster (-7%)  |
+-------------------------+------------------------------+------------------------------+
| python_startup          | 13.9 ms                      | 13.1 ms: 1.06x faster (-6%)  |
+-------------------------+------------------------------+------------------------------+
| unpickle_list           | 6.68 us                      | 6.29 us: 1.06x faster (-6%)  |
+-------------------------+------------------------------+------------------------------+

Scope

  • Other developers are encouraged to check if their package works as expected
  • Release engineering: N/A (not needed for this Change) -- this change does not require a mass rebuild nor any other special releng work
  • Policies and guidelines: N/A (not needed for this Change)
  • Trademark approval: N/A (not needed for this Change)

Upgrade/compatibility impact

Python package maintainers should verify that their packages work as expected and the only impact the end users should see is a performance increase for workloads relying on Python.

How To Test

Test that everything Python related in Fedora works as usual.

Was the flag applied test

You can test whether the -fno-semantic-interposition flag was applied for your Python build:

>>> import sysconfig
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST'))
True
>>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST'))
True

Before the change, you would see False, False.

Performance test

The performance speedup can be measured using the official Python benchmark suite pyperformance: see Run benchmarks.

User Experience

Python based workloads should see a performance gain of up to 27%.

Dependencies

This change is not dependent on anything else.

Contingency Plan

  • Contingency mechanism: If issues appear that cannot be fixed in a timely manner the change can be easily reverted and will be considered again for the next fedora release.
  • Contingency deadline: Before the beta freeze of Fedora 32 (2020-02-25)
  • Blocks release? Yes
  • Blocks product? None

Documentation

This change proposal has all the documentation.

See the previous change proposal and the thread about it on the devel mailing list for more relevant information about what we are not doing

Release Notes

TBD. Be sure to mention PEP 445 and 454.