From Fedora Project Wiki

< Changes

Revision as of 13:42, 5 July 2023 by Catanzaro (talk | contribs) (Created page with "= Privacy-preserving Telemetry for Fedora Workstation = == Summary == Red Hat proposes to enable limited data collection of anonymous Fedora Workstation usage metrics. Please don't panic yet! Fedora is an open source community project, and nobody is interested in violating user privacy. We do not want to collect data about individual users. We want to collect only aggregate usage metrics that are actually needed to achieve specific Fedora improvement objectives, and n...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Privacy-preserving Telemetry for Fedora Workstation

Summary

Red Hat proposes to enable limited data collection of anonymous Fedora Workstation usage metrics.

Please don't panic yet! Fedora is an open source community project, and nobody is interested in violating user privacy. We do not want to collect data about individual users. We want to collect only aggregate usage metrics that are actually needed to achieve specific Fedora improvement objectives, and no more. We understand that if we violate our users' trust, then we won't have many users left, so if metrics collection is approved, we will need to be very careful to roll this out in a way that respects our users at all times. (For example, we should not collect users' search queries, because that would be creepy.)

We believe an open source community can ethically collect limited aggregate data on how its software is used without involving big data companies or building creepy tracking profiles that are not in the best interests of users. Users will have the option to disable data upload before any data is sent for the first time. Our service will be operated by Fedora on Fedora infrastructure, and will not depend on Google Analytics or any other controversial third-party services. And in contrast to proprietary software operating systems, you can redirect the data collection to your own private metrics server instead of Fedora's to see precisely what data is being collected from you, because the server components are open source too.

Keep in mind this Fedora change proposal is just that: a proposal. It must undergo community review and must be approved by the community-elected Fedora Engineering Steering Committee (FESCo) before it can be implemented, just like any other Fedora change proposal. We welcome community participation and fully expect this proposal may need to be modified significantly depending on Fedora community feedback.

Owner

Current status

  • Targeted release: Fedora Linux 40
  • Last updated: 2023-07-05
  • FESCo issue: <will be assigned by the Wrangler>
  • Tracker bug: <will be assigned by the Wrangler>
  • Release notes tracker: <will be assigned by the Wrangler>

Detailed Description

We intend to deploy the Endless OS metrics system. This blog post contains a description of how the system works. We do not plan to deploy the eos-phone-home component in Fedora.

How will data collection be approved?

Although this change is being proposed Red Hat, we feel it is essential to ensure the Fedora community has ultimate oversight over metrics collection. Community control is required to maintain user trust. If this change proposal is approved, then we'll need new policies and procedures to ensure community oversight over metrics collection and ensure Fedora users can be confident that our metrics collection does not violate their privacy.

We can say "we would never collect personally-identifiable data" and write software that really doesn't collect any such data, but this alone will never be enough to ensure user confidence. We will need a metrics collection policy that describes what sort of data may be collected by Fedora (anonymous, non-invasive), and what sort of data may not be collected. Such a policy does not exist currently. We will also want to ensure the Fedora community has ultimate control over which particular metrics are collected. One option is that each metric to be collected should be separately approved by FESCo. Collection of particular metrics in a particular data format is ultimately an engineering decision, and therefore FESCo seems like an appropriate approval point. Because FESCo members are elected regularly by the Fedora community, this also provides the community with ultimate control over metrics collection via the election process. But other oversight and approval structures would work too.

What data might we collect?

We are not proposing to collect any of these particular metrics just yet, because a process for Fedora community approval of metrics to be collected does not yet exist. That said, in the interests of maximum transparency, we wish to give you an idea of what sorts of metrics we might propose to collect in the future.

One of the main goals of metrics collection is to analyze whether Red Hat is achieving its goal to make Fedora Workstation the premier developer platform for cloud software development. Accordingly, we want to know things like which IDEs are most popular among our users, and which runtimes are used to create containers using Toolbx.

Metrics can also be used to inform user interface design decisions. For example, we want to collect the clickthrough rate of the recommended software banners in GNOME Software to assess which banners are actually useful to users. We also want to know how frequently panels in gnome-control-center are visited to determine which panels could be consolidated or removed, because there are other settings we want to add, but our usability research indicates that the current high quantity of settings panels already makes it difficult for users to find commonly-used settings.

Metrics can help us understand the hardware we should be optimizing Fedora for. For example, our boot performance on hard drives dropped drastically when systemd-readahead was removed. Ubuntu has maintained its own readahead implementation, but Fedora does not because we assume that not many users use Fedora on hard drives. It would be nice to collect a metric that indicates whether primary storage is a solid state drive or a hard disk, so we can see actual hard drive usage instead of guessing. We would also want to collect hardware information that would be useful for collaboration with hardware vendors (such as Lenovo), such as laptop model ID.

Other Fedora teams may have other metrics they wish to collect. For example, Fedora localization wishes to count users of particular locales to evaluate which locales are in poorer shape relative to their usage.

This is only a small sample of what we might want to know; no doubt other community members can think of many more interesting data points to collect. But note the purpose of all of the above metrics is to inform specific design decisions, not to build tracking profiles. We only need to collect data in aggregate, and have no need to associate the data we collect with particular users.

Metrics transparency

Transparency is required to provide confidence that Fedora metrics collection is not creepy or invasive. Since Fedora is open source, a developer can review the source code to verify exactly what it is doing and what data is being collected. But most Fedora users are not software developers, and few software developers have time or inclination to review the source code of the operating system to see what it is doing. To retain user trust, we need an easy way for users to understand exactly what data we are collecting. We propose to maintain a documentation page showing the current metrics database schema, so users can see exactly which fields are in the database and what example data looks like.

Experienced users may gain additional confidence by building and running their own metrics collection server; all of the components of the server (discussed below) are open source, and we will provide instructions for how to run a simple server yourself and view its metrics database. You can redirect metrics from Fedora's server to your own by changing a URL in a configuration file.

User control

A new metrics collection setting will be added to the privacy page in gnome-initial-setup and also to the privacy page in gnome-control-center. This setting will be a toggle that will enable or disable metrics collection for the entire system. We want to ensure that metrics are never submitted to Fedora without the user's knowledge and consent, so the underlying setting will be off by default in order to ensure metrics upload is not unexpectedly turned on when upgrading from an older version of Fedora. However, we also want to ensure that the data we collect is meaningful, so gnome-initial-setup will default to displaying the toggle as enabled, even though the underlying setting will initially be disabled. (The underlying setting will not actually be enabled until the user finishes the privacy page, to ensure users have the opportunity to disable the setting before any data is uploaded.) This is to ensure the system is opt-out, not opt-in. This is essential because we know that opt-in metrics are not very useful. Few users would opt in, and these users would not be representative of Fedora users as a whole. We are not interested in opt-in metrics.

To make this a little more confusing, metrics collection is actually separate from uploading. Collection is always initially enabled, while uploading is always initially disabled. The graphical toggle enables or disables both at the same time. That is, a newly-installed Fedora system will always collect metrics locally at first, but the collected metrics will be deleted and never submitted to Fedora if the user disables the metrics collection toggle on the privacy page. If the user leaves the toggle enabled, then the collected metrics may be submitted only after finishing the privacy page.

Metrics uploading will be opt-in for users who upgrade from previous versions of Fedora Workstation, because we don't yet have a mechanism to ask the user to consent to data collection after a system upgrade like we do for new installations, but metrics collection will be opt-out. That is, your upgraded system will collect metrics locally but will never submit them to Fedora. If you visit the privacy page in gnome-control-center, then both collection and uploading will be either enabled or disabled depending on the user's selection. Unlike gnome-initial-setup, the switch in gnome-control-center will default to off if the user has not seen the switch in gnome-initial-setup and has not previously selected a value for the setting.

This might sound complicated, but it is consistent. If the user has not yet made a decision whether to allow telemetry, we collect it locally so that it's ready to submit if the user approves telemetry in the future, but we never upload it. Once the user makes a decision, then we either upload it or delete it and stop collecting.

GDPR

It is Fedora Legal's obligation to ensure our data collection complies with legal requirements in the jurisdictions in which Red Hat operates. This is not an obligation of the Fedora community, so there is no need to discuss GDPR rules on our mailing lists. The proposal owners will not respond to mailing list posts that discuss GDPR or similar legal obligations during this change proposal discussion. In short, let's keep discussion focused on what Fedora SHOULD or SHOULD NOT do, rather than what we MUST or MUST NOT do.

That said, Fedora Legal has determined that if we collect any personally-identifiable data, the entire metrics system must be opt-in. Since we are only interested in opt-out metrics due to the low value of opt-in metrics, we must accordingly never collect any personally-identifiable data. We must also not collect any data that could become personally-identifiable if combined with other data, which notably means IP addresses must not be stored. We only want collect anonymous data anyway, but we need to be especially mindful of the possibility that combining two "anonymous" data points could result in the data no longer being anonymous.

Fedora data collection policy

Fedora Legal requires that we publish a Fedora data collection policy separate from the existing Fedora Privacy Policy, which is designed to address usage of Fedora websites. This is currently a work in progress that we're not quite ready to share yet. You can expect it to be very short and very generic.

Metrics server infrastructure

We propose to deploy Azafea, the open source metrics collection server used by Endless OS. An Azafea deployment consists of five components: an nginx proxy server, azafea-metrics-proxy, redis, azafea itself, and a Postgres database. nginx proxies HTTP requests to azafea-metrics-proxy, which is itself a simple HTTP server that adds metrics into the redis database, where they will be fetched by Azafea and stored into Postgres. We will provide instructions on how to set up your own server and see for yourself what data gets collected.

Metrics client infrastructure

The client side consists of eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation. eos-metrics is a D-Bus interface that applications and services may use to record events, plus a GObject library that provides a simple API around the D-Bus interface. eos-event-recorder-daemon is the service that actually implements this interface: it collects incoming metrics, batches them together, and sends them to the metrics server at predefined intervals. eos-metrics-instrumentation is the component that actually collects specific metrics. Originally, we had planned to not use this component and instead write our own fedora-metrics-instrumentation that would collect only a few particular metrics that are approved via Fedora community process. However, currently we are planning to ship eos-metrics-instrumentation and instead ensure that it is not collecting more metrics than would be acceptable to the Fedora community. A review process to decide which metrics to collect and which metrics to disable will be required.

Data set considerations

Although we assume the metrics server administrator is not malicious and will not actively attempt to deanonymize users, we will still take reasonable precautions to make it difficult to correlate metrics to a particular user, starting by not storing any IP address information in the metrics database. Additionally, each metric that we collect will be considered individual, non-correlatable data by default, unless approved to be correlated with particular other metrics via future Fedora community process. That is, if a user submits two data points, we usually don't want the ability to know that these data points were both submitted by the same user.

Each metric is stored in the database with a Unix timestamp indicating when it was generated on the client. If abused, this timestamp could allow correlation of data points that are collected at the same time as each other, or at a fixed time offset to other events. For example, if the system were designed to collect two metrics exactly 300 seconds after the system were booted, then just looking at the timestamps would be enough to determine that both metrics recorded at the same time were submitted by the same user. Accordingly, we should consider modifying the metrics server to reduce timestamp granularity at least somewhat.

History

Currently Fedora's only form of metrics collection is DNF Better Counting, but this only counts Fedora installations. That is useful, but we want to count more than just how many users we have.

Fedora's first metrics collection attempt was Smolt, a precursor to hw-probe which collected data on user hardware. The current proposal is different from Smolt because it will collect more than just hardware data, and also because Smolt collected only opt-in data. The current proposal would be opt-out, not opt-in.

This change proposal will likely be compared to the Ubuntu spyware complaints from a decade ago, when Ubuntu desktop users' search queries were sent to Amazon by default. Let's not do that.

Feedback

We will endeavor to update this section of the change proposal to include a summary of Fedora community discussion of this proposal.

Benefit to Fedora

The main benefit to Fedora is that we will be able to use collected metrics to inform design decisions. It is very common for developers to wish to know something about how Fedora software is used, and we will finally have a way to answer such questions.

Occasionally, Red Hat might need to collect specific metrics to justify additional time spent on contributing to Fedora or additional investment in Fedora.

Scope

  • Proposal owners:

This change requires substantial technical and nontechnical work from the change owners. Most notably, we will need to package eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation properly for Fedora; they are currently packaged in a copr. We also still need to modify eos-metrics-instrumentation so that it does not send events not approved for use in Fedora, as we expect to collect less data than Endless OS.

  • Other developers:

This proposal will require substantial effort by Community Platform Engineering (CPE) to host the metrics server infrastructure.

  • Policies and guidelines: New processes and guidelines are proposed above under the section "How will data collection be approved?"
  • Trademark approval: N/A (not needed for this Change)
  • Alignment with Objectives: This change does not align with any current Fedora Initiatives, which are very limited in scope. That said, one of the main purposes of metrics collection is to determine whether we are achieving other objectives not listed on the wiki page. For example, we want Fedora Workstation to become the premier developer workstation operating system. To that end, we want to know how many of our users are using particular IDEs.

Upgrade/compatibility impact

We would like to enable metrics upload for upgraded systems, but this isn't trivial because we want to obtain user consent before enabling metrics upload. This would require us to design a user interface that would run on upgraded systems and present the setting to users. We have not yet created such a user interface, so for now metrics upload will need to default to disabled for systems upgraded from older versions of Fedora. Since the underlying setting will be off by default, we don't need to do anything special to achieve this.

How To Test

The ultimate goal is to see metrics appear in the Postgres database of a metrics server, but configuring and running the server is not trivial. Accordingly, we propose to publish a separate document detailing how to set up and configure a metrics server for testing purposes, how to redirect metrics to the custom server, and how to force the client to immediately submit metrics to ease testing. Although we don't actually expect many community members to seriously run their own metrics servers, we still want to document the steps involved so that interested developers can see exactly how it works.

User Experience

A new metrics collection setting will be added to the privacy page in gnome-initial-setup and also to the privacy page in gnome-control-center. This setting will be a simple toggle that will enable or disable all metrics upload for the entire system. Users who do not want any metrics upload should feel confident that uploading can be disabled with a simple toggle.

Fedora users should be confident that Fedora metrics collection respects their privacy and collects only limited, anonymous usage data.

Dependencies

Any package that wishes to collect a metric would need to depend on eos-metrics. For example, if we were to collect statistics on which system settings panels are used most frequently, then the gnome-control-center package would need to depend on eos-metrics in order to send a metric to eos-event-recorder-daemon.

Contingency Plan

  • Contingency mechanism: We would need to remove the eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation packages from the workstation-product comps group, and rebuild any packages that gained a dependency on eos-metrics.
  • Contingency deadline: Beta freeze
  • Blocks release? Yes, if the change is incomplete, it will need to be reverted before release.

Documentation

This feature will depend on several different upstream projects with varying amounts of documentation.

The client side consists of eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation. The best documentation of eos-metrics available online is its D-Bus interface XML. eos-metrics also contains normal API documentation that will be built and installed in a docs subpackage, but this is not currently available online. The eos-event-recorder-daemon and eos-metrics-instrumentation components do not appear to have any online documentation.

On the server end, the metrics server consists of azafea-metrics-proxy feeding metrics into redis, where they will be pulled by azafea and then added to a Postgres database. Documentation for azafea-metrics-proxy and azafea can be reviewed online. Events recognized by the server are documented here. Note that this documentation is currently focused on use by Endless OS rather than by Fedora, and includes documentation of many events that are no longer sent by Endless OS. This change proposal does not propose to enable sending any particular events in Fedora.

Release Notes

Release Notes are not required for initial proposal. We need to write the release notes before change freeze.