Releases19Avg6/moVersionsv2.11.0-rc.0 → v2.13.1

🔍 Debuggability

Add compute job pool spans (PR #7236)

The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. This PR adds spans to jobs that are on this pool to allow users to see when latency is introduced due to resource contention within the compute job pool.

compute_job:
- job.type: (query_parsing|query_planning|introspection)
compute_job.execution
- job.age: P1-P8
- job.type: (query_parsing|query_planning|introspection)

Jobs are executed highest priority (P8) first. Jobs that are low priority (P1) age over time, eventually executing at highest priority. The age of a job is can be used to diagnose if a job was waiting in the queue due to other higher priority jobs also in the queue.

By @bryncooke in https://github.com/apollographql/router/pull/7236

Add compute job pool metrics (PR #7184)

The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. When this pool becomes saturated it is difficult for users to see why so that they can take action. This change adds new metrics to help users understand how long jobs are waiting to be processed.

New metrics:

apollo.router.compute_jobs.queue_is_full - A counter of requests rejected because the queue was full.
apollo.router.compute_jobs.duration - A histogram of time spent in the compute pipeline by the job, including the queue and query planning.
- job.type: (query_planning, query_parsing, introspection)
- job.outcome: (executed_ok, executed_error, channel_error, rejected_queue_full, abandoned)
apollo.router.compute_jobs.queue.wait.duration - A histogram of time spent in the compute queue by the job.
- job.type: (query_planning, query_parsing, introspection)
apollo.router.compute_jobs.execution.duration - A histogram of time spent to execute job (excludes time spent in the queue).
- job.type: (query_planning, query_parsing, introspection)
apollo.router.compute_jobs.active_jobs - A gauge of the number of compute jobs being processed in parallel.
- job.type: (query_planning, query_parsing, introspection)

By @carodewig in https://github.com/apollographql/router/pull/7184

🐛 Fixes

Fix hanging requests when compute job queue is full (PR #7273)

When the compute job queue was full, requests could hang until timeout. Now, the router immediately returns a SERVICE_UNAVAILABLE response to the user.

By @BrynCooke in https://github.com/apollographql/router/pull/7273

Increase compute job pool queue size (PR #7205)

We previously set this queue size to 20 (per thread). However, this may be too small on resource constrained environments.

This patch increases the queue size to 1,000 jobs per thread. For reference, in older router versions before the introduction of the compute job worker pool, the equivalent queue size was 1,000.

By @goto-bus-stop in https://github.com/apollographql/router/pull/7205

🐛 Fixes

Entity-cache: handle multiple key directives (PR #7228)

This PR fixes a bug in entity caching introduced by the fix in https://github.com/apollographql/router/pull/6888 for cases where several @key directives with different fields were declared on a type as documented here.

For example if you have this kind of entity in your schema:

type Product @key(fields: "upc") @key(fields: "sku") {
  upc: ID!
  sku: ID!
  name: String
}

By @duckki & @bnjjj in https://github.com/apollographql/router/pull/7228

Improve Error Message for Invalid JWT Header Values (PR #7121)

Enhanced parsing error messages for JWT Authorization header values now provide developers with clear, actionable feedback while ensuring that no sensitive data is exposed.

Examples of the updated error messages:

-         Header Value: '<invalid value>' is not correctly formatted. prefix should be 'Bearer'
+         Value of 'authorization' JWT header should be prefixed with 'Bearer'

-         Header Value: 'Bearer' is not correctly formatted. Missing JWT
+         Value of 'authorization' JWT header has only 'Bearer' prefix but no JWT token

By @IvanGoncharov in https://github.com/apollographql/router/pull/7121

Fix crash when an invalid query plan is generated (PR #7214)

When an invalid query plan is generated, the router could panic and crash. This could happen if there are gaps in the GraphQL validation implementation. Now, even if there are unresolved gaps, the router will handle it gracefully and reject the request.

By @goto-bus-stop in https://github.com/apollographql/router/pull/7214

🐛 Fixes

Entity-cache: handle multiple key directives (PR #7228)

For example if you have this kind of entity in your schema:

type Product @key(fields: "upc") @key(fields: "sku") {
  upc: ID!
  sku: ID!
  name: String
}

By @duckki & @bnjjj in https://github.com/apollographql/router/pull/7228

🐛 Fixes

Support `@context`/`@fromContext` when using Connectors (PR #7132)

This fixes a bug that dropped the @context and @fromContext directives when introducing a connector.

By @lennyburdette in https://github.com/apollographql/router/pull/7132

📃 Configuration

Add new configurable delivery pathway for high cardinality Apollo Studio metrics (PR #7138)

This change provides a secondary pathway for new "realtime" Studio metrics whose delivery interval is configurable due to their higher cardinality. These metrics will respect telemetry.apollo.batch_processor.scheduled_delay as configured on the realtime path.

All other Apollo metrics will maintain the previous hardcoded 60s send interval.

By @rregitsky and @timbotnik in https://github.com/apollographql/router/pull/7138

🐛 Fixes

Fix potential telemetry deadlock (PR #7142)

The tracing_subscriber crate uses RwLocks to manage access to a Span's Extensions. Deadlocks are possible when multiple threads access this lock, including with reentrant locks:

// Thread 1              |  // Thread 2
let _rg1 = lock.read();  |
                         |  // will block
                         |  let _wg = lock.write();
// may deadlock          |
let _rg2 = lock.read();  |

This fix removes an opportunity for reentrant locking while extracting a Datadog identifier.

There is also a potential for deadlocks when the root and active spans' Extensions are acquired at the same time, if multiple threads are attempting to access those Extensions but in a different order. This fix removes a few cases where multiple spans' Extensions are acquired at the same time.

By @carodewig in https://github.com/apollographql/router/pull/7142

Connection shutdown timeout (PR #7058)

When a connection is closed we call graceful_shutdown on hyper and then await for the connection to close.

Hyper 0.x has various issues around shutdown that may result in us waiting for extended periods for the connection to eventually be closed.

This PR introduces a configurable timeout from the termination signal to actual termination, defaulted to 60 seconds. The connection is forcibly terminated after the timeout is reached.

To configure, set the option in router yaml. It accepts human time durations:

supergraph:
  connection_shutdown_timeout: 60s

Note that even after connections have been terminated the router will still hang onto pipelines if early_cancel has not been configured to true. The router is trying to complete the request.

Users can either set early_cancel to true

supergraph:
  early_cancel: true

AND/OR use traffic shaping timeouts:

traffic_shaping:
  router:
    timeout: 60s

By @BrynCooke in https://github.com/apollographql/router/pull/7058

Fix crash when an invalid query plan is generated (PR #7214)

By @goto-bus-stop in https://github.com/apollographql/router/pull/7214

Improve Error Message for Invalid JWT Header Values (PR #7121)

Enhanced parsing error messages for JWT Authorization header values now provide developers with clear, actionable feedback while ensuring that no sensitive data is exposed.

Examples of the updated error messages:

-         Header Value: '<invalid value>' is not correctly formatted. prefix should be 'Bearer'
+         Value of 'authorization' JWT header should be prefixed with 'Bearer'

-         Header Value: 'Bearer' is not correctly formatted. Missing JWT
+         Value of 'authorization' JWT header has only 'Bearer' prefix but no JWT token

By @IvanGoncharov in https://github.com/apollographql/router/pull/7121

🔒 Security

Certain query patterns may cause resource exhaustion

Corrects a set of denial-of-service (DOS) vulnerabilities that made it possible for an attacker to render router inoperable with certain simple query patterns due to uncontrolled resource consumption. All prior-released versions and configurations are vulnerable except those where persisted_queries.enabled, persisted_queries.safelist.enabled, and persisted_queries.safelist.require_id are all true.

See the associated GitHub Advisories GHSA-3j43-9v8v-cp3f, GHSA-84m6-5m72-45fp, GHSA-75m2-jhh5-j5g2, and GHSA-94hh-jmq8-2fgp, and the apollo-compiler GitHub Advisory GHSA-7mpv-9xg6-5r79 for more information.

By @sachindshinde and @goto-bus-stop.

🔒 Security

Certain query patterns may cause resource exhaustion

By @sachindshinde and @goto-bus-stop.

🐛 Fixes

Use correct default values on omitted OTLP endpoints (PR #6931)

Previously, when the configuration didn't specify an OTLP endpoint, the Router would always default to http://localhost:4318. However, port 4318 is the correct default only for the HTTP protocol, while port 4317 should be used for gRPC.

Additionally, all other telemetry defaults in the Router configuration consistently use 127.0.0.1 as the hostname rather than localhost.

With this change, the Router now uses:

http://127.0.0.1:4317 as the default for gRPC protocol
http://127.0.0.1:4318 as the default for HTTP protocol

This ensures protocol-appropriate port defaults and consistent hostname usage across all telemetry configurations.

By @IvanGoncharov in https://github.com/apollographql/router/pull/6931

Separate entity keys and representation variables in entity cache key (Issue #6673)

This fix separates the entity keys and representation variable values in the cache key, to avoid issues with @requires for example.

By @bnjjj in https://github.com/apollographql/router/pull/6888

🔒 Security

Add `batching.maximum_size` configuration option to limit maximum client batch size (PR #7005)

Add an optional maximum_size parameter to the batching configuration.

When specified, the router will reject requests which contain more than maximum_size queries in the client batch.
When unspecified, the router performs no size checking (the current behavior).

If the number of queries provided exceeds the maximum batch size, the entire batch fails with error code 422 (Unprocessable Content). For example:

{
  "errors": [
    {
      "message": "Invalid GraphQL request",
      "extensions": {
        "details": "Batch limits exceeded: you provided a batch with 3 entries, but the configured maximum router batch size is 2",
        "code": "BATCH_LIMIT_EXCEEDED"
      }
    }
  ]
}

By @carodewig in https://github.com/apollographql/router/pull/7005

🔍 Debuggability

Add `apollo.router.pipelines` metrics (PR #6967)

When the router reloads, either via schema change or config change, a new request pipeline is created. Existing request pipelines are closed once their requests finish. However, this may not happen if there are ongoing long requests that do not finish, such as Subscriptions.

To enable debugging when request pipelines are being kept around, a new gauge metric has been added:

apollo.router.pipelines - The number of request pipelines active in the router
- schema.id - The Apollo Studio schema hash associated with the pipeline.
- launch.id - The Apollo Studio launch id associated with the pipeline (optional).
- config.hash - The hash of the configuration

By @BrynCooke in https://github.com/apollographql/router/pull/6967

Add `apollo.router.open_connections` metric (PR #7023)

To help users to diagnose when connections are keeping pipelines hanging around, the following metric has been added:

apollo.router.open_connections - The number of request pipelines active in the router
- schema.id - The Apollo Studio schema hash associated with the pipeline.
- launch.id - The Apollo Studio launch id associated with the pipeline (optional).
- config.hash - The hash of the configuration.
- server.address - The address that the router is listening on.
- server.port - The port that the router is listening on if not a unix socket.
- http.connection.state - Either active or terminating.

You can use this metric to monitor when connections are open via long running requests or keepalive messages.

By @BrynCooke in https://github.com/apollographql/router/pull/7009

🚀 Features

Connectors: support for traffic shaping (PR #6737)

Traffic shaping is now supported for connectors. To target a specific source, use the subgraph_name.source_name under the new connector.sources property of traffic_shaping. Settings under connector.all will apply to all connectors. deduplicate_query is not supported at this time.

Example config:

traffic_shaping:
  connector:
    all:
      timeout: 5s
    sources:
      connector-graph.random_person_api:
        global_rate_limit:
          capacity: 20
          interval: 1s
        experimental_http2: http2only
        timeout: 1s

By @andrewmcgivery in https://github.com/apollographql/router/pull/6737

Connectors: Support TLS configuration (PR #6995)

Connectors now supports TLS configuration for using custom certificate authorities and utilizing client certificate authentication.

tls:
  connector:
    sources:
      connector-graph.random_person_api:
        certificate_authorities: ${file.ca.crt}
        client_authentication:
          certificate_chain: ${file.client.crt}
          key: ${file.client.key}

By @andrewmcgivery in https://github.com/apollographql/router/pull/6995

Update JWT handling (PR #6930)

This PR updates JWT-handling in the AuthenticationPlugin;

Users may now set a new config option config.authentication.router.jwt.on_error.
- When set to the default Error, JWT-related errors will be returned to users (the current behavior).
- When set to Continue, JWT errors will instead be ignored, and JWT claims will not be set in the request context.
When JWTs are processed, whether processing succeeds or fails, the request context will contain a new variable apollo::authentication::jwt_status which notes the result of processing.

By @Velfi in https://github.com/apollographql/router/pull/6930

Add `batching.maximum_size` configuration option to limit maximum client batch size (PR #7005)

Add an optional maximum_size parameter to the batching configuration.

When specified, the router will reject requests which contain more than maximum_size queries in the client batch.
When unspecified, the router performs no size checking (the current behavior).

If the number of queries provided exceeds the maximum batch size, the entire batch fails with error code 422 (Unprocessable Content). For example:

{
  "errors": [
    {
      "message": "Invalid GraphQL request",
      "extensions": {
        "details": "Batch limits exceeded: you provided a batch with 3 entries, but the configured maximum router batch size is 2",
        "code": "BATCH_LIMIT_EXCEEDED"
      }
    }
  ]
}

By @carodewig in https://github.com/apollographql/router/pull/7005

Introduce PQ manifest `hot_reload` option for local manifests (PR #6987)

This change introduces a persisted_queries.hot_reload configuration option to allow the router to hot reload local PQ manifest changes.

If you configure local_manifests, you can set hot_reload to true to automatically reload manifest files whenever they change. This lets you update local manifest files without restarting the router.

persisted_queries:
  enabled: true
  local_manifests:
    - ./path/to/persisted-query-manifest.json
  hot_reload: true

Note: This change explicitly does not piggyback on the existing --hot-reload flag.

By @trevor-scheer in https://github.com/apollographql/router/pull/6987

Add support to get/set URI scheme in Rhai (Issue #6897)

This adds support to read and write the scheme from the request.uri.scheme/request.subgraph.uri.scheme functions in Rhai, enabling the ability to switch between http and https for subgraph fetches. For example:

fn subgraph_service(service, subgraph){
    service.map_request(|request|{
        log_info(`${request.subgraph.uri.scheme}`);
        if request.subgraph.uri.scheme == {} {
            log_info("Scheme is not explicitly set");
        }
        request.subgraph.uri.scheme = "https"
        request.subgraph.uri.host = "api.apollographql.com";
        request.subgraph.uri.path = "/api/graphql";
        request.subgraph.uri.port = 1234;
        log_info(`${request.subgraph.uri}`);
    });
}

By @starJammer in https://github.com/apollographql/router/pull/6906

Add `router config validate` subcommand (PR #7016)

Adds new router config validate subcommand to allow validation of a router config file without fully starting up the Router.

./router config validate <path-to-config-file.yaml>

By @andrewmcgivery in https://github.com/apollographql/router/pull/7016

Enable remote proxy downloads of the Router

This enables users without direct download access to specify a remote proxy mirror location for the GitHub download of the Apollo Router releases.

By @LongLiveCHIEF in https://github.com/apollographql/router/pull/6667

Add metric to measure cardinality overflow frequency (PR #6998)

Adds a new counter metric, apollo.router.telemetry.metrics.cardinality_overflow, that is incremented when the cardinality overflow log from opentelemetry-rust occurs. This log means that a metric in a batch has reached a cardinality of > 2000 and that any excess attributes will be ignored.

By @rregitsky in https://github.com/apollographql/router/pull/6998

Add metrics for value completion errors (PR #6905)

When the router encounters a value completion error, it is not included in the GraphQL errors array, making it harder to observe. To surface this issue in a more obvious way, router now counts value completion error metrics via the metric instruments apollo.router.graphql.error and apollo.router.operations.error, distinguishable via the code attribute with value RESPONSE_VALIDATION_FAILED.

By @timbotnik in https://github.com/apollographql/router/pull/6905

Add `apollo.router.pipelines` metrics (PR #6967)

To enable debugging when request pipelines are being kept around, a new gauge metric has been added:

apollo.router.pipelines - The number of request pipelines active in the router
- schema.id - The Apollo Studio schema hash associated with the pipeline.
- launch.id - The Apollo Studio launch id associated with the pipeline (optional).
- config.hash - The hash of the configuration

By @BrynCooke in https://github.com/apollographql/router/pull/6967

Add `apollo.router.open_connections` metric (PR #7023)

To help users to diagnose when connections are keeping pipelines hanging around, the following metric has been added:

apollo.router.open_connections - The number of request pipelines active in the router
- schema.id - The Apollo Studio schema hash associated with the pipeline.
- launch.id - The Apollo Studio launch id associated with the pipeline (optional).
- config.hash - The hash of the configuration.
- server.address - The address that the router is listening on.
- server.port - The port that the router is listening on if not a unix socket.
- http.connection.state - Either active or terminating.

You can use this metric to monitor when connections are open via long running requests or keepalive messages.

By @bryncooke in https://github.com/apollographql/router/pull/7023

Add span events to error spans for connectors and demand control plugin (PR #6727)

New span events have been added to trace spans which include errors. These span events include the GraphQL error code that relates to the error. So far, this only includes errors generated by connectors and the demand control plugin.

By @bonnici in https://github.com/apollographql/router/pull/6727

Changes to experimental error metrics (PR #6966)

In 2.0.0, an experimental metric telemetry.apollo.errors.experimental_otlp_error_metrics was introduced to track errors with additional attributes. A few related changes are included here:

Sending these metrics now also respects the subgraph's send flag e.g. telemetry.apollo.errors.subgraph.[all|(subgraph name)].send.
A new configuration option telemetry.apollo.errors.subgraph.[all|(subgraph name)].redaction_policy has been added. This flag only applies when redact is set to true. When set to ErrorRedactionPolicy.Strict, error redaction will behave as it has in the past. Setting this to ErrorRedactionPolicy.Extended will allow the extensions.code value from subgraph errors to pass through redaction and be sent to Studio.
A warning about incompatibility of error telemetry with connectors will be suppressed when this feature is enabled, since it does support connectors when using the new mode.

By @timbotnik in https://github.com/apollographql/router/pull/6966

🐛 Fixes

Export gauge instruments (Issue #6859)

Previously in router 2.x, when using the router's OTel meter_provider() to report metrics from Rust plugins, gauge instruments such as those created using .u64_gauge() weren't exported. The router now exports these instruments.

By @yanns in https://github.com/apollographql/router/pull/6865

Use `batch_processor` config for Apollo metrics `PeriodicReader` (PR #7024)

The Apollo OTLP batch_processor configurations telemetry.apollo.batch_processor.scheduled_delay and telemetry.apollo.batch_processor.max_export_timeout now also control the Apollo OTLP PeriodicReader export interval and timeout, respectively. This update brings parity between Apollo OTLP metrics and non-Apollo OTLP exporter metrics.

By @rregitsky in https://github.com/apollographql/router/pull/7024

Reduce Brotli encoding compression level (Issue #6857)

The Brotli encoding compression level has been changed from 11 to 4 to improve performance and mimic other compression algorithms' fast setting. This value is also a much more reasonable value for dynamic workloads.

By @carodewig in https://github.com/apollographql/router/pull/7007

CPU count inference improvements for `cgroup` environments (PR #6787)

This fixes an issue where the fleet_detector plugin would not correctly infer the CPU limits for a system which used cgroup or cgroup2.

By @nmoutschen in https://github.com/apollographql/router/pull/6787

Separate entity keys and representation variables in entity cache key (Issue #6673)

This fix separates the entity keys and representation variable values in the cache key, to avoid issues with @requires for example.

[!IMPORTANT]

If you have enabled Distributed query plan caching, this release contains changes which necessarily alter the hashing algorithm used for the cache keys. On account of this, you should anticipate additional cache regeneration cost when updating between these versions while the new hashing algorithm comes into service.

By @bnjjj in https://github.com/apollographql/router/pull/6888

Replace Rhai-specific hot-reload functionality with general hot-reload (PR #6950)

In Router 2.0 the rhai hot-reload capability was not working. This was because of architectural improvements to the router which meant that the entire service stack was no longer re-created for each request.

The fix adds the rhai source files into the primary list of elements, configuration, schema, etc..., watched by the router and removes the old Rhai-specific file watching logic.

If --hot-reload is enabled, the router will reload on changes to Rhai source code just like it would for changes to configuration, for example.

By @garypen in https://github.com/apollographql/router/pull/6950

📃 Configuration

Make experimental OTLP error metrics feature flag non-experimental (PR #7033)

Because the OTLP error metrics feature is being promoted to preview from experimental, this change updates its feature flag name from experimental_otlp_error_metrics to preview_extended_error_metrics.

By @merylc in https://github.com/apollographql/router/pull/7033

[!TIP] All notable changes to Router v2.x after its initial release will be documented in this file. To see previous history, see the changelog prior to v2.0.0.

Apollo Router

🔍 Debuggability

Add compute job pool spans (PR #7236)

Add compute job pool metrics (PR #7184)

🐛 Fixes

Fix hanging requests when compute job queue is full (PR #7273)

Increase compute job pool queue size (PR #7205)

🐛 Fixes

Entity-cache: handle multiple key directives (PR #7228)

Improve Error Message for Invalid JWT Header Values (PR #7121)

Fix crash when an invalid query plan is generated (PR #7214)

🐛 Fixes

Entity-cache: handle multiple key directives (PR #7228)

🐛 Fixes

Support @context/@fromContext when using Connectors (PR #7132)

📃 Configuration

Add new configurable delivery pathway for high cardinality Apollo Studio metrics (PR #7138)

🐛 Fixes

Fix potential telemetry deadlock (PR #7142)

Connection shutdown timeout (PR #7058)

Fix crash when an invalid query plan is generated (PR #7214)

Improve Error Message for Invalid JWT Header Values (PR #7121)

🔒 Security

Certain query patterns may cause resource exhaustion

🔒 Security

Certain query patterns may cause resource exhaustion

🐛 Fixes

Use correct default values on omitted OTLP endpoints (PR #6931)

Separate entity keys and representation variables in entity cache key (Issue #6673)

🔒 Security

Add batching.maximum_size configuration option to limit maximum client batch size (PR #7005)

🔍 Debuggability

Add apollo.router.pipelines metrics (PR #6967)

Add apollo.router.open_connections metric (PR #7023)

🚀 Features

Connectors: support for traffic shaping (PR #6737)

Connectors: Support TLS configuration (PR #6995)

Update JWT handling (PR #6930)

Add batching.maximum_size configuration option to limit maximum client batch size (PR #7005)

Introduce PQ manifest hot_reload option for local manifests (PR #6987)

Add support to get/set URI scheme in Rhai (Issue #6897)

Add router config validate subcommand (PR #7016)

Enable remote proxy downloads of the Router

Add metric to measure cardinality overflow frequency (PR #6998)

Add metrics for value completion errors (PR #6905)

Add apollo.router.pipelines metrics (PR #6967)

Add apollo.router.open_connections metric (PR #7023)

Add span events to error spans for connectors and demand control plugin (PR #6727)

Changes to experimental error metrics (PR #6966)

🐛 Fixes

Export gauge instruments (Issue #6859)

Use batch_processor config for Apollo metrics PeriodicReader (PR #7024)

Reduce Brotli encoding compression level (Issue #6857)

CPU count inference improvements for cgroup environments (PR #6787)

Separate entity keys and representation variables in entity cache key (Issue #6673)

Replace Rhai-specific hot-reload functionality with general hot-reload (PR #6950)

📃 Configuration

Make experimental OTLP error metrics feature flag non-experimental (PR #7033)

More from this team

Similar releases

Other sources from this team

Similar sources

More from this team

Similar releases

Other sources from this team

Similar sources

Support `@context`/`@fromContext` when using Connectors (PR #7132)

Add `batching.maximum_size` configuration option to limit maximum client batch size (PR #7005)

Add `apollo.router.pipelines` metrics (PR #6967)

Add `apollo.router.open_connections` metric (PR #7023)

Add `batching.maximum_size` configuration option to limit maximum client batch size (PR #7005)

Introduce PQ manifest `hot_reload` option for local manifests (PR #6987)

Add `router config validate` subcommand (PR #7016)

Add `apollo.router.pipelines` metrics (PR #6967)

Add `apollo.router.open_connections` metric (PR #7023)

Use `batch_processor` config for Apollo metrics `PeriodicReader` (PR #7024)

CPU count inference improvements for `cgroup` environments (PR #6787)