The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. This PR adds spans to jobs that are on this pool to allow users to see when latency is introduced due to resource contention within the compute job pool.
compute_job:
job.type: (query_parsing|query_planning|introspection)compute_job.execution
job.age: P1-P8job.type: (query_parsing|query_planning|introspection)Jobs are executed highest priority (P8) first. Jobs that are low priority (P1) age over time, eventually executing
at highest priority. The age of a job is can be used to diagnose if a job was waiting in the queue due to other higher
priority jobs also in the queue.
By @bryncooke in https://github.com/apollographql/router/pull/7236
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. When this pool becomes saturated it is difficult for users to see why so that they can take action. This change adds new metrics to help users understand how long jobs are waiting to be processed.
New metrics:
apollo.router.compute_jobs.queue_is_full - A counter of requests rejected because the queue was full.apollo.router.compute_jobs.duration - A histogram of time spent in the compute pipeline by the job, including the queue and query planning.
job.type: (query_planning, query_parsing, introspection)job.outcome: (executed_ok, executed_error, channel_error, rejected_queue_full, abandoned)apollo.router.compute_jobs.queue.wait.duration - A histogram of time spent in the compute queue by the job.
job.type: (query_planning, query_parsing, introspection)apollo.router.compute_jobs.execution.duration - A histogram of time spent to execute job (excludes time spent in the queue).
job.type: (query_planning, query_parsing, introspection)apollo.router.compute_jobs.active_jobs - A gauge of the number of compute jobs being processed in parallel.
job.type: (query_planning, query_parsing, introspection)By @carodewig in https://github.com/apollographql/router/pull/7184
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. When the pool is busy, jobs enter a queue.
When the compute job queue was full, requests could hang until timeout. Now, the router immediately returns a SERVICE_UNAVAILABLE response to the user.
By @BrynCooke in https://github.com/apollographql/router/pull/7273
The compute job pool in the router is used to execute CPU intensive work outside of the main I/O worker threads, including GraphQL parsing, query planning, and introspection. When the pool is busy, jobs enter a queue.
We previously set this queue size to 20 (per thread). However, this may be too small on resource constrained environments.
This patch increases the queue size to 1,000 jobs per thread. For reference, in older router versions before the introduction of the compute job worker pool, the equivalent queue size was 1,000.
By @goto-bus-stop in https://github.com/apollographql/router/pull/7205
This PR fixes a bug in entity caching introduced by the fix in https://github.com/apollographql/router/pull/6888 for cases where several @key directives with different fields were declared on a type as documented here.
For example if you have this kind of entity in your schema:
type Product @key(fields: "upc") @key(fields: "sku") {
upc: ID!
sku: ID!
name: String
}
By @duckki & @bnjjj in https://github.com/apollographql/router/pull/7228
Enhanced parsing error messages for JWT Authorization header values now provide developers with clear, actionable feedback while ensuring that no sensitive data is exposed.
Examples of the updated error messages:
- Header Value: '<invalid value>' is not correctly formatted. prefix should be 'Bearer'
+ Value of 'authorization' JWT header should be prefixed with 'Bearer'
- Header Value: 'Bearer' is not correctly formatted. Missing JWT
+ Value of 'authorization' JWT header has only 'Bearer' prefix but no JWT token
By @IvanGoncharov in https://github.com/apollographql/router/pull/7121
When an invalid query plan is generated, the router could panic and crash. This could happen if there are gaps in the GraphQL validation implementation. Now, even if there are unresolved gaps, the router will handle it gracefully and reject the request.
By @goto-bus-stop in https://github.com/apollographql/router/pull/7214
This PR fixes a bug in entity caching introduced by the fix in https://github.com/apollographql/router/pull/6888 for cases where several @key directives with different fields were declared on a type as documented here.
For example if you have this kind of entity in your schema:
type Product @key(fields: "upc") @key(fields: "sku") {
upc: ID!
sku: ID!
name: String
}
By @duckki & @bnjjj in https://github.com/apollographql/router/pull/7228
@context/@fromContext when using Connectors (PR #7132)This fixes a bug that dropped the @context and @fromContext directives when introducing a connector.
By @lennyburdette in https://github.com/apollographql/router/pull/7132
This change provides a secondary pathway for new "realtime" Studio metrics whose delivery interval is configurable due to their higher cardinality. These metrics will respect telemetry.apollo.batch_processor.scheduled_delay as configured on the realtime path.
All other Apollo metrics will maintain the previous hardcoded 60s send interval.
By @rregitsky and @timbotnik in https://github.com/apollographql/router/pull/7138
The tracing_subscriber crate uses RwLocks to manage access to a Span's Extensions. Deadlocks are possible when
multiple threads access this lock, including with reentrant locks:
// Thread 1 | // Thread 2
let _rg1 = lock.read(); |
| // will block
| let _wg = lock.write();
// may deadlock |
let _rg2 = lock.read(); |
This fix removes an opportunity for reentrant locking while extracting a Datadog identifier.
There is also a potential for deadlocks when the root and active spans' Extensions are acquired at the same time, if
multiple threads are attempting to access those Extensions but in a different order. This fix removes a few cases
where multiple spans' Extensions are acquired at the same time.
By @carodewig in https://github.com/apollographql/router/pull/7142
When a connection is closed we call graceful_shutdown on hyper and then await for the connection to close.
Hyper 0.x has various issues around shutdown that may result in us waiting for extended periods for the connection to eventually be closed.
This PR introduces a configurable timeout from the termination signal to actual termination, defaulted to 60 seconds. The connection is forcibly terminated after the timeout is reached.
To configure, set the option in router yaml. It accepts human time durations:
supergraph:
connection_shutdown_timeout: 60s
Note that even after connections have been terminated the router will still hang onto pipelines if early_cancel has not been configured to true. The router is trying to complete the request.
Users can either set early_cancel to true
supergraph:
early_cancel: true
AND/OR use traffic shaping timeouts:
traffic_shaping:
router:
timeout: 60s
By @BrynCooke in https://github.com/apollographql/router/pull/7058
When an invalid query plan is generated, the router could panic and crash. This could happen if there are gaps in the GraphQL validation implementation. Now, even if there are unresolved gaps, the router will handle it gracefully and reject the request.
By @goto-bus-stop in https://github.com/apollographql/router/pull/7214
Enhanced parsing error messages for JWT Authorization header values now provide developers with clear, actionable feedback while ensuring that no sensitive data is exposed.
Examples of the updated error messages:
- Header Value: '<invalid value>' is not correctly formatted. prefix should be 'Bearer'
+ Value of 'authorization' JWT header should be prefixed with 'Bearer'
- Header Value: 'Bearer' is not correctly formatted. Missing JWT
+ Value of 'authorization' JWT header has only 'Bearer' prefix but no JWT token
By @IvanGoncharov in https://github.com/apollographql/router/pull/7121
Corrects a set of denial-of-service (DOS) vulnerabilities that made it possible for an attacker to render router inoperable with certain simple query patterns due to uncontrolled resource consumption. All prior-released versions and configurations are vulnerable except those where persisted_queries.enabled, persisted_queries.safelist.enabled, and persisted_queries.safelist.require_id are all true.
See the associated GitHub Advisories GHSA-3j43-9v8v-cp3f, GHSA-84m6-5m72-45fp, GHSA-75m2-jhh5-j5g2, and GHSA-94hh-jmq8-2fgp, and the apollo-compiler GitHub Advisory GHSA-7mpv-9xg6-5r79 for more information.
By @sachindshinde and @goto-bus-stop.
Corrects a set of denial-of-service (DOS) vulnerabilities that made it possible for an attacker to render router inoperable with certain simple query patterns due to uncontrolled resource consumption. All prior-released versions and configurations are vulnerable except those where persisted_queries.enabled, persisted_queries.safelist.enabled, and persisted_queries.safelist.require_id are all true.
See the associated GitHub Advisories GHSA-3j43-9v8v-cp3f, GHSA-84m6-5m72-45fp, GHSA-75m2-jhh5-j5g2, and GHSA-94hh-jmq8-2fgp, and the apollo-compiler GitHub Advisory GHSA-7mpv-9xg6-5r79 for more information.
By @sachindshinde and @goto-bus-stop.
Previously, when the configuration didn't specify an OTLP endpoint, the Router would always default to http://localhost:4318. However, port 4318 is the correct default only for the HTTP protocol, while port 4317 should be used for gRPC.
Additionally, all other telemetry defaults in the Router configuration consistently use 127.0.0.1 as the hostname rather than localhost.
With this change, the Router now uses:
http://127.0.0.1:4317 as the default for gRPC protocolhttp://127.0.0.1:4318 as the default for HTTP protocolThis ensures protocol-appropriate port defaults and consistent hostname usage across all telemetry configurations.
By @IvanGoncharov in https://github.com/apollographql/router/pull/6931
This fix separates the entity keys and representation variable values in the cache key, to avoid issues with @requires for example.
By @bnjjj in https://github.com/apollographql/router/pull/6888
batching.maximum_size configuration option to limit maximum client batch size (PR #7005)Add an optional maximum_size parameter to the batching configuration.
maximum_size queries in the client batch.If the number of queries provided exceeds the maximum batch size, the entire batch fails with error code 422 (Unprocessable Content). For example:
{
"errors": [
{
"message": "Invalid GraphQL request",
"extensions": {
"details": "Batch limits exceeded: you provided a batch with 3 entries, but the configured maximum router batch size is 2",
"code": "BATCH_LIMIT_EXCEEDED"
}
}
]
}
By @carodewig in https://github.com/apollographql/router/pull/7005
apollo.router.pipelines metrics (PR #6967)When the router reloads, either via schema change or config change, a new request pipeline is created. Existing request pipelines are closed once their requests finish. However, this may not happen if there are ongoing long requests that do not finish, such as Subscriptions.
To enable debugging when request pipelines are being kept around, a new gauge metric has been added:
apollo.router.pipelines - The number of request pipelines active in the router
schema.id - The Apollo Studio schema hash associated with the pipeline.launch.id - The Apollo Studio launch id associated with the pipeline (optional).config.hash - The hash of the configurationBy @BrynCooke in https://github.com/apollographql/router/pull/6967
apollo.router.open_connections metric (PR #7023)To help users to diagnose when connections are keeping pipelines hanging around, the following metric has been added:
apollo.router.open_connections - The number of request pipelines active in the router
schema.id - The Apollo Studio schema hash associated with the pipeline.launch.id - The Apollo Studio launch id associated with the pipeline (optional).config.hash - The hash of the configuration.server.address - The address that the router is listening on.server.port - The port that the router is listening on if not a unix socket.http.connection.state - Either active or terminating.You can use this metric to monitor when connections are open via long running requests or keepalive messages.
By @BrynCooke in https://github.com/apollographql/router/pull/7009
Traffic shaping is now supported for connectors. To target a specific source, use the subgraph_name.source_name under the new connector.sources property of traffic_shaping. Settings under connector.all will apply to all connectors. deduplicate_query is not supported at this time.
Example config:
traffic_shaping:
connector:
all:
timeout: 5s
sources:
connector-graph.random_person_api:
global_rate_limit:
capacity: 20
interval: 1s
experimental_http2: http2only
timeout: 1s
By @andrewmcgivery in https://github.com/apollographql/router/pull/6737
Connectors now supports TLS configuration for using custom certificate authorities and utilizing client certificate authentication.
tls:
connector:
sources:
connector-graph.random_person_api:
certificate_authorities: ${file.ca.crt}
client_authentication:
certificate_chain: ${file.client.crt}
key: ${file.client.key}
By @andrewmcgivery in https://github.com/apollographql/router/pull/6995
This PR updates JWT-handling in the AuthenticationPlugin;
config.authentication.router.jwt.on_error.
Error, JWT-related errors will be returned to users (the current behavior).Continue, JWT errors will instead be ignored, and JWT claims will not be set in the request context.apollo::authentication::jwt_status which notes the result of processing.By @Velfi in https://github.com/apollographql/router/pull/6930
batching.maximum_size configuration option to limit maximum client batch size (PR #7005)Add an optional maximum_size parameter to the batching configuration.
maximum_size queries in the client batch.If the number of queries provided exceeds the maximum batch size, the entire batch fails with error code 422 (Unprocessable Content). For example:
{
"errors": [
{
"message": "Invalid GraphQL request",
"extensions": {
"details": "Batch limits exceeded: you provided a batch with 3 entries, but the configured maximum router batch size is 2",
"code": "BATCH_LIMIT_EXCEEDED"
}
}
]
}
By @carodewig in https://github.com/apollographql/router/pull/7005
hot_reload option for local manifests (PR #6987)This change introduces a persisted_queries.hot_reload configuration option to allow the router to hot reload local PQ manifest changes.
If you configure local_manifests, you can set hot_reload to true to automatically reload manifest files whenever they change. This lets you update local manifest files without restarting the router.
persisted_queries:
enabled: true
local_manifests:
- ./path/to/persisted-query-manifest.json
hot_reload: true
Note: This change explicitly does not piggyback on the existing --hot-reload flag.
By @trevor-scheer in https://github.com/apollographql/router/pull/6987
This adds support to read and write the scheme from the request.uri.scheme/request.subgraph.uri.scheme functions in Rhai,
enabling the ability to switch between http and https for subgraph fetches. For example:
fn subgraph_service(service, subgraph){
service.map_request(|request|{
log_info(`${request.subgraph.uri.scheme}`);
if request.subgraph.uri.scheme == {} {
log_info("Scheme is not explicitly set");
}
request.subgraph.uri.scheme = "https"
request.subgraph.uri.host = "api.apollographql.com";
request.subgraph.uri.path = "/api/graphql";
request.subgraph.uri.port = 1234;
log_info(`${request.subgraph.uri}`);
});
}
By @starJammer in https://github.com/apollographql/router/pull/6906
router config validate subcommand (PR #7016)Adds new router config validate subcommand to allow validation of a router config file without fully starting up the Router.
./router config validate <path-to-config-file.yaml>
By @andrewmcgivery in https://github.com/apollographql/router/pull/7016
This enables users without direct download access to specify a remote proxy mirror location for the GitHub download of the Apollo Router releases.
By @LongLiveCHIEF in https://github.com/apollographql/router/pull/6667
Adds a new counter metric, apollo.router.telemetry.metrics.cardinality_overflow, that is incremented when the cardinality overflow log from opentelemetry-rust occurs. This log means that a metric in a batch has reached a cardinality of > 2000 and that any excess attributes will be ignored.
By @rregitsky in https://github.com/apollographql/router/pull/6998
When the router encounters a value completion error, it is not included in the GraphQL errors array, making it harder to observe. To surface this issue in a more obvious way, router now counts value completion error metrics via the metric instruments apollo.router.graphql.error and apollo.router.operations.error, distinguishable via the code attribute with value RESPONSE_VALIDATION_FAILED.
By @timbotnik in https://github.com/apollographql/router/pull/6905
apollo.router.pipelines metrics (PR #6967)When the router reloads, either via schema change or config change, a new request pipeline is created. Existing request pipelines are closed once their requests finish. However, this may not happen if there are ongoing long requests that do not finish, such as Subscriptions.
To enable debugging when request pipelines are being kept around, a new gauge metric has been added:
apollo.router.pipelines - The number of request pipelines active in the router
schema.id - The Apollo Studio schema hash associated with the pipeline.launch.id - The Apollo Studio launch id associated with the pipeline (optional).config.hash - The hash of the configurationBy @BrynCooke in https://github.com/apollographql/router/pull/6967
apollo.router.open_connections metric (PR #7023)To help users to diagnose when connections are keeping pipelines hanging around, the following metric has been added:
apollo.router.open_connections - The number of request pipelines active in the router
schema.id - The Apollo Studio schema hash associated with the pipeline.launch.id - The Apollo Studio launch id associated with the pipeline (optional).config.hash - The hash of the configuration.server.address - The address that the router is listening on.server.port - The port that the router is listening on if not a unix socket.http.connection.state - Either active or terminating.You can use this metric to monitor when connections are open via long running requests or keepalive messages.
By @bryncooke in https://github.com/apollographql/router/pull/7023
New span events have been added to trace spans which include errors. These span events include the GraphQL error code that relates to the error. So far, this only includes errors generated by connectors and the demand control plugin.
By @bonnici in https://github.com/apollographql/router/pull/6727
In 2.0.0, an experimental metric telemetry.apollo.errors.experimental_otlp_error_metrics was introduced to track errors with additional attributes. A few related changes are included here:
send flag e.g. telemetry.apollo.errors.subgraph.[all|(subgraph name)].send.telemetry.apollo.errors.subgraph.[all|(subgraph name)].redaction_policy has been added. This flag only applies when redact is set to true. When set to ErrorRedactionPolicy.Strict, error redaction will behave as it has in the past. Setting this to ErrorRedactionPolicy.Extended will allow the extensions.code value from subgraph errors to pass through redaction and be sent to Studio.By @timbotnik in https://github.com/apollographql/router/pull/6966
Previously in router 2.x, when using the router's OTel meter_provider() to report metrics from Rust plugins, gauge instruments such as those created using .u64_gauge() weren't exported. The router now exports these instruments.
By @yanns in https://github.com/apollographql/router/pull/6865
batch_processor config for Apollo metrics PeriodicReader (PR #7024)The Apollo OTLP batch_processor configurations telemetry.apollo.batch_processor.scheduled_delay and telemetry.apollo.batch_processor.max_export_timeout now also control the Apollo OTLP PeriodicReader export interval and timeout, respectively. This update brings parity between Apollo OTLP metrics and non-Apollo OTLP exporter metrics.
By @rregitsky in https://github.com/apollographql/router/pull/7024
The Brotli encoding compression level has been changed from 11 to 4 to improve performance and mimic other compression algorithms' fast setting. This value is also a much more reasonable value for dynamic workloads.
By @carodewig in https://github.com/apollographql/router/pull/7007
cgroup environments (PR #6787)This fixes an issue where the fleet_detector plugin would not correctly infer the CPU limits for a system which used cgroup or cgroup2.
By @nmoutschen in https://github.com/apollographql/router/pull/6787
This fix separates the entity keys and representation variable values in the cache key, to avoid issues with @requires for example.
[!IMPORTANT]
If you have enabled Distributed query plan caching, this release contains changes which necessarily alter the hashing algorithm used for the cache keys. On account of this, you should anticipate additional cache regeneration cost when updating between these versions while the new hashing algorithm comes into service.
By @bnjjj in https://github.com/apollographql/router/pull/6888
In Router 2.0 the rhai hot-reload capability was not working. This was because of architectural improvements to the router which meant that the entire service stack was no longer re-created for each request.
The fix adds the rhai source files into the primary list of elements, configuration, schema, etc..., watched by the router and removes the old Rhai-specific file watching logic.
If --hot-reload is enabled, the router will reload on changes to Rhai source code just like it would for changes to configuration, for example.
By @garypen in https://github.com/apollographql/router/pull/6950
Because the OTLP error metrics feature is being promoted to preview from experimental, this change updates its feature flag name from experimental_otlp_error_metrics to preview_extended_error_metrics.
By @merylc in https://github.com/apollographql/router/pull/7033
[!TIP] All notable changes to Router v2.x after its initial release will be documented in this file. To see previous history, see the changelog prior to v2.0.0.