Skip to content

Recover stuck service operations after transient DB failures#5011

Draft
serdarozerr wants to merge 13 commits into
cloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs
Draft

Recover stuck service operations after transient DB failures#5011
serdarozerr wants to merge 13 commits into
cloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs

Conversation

@serdarozerr
Copy link
Copy Markdown
Contributor

Problem

When CCDB becomes temporarily unreachable during a service broker polling cycle, the polling job fails permanently even though the broker is still processing the operation. This leaves the service instance stuck:

  • last_operation.state = 'in progress' (broker still working)
  • pollable_job.state = FAILED (CC gave up)
  • User cannot retry or delete: 422 "operation in progress"

Solution

Adds a new periodic scheduled job DelayedJobsRecover that scans delayed_jobs for permanently failed rows and re-enqueues polling for any broker operation still in progress.

Detection chain:
delayed_jobs.failed_at IS NOT NULL → linked pollable_job in POLLING/FAILED state → service instance last_operation.state = 'in progress'

Recovery:
Deserializes the original handler from the dead delayed_job (preserving @start_time, @maximum_duration, @first_time = false) and calls enqueue_next_job, the same path used by normal poll cycles. A row-level FOR UPDATE lock with delayed_job_guid re-verification prevents double recovery when multiple CC instances run concurrently.

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests

…e operations

Introduces a new periodic recovery job that scans permanently failed delayed_jobs
and re-enqueues polling for service operations still in progress at the broker.

Recovers cases where a transient DB connection error caused the polling job to
fail permanently (max_attempts=1) while the broker operation was still running,
leaving the service instance stuck in 'in progress' with no active poller.
The previous implementation queried dead delayed_jobs then performed
separate lookups per row to find the pollable job, entity, and last
operation state. Replace with a single 4-table join across
service_instance_operations, service_instances, jobs, and delayed_jobs,
filtering all conditions in one query
@serdarozerr serdarozerr force-pushed the fix/recover-failed-delayed-jobs branch from a3a10fe to f37a14c Compare May 5, 2026 15:21
@logger ||= Steno.logger('cc.background')
end

def batch_size
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a constant: BATCH_SIZE.

module Runtime
class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob
def perform
logger.info('Recover halted delayed jobs')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer: "Recovering stuck delayed jobs"

join(:delayed_jobs, guid: Sequel[:jobs][:delayed_job_guid]).
where(Sequel[:service_instance_operations][:state] => 'in progress').
where(Sequel[:service_instance_operations][:type] => 'create').
where { Sequel[:service_instance_operations][:created_at] > cutoff_time }.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use something similar to FailedJobsCleanup (only DB time is used):

where(Sequel.lit("#{Sequel[:service_instance_operations][:created_at]} > CURRENT_TIMESTAMP - INTERVAL '?' SECOND", default_maximum_duration_seconds.to_i))

def max_attempts
1
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a job_name_in_configuration method.

# unwrap the serialized handler and re-enqueue via the reoccurring job's enqueue_next_job method
inner_job = Jobs::Enqueuer.unwrap_job(delayed.payload_object)
inner_job.send(:enqueue_next_job, pjob)
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add an explicit log for every re-enqueued job.

end

def logger
@logger ||= Steno.logger('cc.background')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger should be more specific, e.g. cc.background.delayed-jobs-recover.

PollableJobModel.db.transaction do
pjob = PollableJobModel.where(guid: pollable_guid,
delayed_job_guid: delayed.guid).
for_update.first
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add skip_locked - so if two processes try to lock the same row, the first one succeeds and the second one does nothing.

where(Sequel[:service_instance_operations][:state] => 'in progress').
where(Sequel[:service_instance_operations][:type] => 'create').
where { Sequel[:service_instance_operations][:created_at] > cutoff_time }.
where(Sequel[:jobs][:state] => [PollableJobModel::POLLING_STATE, PollableJobModel::FAILED_STATE]).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving this some additional thoughts, I think that we should not take failed jobs into account here. This is a terminal state that we already exposed to the user - bringing a "dead job back to life" would violate user expectations.

What we might want to have in addition is triggering orphan mitigation for the combination pollable job is failed and service instance operation is in progress.

@philippthun
Copy link
Copy Markdown
Member

When focusing on failed jobs where the pollable job is still POLLING, this PR could be extended to all async operations:

service_instance.create, service_instance.update, service_instance.delete
service_binding.create, service_binding.delete
service_key.create, service_key.delete
service_route_binding.create, service_route_binding.delete

# Test up migration
expect(db.indexes(:jobs)).not_to include(:jobs_operation_state_index)
expect { Sequel::Migrator.run(db, migrations_path, target: current_migration_index, allow_missing_migration_files: true) }.not_to raise_error
expect(db.indexes(:jobs)).to include(:jobs_operation_state_index)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also work for postgres. Then you could get rid of the if.

# pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set.
# Override individual parameters to break a single filter and test exclusion.
def make_stuck_scenario(
sio_state: 'in progress',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very Minor: I prefer explicit names like service_instance_state.

module VCAP::CloudController
module Jobs
module Runtime
class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling it DelayedServiceJobsRecover? Something to make clear that this is only for service related jobs.

# All filter conditions are satisfied: sio is in progress/create/within cutoff,
# pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set.
# Override individual parameters to break a single filter and test exclusion.
def make_stuck_scenario(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: maybe prepare_stuck_service_instance fits better?

…failed polling jobs

When a CC polling job permanently fails due to a transient error (e.g. a brief
DB connection flip), the client is left with no path to a final state: the
pollable job shows FAILED while the service instance operation remains stuck
'in progress' indefinitely. Previously this was addressed by reenqueuing the
delayed job, but that approach was fragile and incomplete.

This cleanup job detects stuck create operations whose polling job has
permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by
marking the operation and pollable job as failed and triggering OrphanMitigator
to deprovision any broker-side resource, giving clients a definitive final state.

Extends coverage to service bindings and service keys. Renames the class and
file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to
reflect the correct scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants