ps/Modules/Cole.PowerShell.Developer/Public/Invoke-JobRunner.ps1

416 lines
21 KiB
PowerShell
Raw Normal View History

2023-05-30 22:51:22 -07:00
function Invoke-JobRunner {
<#
.SYNOPSIS
What if invoke-parallel, but faster, with results written as jobs finish, and not waiting for the slowest batched item to complete before starting more.
.DESCRIPTION
When using Invoke-JobRunner, if you know that all of your tasks will complete rather quickly (defined here as approximately under 2 seconds per task),
You should just invoke the work in a loop instead of trying to split your tasks up.
If your tasks will take longer than that, then this JobRunner is the correct choice.
There is no system throttling for CPU contention, so you are advised to monitor the CPU usage yourself if you are likely to break things.
.PARAMETER JobInputs
An array of inputs for the job. Can be a primitive value or an object. Will be the first object provided to the script parameters.
.PARAMETER ScriptBlock
A script to be executed for each JobInput
.PARAMETER InitializationScript
Used to provide initialization ahead of the job running.
.PARAMETER AdditionalArguments
An additional set of parameters provided after the first object in the parameter list.
These would be common arguments shared across all job entities. If you need per-job parameters, make your JobInput an Object with distinct properties.
.PARAMETER OverallTimeoutSeconds
How long the entire Invoke-JobRunner process should run before timing out. Non-zero values will timeout in that many seconds.
This means that a 30s timeout may kill a job that has run for less than 30s if it was started after iteration 0.
.PARAMETER JobTimeoutSeconds
How long a single job should timeout after. This value is unrelated to the OverallTimeoutSeconds value.
.PARAMETER JobLimitSize
How many jobs should be in flight at one time. May also be referred to as Parallelization or Thread counts.
If your code uses a lot of Start-Sleep, you should increase this number to distribute the load for more workers at one time, each waiting for results/sleeps.
If your code is CPU saturating, you should leave this value as the default (8).
.PARAMETER LoopWaitDelay
THIS VALUE IS IN SECONDS
How long to wait between checking for completed jobs. AKA Don't heat the CPU needlessly
If you know your jobs will take a long time to complete, setting this value will help the running host from going into CPU spin-state.
The best advice is to set this to half the time you think a job will take to complete, in seconds.
Example: I know my script should take between 5 and 10 seconds to run per input. I would set this value to 3s.
Job completion checks would occur at 3s, 6s, 9s, 12s, etc.
.PARAMETER Credential
Used for running under a given user's context
.PARAMETER UseBatchProcessing
Legacy processing mode. Introduces batch processing when requested, but is non-preferred. See Description for details on why this is not preferred.
.PARAMETER ReturnObjects
Should the results be returned or discarded?
.NOTES
This was written to see if we could get faster performance from Invoke-Parallel while still getting the results back faster.
The goal is to wait for any job to complete, get all the completed jobs, then start jobs equal to the pool count until the running jobs count is the same as our pool limit.
Any one job may finish before another, so if we use `Wait-Job -All` then we have to wait for all started jobs to finish before we can read any of them.
In this method, I check to see if any jobs are finished, not all jobs, so I can work them 1 by 1.
I've also added the ability to stop jobs that go past a certain runtime to try to enforce limits for testing purposes (some jobs time out after 90 minutes, wouldn't it be nice to replicate that locally?)
The other incarnations of this function also have a memory leak, so that has been resolved as well. (keeping jobs in state instead of disposing of them once finished.)
But wait, what about batching jobs into equal sized elements then running those in sequential process?
This method should be faster than having X worker threads each work its own queue of batched entities, except for very fast sets (the cost of instantiating new jobs is about 2s per).
Additionally, work that occurs inside the inner batched job can't be returned to the parent until the entire job finishes, which defeats the purpose of this runner.
More details on speed versus batches follows:
**These examples consider you are using Invoke-Parallel**
Assume that I have some set of 16 jobs X, and 4 worker threads allocated, such that the jobs return in either 1, 2, 3, or 4 seconds. If things were truly random, I might end up with units of work taking:
Effort per job in time(s): 1 4 3 2 4 3 1 2 4 1 2 3 3 1 2 4
So if we put those into 4 equal sized blocks we get runtimes of:
Worker -> Iteration timeline (|1| = 1 second, |2 | = 2s, |4 | = 4s)
1 - |1|4 |3 |2 | = 10s
2 - |4 |3 |1|2 | = 10s
3 - |4 |1|2 |3 | = 10s
4 - |3 |1|2 |4 | = 10s
Time to get all results = 10s, time to first diagnostic results = 10s
And overall the batch takes 10s to run, plus overhead, but I only get results of _any_ of the work after 10s
What if the batch were presented in such a way as the ordering naturally tending towards
Effort per job in time(s): 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
I have no way of knowing that this is how things will work out, but assuming that it did, we would see units of work like so:
Worker -> Iteration timeline (|1| = 1 second, |2 | = 2s, |4 | = 4s)
1 - |1|1|1|1| = 4s
2 - |2 |2 |2 |2 | = 8s
3 - |3 |3 |3 |3 | = 9s
4 - |4 |4 |4 |4 | = 16s
Time to get all results = 16s, time to first diagnostic results = 16s
We would not see any results until 16s. This is the worst case, clearly, and the average case is the 10s of the first example.
**These examples consider you are using Invoke-JobRunner**
However, this algorithm would rather approach the concept that we might use more worker threads and let new jobs enqueue as they finish, so that if we had 6 queues instead of 4, we would see:
Original example
Effort per job in time(s): 1 4 3 2 4 3 1 2 4 1 2 3 3 1 2 4
Worker -> Iteration timeline (|1| = 1 second, |2 | = 2s, |4 | = 4s)
1 - |1|1|2 |4 |
2 - |4 |3 |
3 - |3 |2 |2 |
4 - |2 |4 |
5 - |4 |1|
6 - |3 |3 |
Time to get all results = 8s, time to first diagnostic results = 1s
We did cheat by using 2 more worker thread allocations, but now our longest iteration time was 8s,
yet we had all our diagnostic output from the other threds written within 6s,
instead of having to wait the full 10 to see any of the threads, as opposed to the original model.
The difference is we did not have to re-group or re-render worker processes to batch things, at the cost of a little more CPU usage.
Let's re-do it with the original 4 workers, same breakdown of work as before
Worker -> Iteration timeline (|1| = 1 second, |2 | = 2s, |4 | = 4s)
1 - |1|4 |1|3 |
2 - |4 |2 |3 |4 |
3 - |3 |1|4 |2 |
4 - |2 |3 |2 |1|
Time to get all results = 13s, time to first diagnostic results = 1s
Now our longest timeline was 13s, instead of 10, but at 10s we had all but the last thread's complete diagnostic output and it only took a few more seconds to get the final return data to return all objects.
Being able to see all of the diagnostic output of the 15 completed jobs at 10s in is nicer for managing tasks that we rely on seeing all of the results before we can monitor progress.
Let's turn to the second example, with 6 and then 4 worker threads, as we did last time:
Second example
Effort per job in time(s): 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
Worker -> Iteration timeline (|1| = 1 second, |2 | = 2s, |4 | = 4s)
1 - |1|2 |4 |
2 - |1|2 |4 |
3 - |1|3 |4 |
4 - |1|3 |4 |
5 - |2 |3 |
6 - |2 |3 |
Time to get all results = 8s, time to first diagnostic results = 1s
And if we look at it again with only 4 workers (original worker count, not modified 50% additional workers)
Effort per job in time(s): 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
Worker -> Iteration timeline (|1| = 1 second, |2 | = 2s, |4 | = 4s)
1 - |1|2 |3 |4 |
2 - |1|2 |3 |4 |
3 - |1|2 |3 |4 |
4 - |1|2 |3 |4 |
Time to get all results = 10s, time to first diagnostic results = 1s
The reason to add additional workers is because we aren't in spinlock waiting on jobs to finish,
and we can see more work getting done,
and we can start seeing the results of our work sooner,
and since we can see the impact of our work, we are able to more effectively see what is taking too long (not getting diagnostic output per input)
The principal gained benefit of this function is the increased time to diagnostic output being returned.
The second benefit is reduced memory overhead because I'm cleaning up the job-pool as I go, which previously resulted in excess memory consumption.
Unfortunately, for very small jobs, there is still the very unjoyous reminder that all jobs will take approximately 2 seconds to start.
This means that for sufficiently fast tasks, batching is a better option. Longer running tasks are better served by not-batching.
#>
[CmdletBinding(DefaultParameterSetName = 'ThreadPerObject')]
[OutputType([object[]])]
param(
[Parameter(Mandatory = $true, ValueFromPipeline = $true, Position = 0)]
[ValidateNotNullOrEmpty()]
[Alias('Objects')]
[object[]]$JobInputs,
[Parameter(Mandatory = $true)]
[ValidateNotNullOrEmpty()]
[Alias('Script')]
[ScriptBlock]$ScriptBlock,
[Parameter(Mandatory = $false)]
[Alias('Arguments')]
[object[]]$AdditionalArguments,
[Parameter()]
[ScriptBlock]$InitializationScript = $null,
[Parameter()]
[int]$OverallTimeoutSeconds = -1,
[Parameter()]
[int]$JobTimeoutSeconds = -1,
[Parameter()]
[Alias('NumThreads')]
[int]$JobLimitSize = 8,
[Parameter()]
[ValidateRange(0, 900)]
[int]$LoopWaitDelay = 0,
[Parameter()]
[PSCredential]$Credential = $null,
[Parameter(ParameterSetName = 'UseBatchProcessing')]
[switch]$UseBatchProcessing,
[Parameter(ParameterSetName = 'ThreadPerObject')]
[switch]$ThreadPerObject,
[Parameter()]
[switch]$StopProcessingJobsOnError,
[Parameter()]
[switch]$ReturnObjects
)
$logLead = Get-LogLeadName
if ($StopProcessingJobsOnError) {
Write-Warning "$logLead : This will cause your jobs to end, and may cause an unpredictable state if job scripts can not be restarted."
if (Test-IsInteractiveSession) {
Write-Warning "$logLead : You have 5 seconds to hit ctrl-c and re-evaluate your life choices. Inconsistent states on job-termination can be disastrous."
Start-Sleep -Seconds 5
}
}
# If the user input did not specify a value, we should supply a default
if ($LoopWaitDelay -eq 0) {
# The goal here is to not stay in a spin-lock on the OS waiting for jobs to finish, because jobs will usually take longer than 150ms to complete.
# Basically, when PS hits a Start-Sleep (as we use below), the OS knows to background it.
# The value here is actually kind of counter intuitive but it's the right-ish way to go.
# If you want more details, you should lookup System.Activites.Bookmark and System.Activites.Delay
# The reason for twice is to prevent CPU spin, but on short-tasks (under 20ms) they will be completed long before we check for results.
# We likely want to revisit and set this to be a higher multiplier-value in the future
# Allow the system two thread worker attempts before we re-entry on the job to look for updated thread completion
$systemTimeSlice = Get-WindowsThreadSliceTime -Milliseconds
$sleepTimeSlice = $systemTimeSlice * 2
} else {
# Convert to milliseconds (they can enter up to 900 on the loop delay which is 15 minutes)
$sleepTimeSlice = $LoopWaitDelay * 1000
}
if ($null -eq $InitializationScript) {
# I'm greedy, I want us to use more CPU than the underlying system host
$InitializationScript = { (Get-Process -Id $PID).PriorityClass = 'AboveNormal' }
}
$psStopWatch = [System.Diagnostics.Stopwatch]::StartNew()
$hitProcessTimeoutState = $false
$progressActivityLabel = "$logLead$PID"
Write-Progress -Activity $progressActivityLabel -Status 'Processing inputs' -PercentComplete 0
$transformedJobInputs = New-Object -TypeName "System.Collections.ArrayList"
foreach ($jobInput in $JobInputs) {
$transformedJobInputs.Add($jobInput) | Out-Null
}
# Could store them on the array like this to have more data
$jobs = New-Object -TypeName "System.Collections.ArrayList"
$results = @()
Write-Host "$logLead : Starting on $($transformedJobInputs.Count) elements"
[ScriptBlock]$batchScript = $null
if ($UseBatchProcessing) {
# Define script that runs per thread.
$batchScript = {
param(
[object]$script,
[object[]]$ArgumentList
)
# Deserialize script block, turn it into a script block again.
$script = [scriptblock]::Create($script)
# Invoke user-provided script block on each object.
# SRE-13225 - The ErrorAction on Invoke-Command does NOT affect what happens INSIDE
# the ScriptBlock. Because we're batching things to be parallelized on "shared threads"
# this ErrorAction allows us to have a failure in the middle of a batch without
# halting the entire batch.
foreach ($object in $objects) {
Invoke-Command -ErrorAction Continue -ScriptBlock $script -ArgumentList $ArgumentList -NoNewScope
}
}
# Redefine JobInputs array into batches instead of single entities
$batchSize = [Math]::Ceiling($JobInputs.Count / $JobLimitSize)
$batchIterator = 0
$batchSlice = $null
$transformedJobInputs = New-Object -TypeName "System.Collections.ArrayList"
while (($batchSlice = $JobInputs[$batchIterator..$($batchIterator + $batchSize - 1)]) -and (!(Test-IsCollectionNullOrEmpty $batchSlice)) -and ($batchIterator += $batchSize)) {
Write-Debug "$logLead : Created a batch of $($batchSlice.Count) elements with values: [$($batchSlice -join ',')]"
$transformedJobInputs.Add($batchSlice) | Out-Null
}
}
$totalJobCount = $transformedJobInputs.Count
Write-Progress -Activity $progressActivityLabel -Status 'Running threads' -PercentComplete 0
while ($jobs.Count -gt 0 -or $transformedJobInputs.Count -gt 0) {
$removeJobs = @()
# now receive completed jobs
foreach ($job in $jobs) {
$removeJob = $false
$jobId = $job.Id
$jobStatus = Get-Job -Id $jobId
if ($OverallTimeoutSeconds -gt 0) {
if ($psStopWatch.Elapsed.TotalSeconds -gt $OverallTimeoutSeconds) {
<# Inline double-if because PS doesn't support eager evaluation of boolean states and I want to go faster when 0 #>
Stop-Job -Id $jobId | Out-Null
Write-Host "$logLead : Stopped job with id $jobId because it has exceeded the allowed overall timeout value of $OverallTimeoutSeconds seconds. Input object was [$($job.JobInput)]"
$removeJob = $true
$hitProcessTimeoutState = $true
}
}
if ($JobTimeoutSeconds -gt 0) {
if (([System.DateTime]::Now - $job.StartTime).TotalSeconds -gt $JobTimeoutSeconds) {
<# Inline double-if because PS doesn't support eager evaluation of boolean states and I want to go faster when 0 #>
Stop-Job -Id $jobId | Out-Null
Write-Host "$logLead : Stopped job with id $jobId because it has exceeded the allowed job timeout value of $JobTimeoutSeconds seconds. Input object was [$($job.JobInput)]"
$removeJob = $true
}
}
$jobStatus = Get-Job -Id $jobId
if ($jobStatus.State -in @('Stopped', 'Failed', 'Completed', 'Disconnected')) {
if ($ReturnObjects) {
$results += Receive-Job -Id $jobId -ErrorAction Continue
} else {
Receive-Job -Id $jobId -ErrorAction Continue
}
if ($jobStatus.State -in @('Failed', 'Disconnected')) {
Write-Warning "$logLead : Job with ID [$jobId] has failed with JobStateInfo.Reason [$($jobStatus.ChildJobs[0].JobStateInfo.Reason)]"
Write-Warning "$logLead : Job with ID [$jobId] had input [$($job.JobInput)]"
# $throwAtEnd += $job
# Legacy behavior to stop processing jobs when something errors
# This could leave things in an incomplete state on prior threads because we force stopping the jobs
if ($StopProcessingJobsOnError) {
$hitProcessTimeoutState = $true
}
}
Write-Verbose "$logLead : Received job with id $jobId"
Remove-Job -Id $jobId | Out-Null # ensure the job is disposed of
$removeJob = $true
}
if ($removeJob) {
$removeJobs += $job
}
}
foreach ($job in $removeJobs) {
$jobs.Remove($job) | Out-Null
Write-Verbose "$logLead : Removed job with id $($job.Id)"
}
Write-Progress -Activity $progressActivityLabel -PercentComplete (100 * ($totalJobCount - $transformedJobInputs.Count) / $totalJobCount)
if ($hitProcessTimeoutState) {
if ($transformedJobInputs.Count -gt 0) {
Write-Error "$logLead : We hit the job timeout stage, no further jobs will be processed. There are still $($transformedJobInputs.Count) remaining jobs that did not get started."
Write-Host "$logLead : The following job inputs did not start processing: [$((ConvertTo-Json $transformedJobInputs -Depth 5))]"
}
if ($jobs.Count -gt 0) {
Write-Warning "$logLead : Additional jobs still processing will now be stopped."
foreach ($job in $jobs) {
$jobId = $job.Id
Stop-Job -Id $jobId | Out-Null
Write-Host "$logLead : Stopped job with id $jobId because it has exceeded the allowed overall timeout value of $OverallTimeoutSeconds seconds. Input object was [$($job.JobInput)]"
if ($ReturnObjects) {
$results += Receive-Job -Id $jobId -ErrorAction Continue
} else {
Receive-Job -Id $jobId -ErrorAction Continue
}
Write-Verbose "$logLead : Received job with id $jobId"
Remove-Job -Id $jobId | Out-Null # ensure the job is disposed of
}
}
break
}
while ($jobs.Count -le $jobLimitSize -and $transformedJobInputs.Count -gt 0) {
# add new jobs
$originalJobInput = $transformedJobInputs[0]
# Grab next batch to start instead of one per thread
$ArgumentList = New-Object -TypeName "System.Collections.ArrayList"
$ArgumentList.Add($originalJobInput) | Out-Null
foreach ($additionalArgument in $AdditionalArguments) {
$ArgumentList.Add($additionalArgument) | Out-Null
}
$splat = @{
ScriptBlock = $ScriptBlock
ArgumentList = $ArgumentList
InitializationScript = $InitializationScript
}
if ($UseBatchProcessing) {
$splat.ScriptBlock = $batchScript
$splat.ArgumentList = $ScriptBlock, $ArgumentList
}
if ($null -ne $Credential) {
$splat.Credential = $Credential
}
$job = Start-Job @splat
$jobState = @{ Id = $job.Id; JobInput = (ConvertTo-Json $originalJobInput -Compress -Depth 5); StartTime = [System.DateTime]::Now; }
$jobs.Add($jobState) | Out-Null
# Wait-Job -Id $job.Id -Timeout 0 | Out-Null # stick it on the background
Write-Verbose "$logLead : Started job with id $($job.Id)"
$transformedJobInputs.RemoveAt(0) | Out-Null # We know we are done when this array is empty. It was a clone of the original input.
}
$currentJobsIds = @()
foreach ($job in $jobs) {
$currentJobsIds += $job.Id
}
if ($currentJobsIds.Count -gt 0) {
Write-Debug "$logLead : Waiting on jobs $currentJobsIds"
Wait-Job -Any -Id $currentJobsIds -Timeout 0 | Out-Null
}
# Introduce a little jitter so that jobs have a chance to finish before we check for results.
Start-Sleep -Milliseconds $sleepTimeSlice
} # end outer while
Write-Host "$logLead : Finished in $($psStopWatch.Elapsed) seconds"
$psStopWatch.Stop()
Write-Progress -Activity $progressActivityLabel -Completed
if ($ReturnObjects) {
return $results
}
}