When launching a job, one of the most time-consuming processes is provisioning new instances for
your job to run on. If you are debugging errors, it can be helpful to leave machines provisioned so
that you can quickly launch new jobs as you fix bugs. We created
engine lease to solve this problem.
engine lease create command will create a timed 'lease' of GPU nodes.
These nodes will live until their timeout expires or you run
engine lease stop LEASE_ID.
Only jobs that you launch will be able to run on these instances.
When the lease is over, any jobs that are running on leased machines will continue to run.
If you launch a job that requires more machines than you currently have leased, engine will automatically provision
extra non-leased machines for your job.
Example Use Case
You are debugging some improvements to your model, and you know you will be launching many new jobs in a short period of time. You create a lease for two machines for one hour. After a few minutes your leased machines will be ready to run jobs. At this point you can launch your first job. The first time you launch a job on leased machines, it will launch about a minute faster. Subsequent jobs launched on leased machines should only take a minute or two total to launch. Once your job is running smoothly, it can be left running on leased machines. The job will continue running even after the lease ends.
It takes around 3-5 minutes for a lease to become ready.
engine lease create -n 2 -m 20 -i p2.8xlarge Creating "lease-1-1559757406". Requesting 2x p2.8xlarge instances for 20 minutes. engine lease list # Initially, 0 instances are running You have no active leases. engine lease list # Some time later, 1/2 instances have started "lease-1-1559757406" for 2x p2.8xlarge instances is starting (1/2 instances are ready). engine lease list # Finally, the lease startup is complete "lease-1-1559757406" for 2x p2.8xlarge instances will finish in 15 minutes.
Launch your job on the leased instances.
git push project branch Counting objects: 1, done. Writing objects: 100% (1/1), 196 bytes | 196.00 KiB/s, done. Total 1 (delta 0), reused 0 (delta 0) remote: Submitting job to Engine API remote: Project `project` on branch `branch` updated to `00000011111122222223333333444444` remote: Your job id is comfortable-reducer ...
You can either stop the lease or wait for it to expire. Stopping a lease while a job is running on it will not stop the job.
engine lease stop lease-1-1559757406 Marked instances in 'lease-1-1559757406' for termination.
-n / --num-instances INTEGER
Choose how many instances of
instance-type to launch
-m / --minutes INTEGER
Reserve this lease for
-i / --instance-type
Choose the instance type to launch on.
p2.xlarge(1x K80 GPU; 4x vCPUs)
p2.8xlarge(8x K80 GPU; 32x vCPUs)
p3.2xlarge(1x V100 GPU; 8x vCPUs)
p3.16xlarge(8x V100 GPU; 64x vCPUs)