Design of KTQueue
KTQueue
KTQueue is a job manager based on kubernetes, which support GPU scheduling
Why you should use KTQueue
If you have a cluster which have 1-100 GPUs, and you want to share the cluster with multiple user, but you find that your GPUs are not fully utilized, because:
- you can't watch your nodes' status so that you can't start a job immediately after one job is finished
- you should mask GPU manually (by setting CUDA_VISIBLE_DEVICES) to avoid GPU usage conflict
- you should spend time find a node with enough idle GPU, before you want to start a job, which you do not care about at all
- you can't aceess the output of your job conveniently so you can't stop your job which is already failed, then you can't turn over GPUs as soon as possible
Goal
- provide a user-friendly web interface to create and monitor job
- use Docker to provide uniform environment for one job, so that a job can run on any node
- use Docker to allocate GPUs, to avoid GPU usage conflict
- use shared filesystem to provide dataset and share code among nodes
dependancy
- a shared filesystem over network(cephfs prefered, NFS is fine)
- MongoDB
- kubernetes (>1.6.0)
- Docker
What did KTQueue do
kubernetes support GPU schedule since version 1.6.0