@comzyh 2017-05-02T09:50:16.000000Z 字数 1011 阅读 1660

Design of KTQueue

KTQueue

KTQueue is a job manager based on kubernetes, which support GPU scheduling

Why you should use KTQueue

If you have a cluster which have 1-100 GPUs, and you want to share the cluster with multiple user, but you find that your GPUs are not fully utilized, because:

you can't watch your nodes' status so that you can't start a job immediately after one job is finished
you should mask GPU manually (by setting CUDA_VISIBLE_DEVICES) to avoid GPU usage conflict
you should spend time find a node with enough idle GPU, before you want to start a job, which you do not care about at all
you can't aceess the output of your job conveniently so you can't stop your job which is already failed, then you can't turn over GPUs as soon as possible

Goal

provide a user-friendly web interface to create and monitor job
use Docker to provide uniform environment for one job, so that a job can run on any node
use Docker to allocate GPUs, to avoid GPU usage conflict
use shared filesystem to provide dataset and share code among nodes

dependancy

a shared filesystem over network(cephfs prefered, NFS is fine)
MongoDB
kubernetes (>1.6.0)
Docker

What did KTQueue do

kubernetes support GPU schedule since version 1.6.0