ggml_cuda_init: found 2 CUDA devices (Total VRAM: 24575 MiB): Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 12287 MiB Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 12287 MiB -dev, --device comma-separated list of devices to use for offloading (none = don't use --list-devices to see a list of available devices (env: LLAMA_ARG_DEVICE) --list-devices print list of available devices and exit -ot, --override-tensor =,... override tensor buffer type (env: LLAMA_ARG_OVERRIDE_TENSOR) -cmoe, --cpu-moe keep all Mixture of Experts (MoE) weights in the CPU -ncmoe, --n-cpu-moe N keep the Mixture of Experts (MoE) weights of the first N layers in the -sm, --split-mode {none,layer,row} how to split the model across multiple GPUs, one of: - layer (default): split layers and KV across GPUs - row: split rows across GPUs (env: LLAMA_ARG_SPLIT_MODE) -ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of (env: LLAMA_ARG_TENSOR_SPLIT) -mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0) -fit, --fit [on|off] whether to adjust unset arguments to fit in device memory ('on' or target margin per device for --fit, comma-separated list of values, single value is broadcast across all devices, default: 1024 --check-tensors check model tensor data for invalid values (default: false) --op-offload, --no-op-offload whether to offload host tensor operations to device (default: true) -otd, --override-tensor-draft =,... override tensor buffer type for draft model -cmoed, --cpu-moe-draft keep all Mixture of Experts (MoE) weights in the CPU for the draft -ncmoed, --n-cpu-moe-draft N keep the Mixture of Experts (MoE) weights of the first N layers in the -devd, --device-draft comma-separated list of devices to use for offloading the draft model use --list-devices to see a list of available devices