TextSynth uses custom inference code to get faster inference on GPUs and CPUs. It has the following characteristics:

A CPU-only version is freely available.

Benchmarks:

Performance using the GPT-Neox 20B model on a RTX A6000 Nvidia GPU. For the speed measurement, 200 tokens are generated using a batch size of 1:

Precision LAMBADA (ppl) LAMBADA (acc) Max GPU memory (GB) Speed (tokens/s)
float16 3.66 72.6% 40.7 15
8 bits 3.66 72.6% 21.7 27
4 bits 3.71 72.0% 11.6 41

Performance using the Stable diffusion 1.4 model on a RTX A6000 Nvidia GPU. For the speed measurement, a single image is generated using 50 timesteps and a batch size of 1:

Precision Max GPU memory (GB) Generation time (s)
float16 2.8 1.90