TextSynth uses custom inference code to get faster inference on GPUs and CPUs. It has the following characteristics:

Companies wishing to use the TextSynth interference code on their servers can contact us at: contact at textsynth dot com.

Benchmarks:

Performance using the GPT-Neox 20B model on a RTX A6000 Nvidia GPU. For the speed measurement, 200 tokens are generated using a batch size of 1:

Precision LAMBADA (ppl) LAMBADA (acc) Max GPU memory (GB) Speed (tokens/s)
float16 3.67 72.6% 40.7 15
8 bits 3.68 72.4% 21.7 27
4 bits 3.77 72.2% 11.6 41

Performance using the Stable diffusion 1.4 model on a RTX A6000 Nvidia GPU. For the speed measurement, a single image is generated using 50 timesteps and a batch size of 1:

Precision Max GPU memory (GB) Generation time (s)
float16 2.8 1.98