FastApi Sync/Async benchmark

First this benchmark is for our usecase. We host Torch models deployed as Fargate containers. I wanted to know how much difference it is, when (for this usecase) wrongly using async view functions instead of sync ones. Here I was expecting that the sync ones are faster, because there is only CPU wait.

We host morphology models using FastAPI and pyTorch (CPU only!). In the benchmark I use one that predicts the gender of a German noun. The noun is generated randomly, but always ends with Garten, so it should return a valid gender (masc in this case).

FastAPI is running like this:

uvicorn main:app --port 8000 --workers 1

We use only one worker on deployment, but I wanted to try more here too.

To run the stresstest I used k6. The code to stress test the api for 30 seconds with 10 virtual users (later with 100):

import http from 'k6/http';
import { check } from 'k6';

export const options = {
  vus: 10,
  duration: '30s',
}
export default function() {
  const newNoun = (Math.random() + 1).toString(36).substring(4) + '-Garten'
  const res = http.get('http://127.0.0.1:8000/de-noun/?lemma='+ newNoun)
  check(res, { 'status was 200': (r) => r.status == 200 })
}

Then running this with: k6 run script.js.

Experiments

I always run the script twice and used the second run. This guarantees that the model is loaded into memory which is hard to control when pressuring the API with multiple requests.

All benchmarks (uvicorn and k6) ran on the same notebook, an AMD Ryzen7 with 16 cores and 32GB memory. I didn't want to have (real) network in the benchmark.

view func

uvicorn workers

k6 vus

http_reqs

duration

sync

1

10

1627 54.05475/s

avg=184.74ms min=101.94ms med=184.44ms max=241ms

async

1

10

767 25.242789/s

avg=393.63ms min=52.38ms med=393.65ms max=486.22ms

sync

1

100

1518 48.157565/s

avg=2.03s min=751.44ms med=2.06s max=3.08s

async

1

100

845 24.773537/s

avg=3.79s min=60.51ms med=4.02s max=6.45s

sync

4

10

468 14.760021/s

avg=661.84ms min=176.63ms med=266.65ms max=12.25s

async

4

10

41 1.228459/s

avg=393.63ms min=52.38ms med=393.65ms max=486.22ms

The hypothesis that using async def inflect_view() is slower seems to be correct. A lot of virtual users result is unstable return times (up to seconds!). And finally the HTTP request numbers crashing down for multiple workers. I expect this is caused by Torch already consuming all CPUs. So multiple workers is not a good idea for our usecase.

Conclusion

The setup we started with was sync functions and one worker because the Fargate container we use for this only has one CPU. This seems to be the best setup for us at the moment. Pressure of 100 virtual users may happen and the response times may get slower, but at least all requests were fulfilled.

I learned here that k6 is actually easy to use for stress testing of HTTP APIs.