Deploying a trained AI / ML model is often a challenge. In particular, the question of which hardware is required for a given use case is often a puzzle. On the one hand, it must be ensured that the responses of the model are given fast enough, taking into account the expected request volume. On the other hand, the operation of servers is very expensive and it should therefore be avoided that server resources remain unused.
There is a wide range of alternatives for deploying models. For example, you have to decide whether you need a server with or without a hardware accelerator such as a GPU. Other typical design criteria are the number of CPU cores, the RAM size and the hard disk size. Choosing the right hardware is particularly difficult if you deploy multiple models to a single server, which interact differently with the hardware.
It is often proposed to provide and scale resources only on request. However, this leads to cold start problems, i.e. waiting times until the instance is provisioned. In many applications, the provision of resources on request is therefore not possible. Instead, the minimum requirement for computing resources that must be kept available should be determined. In this way, the models are kept available and can respond quickly, while the use of resources is kept as low as possible.
Servers with GPU accelerators are usually much more expensive compared to CPU-only based servers. CPU servers are often underestimated, as they are sometimes sufficient to fulfill the request requirements. GPU servers, on the other hand, are typically a good choice when multiple models are deployed, as these can share the computing resources. The key to a low-resource deployment therefore lies in knowing exactly which computing resources are needed in the corresponding case.
How to find the right hardware?
The best way to find the right hardware is to deploy the models to different servers for testing and measure the achieved performance. As there is a wide range of different hardware constellations, this involves analyzing a large number of different servers. The deployment configuration should be optimized for all these servers so that the results achieved are comparable in a fair way.
Another difficulty in finding the right architecture is that several instances of the server are usually operated in parallel in order to be able to answer all requests quickly enough. Therefore, when searching for the right server architecture, it should also be compared whether several inexpensive servers with a weak architecture (operated in parallel) deliver better results than an expensive server with a strong architecture. This leads to further server combinations that need to be tested during the search.
To avoid having to carry out this time-consuming search manually, we offer a solution in our toolbox that can be used to automatically measure the performance of models on many different server architectures. For this purpose, the configuration is optimized for the respective server and the models are benchmarked on the target hardware. With these results, you can analyze exactly which hardware is required for the given use case. This can often save considerable server costs, as it might turn out that a combination of inexpensive servers meets the requirements.
Get in Touch!
Partner with us to optimize your models and unlock their full potential. Share your requirements, and we’ll show you how we can drive your success.
Need more information? Reach out today to discuss how we can help you achieve your goals.